
3 months in the job, already with at least 4 incidents happening, and other smaller ones not recorded, i.e. managed to cover up.
Most of the incidents were due to human errors, like today, one of the operators mis-keyed a command, mistaken 'z' for 'd'. As a result, the whole system hung and it had to be rebooted or in mainframe's term, IPL. When everything comes up, checked system ok, database ok and website ok, but, next morning during online, users cannot log on. System and middleware guys checked ok, but database side saw an utility holding on to some tables. After getting the agreement from the Apps, the System DBA terminated the utility, everything back to normal, login was ok.
I had to do the incident report again, and do some damage control, the system owner was asking for the names of the operator who made the mistake, oh gosh, should I give names? I'm now in a dilemma man....
Most of the incidents were due to human errors, like today, one of the operators mis-keyed a command, mistaken 'z' for 'd'. As a result, the whole system hung and it had to be rebooted or in mainframe's term, IPL. When everything comes up, checked system ok, database ok and website ok, but, next morning during online, users cannot log on. System and middleware guys checked ok, but database side saw an utility holding on to some tables. After getting the agreement from the Apps, the System DBA terminated the utility, everything back to normal, login was ok.
I had to do the incident report again, and do some damage control, the system owner was asking for the names of the operator who made the mistake, oh gosh, should I give names? I'm now in a dilemma man....
No comments:
Post a Comment