IBM employee sparks massive bank outage
Big Blue liveware triggers seven-hour FAIL
Last Monday, one of Singapore's largest banks suffered a seven-hour IT outage that took down everything from back-office services to ATMs. This Tuesday, the flawed component was identified: an IBM employee.
"We take full responsibility for this incident," wrote DBS Group Holdings CEO Piyush Gupta in a statement. A laudably mature response, to be sure, but his communiqué went on to explain that the blame for the outage, which lasted from 3am to 10am on Monday July 5, is to be borne by IBM.
Specifically, an IBM employee who made "a procedural error in what was to have been a routine maintenance operation [that] subsequently caused a complete system outage."
The cascading failure began when a storage subsystem began giving error messages that indicated intermittent failures. A fix was scheduled for 3am, "a quiet period," in Gupta's words.
Unfortunately for DBS and IBM, an "outdated procedure" was used to initiate the repair, and all IT hell broke loose. By 3:40 a "a technical command function" was mobilized, and at 5:20 a system restart was attempted. Didn't work.
Following "complications during the machine restart," Gupta wrote, the "bankwide disaster recovery command centre" was activated, but by 8:30 it was determined that the core troubles could be fixed by 10:00, so full-scale disaster recovery wasn't needed. Main services were, indeed, up by 10:00, and, Gupta wrote, "All other services were progressively restored through the morning and virtually everything was back on track by lunchtime." No data was lost during the outage, he reports.
IBM and BDS entered into a S$1.2bn ($872m, £575m) agreement in 2002 in which the bank outsourced "selected IT services and infrastructure in Singapore and Hong Kong to IBM."
IBM on Tuesday released a statement noting that it had "taken steps to enhance training of our personnel related to current procedures and brought in experts from our global team to provide further assistance."
Big Blue did not note if that one unlucky IT admin was receiving the enhanced training, or if he has now become an uptick in global unemployment statistics. ®
I know what was done...
Squawk box reports that NAS is having issues
IBM monkey pulls the wrong faulty hot swap disks from a Raid 5
IBM monkey replaces correct broken disk but RAID is borked
IBM monkey runs chkdsk /fix on the broken volume
IBM monkey notices Chkdsk breaking the volume more, panic and hits the reset button.
IBM monkey reboots NAS, and the chkdsk restarts, pooches the volume even more..
Someone who actually KNOWS what they are doing is called around 6am, and it takes the rest of the time to fix the issue, restore backups, and see to it that someone meets with a "terrible accident"
The last hour would have been devoted to sourcing a big enough bag of lime, a shovel, and a roll of carpet.
RE: There, but for the grace of god
".....The biggest failure in IT is that anyone with root has the power, or bad luck, to place the company they work for in exactly this situation....." Yeah, so comforting to just blame the sysadmin, but the truth is this is a management failure, as just about every "laugh-at-the-silly-admin-that-pulled-the-wrong-disk" situation actually resolves down to. Why? Because it is management that selects the admin and gives them that root access. You wouldn't give a novice driver the keys to your Ferrari, would you? If you did, and they bent it, wouldn't you feel just a bit to blame for putting them in the driving seat?
This wouldn't have been some architect-level tech genius, this was probably the junior admin if they were doing the overnight shift. Read the article - the admin thought he was using a good procedure, the fact he didn't know it was a wrong procedure highlights several possible management failings:
1/ They hired an incompetant admin that didn't have the up-to-date training he claimed to have (i.e., he lied on his CV), which means their selection process was flawed (probably because they didn't include a skilled sysadmin in the selection team, who would have spotted the "exaggerations", just used HR drones).
2/ The bank introduced new kit but IBM didn't do the requisite staff training, either because they didn't check their staff's skillsets; or IBM decided to save a few pennies and just told the sysadmin to "self-train on the job"; or IBM actually didn't know what the new kit required, and hence couldn't provide a correctly skilled resource, probably because it was another vendor's kit.
3/ IBM management didn't assign a competent technical project manager or technical team leader who should have looked at the new kit when it was introduced, review any new procedures, update the sysadmin procedures and plan any additional training to get their skillset right.
So, blame the sysadmin if it makes you feel better, but it was incompetant management that put that incorrectly prepared sysadmin at the console.
an "outdated procedure" was used to initiate the repair,
So not the Grunt's fault then, but the management for not updating the procedure or not notifying the Grunt of the update.
There, but for the grace of god
... goes pretty much every major company in the world.
The biggest failure in IT is that anyone with root has the power, or bad luck, to place the company they work for in exactly this situation. The only surprise is that this sort of thing doesn't happen more often - or maybe just that it isn't reported more often.
Until systems are built robust enough to survive the onslaught of a trainee with the manual held upside-down, we really can't call what we do a "profession".
Couldn't agree more...
and with every large company on the face of the Earth sucking out all the cash for executive bonuses in the multi-millions instead of on training we'll see a lot more of this. The last of the folks who know what they are doing, had sufficient training to work on complex systems, are starting/have started to retire.
Fun times ahead, wonder if companies will be able to sue retired executives for bad business practices after they've retired or moved on. You know, once it becomes apparent to everyone that they have ruined the companies that paid them.