'Mainframe blowout' knackered millions of RBS, NatWest accounts
Bankers blame hardware fault, sources point to IBM big iron
Posted in Management, 7th March 2013 14:08 GMT
Free whitepaper – Hands on with Hyper-V 3.0 and virtual machine movement
A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night.
A spokesman said an unspecified system failure was to blame after folks were unable to log into online banking, use cash machines or make payments at the tills for three hours on Wednesday evening. The fault affected customers of RBS, NatWest and Ulster Bank.
The taxpayer-owned financial giant stressed this was not repeat of the catastrophic three-day outage that kicked off less than a year ago, during which RBS banks couldn't process payments to millions of accounts. The cock-up left infuriated customers short of cash and unable to settle bills, shop for food or keep up with mortgage repayments.
Speaking of yesterday's titsup service, an RBS Group spokesman told The Reg: "This problem was caused by a hardware fault and was not related to the issues we experienced last summer.”
El Reg traced the cause of last year’s downtime to human error: an inexperienced IT operative hit the wrong button during what should have been a routine overnight CA-7 batch job to process inbound payments. The cock-up caused a huge backlog that took days, and in some cases weeks, to clear.
Our sources close to the banking group's IT systems told us today that last night's outage was likely a failure on the IBM mainframe that handles customer accounts. This fault may have been something as simple as a corrupted hard drive, broken disk controller or interconnecting hardware.
But such a minor issue shouldn't have taken down core activities. In theory, the banking group’s disaster-recovery procedures should have kicked in straight away without a glitch in critical services.
Yet, between 9pm and 11pm, customers were unable to access accounts online, by phone or through cash machines and smartphones.
One source told The Reg: “This one looks like a standard outage, rather than anything more complicated. It's just so unusual for these to happen with mainframes.”
RBS Group runs its core banking operations, including all customer accounts, on IBM zSeries machines, among the most reliable hardware in the industry.
Another source told The Reg that human error is “most likely” for the actual delay in initiating the recovery; the bank’s IT procedures will in some way require system administrators to understand a problem before they start flipping switches.
If you're in the know and would like to add to the coverage of this latest outage at RBS Group, drop The Reg a note or call the London office on 020 3189 4620. ®
Free whitepaper – Hands on with Hyper-V 3.0 and virtual machine movement
COMMENTS
I reckon the other source had it spot on
"the bank’s IT procedures will in some way require system administrators to understand a problem before they start flipping switches."
Naturally. However, let's not forget the best-of-breed world-class fault resolution protocol that's been implemented to ensure a right-first-time customer-centric outcome.
That protocol means that a flustercluck of management has to be summoned to an immediate conference call. That takes time - dragging them out of bed, out of the pub, out of the brothel gentlemen's club and so on.
Next, they have to dial into the conference call. They wait while everyone joins. Then the fun begins:
Manager 1: "Ok what's this about?"
Operator: "The mainframe's shat itself, we need to fail over NOW. Can you give the OK, please?"
Manager 2: "Hang on a minute. What's the problem exactly?"
Operator: "Disk controller's died."
Manager 3: "Well, can't you fix it?"
Operator: "Engineer's on his way, but this is a live system. We need to fail over NOW."
Manager 4: "All right, all right. Let's not get excited. Why can't we just switch it off and switch it on again? That's what you IT Crowd people do, isn't it?"
Operator: "Nggggg!"
Manager 1: "I beg your pardon?"
Operator: (after deep breath): "We can't just switch it off and on again. Part of it's broken. Can I fail it over now, please?"
Manager 2: "Well, where's your change request?"
Operator: "I've just called you to report a major failure. I haven't got time to do paperwork!"
Manager 3: "Well, I'm not sure we should agree to this. There are processes we have to follow."
Manager 4: "Indeed. We need to have a properly documented change request, impact assessment from all stakeholders and a timeframe for implementation AND a backout plan. Maybe you should get all that together and we'll reconvene in the morning?"
Operator: "For the last bloody time, the mainframe's dead. This is an emergency!"
Manager 1: "Well, I'm not sure of the urgency, but if it means so much to you..."
Manager 2: "Tell you what. Do the change, write it up IN FULL and we'll review it in the morning. But it's up to you to make sure you get it right, OK"
Operator: "Fine, thanks."
<click>
Manager 3: "He's gone. Was anyone taking minutes?"
Manager 4: "No. What a surprise. These techie types just live on a different planet."
Manager 1: "Well, I'm off to bed now. I'll remember this when his next appraisal's due. Broken mainframe indeed. Good night."
Manager 2: "Yeah, night."
Manager 3: "Night."
Manager 4: "Night."
Re: I doubt it
> It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen.
It also assumes someone noticed the first failure. I remember our DEC service bod (it was a while ago :) ) complaining about a customer who'd had a total cluster outage after a disk controller failed. Customer was ranting & raving about the useless "highly available" hardware they'd spent so much money on.
Investigation showed that one of the redundant controllers had failed three months before, but none of the system admins had been checking logs or monitoring things. The spare controller took over without a glitch, no-one noticed, and it was only when it failed that the system finally went down.
High Availability and Resiliance
This should NOT have been about DR or backups. This should have been handled as part of any high-availability , RESILIENT cluster system design. I've designed and architected HA on IBM SP2 supercomputer clusters and can well attest that it works - our "system test" was walking the floor of the data centre randomly pulling drive controller cables and CPU boards out of their sockets, while having the core systems still running processes without failing! And that was 10+ years ago - I find it appalling that a live banking system would not be engineered to have the same degree of _resiliance_. Don't talk in terms of how many minutes of downtime it will have per year - it should be engineered to have the failure of x number of disks, y number of controllers, and z number of processors within a chassis/partition/etc.) before failure. For a live, financial system, those should be the metrics that are quoted, not reliability alone.
Re: I reckon the other source had it spot on
"Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager."
Well, quite. That's exactly what should happen. Been there myself, admittedly not in financial services.
I've seen it done properly, and it's precisely as you describe.
And I've seen it done appallingly, with calls derailed by people who knew next to nothing about the problem, but still insisted on adding value by not keeping their traps shut.
I guess I'm just too old and cynical these days :-)
Re: I doubt it
It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen. The only thing you can be sure of with a system is that the system will eventually fail.
To accuse them of "Mickey mouse" operation suggests that you've no idea how big or complex the IT setup at RBS is. I believe they currently have the largest "footprint" of zServers in Europe, that's without even thinking of mentioning the vast amount of other hardware on a globally distributed network.
Small IT = Easy.
Big IT = Exponentially more complicated.

The new Office Garage series: