Original URL: http://www.theregister.co.uk/2013/06/21/rbs_chernobyl_one_year_on/

RBS Mainframe Meltdown: A year on, the fallout is still coming

When the totally brand new kit comes on ... what do you think will happen?

By Gavin Clarke

Posted in CIO, 21st June 2013 08:24 GMT

A year ago, RBS experienced its Chernobyl moment – an incident when a case of simple human error by those running critical systems resulted in a crisis.

IT staff badly botched routine maintenance of the IBM mainframe handling millions of customers' accounts – a system processing 20 million transactions a day. The mistake was compounded by their inability to recover the situation quickly enough.

The fallout saw up to 16.7 million customers at three banks in the group – RBS, NatWest and Ulster Bank – unable to access their money for four days.

RBS couldn’t hide and MPs monitoring the City pounced, demanding immediate answers from senior management on what went wrong with the bank's computers.

The press of middle England weighed in, too, pillorying the bank’s already unpopular chairman as he a grovelling apology to MPs for the whole episode.

A year on and the fallout is still landing as the Financial Conduct Authority (FCA) are deciding whether action is needed against RBS.

The bank is splashing out £450m on top of its £2bn annual IT spend to replace the mainframe that failed and on new backup. RBS told The Reg that it's instituting a “complete refresh of the mainframe” system in Edinburgh.

It’s an unprecedented move: RBS has had computer problems in the past but nothing has warranted a complete rip and replace of entire systems on this scale.

A year in, though, has RBS learned its lessons? Can throwing money at new hardware save customers from future problems? It’s still unclear that RBS has truly reversed course on its policy of wholesale outsourcing of IT jobs - a policy that helped ignite last year's crisis.

Then it's quite possible that what happened at RBS might be replicated elsewhere as old and overloaded mainframes like the one at RBS hold millions of accounts at other banks who’ve also sent their IT jobs overseas.

RBS is spending nearly half a billion to replace the system that failed in June 2012. The group is buying a new IBM mainframe and making “significant change to many of the systems to improve disaster recovery and automated error recovery” a bank spokesperson tells the Reg. “This is a result of the IT problems [last] June.”

One former RBS IT insider told us the usual procedure for outages at the bank:

“I've dealt with a few outages at RBS in the past. There will have been standard 'lessons learned' or 'drains up' type of investigation. The exact reason for the problem have been found and poured over in a tedious level of detail, then process will have been put in place to prevent the set of circumstances repeating. Often this is to tighten security and/or process, so for example you may have found individual user groups being tied down more or process documentation required to be more granular. Potentially more post-change reviewing to make sure that people did what they said they would.”

RBS faces a Herculean job in bringing online a new mainframe operating in a core part of its day-to-day business. It must plan and execute the job without interrupting the existing service by taking the old mainframe offline during the transition.

RBS did not say when it plans to bring the new mainframe online.

But hardware is only one thing: RBS must also determine what to do with the existing apps running on the system. Either it must port existing apps to the new system - which is likely - or write or buy new apps. If the former, RBS must design, write, test and then shift. If the latter, RBS must make sure the new apps work on the new mainframe and interoperate with other RBS’s other, connected systems.

RBS did not say whether the old CA-7 software would be ported to run on the new mainframe.

Our ex-RBS techie outlines the complexity of the challenge facing the RBS team making the switchover:

Being a bank everything is by nature hung together with schedules and batches, this is really the correct way of doing things. You make a product, make it stable, then if it needs to do something else you add another system and a dataflow … I once saw a diagram of all the dataflow in RBS, just the ones between major systems fairly impressively filled a projection screen, then all the minor systems were added and it was just a black screen!

It’s a measure of just how bad things were for RBS that it’s spending £450m to be more or less back where it started - on a mainframe, just a newer and, fingers crossed, more reliable mainframe. There have been reports of companies dumping mainframes, but the mainframe remains a standard for banks: 25 of the world’s top banks use mainframes from IBM, according to Gartner.

Another ex-RBS IT staffer told us:

“A lot of talk has been had in the news about how these systems are too complex and bound to fail, but I guarantee that re-writing the systems and making them monolithic programs would result in some serious pain and cost in the short to medium term and in the long term you'd just end up with the same satellite batch controlled systems when the appetite to change the central system runs out.”

Mainframes are embedded thanks to their history, starting out with the S/360 in the mid 1960s giving companies access to the kinds of fast computer power that had only been available to governments and academics building machines on a project-by-project basis. Over the years, IBM has extended and upgraded the family, through S/390 to the z-series.

Bad-tasting layered cake

Banks have layered on more apps, different code and data flows to these systems as the business has changed. The RBS mainframe has been stretched with the acquisition of NatWest in 2000 and Ulster Bank to the point where it’s audibly creaking.

The Edinburgh mainframe system was so old that parts of its code had been written in assembler for hardware going back to the 1970s. The mainframe talks to RBS’s network of ATMs, once CICS terminals but which are today PCs serving modern amenities such as mobile phone topup. The back-end systems are strung together using a clone of MQ (Message Queuing) middleware from IBM. One of our ex-insiders points out:

“Heaven knows how much extra cruft you have to know and understand.”

This has made the Edinburgh mainframe hard to run and to maintain. Another former RBSer explained how complicated it had become:

“I recall getting into discussions regarding adding an itemized ATM withdrawal fee to statements if such a thing was incurred, and waving around a bit of listing with the relevant code on it. The meeting (involving about 12 people) happily discussed all sorts of peripheral things, with the only question to me being 'can you actually read that stuff'. They gave up on that idea after a while on the grounds that it was "too complicated."

Spending for a fresh start was perhaps inevitable given the levels of flak RBS and its chairman took. But such systems don’t run on their own, and there’s a need to make sure those running them understand the systems and the jobs. Is RBS really making the changes that might help stave off a repeat of the crisis of June 2012?

Treasury Select Committee chairman Andrew Tyrie MP wrote to RBS CEO Stephen Hester at the time of the outage saying his committee was “extremely concerned about the current crisis at RBS.”

In his letter, Tyrie demanded Hester explain what caused the meltdown, what it revealed in RBS’ ability to spot potential risks and to develop contingency plans, and to know whether outsourcing had contributed to the crisis. The latter question is pertinent because among 36,000 jobs cut by RBS since March 2012 as part of cost cutting are 500 IT jobs that have been outsourced to suppliers in India.

Hester and RBS have said their investigations found the error occurred during operations “managed and operated by our team in Edinburgh.”

It’s a carefully constructed defense because, as The Reg found out at the time, among 500 IT staff that have been cut by RBS were those running the CA-7 process that went wrong. The job of running CA-7 went, at least in part, to staffers working in Hyderabad who were paid 8-10 lakhs of rupees, a salary of roughly £9,000 - £11,000 according to an “urgent” job ad in February 2012.

RBS uses CA-7 to perform routine batch scheduling of jobs on the mainframe:

Batch scheduling software is used to process routine jobs and avoid the need for manual input: jobs are prioritized, scheduled and performed automatically. RBS runs updates on accounts on the mainframe concerned overnight, with thousands of batch jobs scheduled by CA-7 [from CA].

The important difference is the geographical gap between those running the CA-7 process, in Hyderabad, and those managing the CA-7 team – in Edinburgh.

One ex-RBS IT staffer pointed to the existence of a communication gap between teams in India managed by staff in the UK and how this could have helped slow down RBS’s response to the unfolding crisis last June.

“A lot of people miss the fact that there are very different cultural references and behaviors, lots of people in UK presume that they speak English so they must understand everything said in the same way that someone from the UK does and it's just not the case,” our source said.

Another consequence of outsourcing has been a loss of those skilled in running the mainframe and knowledgeable as to how the mainframe’s owner, RBS, operates in what is a sensitive and demanding sector. One reason mainframes are so popular among banks is the fact they are reliable, so an RBS-style meltdown should be relatively rare.

Robin Bloomfield, professor of software and system dependability at City University, London, told The Reg that skilled IT staff are as important as the hardware because they get to know the individual systems and learn to spot early warning signs and apply the appropriate remediation before things escalate.

“Sometimes people see legacy equipment as a legacy issue and all you need do is plug in something more modern,” Bloomfield says. “But they are reliable because of the culture around them – the people around them, the safeguards. That can be ignored in an organization if it’s seen as an IT issue and ‘all we need is a black box’.”

Bloomfield, who specializes in dependability and safety of software-based systems and in operational risk, says he’s seen many cases in financial IT where the technology is treated as black box – meaning it can be installed and operated, without much thought to who runs it.

It is unclear whether RBS is reversing its policy of outsourcing or whether the team running the new mainframe will be brought back on-shore. We asked RBS what had happened to the old outsourced teams that ran the CA software last June but RBS did not respond. We asked RBS whether it had updated or changed the policies used to manage the risks associated with the mainframe to improve recovery as a result of last year’s outage, but – again – RBS did not respond.

As noted, the matter of outsourcing and running critical banking functions on legacy mainframes loaded with a spaghetti of dated code is not restricted to RBS.

It could be that one potential consequence of the June 2012 RBS meltdown is other banks are forced to update or change their ways, too, especially if regulators act.

Regulator pass the parcel

The former Financial Services Authority (FSA) chairman Lord Adair Turner told Tyrie of the Treasury Select Committee that he wanted a full independent review to establish what had gone wrong at RBS and to “provide an assessment of the consequences and the subsequent management of the IT failure.

“On receipt of the independent review, we will consider whether further regulator action is required,” Turner wrote to Tyrie.

The FSA no longer exists, and its responsibilities have passed to the Financial Conduct Authority (FCA) and the Prudential Regulatory Authority (PRA) with the job of digging to the bottom of what really happened at RBS falling to the FCA. In April the FCA said it had started to conduct an enforcement investigation into the crisis.

“The FCA will reach its conclusions in due course and decide whether or not enforcement action should follow that investigation,” the body said in a statement.

Should the FSC’s investigation throw up anything more than a one-off systems failure, then there’s every chance the fallout from RBS could settle on other banks, too.

There are three possible outcomes from the FCA’s investigation: it might takes no further action, it could fine RBS, or it may propose regulation that would be enforced by the PRA. The latter would happen if the RBS crash was caused by failures in the technology, risk management, disaster recovery outsourcing present inside other banks and lenders.

Since RBS went titsup there have been at least two more outages attributed to IT problems: up to 22 million customers of Lloyds Banking Group - which includes Halifax and the Bank of Scotland - were unable to use cash machines, debit cards or connect to their accounts via the web in October 2012. Up to 2.4 million customers of Co-op Bank were also blocked from accounts. And it doesn’t just happen in the UK: technical issues in December 2011 took out ATMs, retail points of sale and telephone banking systems for Commonwealth Bank Australia.

Chris Skinner, chairman of banking and financial services networking group the FSClub, tells The Reg:

“The problem is that most financial institutions are hamstrung by their heritage – the mainframe,” Skinner says. “We are seeing more outages because technology’s part in banking and finance is becoming more common.”

Skinner has spoken out on banks running IT systems that he says are no longer fit for purpose. Most problems are relatively small – resulting in outages of just a few hours as in the case of Lloyds and Co-op – and are often down to upgrades to the old systems that aren’t applied properly.

But there’s a growing awareness of a need to overhaul IT to avoid becoming the next RBS and – also – to dodge regulation. “Banks are under pressure to keep up with the speed of technology change,” Skinner said. “Every bank I know has been through a core system replacement or is undergoing one.”

If regulation is mandated then don’t expect a quick fix. Unlike, say, the nuclear or airline industries, where accidents have led to investigations that have produced operation and safety standards, similar standards in financial systems will be difficult because of a fundamental refusal to share information.

Financial services is a competitive sector while IT systems are varied and valued, so it’s unlikely companies will volunteer the kinds performance data, risk assessment or outage information that will be considered needed to help regulators impose standards or force change. They will fear ceding competitive advantage should they reveal what they’re running and where they are exposed.

Dave Cliff, a contributor to last year’s Government Office for Science report on the Future of Computer Trading in Financial Markets, told us policy makers are already several steps behind the markets because there’s not enough raw data to inform their decisions.

Cliff was talking about financial markets’ increasing reliance on High-Frequency Trading (HFT) - a system of trading dependent on algorithms that execute at millisecond speeds. It is suspected that HFT has been responsible for exaggerating wild market swings, and it certainly led to the downfall of Knight Capital Group in August 2012. Knight lost more than $450m after a trading algorithm it had used bought and sold shares at the wrong prices before it was noticed or could be stopped. Knight was sold to Getco as a result of the crippling loss. But when it comes to HFT, the hedge funds and traders who write and employ them don’t like to disclose their algorithms or the special systems running them.

That might just be HFT, but retail banks are equally coy.

“Something that’s a major systemic issue when we talked about what happened at Knight Capital and RBS is we are just recycling stories we heard over coffee or in a bar,” Cliff told The Reg. “That’s the difference between companies in banking and other advanced areas where computer technology creates risks and where you have advanced legislation.

“If an accident occurs - like a plane crash - there is a very detailed inquiry where all the participants are required by law to disclose all the information the investigators want.”

A year on, RBS has committed millions of pounds on a new mainframe and disaster recovery to avoid a repeat of last year’s disaster but new hardware alone will be insufficient if the bank has not changed the people, software or the processes that saw millions of customers locked out of their accounts.

And, with regulators only just warming up and unlikely to get the information they need to force a meaningful or deep change, we should expect more account outages thanks to our banks' faulty computers. ®