BA IT systems failure: Uninterruptible Power Supply was interrupted
Potentially by a panicking contractor, if reports are to be believed
Analysis An IT bod from a data centre consultancy has been fingered as the person responsible for killing wannabe budget airline British Airways' Boadicea House data centre – and an explanation has emerged as to what killed the DC.
Earlier this week Alex Cruz, BA's chief exec, said a major "power surge" at 0930 on Saturday 27 May caused the airline's systems to "collapse". Its Boadicea House (BoHo) data centre went down for around a quarter of an hour, seemingly taking the airline's failover systems with it.
The resulting mess caused total chaos for tens of thousands of passengers, stranding them overseas, sending their luggage hither and thither, and grounding aircraft while frantic staff tried to sort out the madness.
Today the Daily Mail fingered a contractor from CBRE Global Workplace Solutions as the culprit for the power restoration problem. BA said that it was the uncontrolled restoration of power that fried BoHo's supply rather than the shutdown.
A spokeswoman for CBRE told The Register: "We are the manager of the facility for our client BA and fully support its investigation. No determination has been made yet regarding the cause of this incident. Any speculation to the contrary is not founded in fact."
It appears, however, that BA's IT was experiencing problems before the main outage.
On the Friday before the catastrophic outage, Reg reader Anton Ivanov said his booking confirmation emails took eight hours to reach him instead of the usual minutes between clicking "confirm" on the website and receiving the email.
Failover. Haven't you heard of it?
Although Cruz blamed a "power surge" for knocking out BoHo, the National Grid confirmed to El Reg that there were no problems with its transmission network in the Heathrow area over the weekend. Similarly, Scottish and Southern Electricity Networks, the local electricity distribution network operator, told Reuters: "The power surge that BA is referring to could have taken place at the customer side of the meter."
This doesn't explain why the outage then killed BA's second data centre – or, indeed, why the contagion spread around the world and grounded all BA flights at the height of the failure.
Informed sources told us that the power failure did occur as described by BA – but they weren't sure how or why the failover DC itself promptly keeled over when called upon.
The airline's UK IT infrastructure is said to span more than 500 cabinets in six halls across two different data centres, both of which are no more than a mile from the eastern end of Heathrow's two runways. Neither is under the flightpaths. BoHo has been in the airline's ownership since it was called the British Overseas Airways Corporation, though the DC's internals have been replaced a good few times since the dusty days of Empire.
BoHo's uninterruptible power supplies (UPSes) were replaced three years ago with equipment from electrical firm Socomec, which refused to comment for this article. The firm's Smart Powerport product can be specified to supply up to 2.4MW of backup juice. A very rough rule of thumb for DC power consumption is around 900kW per 10,000 square feet, and from aerial views (no, we're not going to pinpoint it on a map for you) BoHo looks to be around about that size.
We are told that the two DCs normally operate in
active:active configuration. Power was lost at BoHo but, other than Cruz's odd reference to a "power surge" that affected nobody else in the area, it is not yet known for certain what caused the initial power loss. What caused the second (failover) DC to go down is still a mystery for now, though there are plausible technical explanations.
Too much juice?
When mains feed is lost, the UPS (if properly specified) should have enough battery power to keep the DC running in the minute or two it normally takes the backup generators to spool up to full power, a data centre source who spoke to us on condition of anonymity said. In most common setups, we were told, backup power, regardless of whether it is from the standby batteries or the generators, flows through the UPS.
The UPS can also have a secondary function of smoothing the current from the backup generators so the servers are only fed 240v at 50Hz, rather than the fluctuating voltages and frequency produced by a local genset as the load varies.
If, our source said, the power control software controlling the multiple feeds into the data centre didn’t get the switch between battery and backup generator supply right at the critical moment – or, potentially, if someone panicked and interrupted the automatic switchover sequence – it could have resulted in both the battery supply and the generator supply being briefly connected in series to the power bus feeding the racks. That would result in the data centre’s servers being fed 480v instead of 240v, causing a literal meltdown.
This is one potential scenario that could have caused the physical damage to BoHo’s infrastructure that BA’s Cruz described. It fits what the Press Association reported late last night, based on a leaked email:
Bill Francis, Head of Group IT at BA's owner International Airlines Group (IAG), has sent an email to staff saying an investigation so far had found that an Uninterruptible Power Supply to a core data centre at Heathrow was over-ridden on Saturday morning. He said: "This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries. This in turn meant that the controlled contingency migration to other facilities could not be applied. "After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system, and significantly exacerbated the problem.
What should have happened, and didn’t, was that the second data centre (either Comet House or one at Cranebank) should have more-or-less seamlessly picked up the load from BoHo.
“A data failure is quicker to fix than a hardware failure,” commented our source. “How has a power surge at one site killed the other? That sounds like data, not hardware and software, and that suggests [in relation to BA’s lack of communication about the cause of the failure] they’re worried about a PR problem.”
An uncommanded shutdown of the data centre may have caused corrupted data to be synchronised between the two as BoHo died. That in turn, said our source, could have caused the failover DC to be populated with enough bad data to crash all the systems depending on it – leaving BA with one dead DC, one live DC of no immediate use, and a ton of very angry stranded passengers.
There is a possibility that BA was operating an active:active:passive configuration, with the third DC acting as a cold standby populated with backed-up data, our source said. This fits with some web forum chat seen by El Reg dating back to Sunday 28 May as BA customers claimed to have seen old flight booking data in their online accounts which later vanished, presumably as more recent backups were populated to the airline’s systems.
Outsourcing not to blame, says everyone involved in it
Although various people have speculated that operations and jobs outsourced to India's Tata Consulting Services (TCS) contributed to the cockup, both the airline and TCS vehemently deny it.
A BA spokeswoman told us: "On Saturday we experienced a power supply issue at one of our UK data centres. We have been very clear that this was a UK data centre, operated by UK staff. The issue is in no way related to any work being conducted by Tata Consulting Services. We are conducting an exhaustive investigation into every aspect of the IT outage."
The airline has not responded to Vulture South's detailed questions about disaster recovery and resilience.
Meanwhile, a TCS spokeswoman told El Reg: "As BA has already confirmed, the problems over the weekend were caused by a power supply issue and not due to outsourcing of IT services, so we can't comment further. BA and all its partners including TCS have been working very hard to restore the services fully."
BA has promised to publish its investigation, but has not indicated if this is before or after the potential £150m bill for the disruption, which The Times reported was a possibility. The GMB trade union has demanded a halt to BA parent company IAG’s plans to offshore another 600 IT jobs. ®
Sponsored: What next after Netezza?