The biggest British Airways IT meltdown WTF: 200 systems in the critical path?
It's not the velociraptor you can see that kills you
One of the key principles of designing any high availability system is to make sure only vital apps or functions use it and everything else doesn't – sometimes referred to as KISS (Keep It Simple Stupid).
High availability or reliability is always technically challenging at whatever systems level it is achieved, be it hardware or software. The colossal systems failure at British Airways has been blamed on a "power surge" trigger followed up by a messaging system failure.
However, within the comments of the BA chief executive there is one telling statement:
Tens of millions of messages every day that are shared across 200 systems across the BA network and it actually affected all of those systems across the network.
Sorry for the text speak, but WTF? How does it require 200 systems to issue a boarding pass, check someone in and pass their security details on to the US – even if they aren't going there? Buried deep in The Register comments on the article is an allegedly former BA employee claiming that this is in fact the case, that all of these systems are required for BA to function. How did BA get to the point that there are 200 systems in the critical path?
The problem with current IT systems is that even with no high availability elements in path, once an initial burn-in period has passed, they are hugely reliable. Failures even in this setting are sufficiently rare that unless you look at IT systems as a whole, it can seem like they never occur.
So, sure, we need this new function in path, just add another server (virtual machine) and let's go, maybe spread over a couple of data centres won't be a problem. Can't be a problem – we've never seen a failure, so why do those IT guys keep telling me I have to spend millions on re-factoring the system to ensure it is highly available?
Another organisation that struggled to internally communicate the true nature of the reliability risk they were facing was NASA – and the consequences of that were even more visible than BA's. This also demonstrated a spectacularly poor understanding of the nature of risk on the part of senior management.
During the Rogers Commission's investigation into the Challenger disaster, Richard Feynman examined the NASA approach to estimating failure rates. NASA's management believed that the risk of shuttle failure was "necessarily" one in 105. This figure seemed "fantastical" to Feynman and so he estimated the failure rate himself and obtained a figure of one in 100.
Moreover, once he involved NASA's engineers in the calculation, this figure came in between one in 50 and one in 200. How could there be such a disconnect between the engineers' view of the failure rate of the system they designed, and the management's view of the system they commissioned?
In the case of the shuttle, many engineers had raised the issue that ultimately led to the failure, but their warnings fell on deaf ears. Indeed, it was far from clear that even senior NASA management were actually capable of understanding the warnings their engineers were raising – often having neither an engineering or a scientific background.
NASA were well aware of the exposure that a failed space shuttle, likely to be both explosive and public, would cause. Indeed, from a risk consequence perspective, the outcome was regarded as having similarly negative connotations to the assassination of a president. So how could they get it so wrong? There are almost no organisations – actually there are none – that like or encourage prophets of doom.
So what, if any, are the parallels for large-scale IT systems?
For many organisations (citing practically all UK government utterances on IT issues as evidence), senior management has practically no meaningful IT knowledge beyond the ability to press the buttons on their smartphone or tablet. Within the IT function, senior management figures are generally chosen – by non-technical managers – for management rather than their technical abilities.
How many organisations, BA included, have a detailed model of why their systems are fit for purpose? Just as the space shuttle was "necessarily" good for 1 in 105, how many IT systems are claimed to be five nines on the basis of a box and line diagram showing the presence of duplicate resilient systems? Are the models used by IT management to understand the underlying failure rate of their systems any better than the ones used by NASA management to achieve their necessary 99.999?
It is unlikely that any of the 200 systems BA needs to be functional to keep operating is a simple computational unit. Each of these sub-systems will themselves have complex internal interdependencies between the servers, network, storage and the software that come together to deliver the function. The sheer number of potential points of failure that BA was exposed to is hard to believe. Fortunately, as a default they fail very, very rarely, so it is easy to believe that failure simply cannot occur.
It is clear that BA is suffering from criticality bloat. They have permitted systems to be added to the critical business path willy-nilly. The systems fail so rarely that surely this cannot be a problem – but what about the system you add to the critical delivery path but don't know about?
When confronted with complexity people have an inevitable tendency to retreat into hope and historic belief. One consequence of this is that if an event hasn't happened yet, it is very unlikely to ever happen. In probability circles this is called the gambler's fallacy, the base of a significant fraction of the earnings currently achieved on the web – a great example is the so-called guaranteed winning "doubling" strategy for roulette.
For any IT dependent organisation, which in reality is pretty much everything these days, a fundamental question should be: Why does the organisation believe its IT is sufficiently robust to allow it to meet its operational goals? What is the evidence that belief is based on? How has the evidence been validated? Is there a predictive model, not a picture on a slide deck, of why the system as a whole stays up?
Just like velociraptors, it's not the one you can see that kills you. ®