Risky business: You'd better have a plan for tech to go wrong

Power outages, automation, rollback... and more

Analysis Back in the days of the mainframe, technology risk looked a lot different. You paid a lot of money for a big box in the corner, using software often written by the same supplier. If it went wrong, a nice techie came along and fixed it. Business moved slowly enough that the world didn’t come crashing down if you couldn’t process data for a few days.

Today, risk is a constant factor in business technology. Computing and software systems are so complex and interconnected that things are far more likely to snarl up, and the cause may not always be obvious. When they do, the cost of downtime to the business can be far greater than it used to be.

How much?

It is difficult to assess the cost of today’s downtime, and therefore to quantify outage risk. It depends on your type and size of business and your level of automation. Information Technology Intelligence Consulting (ITIC) took a stab at finding an average with an updated survey last June.

The company polled 709 global businesses in a web-based survey, backed up by selected first-person customer interviews, covering 47 verticals and a roughly equal spread of company sizes. 81 per cent of them estimated the average hourly cost of downtime at above $300,000, with a third citing costs of over $1m.

That might sound high, but it fits with a 2014 Gartner assessment, which cited an average of $5,600 per minute ($336,000 per hour).

Companies in top-tier and highly regulated verticals stand to lose far more from downtime, the ITIC report suggested. Finance, government, healthcare, retail, transportation and utilities were among those sectors hit the hardest by downtime. Costs for some larger companies in those sectors could top $5m, it reckoned.

Crash landing

When you account for not just the effects of lost business, but also regulatory fines, staff costs and customer compensation, overall downtime numbers can soar exponentially. Some of these outages can cost companies dear.

When BA’s systems crash landed last May because of what it said was a power supply issue, the price tag hit an estimated £80m. Parent firm International Airlines Group also saw its share value fall.

When Delta Air Lines suffered its own computer crash in August 2016, it cut its profit margins from 19-21 per cent down to 18-19 per cent after the costs of the IT outage slashed $150m from its pre-tax income in the third quarter.

Falling share prices may be temporary, but the cost to brand reputation can be longer-ranging and more difficult to calculate.

The source of IT risk

The outage risk comes from several sources in the data centre. According to a 2016 Ponemon study on data centre outages, the stuff you might expect to make your IT fall over didn’t figure that highly. Failing IT equipment accounted for just 4 per cent of outages.

Instead, the biggest outage sources were power related. The report, which surveyed US entities experiencing at least one unplanned outage over the past 12 months, found that uninterruptable power supply (UPS) system failure was still the biggest cause of data centre borkage. One in four unplanned outages stemmed from a dodgy power supply.

Denial-of-service attacks was the next biggest cause of outages, at 22 per cent, followed by Problem Exists Between Keyboard and Chair (PEBKAC). 22 per cent of belly-ups in the data centre were down to accidental or human error.

Human error is also a big factor in network outages, according to a 2016 survey of 315 IT pros by Dimensional Research. A little less than half those respondents said that it caused a lot of them, and most said that changes to the network lead to an outage or performance issue at least several times a year. That’s probably because most respondents relied on manual checks to ensure that the network was functioning properly.

Raining risk from the public cloud

Human error needn’t be a problem when you have control over your own users. You put in proper automation controls and threaten anyone who mistypes a command that you’ll publicly flagellate then with cat 5 cable. But the problem is less tractable when you don’t control the environment. Welcome to public cloud, which vendors would have you believe runs smoothly all of the time but frequently falls over.

To wit: Level 3 took down VoIP service across most of the US after someone misconfigured a call routing table. Bang! Google stopped users from adding more compute power to their cloud workloads after a bad algorithm tweak. Crash! IBM’s global load balancing and reverse DNS services went dark for almost an entire day after someone forgot to renew some domain names. Wallop!

These public cloud incidents can cost companies dearly when they happen, because they’re putting mission-critical stuff on them. When Amazon’s US-East-1 data centre went kaput last March, analytics firm Cyence estimated that it cost S&P 500 companies alone around $150m, and US financial firms about $160m.

Absorbing risk

What can we do to mediate these risks? Start by identifying them and putting some numbers behind them. What are the most likely risks based on your current operating profile, and what would they mean for the business if they occurred? How are you offsetting them, and what must you do to plug the gap?

It’s common to think of availability in terms of “nines”. Three nines (99.9 per cent uptime) allows your systems to go down for eight hours and 42 minutes each year. Four nines (99.99 per cent) gives you 52 minutes of borkage, while five nines (99.999 per cent) allows for just five minutes and 15 seconds. Working out your tolerance for downtime will put you in one of those categories, and it may change depending on the workload you’re talking about.

That can help you to tier your chosen data centre facility properly. The Uptime Institute uses a proprietary four-tier system, starting with tier 1. By the book tier 1 provides basic capacity – a dedicated space for IT systems, a UPS and dedicated cooling equipment. By the book, according to Colocation America, you get up time of 99.671 per cent.

Some vendors in this field, however, have gone beyond the basic parameters of tier 1 with their kit. They claim to provide full redundancy with 99.999 per cent uptime – and on Flash (albeit that's six nines over a two-year period). You are, in other words, theoretically getting more for less by not having to pay for the full tier-4 price tag.

Tier 2 provides some redundant critical power and cooling components to make some maintenance easier, while tier 3 requires no shut downs at all for replacement and maintenance. Finally, tier 4 is the granddaddy of reliable data centres; it’s a fault-tolerant facility including the ability to fail over facility equipment and distribution path interruptions without affecting IT operations.

Tiering your data centre appropriately will help to avoid some of the technical failures I’ve discussed here, along with the 11 per cent of outages caused by water, heat or CRAC-based failures that Ponemon also saw in its 2016 survey.

As we’ve seen, though, a solid facilities infrastructure won’t be enough to lower your risk and get your nines up. Reducing the opportunity for human error is also critical. This is a function of proper process.

Doing change management properly is important. This includes understanding the implications of a change, testing it where possible, scheduling it for minimal fallout if something goes wrong, and understanding how to roll back to a prior configuration if necessary.

With DDoS attacks becoming more prevalent, talking to your ISP or hosting company about options to mitigate attacks upstream becomes more important, as does understanding the risk to your own infrastructure from DDoS attacks incurred by any other of their tenants.

This is all important stuff, but what happens if the worst kind of risk hits, and a hurricane ends up flooding your data centre or wiping out your server room? Proper business continuity planning and disaster recovery strategies can help deal with less likely physical risks that have potentially large impacts. The first describes how a business should continue to function after a disaster hits. The second gives IT a playbook to recover as quickly as possible from that disaster.

We’re sure there’s something else you’d rather be doing with your time, but taking a quantitative, analytical approach to risk will help when it comes to allocating budget to help mitigate it. A stitch in time could save several nines – of downtime. ®




Biting the hand that feeds IT © 1998–2018