Whoops, there goes my cloud: What to do when AWS foresakes you
Multiple providers? Hybrid approach? Or, just take the hit?
Comment It's been an interesting period for cloud services, with both Amazon and Skype suffering major outages. In Amazon's case it was a good old-fashioned network interruption that caused significant knock-on effects (it released a very frank explanation that is well worth a read). As for Skype it was a configuration change that broke the world.
Makes you think though, doesn't it? Even the biggest cloud services aren't immune to problems. So what's the implication for companies using the cloud for service provision?
First of all, it's important to realise that the recent failures in the cloud change nothing. Just as gravity existed long before Newton's apple experience, so the potential for system failure has been around for a long, long time prior to AWS and Skype going AWOL.
All that's happened is a bit of proof that the possibility of a failure is a real one. One would hope, therefore, that when choosing to move services into the cloud an organisation weighs up the risks alongside the benefits.
You're a minnow
The first thing you need to realise is that you're insignificant to a big cloud provider. Of course, they pride themselves on customer service levels, and when they're signing you up they'll assure you of your importance to their business.
But face it: if (say) your Office 365 world crumbles to dust next Thursday, nothing you can say to Microsoft will get it fixed any more quickly. You'll get some service credits, and they'll be utterly insignificant in the context of the productivity or sales you lost in the meantime. (That's not a dig at Microsoft, of course – service credits are universally pointless as they seldom provide recompense for the actual loss).
You could, of course, decide not to go with the big cloud providers and instead choose one of the smaller, more boutiquey suppliers. Of course this has its own risks – smaller companies may well not have the scale you require to burst your processing requirements, or geographic redundancy, or the same levels of architectural or support expertise.
So although your business is more of a big deal to them than to the big boys, the small supplier approach has its drawbacks. And in many cases it's not an option: for apps such as SalesForce there's only one option, for instance.
Planning for outages
You should assume, then, that at some point a given cloud service will die. Hence you should plan what you're going to do about it. And there are several ways to go.
First is the approach of using multiple providers. Not so easy if you're an Office 365 user because there's only one provider, but if you're using cloud-based server provision then you can spread it between providers or at least between regions within a single provider.
The latter can often be sufficient – and there's a very cool blog article from Netflix describing how their service is split between AWS regions and hence was largely unaffected by the outage on 20 September.
Next is the hybrid approach, where you have your primary systems on-premise and use the cloud for the Disaster Recovery setup. In this case a cloud failure would simply cause a loss of resilience, so unless something on-prem decided to turn up its toes during the cloud outage you'd see no ill effects.
There's a third approach, and one that's sometimes forgotten: accept the risk and live with the downtime should it happen. This is generally a financial decision, but it's perfectly reasonable to give your business the choice between a risk of (say) two days' downtime a year and a bill for fifty thousand quid for the resilient option.
Stick with your providers
If you have a big lump of unexpected downtime, it's natural for your confidence in the vendor to be dented. Human nature is to mistrust the vendor until the service has been back up for a while and they've shown it can work properly for extended periods.
But don't be hasty to dump the vendor in favour of an alternative. Amazon's wash-up following their AWS outage notes that: “There are several actions we'll take immediately to avoid a recurrence of Sunday's DynamoDB event,” and then goes on to talk about increasing capacity for metadata, improving the monitoring of the service, tweaking the way nodes communicate with each other and segmenting the nodes in their systems to alleviate overload.
Yes, it had a problem but the system has come out of the other side working better, being better monitored and with a reduced propensity to fall over.