Hosts with the mosts: Getting to grips with SLAs for the cloud
Hey baby, I’m your telephone man
When email is down, businesses cease to function. If the email goes down due to a mishandling of the Exchange server, the appropriate sysadmin is found and duly berated.
Finger pointing exercises are less well defined when email stops working because Gmail is down. Again. In this case the sysadmin in question bears no direct responsibility for the issue. His burden lies through the indirect responsibility of the recommendation to engage Google’s business-class email services.
A sysadmin caught in this particular trap can do little. He neither controls the servers in question, nor is there any method of ensuring an appropriate Google sysadmin on the job. Google quite famously doesn’t take phone calls. A calm reminder about how to make use of whatever backups and contingencies exist is all a sysadmin in such a situation can muster. One can only trust that Google will live up to its Service Level Agreement (SLA).
It is perhaps unfair to single out Google for this theoretical exercise; it has proven able to live up to its SLA. It offers a massive array of services with outages so short and infrequent that each one is news. Google has become the poster child for upholding a punishing SLA.
It is also the poster child for “not getting it” regarding customer service. Microsoft earns some points over Google here; though limited, it offers phone and live chat and even Twitter support for many of its services.
Amazon offers yet another approach; you may pay for whichever level of support you feel appropriate. One-on-one online support is available starting with the basic support package. Phone support starts at $400 and goes up from there. Still others hosted providers seem to treat support as nothing more than a public relations requirement.
Trust in me
Regardless of how well executed the technical requirements of an SLA, there is a sense of helplessness experienced by those asked to trust in that SLA. People aren’t very good at bearing statistical uptime in mind when a critical service goes down at an inconvenient time. The quality and type of customer service are an important – though often neglected – consideration to any hosted service SLA.
Such feelings may not be entirely rational, but they are human. People need to feel in control. When something goes wrong, it is simply not enough to fix it quickly. We require reassurance that the problem is acknowledged and being worked on. A timeframe for repairs is vital; downtime costs money and past a certain point backup plans need to be engaged.
Some of the support issues legitimately can be solved through automation. Services dashboards let customers know that an outage is known and being worked on, even in cases where live support is not offered. Google and Microsoft both offer serviceable examples. Google Apps has a status page for select applications. Microsoft’s Windows Live services are similarly monitored. Microsoft’s Azure cloud also has a comprehensive offering.
How these status pages are handled is critical. Consider both Google’s approach to an incident on 2011-03-09 and Microsoft’s approach to an incident on 2011-03-16. In both cases, incidents were handled with professionalism. As soon as the support desk became aware of the incident it was reflected on the status page. Users that knew about the status pages – and checked them – were kept in the loop throughout both outages.
As professional as both this approach is, its real world serviceability has limits. Automated support is completely inadequate when downtime is costing your business thousands – or millions – of dollars an hour.
Hosted cloud services are risky. The right SLA is critical to the success of hosted services in your organisation. Selecting a provider with the right mix of support options is as vital as selecting one that can deliver on their promises of high uptime.
Trevor Pott is a sysadmin for a small-ish company based in Edmonton, Canada.