Feeds

Amazon fine print limits potential credits for cloud outage

Rackspace CTO talks EC2 failure

  • alert
  • submit to reddit

Beginner's guide to SSL certificates

Amazon's EC2 contract promises its infrastructure cloud will provide 99.95 per cent "uptime" over the course of a year. But that doesn't mean the company will dish out credits in the wake of the outage that affected some users for as many as four days, if not more.

Though the EC2 service level agreement says users will be eligible to receive credits if the service doesn't meet a 99.95 per cent "annual uptime percentage" within a particular geographical region, this only applies to users who have spread their applications across multiple "availability zones" – subsections of Amazon's regional services designed not to fail at the same time.

The outage did hit multiple zones in EC2's East Region – served up from at least one facility in Northern Virginia – but it appears that multiple zones were affected for only about three hours.

Amazon has yet provide details about the outage, and many third-party commentators have failed to realize that the service level agreement is more complex that it seems. The availability zone setup continues to cause confusion, in part because people don't actually read SLAs, but also because Amazon has yet to describe how the zones are designed and how they operate.

At 1:41am Pacific time on Thursday, Amazon said with a post to its status page that it was investigating connectivity issues with its Elastic Compute Cloud (EC2) service, which provides on-demand access to processing power across the net. According to one status message, the problem began with a "network event" that caused the service to re-mirror a large number of Elastic Block Storage volumes in the East Region. Elastic Block Storage provides storage that's independent of particular server instances on EC2.

Amazon divides EC2 into multiple geographic regions, and some regions – including the East Region – are divided into multiple "availability zones". Amazon has always said that these zones are protected from each other's outages. "Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones," the company's website reads. But the East Region outage spread across multiple zones.

Some felt that Amazon had broken its promise over availability zones. But the particulars of the service-level agreement add a new twist to this discussion. "'Annual Uptime Percentage' is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of 'Region Unavailable'," the agreement reads. "'Region Unavailable'...means that more than one Availability Zone in which you are running an instance, within the same Region, is 'Unavailable' to you."

According to Amazon's status messages, multiple availability zones experienced problems for about three hours on Thursday, then the problem was isolated in the zone where it began. John Engates, the chief technology officier at Rackspace, which operates a cloud service similar to Amazon's, believes Amazon is unlikely to provide many credits in the wake of the outage.

"More than one availability would have to go down for you to receive a credit, and you have to be down for a considerable about of time," Engates told us during a conversation at this week's OpenStack design summit in Santa Clara, California. "I really doubt they're pay a lot on credits."

Rackspace's Cloud Servers service does not provide a setup analogous to Amazon's availability zones. The Rackspace service-level agreement guarantees uptime for particular components within each service region, including its network, its data center infrastructure, and individual hosts. The company operates separate data centers in Texas, Chicago, and London.

Judging from Amazon's status messages, Engates says, he believes that Amazon's outage spread across multiple availability zones because the company was using availability zones to mirror Elastic Block Storage data for other zones. "Rather than replicating data within a zone, I think they were replicating between zones," he said. "And it seems that when they had a failure in one zone, traffic waterfalled into the other zones. It's like if there was a fire in a hotel. We would have to evacuate to the hotel across the street, and there may not be enough room in the hotel across the street for everyone to get a room."

It appears that the outage affected only those who were using Amazon's Elastic Block Storage service.

Engates says that Amazon's cloud service and its service-level agreement is set up in such as way that users must ensure redundancy across zones – if not across entire regions. "You have to think about how to allocate your application across multiple resources to maximize that SLA," he said. "Those that did so – NetFlix is one example of a big customer – did not experienced the same kind of outages as people who were very localized. You could put some of the blame on Amazon, but some of the blame on the customer."

Yes, multiple zones were hit by the outage. But Amazon does not promise 100 per cent availability. The company has said, however, that it is unable to restore EBS volumes for some customers. About 0.07 per cent of EBS volumes in the East Region, a status message indicates, "will not be fully recoverable". ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
IT crisis looming: 'What if AWS goes pop, runs out of cash?'
Public IaaS... something's gotta give - and it may be AWS
Linux? Bah! Red Hat has its eye on the CLOUD – and it wants to own it
CEO says it will be 'undisputed leader' in enterprise cloud tech
Oracle SHELLSHOCKER - data titan lists unpatchables
Database kingpin lists 32 products that can't be patched (yet) as GNU fixes second vuln
Ello? ello? ello?: Facebook challenger in DDoS KNOCKOUT
Gets back up again after half an hour though
Hey, what's a STORAGE company doing working on Internet-of-Cars?
Boo - it's not a terabyte car, it's just predictive maintenance and that
Troll hunter Rackspace turns Rotatable's bizarro patent to stone
News of the Weird: Screen-rotating technology declared unpatentable
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.