Feeds

Amazon fine print limits potential credits for cloud outage

Rackspace CTO talks EC2 failure

  • alert
  • submit to reddit

Remote control for virtualized desktops

Amazon's EC2 contract promises its infrastructure cloud will provide 99.95 per cent "uptime" over the course of a year. But that doesn't mean the company will dish out credits in the wake of the outage that affected some users for as many as four days, if not more.

Though the EC2 service level agreement says users will be eligible to receive credits if the service doesn't meet a 99.95 per cent "annual uptime percentage" within a particular geographical region, this only applies to users who have spread their applications across multiple "availability zones" – subsections of Amazon's regional services designed not to fail at the same time.

The outage did hit multiple zones in EC2's East Region – served up from at least one facility in Northern Virginia – but it appears that multiple zones were affected for only about three hours.

Amazon has yet provide details about the outage, and many third-party commentators have failed to realize that the service level agreement is more complex that it seems. The availability zone setup continues to cause confusion, in part because people don't actually read SLAs, but also because Amazon has yet to describe how the zones are designed and how they operate.

At 1:41am Pacific time on Thursday, Amazon said with a post to its status page that it was investigating connectivity issues with its Elastic Compute Cloud (EC2) service, which provides on-demand access to processing power across the net. According to one status message, the problem began with a "network event" that caused the service to re-mirror a large number of Elastic Block Storage volumes in the East Region. Elastic Block Storage provides storage that's independent of particular server instances on EC2.

Amazon divides EC2 into multiple geographic regions, and some regions – including the East Region – are divided into multiple "availability zones". Amazon has always said that these zones are protected from each other's outages. "Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones," the company's website reads. But the East Region outage spread across multiple zones.

Some felt that Amazon had broken its promise over availability zones. But the particulars of the service-level agreement add a new twist to this discussion. "'Annual Uptime Percentage' is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of 'Region Unavailable'," the agreement reads. "'Region Unavailable'...means that more than one Availability Zone in which you are running an instance, within the same Region, is 'Unavailable' to you."

According to Amazon's status messages, multiple availability zones experienced problems for about three hours on Thursday, then the problem was isolated in the zone where it began. John Engates, the chief technology officier at Rackspace, which operates a cloud service similar to Amazon's, believes Amazon is unlikely to provide many credits in the wake of the outage.

"More than one availability would have to go down for you to receive a credit, and you have to be down for a considerable about of time," Engates told us during a conversation at this week's OpenStack design summit in Santa Clara, California. "I really doubt they're pay a lot on credits."

Rackspace's Cloud Servers service does not provide a setup analogous to Amazon's availability zones. The Rackspace service-level agreement guarantees uptime for particular components within each service region, including its network, its data center infrastructure, and individual hosts. The company operates separate data centers in Texas, Chicago, and London.

Judging from Amazon's status messages, Engates says, he believes that Amazon's outage spread across multiple availability zones because the company was using availability zones to mirror Elastic Block Storage data for other zones. "Rather than replicating data within a zone, I think they were replicating between zones," he said. "And it seems that when they had a failure in one zone, traffic waterfalled into the other zones. It's like if there was a fire in a hotel. We would have to evacuate to the hotel across the street, and there may not be enough room in the hotel across the street for everyone to get a room."

It appears that the outage affected only those who were using Amazon's Elastic Block Storage service.

Engates says that Amazon's cloud service and its service-level agreement is set up in such as way that users must ensure redundancy across zones – if not across entire regions. "You have to think about how to allocate your application across multiple resources to maximize that SLA," he said. "Those that did so – NetFlix is one example of a big customer – did not experienced the same kind of outages as people who were very localized. You could put some of the blame on Amazon, but some of the blame on the customer."

Yes, multiple zones were hit by the outage. But Amazon does not promise 100 per cent availability. The company has said, however, that it is unable to restore EBS volumes for some customers. About 0.07 per cent of EBS volumes in the East Region, a status message indicates, "will not be fully recoverable". ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
Turnbull should spare us all airline-magazine-grade cloud hype
Box-hugger is not a dirty word, Minister. Box-huggers make the cloud WORK
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
Microsoft adds video offering to Office 365. Oh NOES, you'll need Adobe Flash
Lovely presentations... but not on your Flash-hating mobe
prev story

Whitepapers

Free virtual appliance for wire data analytics
The ExtraHop Discovery Edition is a free virtual appliance will help you to discover the performance of your applications across the network, web, VDI, database, and storage tiers.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
10 threats to successful enterprise endpoint backup
10 threats to a successful backup including issues with BYOD, slow backups and ineffective security.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Website security in corporate America
Find out how you rank among other IT managers testing your website's vulnerabilities.