Feeds

Amazon cloud fell from sky after botched network upgrade

'Catholic penance' awards 10 days of credit

Choosing a cloud hosting partner with confidence

Clouds as dominos

This caused a kind of domino effect. The EBS cluster couldn't handle API requests to create new volumes, and as these requests backed up in a queue, it couldn't handle API requests from other availability zones. At 2:40 am, engineers disabled all requests to create new volumes in the affected availability zone, and ten minutes later, the company said, requests from other zones were operating normally.

But then EBS nodes in the affected zone started failing, and at about 5:40 am, this again caused problems in other zones. Amazon said that within about 3 hours, engineers began to lower error rates and latencies in those other zones and that by 12:04 pm, they had isolated the problem in the original zone. For about 11 hours that morning, users were also unable to launch new EBS-backed EC2 instances in the affected zone.

Just after noon, about 13 per cent of EBS volumes in the original zone remained "stuck" and EBS APIs remained disabled. By 12:30pm on April 22 (the next day), all but 2.2 per cent of EBS volumes were restored. By 2 pm on April 24, all but 0.07 per cent was restored, and these, Amazon said, won't be restored. The company did not explain why.

The outage also affected Amazon's Relational Database Service (RBS), as RBS relies on EBS for storage.

Amazon said it will automatically provide customers with 10 days of credit to equal to 100 per cent of their usage of EBS volumes, EC2 instances and RDS database instances that were running in the affected availability zone at the time of the outage. It did not mention credits for services operating in the other availability zones.

The company did say that availability zones are physically separate from each other, but did not elaborate. It's unclear whether they're in separate data centers. In the post mortem, the company also said it intends to improve the design of the availability zones so that an EBS outage like this cannot spread from one zone to another.

Amazon also promises to expose additional APIs that will allow customers to more easily determine whether their instances are affected by an outage. This move was applauded by FathomDB's Santa Barbara, but he believes that the world should consider alternatives to Amazon, which pioneered the infrastructure cloud market and controls the largest market share.

"Amazon has been open about admitting to failure, and has promised to expose more of the private APIs so that customers and partners can be better able to help themselves in future outages, without relying on AWS to do so," he said.

"This is reassuring, though I believe that in the long term customers will be looking at other cloud operators and technologies, for redundancy and different philosophies in terms of timely and open customer communication, but also in terms of relying on well-understood technologies and on the broader community of engineering talent rather than just those at AWS." ®

Internet Security Threat Report 2014

More from The Register

next story
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
DEATH by COMMENTS: WordPress XSS vuln is BIGGEST for YEARS
Trio of XSS turns attackers into admins
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
10 ways wire data helps conquer IT complexity
IT teams can automatically detect problems across the IT environment, spot data theft, select unique pieces of transaction payloads to send to a data source, and more.
5 critical considerations for enterprise cloud backup
Key considerations when evaluating cloud backup solutions to ensure adequate protection security and availability of enterprise data.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?