Feeds

Bad generator and bugs take out Amazon cloud

From failover to fallover

Intelligent flash storage arrays

Failovers set to manual

In the wake of the outage, Amazon says it now knows it must keep enough techies around to do manual failovers if automatic failovers don't work. The faulty equipment will be repaired and recertified if it can be, or replaced.

Amazon will tweak procedures, giving the generators a little more time to get up to stable power before switching over, too. Although it didn't say this, perhaps some bigger batteries are in order, too, to buy some time.

Customers who ran EC2 compute or EBS storage volumes inside of that failed availability zone, or who were running Relational Database Service or Elastic Load Balancer services from that iron, were obviously affected by the outage, which lasted about 20 minutes if you count server downtime only.

AWS says that about seven per cent of the EC2 instances in the US-East 1 region were knocked out and had to be recovered, and "the vast majority" of these instances were back online by midnight Pacific time, or about three-and-a-half hours after the servers in the one data center started to go dark because the UPSes were out of juice.

"Time for the completion of this recovery was extended by a bottleneck in our server booting process," AWS explained in the blog. "Removing this bottleneck is one of the actions we’ll take to improve recovery times in the face of power failure."

About seven per cent of the EBS volumes in the US East-1 region of the Amazon cloud were affected by the power outage, and it took a bit longer to get volumes back because any volumes that had in-flight writes when power was lost come back in an impaired state, and customers have to check them for consistency to before turning them back on.

After last year's outage in the US East-1 region, caused by a botched network upgrade, which affected 13 per cent of the EBS volumes, Amazon changed some of its procedures to speed up EBS recovery time, but admitted that it still has work to do, with 10 per cent of the affected EBS volumes still not being online by 2:45 AM Pacific on Sunday.

"We have identified several areas in the recovery process that we will further optimize to improve the speed of processing recovered volumes."

The control planes for EC2 and EBS also need a little work, according to Amazon, which said it took until 9:10 PM on Friday to get this vital part of the Amazon cloud working in the region and which still required a lot of manual intervention on the part of its own techies to get going again.

When the power was brought back up a bug went nuts in the code

The control planes for EC2 and EBS keep replicated metadata sets in other availability zones, and when a primary copy fails, the control plane is supposed to flip over to a copy stored in the other availability zone, and one that is made read-only.

"We are addressing the sources of blockage which forced manual assessment and required hand-managed failover for the control plane, and have work already underway to have this flip happen automatically," Amazon said.

The Elastic Load Balancer service, which directs traffic within and across availability zones inside a region for companies with multiple EC2 and EBS instances, and when the power was brought back up on the failed ELB instances in the downed data center on Friday night, a bug went nuts in the code.

This tried to scale up restarted ELB instances with more resources and also flooded the control plane with requests to update the ELB services as EC2 and EBS instances came on line, and, making matters worse, as customers came online and started asking for instances in the nine other data centers because the tenth one was hosed.

The RDS database service also had issues. Customers can run RDS in a single availability zone, or across multiple zones for high availability. Those instances running in the failed data center crashed and were eventually rebooted.

However, those spanning two zones – with one of them being the failed data center – had issues because of a bug that Amazon itself introduced when it patched some storage management software back in April. RDS instances had to be failed back manually.

"We apologize for the inconvenience and trouble this caused for affected customers," Amazon ended its post.

"We know how critical our services are to our customers’ businesses. If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes."

Equipment, software, and people screw up, whether they are in the cloud or not (as the multi-day RBS outage shows. That is not going to change any time soon.

It is interesting to contemplate how many server crashes there are in the world every day and whether or not Amazon will be increasing and decreasing the average. (People are moved by large numbers, which is why airplane crashes are shocking, but far more people die in car crashes in any given year.)

Tens of thousands of servers that would have been distributed across many states – sitting under desks or in data closets or in proper but much smaller data centers – were concentrated into one area that had violent storms, a prolonged power outage, and a cranky generator that, had its done its job, would have not caused the outage at all.

The real story is not that a generator can bring down a chunk of a cloud, but that the recovery process when there is a failure is still hairy even after some of the best minds on the planet have tried to think of all the angles. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
Hey - who wants 4.8 TERABYTES almost AS FAST AS MEMORY?
China's Memblaze says they've got it in PCIe. Yow
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
This time it's SO REAL: Overcoming the open-source orgasm myth with TODO
If the web giants need it to work, hey, maybe it'll work
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
How to simplify SSL certificate management
Simple steps to take control of SSL certificates across the enterprise, and recommendations centralizing certificate management throughout their lifecycle.