Bad generator and bugs take out Amazon cloud

Original URL: https://www.theregister.com/2012/07/03/amazon_outage_post_mortem/

From failover to fallover

Posted in SaaS, 3rd July 2012 19:52 GMT

Amazon Web Services battened down the hatches and called in extra troops ahead of the vicious thunderstorms that ripped through the Ohio valley and across Virginia and Maryland on Friday night.

But despite all the precautions – and there were many – the US East-1 region of Amazon's cloud was brought down after an electrical surge.

And not because of the power outage, but because of faulty generators and some bugs in some of the control software that Amazon has created to deal with failover and recovery of various cloudy services.

"Netflix, Pinterest, Instagram, and Heroku, which run their services atop Amazon's infrastructure cloud, all reported outages because of the power failure.

The outage was triggered during a large scale electrical storm which swept through the Northern Virginia area," Amazon said in a post-mortem on the failure. "We regret the problems experienced by customers affected by the disruption and, in addition to giving more detail, also wanted to provide information on actions we'll be taking to mitigate these issues in the future."

As El Reg previously reported, the triple-digit temperatures in the American Midwest supplied the energy for a line of storms called a derecho that wreaked a path of havoc from Indiana to the Atlantic ocean late Friday night. Sustained winds in excess of 80 miles per hour killed a dozen people and knocked out power to more than 3.5 million people in Virginia and Maryland.

Fallen trees took out a lot of power lines, and even with hundreds of utility crews from all over the United States and Canada rushing in to help Domino Virginia Power, BGE, and Pepco, the main suppliers of juice in the region, it will take until Friday, 6 July to get everyone back online. More than 371,000 people were still without power as of Tuesday morning.

Failover fallover

The US East-1 region is a collection of ten data centers run by Amazon in Ashburn, Virginia and broken into four electrically isolated availability zones to allow for some measure of redundancy for customers who want to pay extra for it.

As the derecho came roaring overhead on Friday night at 7:27 PM Pacific time, two of these data centers, which both supported the same availability zone, had large voltage spikes, according to the AWS team.

As they were supposed to, these two data centers tried to fail over to generator power. One flipped over to generator power, and one did not, and in that case, the data center ran on juice stored in its uninterruptible power supplies.

Thirty minutes later, the power coming in from the utility failed across the whole US East-1 region – meaning all ten data centers. In its post mortem, Amazon said this was the second time that night that power failed across the region but it has not identified when the first time was. The generators were supposed to kick over again in that bad data center, and once again it did not, but the other nine did.

Seven minutes after the blackout and 40 minutes after the voltage spike, servers started to go offline in that one data center without generators as the UPSes were depleted of their life-giving juice. AWS techies got the generators going by hand, fired up the UPSes, and by 8:24 PM Pacific had the full facility running with power to all racks.

AWS knows what you are thinking: They did not test the backup generators. Amazon is not about to divulge what gear it has, but says that it installed the generators and electrical switching equipment from the same vendor and installed them in late 2010 and early 2011.

The gear was "rigorously tested" by the manufacturer before installation, according to AWS, and were also run through eight hours of load testing when they were installed. The gear is tested weekly by the manufacturer, and all of the gear used across the ten data centers "worked flawlessly" for over 30 hours until power supplies from the electric utility were restored early Sunday morning local time. The generators that did not kick on were actually tested on May 12 this year, simulating an outage and running the full load in that data center and worked fine.

Failovers set to manual

In the wake of the outage, Amazon says it now knows it must keep enough techies around to do manual failovers if automatic failovers don't work. The faulty equipment will be repaired and recertified if it can be, or replaced.

Amazon will tweak procedures, giving the generators a little more time to get up to stable power before switching over, too. Although it didn't say this, perhaps some bigger batteries are in order, too, to buy some time.

Customers who ran EC2 compute or EBS storage volumes inside of that failed availability zone, or who were running Relational Database Service or Elastic Load Balancer services from that iron, were obviously affected by the outage, which lasted about 20 minutes if you count server downtime only.

AWS says that about seven per cent of the EC2 instances in the US-East 1 region were knocked out and had to be recovered, and "the vast majority" of these instances were back online by midnight Pacific time, or about three-and-a-half hours after the servers in the one data center started to go dark because the UPSes were out of juice.

"Time for the completion of this recovery was extended by a bottleneck in our server booting process," AWS explained in the blog. "Removing this bottleneck is one of the actions we’ll take to improve recovery times in the face of power failure."

About seven per cent of the EBS volumes in the US East-1 region of the Amazon cloud were affected by the power outage, and it took a bit longer to get volumes back because any volumes that had in-flight writes when power was lost come back in an impaired state, and customers have to check them for consistency to before turning them back on.

After last year's outage in the US East-1 region, caused by a botched network upgrade, which affected 13 per cent of the EBS volumes, Amazon changed some of its procedures to speed up EBS recovery time, but admitted that it still has work to do, with 10 per cent of the affected EBS volumes still not being online by 2:45 AM Pacific on Sunday.

"We have identified several areas in the recovery process that we will further optimize to improve the speed of processing recovered volumes."

The control planes for EC2 and EBS also need a little work, according to Amazon, which said it took until 9:10 PM on Friday to get this vital part of the Amazon cloud working in the region and which still required a lot of manual intervention on the part of its own techies to get going again.

When the power was brought back up a bug went nuts in the code

The control planes for EC2 and EBS keep replicated metadata sets in other availability zones, and when a primary copy fails, the control plane is supposed to flip over to a copy stored in the other availability zone, and one that is made read-only.

"We are addressing the sources of blockage which forced manual assessment and required hand-managed failover for the control plane, and have work already underway to have this flip happen automatically," Amazon said.

The Elastic Load Balancer service, which directs traffic within and across availability zones inside a region for companies with multiple EC2 and EBS instances, and when the power was brought back up on the failed ELB instances in the downed data center on Friday night, a bug went nuts in the code.

This tried to scale up restarted ELB instances with more resources and also flooded the control plane with requests to update the ELB services as EC2 and EBS instances came on line, and, making matters worse, as customers came online and started asking for instances in the nine other data centers because the tenth one was hosed.

The RDS database service also had issues. Customers can run RDS in a single availability zone, or across multiple zones for high availability. Those instances running in the failed data center crashed and were eventually rebooted.

However, those spanning two zones – with one of them being the failed data center – had issues because of a bug that Amazon itself introduced when it patched some storage management software back in April. RDS instances had to be failed back manually.

"We apologize for the inconvenience and trouble this caused for affected customers," Amazon ended its post.

"We know how critical our services are to our customers’ businesses. If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes."

Equipment, software, and people screw up, whether they are in the cloud or not (as the multi-day RBS outage shows. That is not going to change any time soon.

It is interesting to contemplate how many server crashes there are in the world every day and whether or not Amazon will be increasing and decreasing the average. (People are moved by large numbers, which is why airplane crashes are shocking, but far more people die in car crashes in any given year.)

Tens of thousands of servers that would have been distributed across many states – sitting under desks or in data closets or in proper but much smaller data centers – were concentrated into one area that had violent storms, a prolonged power outage, and a cranky generator that, had its done its job, would have not caused the outage at all.

The real story is not that a generator can bring down a chunk of a cloud, but that the recovery process when there is a failure is still hairy even after some of the best minds on the planet have tried to think of all the angles. ®