Bad generator and bugs take out Amazon cloud
From failover to fallover
Failovers set to manual
In the wake of the outage, Amazon says it now knows it must keep enough techies around to do manual failovers if automatic failovers don't work. The faulty equipment will be repaired and recertified if it can be, or replaced.
Amazon will tweak procedures, giving the generators a little more time to get up to stable power before switching over, too. Although it didn't say this, perhaps some bigger batteries are in order, too, to buy some time.
Customers who ran EC2 compute or EBS storage volumes inside of that failed availability zone, or who were running Relational Database Service or Elastic Load Balancer services from that iron, were obviously affected by the outage, which lasted about 20 minutes if you count server downtime only.
AWS says that about seven per cent of the EC2 instances in the US-East 1 region were knocked out and had to be recovered, and "the vast majority" of these instances were back online by midnight Pacific time, or about three-and-a-half hours after the servers in the one data center started to go dark because the UPSes were out of juice.
"Time for the completion of this recovery was extended by a bottleneck in our server booting process," AWS explained in the blog. "Removing this bottleneck is one of the actions we’ll take to improve recovery times in the face of power failure."
About seven per cent of the EBS volumes in the US East-1 region of the Amazon cloud were affected by the power outage, and it took a bit longer to get volumes back because any volumes that had in-flight writes when power was lost come back in an impaired state, and customers have to check them for consistency to before turning them back on.
After last year's outage in the US East-1 region, caused by a botched network upgrade, which affected 13 per cent of the EBS volumes, Amazon changed some of its procedures to speed up EBS recovery time, but admitted that it still has work to do, with 10 per cent of the affected EBS volumes still not being online by 2:45 AM Pacific on Sunday.
"We have identified several areas in the recovery process that we will further optimize to improve the speed of processing recovered volumes."
The control planes for EC2 and EBS also need a little work, according to Amazon, which said it took until 9:10 PM on Friday to get this vital part of the Amazon cloud working in the region and which still required a lot of manual intervention on the part of its own techies to get going again.
When the power was brought back up a bug went nuts in the code
The control planes for EC2 and EBS keep replicated metadata sets in other availability zones, and when a primary copy fails, the control plane is supposed to flip over to a copy stored in the other availability zone, and one that is made read-only.
"We are addressing the sources of blockage which forced manual assessment and required hand-managed failover for the control plane, and have work already underway to have this flip happen automatically," Amazon said.
The Elastic Load Balancer service, which directs traffic within and across availability zones inside a region for companies with multiple EC2 and EBS instances, and when the power was brought back up on the failed ELB instances in the downed data center on Friday night, a bug went nuts in the code.
This tried to scale up restarted ELB instances with more resources and also flooded the control plane with requests to update the ELB services as EC2 and EBS instances came on line, and, making matters worse, as customers came online and started asking for instances in the nine other data centers because the tenth one was hosed.
The RDS database service also had issues. Customers can run RDS in a single availability zone, or across multiple zones for high availability. Those instances running in the failed data center crashed and were eventually rebooted.
However, those spanning two zones – with one of them being the failed data center – had issues because of a bug that Amazon itself introduced when it patched some storage management software back in April. RDS instances had to be failed back manually.
"We apologize for the inconvenience and trouble this caused for affected customers," Amazon ended its post.
"We know how critical our services are to our customers’ businesses. If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes."
Equipment, software, and people screw up, whether they are in the cloud or not (as the multi-day RBS outage shows. That is not going to change any time soon.
It is interesting to contemplate how many server crashes there are in the world every day and whether or not Amazon will be increasing and decreasing the average. (People are moved by large numbers, which is why airplane crashes are shocking, but far more people die in car crashes in any given year.)
Tens of thousands of servers that would have been distributed across many states – sitting under desks or in data closets or in proper but much smaller data centers – were concentrated into one area that had violent storms, a prolonged power outage, and a cranky generator that, had its done its job, would have not caused the outage at all.
The real story is not that a generator can bring down a chunk of a cloud, but that the recovery process when there is a failure is still hairy even after some of the best minds on the planet have tried to think of all the angles. ®