Feeds

Bad generator and bugs take out Amazon cloud

From failover to fallover

Top 5 reasons to deploy VMware with Tegile

Failovers set to manual

In the wake of the outage, Amazon says it now knows it must keep enough techies around to do manual failovers if automatic failovers don't work. The faulty equipment will be repaired and recertified if it can be, or replaced.

Amazon will tweak procedures, giving the generators a little more time to get up to stable power before switching over, too. Although it didn't say this, perhaps some bigger batteries are in order, too, to buy some time.

Customers who ran EC2 compute or EBS storage volumes inside of that failed availability zone, or who were running Relational Database Service or Elastic Load Balancer services from that iron, were obviously affected by the outage, which lasted about 20 minutes if you count server downtime only.

AWS says that about seven per cent of the EC2 instances in the US-East 1 region were knocked out and had to be recovered, and "the vast majority" of these instances were back online by midnight Pacific time, or about three-and-a-half hours after the servers in the one data center started to go dark because the UPSes were out of juice.

"Time for the completion of this recovery was extended by a bottleneck in our server booting process," AWS explained in the blog. "Removing this bottleneck is one of the actions we’ll take to improve recovery times in the face of power failure."

About seven per cent of the EBS volumes in the US East-1 region of the Amazon cloud were affected by the power outage, and it took a bit longer to get volumes back because any volumes that had in-flight writes when power was lost come back in an impaired state, and customers have to check them for consistency to before turning them back on.

After last year's outage in the US East-1 region, caused by a botched network upgrade, which affected 13 per cent of the EBS volumes, Amazon changed some of its procedures to speed up EBS recovery time, but admitted that it still has work to do, with 10 per cent of the affected EBS volumes still not being online by 2:45 AM Pacific on Sunday.

"We have identified several areas in the recovery process that we will further optimize to improve the speed of processing recovered volumes."

The control planes for EC2 and EBS also need a little work, according to Amazon, which said it took until 9:10 PM on Friday to get this vital part of the Amazon cloud working in the region and which still required a lot of manual intervention on the part of its own techies to get going again.

When the power was brought back up a bug went nuts in the code

The control planes for EC2 and EBS keep replicated metadata sets in other availability zones, and when a primary copy fails, the control plane is supposed to flip over to a copy stored in the other availability zone, and one that is made read-only.

"We are addressing the sources of blockage which forced manual assessment and required hand-managed failover for the control plane, and have work already underway to have this flip happen automatically," Amazon said.

The Elastic Load Balancer service, which directs traffic within and across availability zones inside a region for companies with multiple EC2 and EBS instances, and when the power was brought back up on the failed ELB instances in the downed data center on Friday night, a bug went nuts in the code.

This tried to scale up restarted ELB instances with more resources and also flooded the control plane with requests to update the ELB services as EC2 and EBS instances came on line, and, making matters worse, as customers came online and started asking for instances in the nine other data centers because the tenth one was hosed.

The RDS database service also had issues. Customers can run RDS in a single availability zone, or across multiple zones for high availability. Those instances running in the failed data center crashed and were eventually rebooted.

However, those spanning two zones – with one of them being the failed data center – had issues because of a bug that Amazon itself introduced when it patched some storage management software back in April. RDS instances had to be failed back manually.

"We apologize for the inconvenience and trouble this caused for affected customers," Amazon ended its post.

"We know how critical our services are to our customers’ businesses. If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes."

Equipment, software, and people screw up, whether they are in the cloud or not (as the multi-day RBS outage shows. That is not going to change any time soon.

It is interesting to contemplate how many server crashes there are in the world every day and whether or not Amazon will be increasing and decreasing the average. (People are moved by large numbers, which is why airplane crashes are shocking, but far more people die in car crashes in any given year.)

Tens of thousands of servers that would have been distributed across many states – sitting under desks or in data closets or in proper but much smaller data centers – were concentrated into one area that had violent storms, a prolonged power outage, and a cranky generator that, had its done its job, would have not caused the outage at all.

The real story is not that a generator can bring down a chunk of a cloud, but that the recovery process when there is a failure is still hairy even after some of the best minds on the planet have tried to think of all the angles. ®

Beginner's guide to SSL certificates

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
AWS pulls desktop-as-a-service from the PC
Support for PCoIP protocol means zero clients can run cloudy desktops
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.