Feeds

Bad generator and bugs take out Amazon cloud

From failover to fallover

Internet Security Threat Report 2014

Failovers set to manual

In the wake of the outage, Amazon says it now knows it must keep enough techies around to do manual failovers if automatic failovers don't work. The faulty equipment will be repaired and recertified if it can be, or replaced.

Amazon will tweak procedures, giving the generators a little more time to get up to stable power before switching over, too. Although it didn't say this, perhaps some bigger batteries are in order, too, to buy some time.

Customers who ran EC2 compute or EBS storage volumes inside of that failed availability zone, or who were running Relational Database Service or Elastic Load Balancer services from that iron, were obviously affected by the outage, which lasted about 20 minutes if you count server downtime only.

AWS says that about seven per cent of the EC2 instances in the US-East 1 region were knocked out and had to be recovered, and "the vast majority" of these instances were back online by midnight Pacific time, or about three-and-a-half hours after the servers in the one data center started to go dark because the UPSes were out of juice.

"Time for the completion of this recovery was extended by a bottleneck in our server booting process," AWS explained in the blog. "Removing this bottleneck is one of the actions we’ll take to improve recovery times in the face of power failure."

About seven per cent of the EBS volumes in the US East-1 region of the Amazon cloud were affected by the power outage, and it took a bit longer to get volumes back because any volumes that had in-flight writes when power was lost come back in an impaired state, and customers have to check them for consistency to before turning them back on.

After last year's outage in the US East-1 region, caused by a botched network upgrade, which affected 13 per cent of the EBS volumes, Amazon changed some of its procedures to speed up EBS recovery time, but admitted that it still has work to do, with 10 per cent of the affected EBS volumes still not being online by 2:45 AM Pacific on Sunday.

"We have identified several areas in the recovery process that we will further optimize to improve the speed of processing recovered volumes."

The control planes for EC2 and EBS also need a little work, according to Amazon, which said it took until 9:10 PM on Friday to get this vital part of the Amazon cloud working in the region and which still required a lot of manual intervention on the part of its own techies to get going again.

When the power was brought back up a bug went nuts in the code

The control planes for EC2 and EBS keep replicated metadata sets in other availability zones, and when a primary copy fails, the control plane is supposed to flip over to a copy stored in the other availability zone, and one that is made read-only.

"We are addressing the sources of blockage which forced manual assessment and required hand-managed failover for the control plane, and have work already underway to have this flip happen automatically," Amazon said.

The Elastic Load Balancer service, which directs traffic within and across availability zones inside a region for companies with multiple EC2 and EBS instances, and when the power was brought back up on the failed ELB instances in the downed data center on Friday night, a bug went nuts in the code.

This tried to scale up restarted ELB instances with more resources and also flooded the control plane with requests to update the ELB services as EC2 and EBS instances came on line, and, making matters worse, as customers came online and started asking for instances in the nine other data centers because the tenth one was hosed.

The RDS database service also had issues. Customers can run RDS in a single availability zone, or across multiple zones for high availability. Those instances running in the failed data center crashed and were eventually rebooted.

However, those spanning two zones – with one of them being the failed data center – had issues because of a bug that Amazon itself introduced when it patched some storage management software back in April. RDS instances had to be failed back manually.

"We apologize for the inconvenience and trouble this caused for affected customers," Amazon ended its post.

"We know how critical our services are to our customers’ businesses. If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes."

Equipment, software, and people screw up, whether they are in the cloud or not (as the multi-day RBS outage shows. That is not going to change any time soon.

It is interesting to contemplate how many server crashes there are in the world every day and whether or not Amazon will be increasing and decreasing the average. (People are moved by large numbers, which is why airplane crashes are shocking, but far more people die in car crashes in any given year.)

Tens of thousands of servers that would have been distributed across many states – sitting under desks or in data closets or in proper but much smaller data centers – were concentrated into one area that had violent storms, a prolonged power outage, and a cranky generator that, had its done its job, would have not caused the outage at all.

The real story is not that a generator can bring down a chunk of a cloud, but that the recovery process when there is a failure is still hairy even after some of the best minds on the planet have tried to think of all the angles. ®

Internet Security Threat Report 2014

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
IBM storage revenues sink: 'We are disappointed,' says CEO
Time to put the storage biz up for sale?
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.