Feeds

Bad generator and bugs take out Amazon cloud

From failover to fallover

Designing a Defense for Mobile Applications

Amazon Web Services battened down the hatches and called in extra troops ahead of the vicious thunderstorms that ripped through the Ohio valley and across Virginia and Maryland on Friday night.

But despite all the precautions – and there were many – the US East-1 region of Amazon's cloud was brought down after an electrical surge.

And not because of the power outage, but because of faulty generators and some bugs in some of the control software that Amazon has created to deal with failover and recovery of various cloudy services.

"Netflix, Pinterest, Instagram, and Heroku, which run their services atop Amazon's infrastructure cloud, all reported outages because of the power failure.

The outage was triggered during a large scale electrical storm which swept through the Northern Virginia area," Amazon said in a post-mortem on the failure. "We regret the problems experienced by customers affected by the disruption and, in addition to giving more detail, also wanted to provide information on actions we'll be taking to mitigate these issues in the future."

As El Reg previously reported, the triple-digit temperatures in the American Midwest supplied the energy for a line of storms called a derecho that wreaked a path of havoc from Indiana to the Atlantic ocean late Friday night. Sustained winds in excess of 80 miles per hour killed a dozen people and knocked out power to more than 3.5 million people in Virginia and Maryland.

Fallen trees took out a lot of power lines, and even with hundreds of utility crews from all over the United States and Canada rushing in to help Domino Virginia Power, BGE, and Pepco, the main suppliers of juice in the region, it will take until Friday, 6 July to get everyone back online. More than 371,000 people were still without power as of Tuesday morning.

Failover fallover

The US East-1 region is a collection of ten data centers run by Amazon in Ashburn, Virginia and broken into four electrically isolated availability zones to allow for some measure of redundancy for customers who want to pay extra for it.

As the derecho came roaring overhead on Friday night at 7:27 PM Pacific time, two of these data centers, which both supported the same availability zone, had large voltage spikes, according to the AWS team.

As they were supposed to, these two data centers tried to fail over to generator power. One flipped over to generator power, and one did not, and in that case, the data center ran on juice stored in its uninterruptible power supplies.

Thirty minutes later, the power coming in from the utility failed across the whole US East-1 region – meaning all ten data centers. In its post mortem, Amazon said this was the second time that night that power failed across the region but it has not identified when the first time was. The generators were supposed to kick over again in that bad data center, and once again it did not, but the other nine did.

Seven minutes after the blackout and 40 minutes after the voltage spike, servers started to go offline in that one data center without generators as the UPSes were depleted of their life-giving juice. AWS techies got the generators going by hand, fired up the UPSes, and by 8:24 PM Pacific had the full facility running with power to all racks.

AWS knows what you are thinking: They did not test the backup generators. Amazon is not about to divulge what gear it has, but says that it installed the generators and electrical switching equipment from the same vendor and installed them in late 2010 and early 2011.

The gear was "rigorously tested" by the manufacturer before installation, according to AWS, and were also run through eight hours of load testing when they were installed. The gear is tested weekly by the manufacturer, and all of the gear used across the ten data centers "worked flawlessly" for over 30 hours until power supplies from the electric utility were restored early Sunday morning local time. The generators that did not kick on were actually tested on May 12 this year, simulating an outage and running the full load in that data center and worked fine.

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Apple fanbois SCREAM as update BRICKS their Macbook Airs
Ragegasm spills over as firmware upgrade kills machines
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Attack of the clones: Oracle's latest Red Hat Linux lookalike arrives
Oracle's Linux boss says Larry's Linux isn't just for Oracle apps anymore
THUD! WD plonks down SIX TERABYTE 'consumer NAS' fatboy
Now that's a LOT of porn or pirated movies. Or, you know, other consumer stuff
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
prev story

Whitepapers

Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Reducing security risks from open source software
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Consolidation: the foundation for IT and business transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.