Feeds

Bad generator and bugs take out Amazon cloud

From failover to fallover

Remote control for virtualized desktops

Amazon Web Services battened down the hatches and called in extra troops ahead of the vicious thunderstorms that ripped through the Ohio valley and across Virginia and Maryland on Friday night.

But despite all the precautions – and there were many – the US East-1 region of Amazon's cloud was brought down after an electrical surge.

And not because of the power outage, but because of faulty generators and some bugs in some of the control software that Amazon has created to deal with failover and recovery of various cloudy services.

"Netflix, Pinterest, Instagram, and Heroku, which run their services atop Amazon's infrastructure cloud, all reported outages because of the power failure.

The outage was triggered during a large scale electrical storm which swept through the Northern Virginia area," Amazon said in a post-mortem on the failure. "We regret the problems experienced by customers affected by the disruption and, in addition to giving more detail, also wanted to provide information on actions we'll be taking to mitigate these issues in the future."

As El Reg previously reported, the triple-digit temperatures in the American Midwest supplied the energy for a line of storms called a derecho that wreaked a path of havoc from Indiana to the Atlantic ocean late Friday night. Sustained winds in excess of 80 miles per hour killed a dozen people and knocked out power to more than 3.5 million people in Virginia and Maryland.

Fallen trees took out a lot of power lines, and even with hundreds of utility crews from all over the United States and Canada rushing in to help Domino Virginia Power, BGE, and Pepco, the main suppliers of juice in the region, it will take until Friday, 6 July to get everyone back online. More than 371,000 people were still without power as of Tuesday morning.

Failover fallover

The US East-1 region is a collection of ten data centers run by Amazon in Ashburn, Virginia and broken into four electrically isolated availability zones to allow for some measure of redundancy for customers who want to pay extra for it.

As the derecho came roaring overhead on Friday night at 7:27 PM Pacific time, two of these data centers, which both supported the same availability zone, had large voltage spikes, according to the AWS team.

As they were supposed to, these two data centers tried to fail over to generator power. One flipped over to generator power, and one did not, and in that case, the data center ran on juice stored in its uninterruptible power supplies.

Thirty minutes later, the power coming in from the utility failed across the whole US East-1 region – meaning all ten data centers. In its post mortem, Amazon said this was the second time that night that power failed across the region but it has not identified when the first time was. The generators were supposed to kick over again in that bad data center, and once again it did not, but the other nine did.

Seven minutes after the blackout and 40 minutes after the voltage spike, servers started to go offline in that one data center without generators as the UPSes were depleted of their life-giving juice. AWS techies got the generators going by hand, fired up the UPSes, and by 8:24 PM Pacific had the full facility running with power to all racks.

AWS knows what you are thinking: They did not test the backup generators. Amazon is not about to divulge what gear it has, but says that it installed the generators and electrical switching equipment from the same vendor and installed them in late 2010 and early 2011.

The gear was "rigorously tested" by the manufacturer before installation, according to AWS, and were also run through eight hours of load testing when they were installed. The gear is tested weekly by the manufacturer, and all of the gear used across the ten data centers "worked flawlessly" for over 30 hours until power supplies from the electric utility were restored early Sunday morning local time. The generators that did not kick on were actually tested on May 12 this year, simulating an outage and running the full load in that data center and worked fine.

Secure remote control for conventional and virtual desktops

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Hey - who wants 4.8 TERABYTES almost AS FAST AS MEMORY?
China's Memblaze says they've got it in PCIe. Yow
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
This time it's SO REAL: Overcoming the open-source orgasm myth with TODO
If the web giants need it to work, hey, maybe it'll work
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Simplify SSL certificate management across the enterprise
Simple steps to take control of SSL across the enterprise, and recommendations for a management platform for full visibility and single-point of control for these Certificates.