Feeds

Bad generator and bugs take out Amazon cloud

From failover to fallover

Next gen security for virtualised datacentres

Amazon Web Services battened down the hatches and called in extra troops ahead of the vicious thunderstorms that ripped through the Ohio valley and across Virginia and Maryland on Friday night.

But despite all the precautions – and there were many – the US East-1 region of Amazon's cloud was brought down after an electrical surge.

And not because of the power outage, but because of faulty generators and some bugs in some of the control software that Amazon has created to deal with failover and recovery of various cloudy services.

"Netflix, Pinterest, Instagram, and Heroku, which run their services atop Amazon's infrastructure cloud, all reported outages because of the power failure.

The outage was triggered during a large scale electrical storm which swept through the Northern Virginia area," Amazon said in a post-mortem on the failure. "We regret the problems experienced by customers affected by the disruption and, in addition to giving more detail, also wanted to provide information on actions we'll be taking to mitigate these issues in the future."

As El Reg previously reported, the triple-digit temperatures in the American Midwest supplied the energy for a line of storms called a derecho that wreaked a path of havoc from Indiana to the Atlantic ocean late Friday night. Sustained winds in excess of 80 miles per hour killed a dozen people and knocked out power to more than 3.5 million people in Virginia and Maryland.

Fallen trees took out a lot of power lines, and even with hundreds of utility crews from all over the United States and Canada rushing in to help Domino Virginia Power, BGE, and Pepco, the main suppliers of juice in the region, it will take until Friday, 6 July to get everyone back online. More than 371,000 people were still without power as of Tuesday morning.

Failover fallover

The US East-1 region is a collection of ten data centers run by Amazon in Ashburn, Virginia and broken into four electrically isolated availability zones to allow for some measure of redundancy for customers who want to pay extra for it.

As the derecho came roaring overhead on Friday night at 7:27 PM Pacific time, two of these data centers, which both supported the same availability zone, had large voltage spikes, according to the AWS team.

As they were supposed to, these two data centers tried to fail over to generator power. One flipped over to generator power, and one did not, and in that case, the data center ran on juice stored in its uninterruptible power supplies.

Thirty minutes later, the power coming in from the utility failed across the whole US East-1 region – meaning all ten data centers. In its post mortem, Amazon said this was the second time that night that power failed across the region but it has not identified when the first time was. The generators were supposed to kick over again in that bad data center, and once again it did not, but the other nine did.

Seven minutes after the blackout and 40 minutes after the voltage spike, servers started to go offline in that one data center without generators as the UPSes were depleted of their life-giving juice. AWS techies got the generators going by hand, fired up the UPSes, and by 8:24 PM Pacific had the full facility running with power to all racks.

AWS knows what you are thinking: They did not test the backup generators. Amazon is not about to divulge what gear it has, but says that it installed the generators and electrical switching equipment from the same vendor and installed them in late 2010 and early 2011.

The gear was "rigorously tested" by the manufacturer before installation, according to AWS, and were also run through eight hours of load testing when they were installed. The gear is tested weekly by the manufacturer, and all of the gear used across the ten data centers "worked flawlessly" for over 30 hours until power supplies from the electric utility were restored early Sunday morning local time. The generators that did not kick on were actually tested on May 12 this year, simulating an outage and running the full load in that data center and worked fine.

5 things you didn’t know about cloud backup

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story

Whitepapers

Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.