Feeds

Bad generator and bugs take out Amazon cloud

From failover to fallover

Next gen security for virtualised datacentres

Amazon Web Services battened down the hatches and called in extra troops ahead of the vicious thunderstorms that ripped through the Ohio valley and across Virginia and Maryland on Friday night.

But despite all the precautions – and there were many – the US East-1 region of Amazon's cloud was brought down after an electrical surge.

And not because of the power outage, but because of faulty generators and some bugs in some of the control software that Amazon has created to deal with failover and recovery of various cloudy services.

"Netflix, Pinterest, Instagram, and Heroku, which run their services atop Amazon's infrastructure cloud, all reported outages because of the power failure.

The outage was triggered during a large scale electrical storm which swept through the Northern Virginia area," Amazon said in a post-mortem on the failure. "We regret the problems experienced by customers affected by the disruption and, in addition to giving more detail, also wanted to provide information on actions we'll be taking to mitigate these issues in the future."

As El Reg previously reported, the triple-digit temperatures in the American Midwest supplied the energy for a line of storms called a derecho that wreaked a path of havoc from Indiana to the Atlantic ocean late Friday night. Sustained winds in excess of 80 miles per hour killed a dozen people and knocked out power to more than 3.5 million people in Virginia and Maryland.

Fallen trees took out a lot of power lines, and even with hundreds of utility crews from all over the United States and Canada rushing in to help Domino Virginia Power, BGE, and Pepco, the main suppliers of juice in the region, it will take until Friday, 6 July to get everyone back online. More than 371,000 people were still without power as of Tuesday morning.

Failover fallover

The US East-1 region is a collection of ten data centers run by Amazon in Ashburn, Virginia and broken into four electrically isolated availability zones to allow for some measure of redundancy for customers who want to pay extra for it.

As the derecho came roaring overhead on Friday night at 7:27 PM Pacific time, two of these data centers, which both supported the same availability zone, had large voltage spikes, according to the AWS team.

As they were supposed to, these two data centers tried to fail over to generator power. One flipped over to generator power, and one did not, and in that case, the data center ran on juice stored in its uninterruptible power supplies.

Thirty minutes later, the power coming in from the utility failed across the whole US East-1 region – meaning all ten data centers. In its post mortem, Amazon said this was the second time that night that power failed across the region but it has not identified when the first time was. The generators were supposed to kick over again in that bad data center, and once again it did not, but the other nine did.

Seven minutes after the blackout and 40 minutes after the voltage spike, servers started to go offline in that one data center without generators as the UPSes were depleted of their life-giving juice. AWS techies got the generators going by hand, fired up the UPSes, and by 8:24 PM Pacific had the full facility running with power to all racks.

AWS knows what you are thinking: They did not test the backup generators. Amazon is not about to divulge what gear it has, but says that it installed the generators and electrical switching equipment from the same vendor and installed them in late 2010 and early 2011.

The gear was "rigorously tested" by the manufacturer before installation, according to AWS, and were also run through eight hours of load testing when they were installed. The gear is tested weekly by the manufacturer, and all of the gear used across the ten data centers "worked flawlessly" for over 30 hours until power supplies from the electric utility were restored early Sunday morning local time. The generators that did not kick on were actually tested on May 12 this year, simulating an outage and running the full load in that data center and worked fine.

Secure remote control for conventional and virtual desktops

More from The Register

next story
HP busts out new ProLiant Gen9 servers
Think those are cool? Wait till you get a load of our racks
Shoot-em-up: Sony Online Entertainment hit by 'large scale DDoS attack'
Games disrupted as firm struggles to control network
Like condoms, data now comes in big and HUGE sizes
Linux Foundation lights a fire under storage devs with new conference
Community chest: Storage firms need to pay open-source debts
Samba implementation? Time to get some devs on the job
Silicon Valley jolted by magnitude 6.1 quake – its biggest in 25 years
Did the earth move for you at VMworld – oh, OK. It just did. A lot
Forrester says it's time to give up on physical storage arrays
The physical/virtual storage tipping point may just have arrived
prev story

Whitepapers

5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up Big Data
Solving backup challenges and “protect everything from everywhere,” as we move into the era of big data management and the adoption of BYOD.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?