Lightning strikes Amazon cloud (honest)
The dangers of sky-high computing
Amazon's cloud was struck by lightning earlier this week. And that's the truth.
On Wednesday evening at about 6:30pm Pacific time, some Amazon cloud sitters saw their floating servers disappear - and yes, the company blamed the temporary outage on a lightning strike.
According to a web post from the company, the strike zapped a power distribution unit in one of its data centers, taking out server instances in one - and only one - Availability Zone. Amazon's Elastic Compute Cloud (EC2) serves up on-demand processing power from two separate geographic locations - the US and Europe - and each geographic region is split into multiple zones designed never to vanish at the same time.
"A lightning storm caused damage to a single Power Distribution Unit (PDU) in a single Availability Zone," the company said in a web post at 7:33pm. "While most instances were unaffected, a set of racks does not currently have power, so the instances on those racks are down."
At 9:26, Amazon said power had been restored and the affected server instances were beginning to recover. By 1:20am, the company said the problem had been fully resolved.
While Amazon was correcting the problem, it told customers they had the option of launching new server instances to replace those that went down. But customers were also able to wait for their original instances to come back up after power was restored to the hardware in question.
This was a relatively minor issue compared to the two major outages Amazon's cloud suffered in October 2007 and February 2008. And it's nowhere near as amusing as the time an engineer accidentally deleted Flexiscale's infrastructure cloud. Well, not nearly as amusing except for the lightning bit. ®
It's up to the customer to plan for the worst too
To the uninformed....
I've worked in data centres where we've had N+1 and more resilience on every aspect of the infrastructure. However a well placed lightening strike, chiller leak, roof collapse, car crash or other act of nature/technology can cause the unexpected to happen.
Ultimately it is down to the customer to have business continuity/contingency plans for these events, however unlikely they may be. As Eric Hammond rightly states on his blog:
"A well designed architecture built on top of EC2 keeps important information (databases, log files, etc) in easy to manage persistent and redundant data stores which can be snapshotted, duplicated, detached, and attached to new servers."
If your application is mission critical then you need to spread it around.
Also many years ago, we had the bearings on a 80kw motor fail so catastrophically that the rotor shorted out the windings. It tripped every overload back to and including the main site incomer (north). The emergency generator started up on auto, ran up to speed and connected into the system. Then cut out on overload. It did this twice more on auto before locking out. Started to check the system before connecting to the alternative incomer (south - what else). Found a dead short caused by the distribution panel in the original fault path having all the terminals (3 phase) fused together as one solidified lump of molten metal.
I can imagine a direct lightning strike being somewhat worse!
Get real, complainers
"The REAL question is what else in their system failed that allowed a surge to get that far into the network..."
EVERYTHING failed; it was a direct lightning strike. If you think you have a surge protector that works against that, I suggest you get yourself to the patent office straight away.
"The better question is why didn't UPS and generator kick in"
Oh come ON. Lightning doesn't cause a power INTERRUPTION; it eats your infrastructure for breakfast. Nothing is going to protect your data center from a lightning strike. Oh, so you have a UPS system? Well, when the massive current fuses every metallic component in your UPS into a giant conductor, fat lot of good that will do you.