AWS celebrates Labor Day weekend by roasting customer data in US-East-1 BBQ

Postmortem report: Power outage knackered instances, volumes for unlucky punters

A group of people around a BBQ grill

A power outage fried hardware within one of Amazon Web Services' data centers during America's Labor Day weekend, causing some customer data to be lost.

When the power went out, and backup generators subsequently failed, some virtual server instances evaporated – and some cloud-hosted volumes were destroyed and had to be restored from backups, where possible, we're told.

A Register reader today tipped us off that on Saturday morning, Amazon's cloud biz started suffering a breakdown within its US-East-1 region.

Our tipster told us they had more than 1TB of data in Amazon's cloud-hosted Elastic Block Store (EBS), which disappeared during the outage: they were told "the underlying hardware related to your EBS volume has failed, and the data associated with the volume is unrecoverable."

Our reader, who asked to remain anonymous, was able to restore their data by hand from an EBS snapshot conveniently taken roughly eight hours earlier. Without this backup, they may not have been able to recover any of the lost information: Amazon's engineers were able to resuscitate the vast majority of downed systems, though not every storage volume survived the hard crash.

Zuckerberg

Facebook blames 'server config change' for 14-hour outage. Someone run that through the universal liar translator

READ MORE

Unlucky customers who had data on the zapped storage systems were told by AWS staff that, despite attempts to revive the missing bits and bytes, some of the ones and zeroes were permanently scrambled: "A small number of volumes were hosted on hardware which was adversely affected by the loss of power. However, due to the damage from the power event, the EBS servers underlying these volumes have not recovered.

"After further attempts to recover these volumes, they were determined to be unrecoverable."

Meanwhile, one customer and tech consultant, Andy Hunt, not only complained on Twitter that their data was trashed in the power cut, but also claimed the cause of the failure wasn't swiftly communicated to subscribers: "AWS had a power failure, their backup generators failed, which killed their EBS servers, which took all of our data with it. Then it took them four days to figure this out and tell us about it.

"Reminder: The cloud is just a computer in Reston with a bad power supply."

A spokesperson for AWS was not available for comment.

'Impaired'

Although some details about the downtime were published, albeit buried, on AWS's status page, El Reg has seen a more detailed series of notices sent to customers explaining the blunder.

At just before 1100 PDT that day, AWS noted that, at about 0430 PDT, "one of ten data centers in one of the six Availability Zones in the US-East-1 Region saw a failure of utility power. Backup generators came online immediately, but for reasons we are still investigating, began quickly failing at around 0600 PDT."

"This resulted in 7.5 per cent of all instances in that Availability Zone failing by 0610 PDT," it continued. "Over the last few hours we have recovered most instances but still have 1.5 per cent of the instances in that Availability Zone remaining to be recovered. Similar impact existed to EBS and we continue to recover volumes within EBS. New instance launches in this zone continue to work without issue."

Roughly a couple of hours later, at 1330 PDT, the cloud goliath clarified and expanded its note as follows:

At 0433 PDT one of ten data centers in one of the six Availability Zones in the US-East-1 Region saw a failure of utility power. Our backup generators came online immediately but began failing at around 0600 PDT. This impacted 7.5 per cent of EC2 instances and EBS volumes in the Availability Zone.

Power was fully restored to the impacted data center at 0745 PDT. By 1045 PDT, all but one per cent of instances had been recovered, and by 1230 PDT only 0.5 per cent of instances remained impaired. Since the beginning of the impact, we have been working to recover the remaining instances and volumes. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. We continue to work to recover all affected instances and volumes and will be communicating to the remaining impacted customers via the Personal Health Dashboard. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible.

So, in effect, according to Amazon, early on Saturday morning, US West Coast time, an AWS data center lost power, then an hour and a half later, the backup generators failed, taking down just one in ten EC2 virtual machines and EBS volumes in that availability zone.

A few hours later, 99.5 per cent of affected systems had been recovered, and of those still "impaired," some were unrecoverable, forcing subscribers to pull out their backups – assuming they kept them. ®

Sponsored: Technical Overview: Exasol Peek Under the Hood




Biting the hand that feeds IT © 1998–2019