Reg comments122

AWS's S3 outage was so bad Amazon couldn't get into its own dashboard to warn the world

Websites, apps, security cams, IoT gear knackered

Man runs as a fire engine pulls up to a burning building. Pic by Shutterstock

Tuesday's Amazon Web Services mega-outage knocked offline not only websites big and small, by yanking away their backend storage, but also knackered apps and Internet of Things gadgets relying on the technology.

In fact, the five-hour breakdown was so bad, Amazon couldn't even update its own AWS status dashboard: its red warning icons were stranded, hosted on the broken-down side of the cloud.

Essentially, S3 buckets in the US-East-1 region in northern Virginia, US, became inaccessible at about 0945 PST (1745 UTC). Software, from web apps to smartphone applications, relying on this cloud-based storage quickly broke, taking out a sizable chunk of the internet as we know it.

AWS has many regions, and US-East-1 is just one of them. Developers are supposed to spread their applications over different data centers so when one region goes TITSUP, it doesn't take your whole platform down. For various reasons – from the fact that programmers find distributed computing hard to the costs involved – this redundancy isn't always coded in. And so here we are.

A shot of AWS status board Tuesday mid-day

The carnage on Tuesday midday, Pacific Time, as shown on the AWS dashboard customers see after logging in
(click to enlarge)

This is by no means an exhaustive list of things that fell over or were wobbly today, due to the S3 downtime, but here's a start: Docker's Registry Hub, Trello, Travis CI, GitHub and GitLab, Quora, Medium, Signal, Slack, Imgur, Twitch.tv, Razer, heaps of publications that stored images and other media in S3, Adobe's cloud, Zendesk, Heroku, Coursera, Bitbucket, Autodesk's cloud, Twilio, Mailchimp, Citrix, Expedia, Flipboard, and Yahoo! Mail (which you probably shouldn't be using anyway). Readers also reported that Zoom.us and some Salesforce.com services were having problems, as were Xero, SiriusXM, and Strava. Another reader reports being unable to order coffee because the Hey You app was broken.

Launching new EC2 instances in the broken AWS region was also impossible at one point, we're told. Other AWS services knackered in the data center include: Elastic File System and Elastic Load Balancing, Simple Email Service, Relational Database Service, Lambda, Elastic MapReduce and Elastic Beanstalk.

It was not only websites and hosted services that were affected by the outage. A number of IoT products also rely on S3 buckets, and were thrown for a loop by the downtime.

Nest warned customers that its internet-connected security cameras and smartphone apps were not functioning properly – as in, weren't recording video footage for several hours – as a result of the AWS blunder.

Other IoT devices were also impacted and caused some rather surreal scenarios for their owners. We're told that cloud-connected lightbulbs, thermostats, and similar gear, stopped working properly as their backends fell over.

Ironically, outage monitoring sites DownDetector and isitdownrightnow.com were also offline, thanks to the issue. This may have been due to overwhelming attention from netizens wondering why their favorite sites and apps weren't working.

The US-East-1 region is one of the most popular regions for AWS, as today's calamity reflects. Its outage was so severe that Amazon was unable to update its own online public dashboard to warn users about the outage: the icons were stuck on green lights because the red icons warning of failures were hosted in the downed systems, apparently.

Amazon was able to regain control of the dashboard by noon PST, and update it to reflect the ongoing downtime. At 1300 PST, the board stated: "We are seeing recovery for S3 object retrievals, listing and deletions." By 1330 PST AWS added, "S3 object retrieval, listing and deletion are fully recovered now."

Finally, by about 1400 PST, Amazon reported that the service had returned to near-enough full functionality, roughly five hours after the first error reports surfaced. Some services, such as Lamba, still weren't up and running at time of writing. Here's the full timeline, according to the AWS team:

[RESOLVED] Increased Error Rates for Amazon S3

Update at 2:08 PM PST: As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.

Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3.

Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.

Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.

Update at 10:33 AM PST: We're continuing to work to remediate the availability issues for Amazon S3 in US-EAST-1. AWS services and customer applications depending on S3 will continue to experience high error rates as we are actively working to remediate the errors in Amazon S3.

AWS has not yet said what the exact nature of the problem was. We're trying to get to the bottom of it. If you have any info, email us, please. ®


Biting the hand that feeds IT © 1998–2017