SaaS

This article is more than 1 year old

Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command

Basically, Team Bezos pulled a GitLab

Thu 2 Mar 2017 // 18:59 UTC

Amazon has provided the postmortem for Tuesday's AWS S3 meltdown, shedding light on what caused one of its largest cloud facilities to bring a chunk of the web down.

In a note today to customers, the tech giant said the storage system was knocked offline by a staffer trying to address a problem with its billing system. Essentially, someone mistyped a command within a production environment while debugging a performance gremlin.

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the team wrote in its message.

"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems."

Those two subsystems handled the indexing for objects stored on S3 and the allocation of new storage instances. Without these two systems operating, Amazon said it was unable to handle any customer requests for S3 itself, or those from services like EC2 and Lambda functions connected to S3.

As a result, websites small and large that relied on the cheap and popular Virginia US-East-1 region stopped working properly, costing hundreds of millions of dollars in losses for customers. It also broke smartphone apps and Internet of Things gadgets – from lightbulbs to Nest security cameras – that were relying on the S3 storage backend.

Among the collateral damages from the outage was the AWS service dashboard, which relies on S3 for some of its data, and with that service offline, could not be accessed by staff for about two hours to provide any updates. AWS says it was able to restore full S3 service and operations by 1:54 PM PST, nearly four and a half hours later.

"While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years," the AWS team explained on Thursday.

"S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."

Amazon says that it will be putting several safeguards in place to prevent similar outages in the future, including limiting the ability its debugging tools have to take multiple subsystems offline and partitioning the entire service into smaller "cells" that can individually be taken offline and updated without affecting other parts of S3.

"We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level," AWS said of the offending software.

"This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks."

Mistyping a command and crippling service for hours. Where have we heard that one before? ®

Topics

Special Features

Vendor Voice

Resources

SaaS

Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command

Basically, Team Bezos pulled a GitLab

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

AWS must pay $525M to cloud storage patent holder, says jury

Irish power crunch could be prompting AWS to ration compute resources

A different view from the edge

UK govt office admits ability to negotiate billions in cloud spending curbed by vendor lock-in

AWS severs connection with several hundred staff

Amazon to lure upstarts with $500K in AWS AI credits each

GenAI will be bigger than the cloud or the internet, Amazon CEO hopes

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Microsoft hiring Inflection team triggers interest from EU's antitrust chief

Datacenter outages are on the decline, but when they hit, they hit hard

Stability AI reportedly ran out of cash to pay its bills for rented cloudy GPUs

About Us

Our Websites

Your Privacy