DevOps

$310m AWS S3-izure: Why everyone put their eggs in one region

Lessons learned from Tuesday's cloud, er, fog storage mega-failure

By Shaun Nichols in San Francisco

37 SHARE

Analysis With Amazon now recovered from a four-hour outage that brought a large portion of the internet to a grinding halt, analysts are looking back to see what lessons companies can learn from the ordeal.

The system breakdown – or as AWS put it, "increased error rates" – knocked out a single region of the AWS S3 storage service on Tuesday. That in turn brought down AWS's hosted services in the region, preventing EC2 instances from launching, Elastic Beanstalk from working, and so on. In the process, organizations from Docker and Slack to Nest, Adobe and Salesforce.com had some or all of their services knocked offline for the duration.

According to analytics firm Cyence, S&P 500 companies alone lost about $150m (£122m) from the downtime, while financial services companies in the US dropped an estimated $160m (£130m).

The epicenter of the outage was one region on the east coast of America: the US-East-1 facility in Virginia. Due to its lower cost and familiarity with application programmers, that one location is an immensely popular destination for companies that use AWS for their cloud storage and virtual machine instances.

As a result of developers centralizing their code there, when it fell over, it took out a chunk of the web. Startups and larger orgs find it cheaper and easier to use US-East-1 out of all the other regions AWS provides. It's Amazon's oldest location, and the one they are most familiar with.

Coders are, ideally, supposed to spread their software over multiple regions so any failures can be absorbed and recovered from. This is, to be blunt, too difficult to implement for some developers; it introduces extra complexity which means extra bugs, which makes engineers wary; and it pushes up costs.

For instance, for the first 50TB, S3 storage in US-East-1 costs $0.023 per GB per month compared to $0.026 for US-West-1 in California. Transferring information between apps distributed across multiple data centers also costs money: AWS charges $0.010 per GB to copy data from US-East-1 to US-East-2 in Ohio, and $0.020 to any other region.

Then there are latency issues, too. It obviously takes time for packets from US-East-1 to reach US-West-1. In the end, it's easier to plonk your web application and smartphone app's backend in one friendly region, and ride out any storms. It's rare for a whole region to evaporate.

"Being the oldest region, and the only public region in the US East coast until 2016, it hosts a number of their earliest and largest customers," said IDC research director Deepak Mohan. "It is also one of their largest regions. Due to this, impacts to the region typically affect a disproportionately high percentage of customers."

Cost was a big factor, says Rob Enderle, principal analyst at the Enderle Group. "The issue with public cloud providers – particularly aggressively priced ones like Amazon – is that your data goes to the cheapest place. It is one of the tradeoffs you make when you go to Amazon versus an IBM Softlayer," Enderle said.

"With an Amazon or Google you are going to have that risk of a regional outage that takes you out."

'Pouring one hundred gallons of water through a one gallon hose'

While those factors made the outage particularly difficult for customers who had come to rely on the US-East-1 region for their service, even those who had planned for such an occurrence and set up multiple regions were likely caught up in the outage. After US-East-1's cloud buckets froze and services vanished, some developers discovered their code running in other regions was unable to pick up the slack for various reasons.

"It is hard to say exactly what happened, but I would speculate that whatever occurred created enough of an issue that multiple sites attempted to fail over to other zones or regions simultaneously," Charles King, principal analyst with Pund-IT, told El Reg.

"It's like trying to pour one hundred gallons of water through a one gallon hose, and you end up with what looks like a massive breakdown."

The takeaway, say the industry analysts, is that companies should consider building redundancy into their cloud instances just as they would for on-premises systems. This could come in the form of setting up virtual machines in multiple regions or sticking with the hybrid approach of keeping both cloud and on-premises systems. And, just like testing backups, testing that fail overs actually work.

"I think we have grown accustomed to the idea that the cloud has become a panacea for a lot of companies," King said. "It is important for businesses to recognize that the cloud is their new legacy system, and if the worst does occur the situation can be worse for businesses using cloud than those choosing their own private data centers, because they have less visibility and control."

While the outage will probably do little to slow the move of companies into cloud services, it could give some a reason to pause, and that might not be a bad thing.

"What this emphasizes is the importance of a disaster recovery path, for any application that has real uptime requirements, be it a consumer-facing website or an internal enterprise application," said IDC's Mohan.

"The biggest takeaway here is the need for a sound disaster recovery architecture and a plan that meets the needs and constraints of the application. This may be through usage of multiple regions, multiple clouds, or other fallback configurations." ®

Sign up to our NewsletterGet IT in your inbox daily

37 Comments

More from The Register

Oh, Bucket! AWS in S3 status-checking tool free-for-all

'Your data is waiting for the internet to download it' warning lights are now free

Millions of scraped public social net profiles left in open AWS S3 box

Poorly configured cloud buckets strike again – this time, Localbox fingered

AWSome, S3 storage literally costs pennies

Just ignore the retrieval fees and relatively lower resilience

When it absolutely, positively needs to be leaked overnight: 120k FedEx customer files spill from AWS S3 silo

Passport scans, drivers licenses, etc, exposed online

AWS users felt a great disturbance in the cloud, as S3 cried out in terror

S3izure made things tricky for an hour, but was no apocalypS3 to match March mess

When is a Barracuda not a Barracuda? When it's really AWS S3

Now you can replicate backups to Barracuda's actually-Amazonian cloud

This week in 'Bungles in the AWS S3 Privacy Jungles', we present Alteryx – and 123 million households exposed

Dodged a bit of a bullet this time

Multi-cloud Cloudian controllers now run in AWS, Azure and Google

One namespace to rule them all

Whoa, AWS, don't slip off your cloudy perch. Google and Microsoft are coming up to help

While Alibaba dips a tentative toe in the challenger pool

Aw, not you too, Verizon: US telco joins list of leaky AWS S3 buckets

Now is a good time to go check your own Amazon settings. It's OK, we'll wait