Feeds

Amazon gets 'F' for communication amidst cloud outage

CTO's distributed computing pal analyzes EC2 failure

  • alert
  • submit to reddit

3 Big data security analytics techniques

Thorsten von Eicken – a former academic colleague of Amazon CTO Werner Vogels and the brains behind RightScale, one of the organizations best positioned to comment on the Amazon "cloud" – has heavily criticized Vogels and company for providing so little information about the massive outage that hit their service last week and continues to affect at least some Amazon customers.

"Amazon’s communication, while better than during previous outages, still earns an F. This is probably the #1 threat to AWS’s business," von Eicken said in a blog post on Monday.

"The biggest failure in this event was Amazon’s communication, or rather lack thereof. The status updates were far too vague to be of much use and there was no background information whatsoever. Neither the official AWS blog nor Werner Vogels’ blog had any post whatsoever 4 days after the outage!"

Eicken called the outage the worst in Amazon's history, but he believes that – at least on one level – the company proved that it was prepared for such an outage. "The Amazon cloud proved itself in that sufficient resources were available worldwide such that many well-prepared users could continue operating with relatively little downtime," he said.

"But because Amazon’s reliability has been incredible, many users were not well-prepared leading to widespread outages. Additionally, some users got caught by unforseen failure modes rendering their failure plans ineffective."

Amazon's lack of communication aside, Eicken said that the "biggest problem" was that the outage affected more than one "availability zone".

Amazon divides its so-called infrastructure cloud into multiple geographic regions, and some regions are divided into availability zones that are ostensibly "insulated" from each other's failures. But the outage that began in the early morning hours Pacific time on Thursday originated in one zone inside the service's East region and spread to other zones.

'Not supposed to happen'

RightScale runs a service for managing the use of Amazon's Elastic Compute Cloud and similar "infrastructure clouds," services that provide on-demand access to readily scalable processing power. RightScale did not see failures outside the zone where the problem originated, with an breakdown in Amazon's Elastic Block Store service, but it was nevertheless unable to launch new server instances outside that zone.

"We didn't see servers or volumes fail in other zones but we were unable to create fresh volumes elsewhere, which of course makes it difficult to move services," von Eicken says. "This is 'not supposed to happen' and is an indication that the EBS control plane has dependencies across zones."

However, von Eicken points out, Amazon did contain the problem to one zone approximately three hours after it started.

Amazon first reported the problem at 1:41am Pacific on Thursday, and von Eicken says RightScale started noticing problems about 40 minutes before. "They finally posted a status message at 1:41am containing no useful details, sadly this is a typical sequence of events," he says.

In one of its brief status messages, Amazon said the problem began with a "network event" that caused the service to re-mirror a large number of EBS volumes in the East region. "It appears that a major network failure was the initial cause of problems but that the real damage happened when EBS (Elastic Block Store) volume replication was disrupted," von Eicken says.

"It appears that a significant fraction of the volumes concluded that the replication mirroring was out-of-sync and started re-replicating causing further havoc, including an overload of the EBS control plane. It is also possible that the EBS replication problem was the root cause and that the network issues were a consequence, hopefully Amazon’s root cause analysis will shed light on this."

SANS - Survey on application security programs

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story

Whitepapers

Designing a defence for mobile apps
In this whitepaper learn the various considerations for defending mobile applications; from the mobile application architecture itself to the myriad testing technologies needed to properly assess mobile applications risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.