Amazon cloud still on fritz after 36 hours

'All hands on deck'

Amazon's cloud is still on the fritz, a day and a half after the company first reported connection problems, latency issues, and increased error rates across the service. But on Friday morning, the company said that full service should be restored for a "majority" of users by the afternoon Pacific time.

"We continue to see progress in recovering volumes, and have heard many additional customers confirm that they're recovering. Our current estimate is that the majority of volumes will be recovered over the next 5 to 6 hours," the company said in a post to its Amazon Web Services status page.

In some cases, Amazon said, it will take longer to restore data. With these volumes, the company is having to restore backups it made to its own S3 online storage service on Thursday.

The problems began in the early hours of Thursday morning Pacific time. At 1:41 am, Amazon said on its status page that it was investigating connectivity issues with its EC2 (Elastic Compute Cloud) service, which provides on-demand access to processing power via the web. The outage brought down several websites that run atop the service, including Quora, Sencha, Reddit, and FourSquare.

The outage also affected Amazon's Elastic Block Store, Relational Database Service, and Elastic Beanstalk services. And according to one post from the company, it all began with a "networking event" that triggered a large amount of re-mirroring of EBS volumes in the "East region" of Amazon Web Services. Amazon divides its so-called infrastructure cloud service into multiple geographic regions, and it guarantees 99.95 per cent availability within each region.

Some regions, including the East region, are divided into multiple "availability zones". For years, Amazon has said that these zones are "insulated" from each other's failures. But yesterday's outage spread across zones in the East region. Amazon has never said how these zones are designed. It's unclear whether they're locations in separate data centers or not.

"We can assure you that all-hands are on deck to recover as quickly as possible," the company said late last night. ®

Update

Amazon has now said that a majority of volumes have indeed been restored. "These volumes were recovered by ~1:30pm PDT," the company said at 2:15pm Pacific time. "We mentioned that a 'smaller number of volumes will require a more time consuming process to recover, and we anticipate that those will take longer to recover.' We're now starting to work on those.'"

Sponsored: Designing and building an open ITOA architecture