Windows Azure Compute cloud goes TITSUP PLANET-WIDE
Looks like a distributed system, breaks like a single tenant
Microsoft's Windows Azure cloud was hit by a worldwide partial compute outage today, calling into question how effectively Redmond has partitioned its service.
The problems emerged at 2.35AM UTC, and were still ongoing as of 10.20PM UTC the same day, according to the company's service dashboard.
"Manual actions to perform Swap Deployment operations on Cloud Services may error, which will then restrict Service Management functions," the company said.
Every single Azure region – a geographically distant and independent set of data centers – was affected, but for posterity that included: West US, West Europe, Southeast Asia, South Central US, North Europe, North Central US, East Asia, and East US.
"We are taking all necessary steps to mitigate this incident for the affected hosted services as soon as possible. Further updates will be published within 2 hours to keep you apprised of the situation. We apologize for any inconvenience this causes our customers," the company wrote at 10PM UTC.
Swap Deployment operations let developers initiate a virtual IP address swap between staging and production environments for services. Swap Deployment is an asynchronous operation that interacts with an Azure management service. Though not a main component of the IaaS cloud, an outage would be irritating for some heavy users, and a global outage is likely to damage confidence in Microsoft's ability to manage services at scale.
Alongside a global fail to a sub-component of Compute, the Azure cloud's Website feature also reported a global problem with "FTP data access" which began at 7PM UTC, suggesting a cascading fail from some part of the problem that downed Swap Deployment.
The antithesis of cloud computing is a problem cropping up that affects all regions simultaneously, and yet this marks the second time in under a year that Microsoft has had a concurrent global fail.
Last time we had a Blue Sky of Death it was due to a lapsed security certificate which downed all worldwide Windows Azure storage services. This time a much more minor component of the cloud has gone down, but the fact it has failed globally is a severe indictment against the partitioning policies Microsoft may have put in place. ®