Microsoft reveals train of mistakes that killed Azure in the South Central US 'incident'
Thunderbolt and lightning, Azure outage frightening
Microsoft has published the preliminary findings for what it calls “the South Central US incident”, but what many will call “the day the Azure cloud fell from the sky” and it doesn’t make for happy reading.
Thunder and lightning, very very frightening
As is well known now, high energy storms hit Southern Texas early in the morning on 4 September, with multiple Azure data centers in the region seeing what Microsoft described as “voltage sags and swells.” A lightning strike at 0842 UTC caused one data center to switch to generator power and also overloaded suppressors on the mechanical cooling system, shutting it down.
The data center struggled on for a bit, but as its thermal buffers were depleted, temperatures rose, and a shutdown started. Alas, this was not before temperatures had risen to the point where actual hardware, including storage units and network devices, were damaged.
It’s at this point that a fateful decision was taken by engineers. The team could have failed over to another data center but instead put a higher priority on the integrity of customer data (since the asynchronous nature of geo-replication could have led to data loss.) Thus the engineers began working through the damaged hardware, replacing where necessary and migrating customer data to healthy servers as needed, while customers kept hitting Refresh and staring at their screens in bafflement.
In the preliminary report, Microsoft admitted that “this particular set of issues also caused a cascading impact to services outside of the region”
And goodness, it certainly did cause an impact.
For customers directly using the South Central US region, problems began at 0929 UTC on 4 September with pretty much everything going down. Microsoft states that the majority of Azure services were up again just over a day later, by 11:00 UTC on 5 September, but it was not until 08:40 UTC on 7 September that “full mitigation” was complete.
Azure Service Manager does not support automatic failover
So far so bad. However, as Azure users know all too well, the problem was not isolated to the South Central US region. Microsoft has revealed that the legacy Azure Service Manager (ASM) which manages ‘classic’ resource types uses South Central US as its primary site to store resource metadata. While it also uses other locations to store metadata, ASM does not support automatic failover.
Uh oh. It wasn’t until 0110 UTC on 5 September that service was fully resumed.
Microsoft is keen to point out that its shiny new Azure Resource Manager (ARM) features global resiliency and stores data in every region. Unfortunately, it appeared that ARM also struggled with customers experiencing time-outs and, of course, problems with resources that had underlying dependencies.
The incident also served as a pointer to weaknesses in Azure Active Directory (AAD). The affected data center was, unfortunately, one of the AAD sites for North America.
The good news is that as the data center fell over, authentication traffic was routed to the other sites automatically. The bad news is that automatic throttling kicked in, leading to timeouts for customers. It took until 1440 UTC on 4 September for Microsoft to deal with routing and bump up capacity elsewhere.
Finally Visual Studio Team Services (VSTS) customers discovered that the affected data center provided capabilities used by services in other regions. Again, the decision by engineers not to fail over in order to protect data led to a long wait for affected customers. The VSTS impact was not fully mitigated until 0005 UTC on 6 September. Nearly two days after the initial failure.
We have drawn a discreet veil over the fact that it was over 12 hours before Azure was even able to reliably show its status page.
We are so so sorry
Microsoft obviously said it is very sorry, although you’ll have to check your service level agreement and October billing statement to see just how sorry. It also promised that it will deal with the hardware problems – be it the design of the data center itself or dealing with the lack of resilience in its storage units to “environmental factors”.
More importantly, Microsoft has seemingly recongised that the whole ASM thing isn’t good and plans to migrate dependencies away from it to ARM as rapidly as possible. Customers would be well advised to take a good long look at their own designs as well. ®