Channel

This article is more than 1 year old

TALE OF FAIL: Microsoft offers blow-by-blow Azure outage account

Buggy software and bad deployment borked Redmond's cloud

Thu 18 Dec 2014 // 08:29 UTC

Microsoft has published a full, frank, and ugly account of just what went wrong when Azure Storage entered Total Inability To Support Usual Performance – TITSUP - mode in November.

The nub of the problem was that Azure's update procedures and code had “... a gap in the deployment tooling that relied on human decisions and protocol.”

At the time of the incident, Microsoft said it was caused by “... an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting (testing).”

Microsoft says its flighting process works like this:

“There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’ When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.”

The new analysis of the outage fingers faulty flighting as the cause of the mess, saying it started with “a software change to improve Azure Storage performance by reducing CPU footprint of the Azure Storage Table Front-Ends.”

During the upgrade, “The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.” The engineer doing the upgrade “believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk.”

But it wasn't, because “the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends.”

“Enabling this change on the Azure Blob storage Front-Ends exposed a bug which resulted in some Azure Blob storage Front-Ends entering an infinite loop and unable to service requests.”

Microsoft has since changed its processes and “released an update to our deployment system tooling to enforce compliance to the above testing and flighting policies for standard updates, whether code or configuration.”

Those updates mean “policy is now enforced by the deployment platform itself.”

Microsoft's being very open about this issue. Not only is there a detailed article about the mess, there's also a video interview with more detail.

That's an unusual amount of explanatory material for any incident and a far more detailed dump than cloudy rivals have offered after their own outages.

With cloud services now hard to differentiate on price, and often not highly-differentiated in terms of features, might this kind of openness sway customers? Or is it safer to assume that if Microsoft has a big SNAFU like this in it, there are others waiting to happen and Azure is best avoided?

Topics

Special Features

Vendor Voice

Resources

Channel

TALE OF FAIL: Microsoft offers blow-by-blow Azure outage account

Buggy software and bad deployment borked Redmond's cloud

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Microsoft foresees a new type of AI PC: A Surface designed with help from machines

French lawmakers take a swing at cloud monopolies

SharePoint logs are easily circumvented and Microsoft is dragging its heels

Protecting distributed branch office environments from ransomware

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

Researchers claim Windows Defender can be fooled into deleting databases

October 2025 will be a support massacre for a bunch of Microsoft products

Microsoft is a national security threat, says ex-White House cyber policy director

Open source versus Microsoft: The new rebellion begins

Microsoft breach allowed Russian spies to steal emails from US government

AI gold rush continues as Microsoft invests $1.5B in UAE's G42

Microsoft shrinks AI down to pocket size with Phi-3 Mini

About Us

Our Websites

Your Privacy