Feeds

How a tiny leap-day miscalculation trashed Microsoft Azure

Redmond drills into cause of eight-hour outage

Beginner's guide to SSL certificates

As soon as Microsoft's cloudy platform Azure crashed to Earth, and stayed there for eight hours, on 29 February, every developer who has ever had to handle dates immediately figured it was a leap-day bug.

Now the software biz behemoth has put its hands up and admitted in a detailed dissection of the blunder how a calendar glitch trashed its server farm. It's also a handy guide to setting up your own wholesale-sized cloud platform.

The mega-crash stemmed from the handling of messages passed between a virtual machine running a client's application and the underlying host operating system running on each of the Azure servers.

These messages are encrypted using a public-private key pair taken from a "transfer certificate" generated within the VM. This security measure allows the host OS and the VM to trust their communications channel, through which stuff such as SSL certificates and diagnostic health checks are sent.

A transfer certificate is valid for a year from its creation date. A certificate created by an agent in a VM on 29 February 2012 will expire on 29 February 2013, a date that simply doesn't exist. This crashed the certificate validation process, bringing the start up of the VM to a halt.

The host OS tries to restart a crashed app VM every 25 minutes, which would have been fruitless in this case. After a series of failures, the host OS declares the hardware to be at fault and reports the server as knackered. Automatic systems that manage the clusters of servers try to self-heal the cloud by restarting the VMs on other boxes - in this scenario it caused them to also fall over, cascading the gaffe into a full-blown outage.

It didn't help that, at the time, new versions of the cloud's platform software were being rolled out, which required the generation of new albeit broken transfer certificates. Once enough servers are reported faulty in a cluster, the whole set is put on red alert - halting self-healing and software updates to minimise the damage.

Microsoft engineers confessed that in a rush to roll out the fix to the servers, they hit incompatibility problems within their own code, which knocked out services again.

You can read the full cock-up, blow by blow, here. Microsoft has said it will cough up service credits for customers walloped by the Azure outage. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
IT crisis looming: 'What if AWS goes pop, runs out of cash?'
Public IaaS... something's gotta give - and it may be AWS
Linux? Bah! Red Hat has its eye on the CLOUD – and it wants to own it
CEO says it will be 'undisputed leader' in enterprise cloud tech
BT claims almost-gigabit connections over COPPER WIRE
Just need to bring the fibre box within 19m ...
Oracle SHELLSHOCKER - data titan lists unpatchables
Database kingpin lists 32 products that can't be patched (yet) as GNU fixes second vuln
Ello? ello? ello?: Facebook challenger in DDoS KNOCKOUT
Gets back up again after half an hour though
Hey, what's a STORAGE company doing working on Internet-of-Cars?
Boo - it's not a terabyte car, it's just predictive maintenance and that
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.