Microsoft's own code should prevent an Azure SSL fail: So what went wrong?
Cloud service fell over despite cert automation in Server 2012
Sysadmin blog Server 2012 is the Microsoft operating system that, in my opinion, makes cloud computing a reality. As far as I am concerned it is as big a leap over Server 2008 R2 as that OS was over Server 2003. With it you can build anything from a small cluster to a service as big as Microsoft's own Azure platform.
Which is why I am completely baffled as to how it is possible that Azure was knocked offline by last week's SSL cock-up.
Let me start out by saying that I have the utmost sympathy – and respect – for the poor bastards working behind the scenes to fix this particular embarrassing incident. I'm not too proud to admit that I have done the exact same thing; like Microsoft, I've accidentally let a HTTPS certificate lapse more than once.
I could throw up excuses such as the ever infamous "I was too busy". I could even hand-wave at Apache's maddening certificate management (which makes it easy to miss a node) or RapidSSL's long delays in verifying the certs.
I could make those excuses, but I won't; none of them are valid. I screwed up because I was lazy, and any users trying to access an Outlook Web App late at night last Christmas (and the one before) were terribly inconvenienced for nearly six hours. The bit that bothers me about this snafu is that Microsoft doesn't even get to try those excuses. Not only can Microsoft sign its own damned certs, Server 2012 makes this whole process so simple web administrators will weep.
Microsoft has code to save itself from this sort of blunder
One of the features buried inside the release notes for Server 2012 is Centralized SSL Certificate (CSC) management. You can run a farm of up to 10,000 IIS web server nodes off a single CSC server; each of them can be directed to automatically contact the server to receive their certs from a single server that gives you a reasonably simple interface to direct a symphony of re-validation.
Considering everything in Microsoft's new cloudy world is PowerShell scriptable, you can even stagger renewals so that no one certificate expiration can tank everything. Microsoft doesn't have to worry about licensing Microsoft's own kit, so how exactly did this happen?
Even if it was the cryptographic certificate upstream from the end nodes that expired, why wasn't the CSC server auto-renewing from elsewhere? Since Redmond can sign its own certs, then between CSC and Server 2012's more traditional certificate manager you could have a great big circle jerk with servers auto-renewing in an endless frolic of crypto-hedonism.
So let's set this aside for the moment and assume that for whatever reason someone somewhere decided that it was vitally important to manually update a certificate along the chain. What could have prevented them from doing so? Maybe it was the data centre edge blacklist that Office 365 users can't control. Nah; you'd think that the cert guy would have an internal staff list that would tell him where to send the bottle of scotch to make sure that the people who try to send him email actually can.
Still working on the assumption that an expired cert was at fault, last I checked, Microsoft had some money lying around, so if it was getting the certificate verified by an external entity it should have been possible to pay the bill. Laziness? I doubt it. Surely Microsoft pays its systems administrators enough to actually care about their job. It is highly unlikely to be the fault of any one person not pulling the trigger on the update.
That leaves me with two remaining possibilities. The first: Microsoft isn't using its own rather excellent technology to handle these certs. I'm not fully sure of the underpinnings of Azure; does it run on Server 2012? Bing.com does. Even if Azure isn't using off-the-shelf Windows Server, there would be a delicious irony if Microsoft – enthusiastic player of the constant, cacophonous drumbeat of "upgrade for your own good" – had failed to take advantage of technology it invented to solve this exact problem.
I find it hard to buy that Microsoft doesn't have a version of CSC for their Azure infrastructure, leaving me with only one solid hypothesis about Azure's outage. I believe Microsoft is coming face to face with the fact that when pretty much all automation relies on scripting - using PowerShell or otherwise - a simple change to one line of code in one script can topple the mightiest cloud. Even one built on a foundation as solid as Server 2012.
I have a lot of respect for the systems administrators running Azure. That's a big, complicated job with an enormous amount of pressure. Right now, they are probably getting emotionally flayed alive - I won't envy them for the next few weeks. I would, however, like to offer a suggestion to Microsoft - especially the script-all-the-things happy server division. Pick up the phone and call Luke Kanies over at PuppetLabs.
Ask him nicely for an education on why enforced states are better than scripts. Learn from those who have solved the problem of leaving the reputation of their flagship cloud service hanging on a single forgotten semicolon. ®