Feeds

The TRUTH behind Microsoft Azure's global cloud mega-cock-up

"Machines will fail, software has bugs, people will make mistakes"

Internet Security Threat Report 2014

Windows Azure suffered a global meltdown at the end of October that caused us to question whether Microsoft had effectively partitioned off bits of the cloud from one another. Now we have some answers.

After a bit of prodding, Redmond sat us down with Windows Azure general manager Mike Neil, who explained to us why a sub-component of the system failed worldwide last week.

"There are three truths of cloud – machines will fail, software has bugs, people will make mistakes," he said.

The trick is to build your cloud system so that end users never notice these faults, but when everyone experiences a worldwide wobble in your service, it's hard to claim, as Microsoft does, that "Windows Azure provides on-demand infrastructure that scales."

The problem that befell Azure lay the "Red Dog front end" (RDFE), which provides the publicly exposed management portal and the service management API. Users' requests are fed through the RDFE to the fabric front end, which disperses requests through aggregators and load balancers to fabric controllers that direct the cloud's virtual machines and other resources.

Microsoft made a change to the RDFE which it tested on a small percent of nodes within a single cluster. Upon seeing no problems, the company pushed it out worldwide.

"We were seeing good results and then we rolled it out more broadly," Neil said. When the update went global, the company detected "a small number of customers having issues with swap."

The swap feature lets admins move virtual machines from staging to production. It's a minor service, but some customers were affected (and one that El Reg spoke to was a bit miffed by the impact).

The key problem was that this fault, however "intermittent", occurred in all of Microsoft's regions, thus causing disruption across the world, because RDFE turned into a single point of failure in Azure – all thanks to the problematic update.

"One of the most difficult problems for us to address which creates a single point of failure in the system is the software itself," Neil said.

Though Microsoft eventually fixed the service by tweaking the RDFE, the company drew significant flack for having a planet-wide fault. Due to the way Azure is built, there can only be one RDFE functioning worldwide at any one time, rather than multiple versions in an A/B testing arrangement as other major clouds like Amazon and Google are thought to use.

"The challenge we have for RDFE is that it's the API everybody talks to," he says. "In this particular case, we need to present a common API to the rest of the world. RDFE is that one special case."

We here at El Reg's floating cloud bureau understand Microsoft issued a full report on the outage to affected customers today. If it happened to plop into your inbox, do get in touch. ®

Internet Security Threat Report 2014

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
IBM storage revenues sink: 'We are disappointed,' says CEO
Time to put the storage biz up for sale?
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.