Feeds

Inside Microsoft's Autopilot: Nadella's secret cloud weapon

Redmond man spills the beans on Microsoft's top-secret software

Maximizing your infrastructure through virtualization

Exclusive Satya Nadella may have just taken the reins as Microsoft's chief executive, but he's already intimately familiar with one of the company's key internal tools to let it compete with Amazon and Google: a complex software system named Autopilot.

Autopilot is the system that lets Microsoft knit together millions of servers and hundreds and hundreds of petabytes of data into a great, humming lake of compute and storage capacity. Without Autopilot, Nadella's former divisions of Server and Tools, Online Services, Search and Advertizing, and Cloud and Enterprise, would have performed poorly and been less reliable.

Gaining access to Autopilot, Windows Azure's general manager Mike Neil told The Reg, is like being handed "the keys to a multi-billion dollar car."

Microsoft rarely talks publicly about Autopilot, and has only published two official documents about it: a now-outdated academic paper in 2007 titled Autopilot: Automatic Data Center Management, and a 2013 web page describing how Autopilot's development team were given an "Outstanding Technical Achievement" award for their work on the system.

Part of the reason Autopilot has never been talked about much – until now – is that its presence jars with Microsoft's marketing goal of claiming that everything it uses to run its cloud can be bought by Joe Public.

To distributed-systems cognoscenti aware of the idiosyncratic, complex needs of huge IT estates, that claim was always an odd one, and now we know why: yes, Microsoft uses a vast amount of its own commercial software internally to run its cloud, but "the vast majority" of applications running in Microsoft data centers ultimately sit on top of the Autopilot system.

"Autopilot software now completely automates the entire server operational lifecycle, from power on and OS installation, to fault detection and repair, to power cycling and vendor RMA," explains Microsoft. "The [Autopilot] team can take a bow for a quietly effective operation that has profoundly transformed Internet-scale services at Microsoft."

It also helps assign resources to applications, schedule when jobs should run, gathers information from millions of computers to give up-to-the-minute capacity utilization information, and forms the underlay of other even more-secret technologies, such as the exabyte-scale COSMOS data analysis engine that sits beneath services such as Bing, Xbox Live, and Windows Azure.

Finally, Autopilot has gone hand-in-hand with a redesign of Microsoft's data center hardware, which has seen the company move away from buying high-end gear from traditional vendors, and to designing its own commodity-style cut-down servers – these computers were declassified in January when Microsoft contributed their designs to Facebook's Open Compute Project.

In other words – if Microsoft's servers are puppets, Autopilot is the unseen puppeteer that animates both them and the stage they dance upon.

Neil compares Autopilot to a 747 jet: "It's a big, complex, honking thing," he told us, explaining that the system is designed "to take load off of the [data center sysadmin] pilot, so the pilot can concentrate on more important things."

One of Autopilot's main jobs is handling low-level infrastructure provisioning.

When Microsoft wants to add capacity to its global fleet of "10 to 100" data centers, it typically does so by loading in a shipping container stuffed with around 10,000 nodes, dubbed in Microsoft parlance an "ITPAC". Once these machines are connected to the data center's power grid, Autopilot is the system that checks that all the new servers are configured correctly and that the network works well, and helps link them to the rest of the system.

"Autopilot deploys and manages the OS image for the host as well as managing the applications that are deployed" Neil explains. "The agent comes along with the OS image and part of that is our SDN solution. The SDN solution manages both east-west and north-south traffic, and our topology gives us great cross-sectional bandwidth and path redundancy."

Once the servers have been brought into Microsoft's global network of "over a million servers," Autopilot helps manage them as well.

If a server fails, then Autopilot has a "self-healing" capability that can prevent a cluster-scale brownout, he said. "Things are going to fail all the time – Autopilot can take remediation actions for you to address failures. There's a bunch of auto-healing autonomic behavior in the system – you don't have to trim the flaps."

Autopilot also has a sophisticated scheduling component as well, which – to stretch the aeronautical metaphor a wee bit further – lets it play the role of an air traffic controller for the innumerable large and small workloads flying in and out of Microsoft's global pool of computers.

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
Disaster Recovery upstart joins DR 'as a service' gang
Quorum joins the aaS crowd with DRaaS offering
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.