Original URL: http://www.theregister.co.uk/2009/10/12/microsoft_and_google_without_chillers/

Microsoft yawns at Google's chillerless data center antidote

Instant failover? We do that too

By Cade Metz

Posted in Servers, 12th October 2009 19:03 GMT

Microsoft wouldn't be surprised if Google is using some sort of custom-built mystery software that automatically shifts workloads between its mega data centers. After all, Microsoft is doing much the same thing.

"We are at such an enormous scale. Think about this world where many data centers and hundreds of thousands of servers are running search and enterprise services and all sorts of services," Microsoft data center chief Arne Josefsberg tells The Reg.

"These infrastructures that we run - and that Google does too - are so large, you can't really rely on individuals to manually make these decisions on an application failing-over from one [data center] site to another. Essentially, it all has to be built into automation software that makes these types of decisions."

This summer, at a cloud-happy mini-conference in San Francisco, Google architecture guru/quip-meister Vijay Gill hinted that the Mountain View Chocolate Factory had developed some sort of back-end technology that automatically moves live compute loads to other locations when a data center verges on overheating.

"You have to have integration with everything right from the chillers down all the way to the CPU," Gill said. "Sometimes, there's a temperature excursion, and you might want to do a quick load-shedding to prevent a temperature excursion because, hey, you have a data center with no chillers. You want to move some load off. You want to cut some CPUs and some of the processes in RAM."

And, yes, he indicated the company has a way of redistributing these workloads (near-)instantly. "How do you manage the system and optimize it on a global-level? That is the interesting part," he continued. "What we’ve got here [with Google] is massive - like hundreds of thousands of variable linear programming problems that need to run in quasi-real-time. When the temperature starts to excurse in a data center, you don’t have the luxury to sitting around for a half an hour...You have on the order of seconds."

Apparently, that bit about the "data center with no chillers" was a reference to Google's new facility in Saint-Ghislain, Belgium. According to a report from Data Center Knowledge, the Belgium facility really does operate without chillers, using nothing but the outside Belgium air - aka "free-cooling" - to keep temperatures low in the server room.

And it seems that when the Belgium summer gets too hot, Google uses its mystery software platform to shift loads elsewhere. Though the company won't actually fess up. "I don't believe we have published any papers regarding that," uber-Googler Matt Cutts recently told The Reg.

A typically coy Google remark? Microsoft seems to think so.

Chillerless v chillerless

Late last month, Microsoft unveiled its own chillerless data center, 303,000-square-foot facility in Dublin, Ireland. But unlike Google's facility, Microsoft's data center includes a backup for its free-cooling setup: Direct eXpansion (DX) cooling units similar to ordinary air conditioners.

"These are fairly simple units, and we don't run them unless we absolutely have to," Josefsberg says. "It will be for only a very short period of time." According to Josefsberg, there will only be a few hours each year - spread out over a few days - where the Irish air is too hot to cool the server rooms, and that's when the DX units kick in.

Asked whether Microsoft had considered a Google-like setup with no cooling backup, Josefsberg indicated the company had not, going so far as to question the efficiency of such an arrangement. "If you're offloading data like that, essentially that would mean you would have to have a second data center with the same infrastructure somewhere else," he says. "And while each one can be energy efficient, you now need two of them. So the net is actually energy inefficient. You need a lot more infrastructure.

"If you make a reasonable investment in the reliability of the data center, you don't have to failover as much. Otherwise, you need more data centers. They're costly, and they're not good for the environment. We try to strike a balance. We don't want to invest in more data centers than we have to."

No doubt, Google would argue the other way. As Vijay Gill explains, the company's entire back-end philosophy is to create an unified infrastructure that spans all its data center facilities - an infrastructure that behaves as much as possible like a single machine.

In theory, this could lead to even greater levels of efficiency. Some so-called cloud evangelists have trumpeted a "follow the moon" setup, where workloads are constantly shifted to facilities where night has fallen. Night hours mean lower power costs.

But like Google, when a data center malfunctions, Microsoft needs a reliable means of maintaining service. Whether it has back-up cooling or not, there will be times where Microsoft needs to shift workloads out of its shiny new Dublin data center. And Josefsberg says that Redmond can do so on the fly.

As an example, he points to the "fabric controller" build into Windows Azure. "That's essentially what it does," he says. "It measures events and incidents and moves processing for customers to alternate servers. These could be in the same data center, if it's a smaller localized problem, or to another data if there's a problem with the data center itself."

And this can happen on a grand scale. "There was an earthquake in Asia that cut a lot of the undersea fibre optic networks," Josefsberg says. "This is the sort of situation where you have to be smart about failing over your services. The data center itself might be fine, but it might not be connected to the rest of the world. You've got to be able to quickly and automatically detect such situations and re-direct your customers to a failover situation."

And, he adds, Windows Azure is just one example. There's software that performs a similar functions for other Microsoft services.

But there are limits to the automatic nature of Redmond's setup. "In some cases, if we have a potential problem in the data center, the decision isn't completely handled by software. We do have very highly trained staff on site that can determine if we really want to failover everything in the data center.

"Generally, we want the software to make as many of the decisions as possible. But there will be cases where trained engineers and architects will look at it and make the final determination."

Once again, the big difference here is that Redmond thinks in terms of disparate services. There's one setup for Windows Azure, and then there's a separate setup for the next service. Google squeezes all its services into the same unified infrastructure. This is meant to improve performance. But at least in theory, it can also handle failover on a much larger scale.

Of course, there's theory, and then there's practice. Over the past year, two much-discussed Gmail outages occurred when Google was moving workloads between data centers. ®