Original URL: http://www.theregister.co.uk/2009/10/23/google_spanner/

Google Spanner — instamatic redundancy for 10 million servers?

Mountain View wants your exabyte

By Cade Metz

Posted in HPC, 23rd October 2009 21:40 GMT

Google’s massively global infrastructure now employs a proprietary system that automatically moves and replicates loads between its mega data centers when traffic and hardware issues arise.

The distributed technology was first hinted at — in classically coy Google fashion — during a conference this summer, and Google fellow Jeff Dean has now confirmed its existence in a presentation (PDF) delivered at a symposium earlier this month.

The platform is known as Spanner. Dean’s presentation calls it a “storage and computation system that spans all our data centers [and that] automatically moves and adds replicas of data and computation based on constraints and usage patterns.” This includes constraints related to bandwidth, packet loss, power, resources, and “failure modes”.

Dean speaks of an “automated allocation of resources across [Google’s] entire fleet of machines” — and that's quite a fleet. Google now has at least 36 data centers across the globe — though a handful may still be under construction. And as Data Center Knowledge recently noticed, the goal is to span a far larger fleet.

According to Dean’s presentation, Google is intent on scaling Spanner to between one million and 10 million servers, encompassing 10 trillion (1013) directories and a quintillion (1018) bytes of storage. And all this would be spread across “100s to 1000s” of locations around the world.

Imagine that. A single corporation housing an exabyte of the world's data across thousands of custom-built data centers.

Google Spanner

Google’s 10-million-server vision

Dean declined to discuss the presentation with The Reg. And Google’s PR arm has yet to respond to specific questions about the Spanner setup. But Google senior manager of engineering and architecture Vijay Gill alluded to the technology during an appearance at the cloud-happy Structure 09 mini-conference in San Francisco earlier this year.

Google’s favorite sentence

Asked if he what he would do if he could “wave a magic wand” to create a back-end net technology that “we don’t have today,” Gill waxed cryptic about Google’s famously distributed online infrastructure — which treats data centers as “warehouse-scale” machines — touching on the idea of moving loads from any data center that’s in danger of overheating.

“What we are building here...is warehouse-sized compute platforms,” Gill said. “You have to have integration with everything right from the chillers down all the way to the CPU.

“Sometimes, there’s a temperature excursion, and you might want to do a quick load-shedding — a quick load-shedding to prevent a temperature excursion because, hey, you have a data center with no chillers. You want to move some load off. You want to cut some CPUs and some of the processes in RAM.”

And he indicated the company could do this automatically and near-instantly — meaning without human intervention. “How do you manage the system and optimize it on a global level? That is the interesting part,” Gill continued.

“What we’ve got here [with Google] is massive — like hundreds of thousands of variable linear programming problems that need to run in quasi-real-time. When the temperature starts to excurse in a data center, you don’t have the luxury to sitting around for a half an hour… You have on the order of seconds.”

Asked if this was a technology Google is using today, Gill responded with one of Google’s favorite sentences. “I could not possibly comment on that," he said. When we later asked uber Googler Matt Cutts about this — with a Google PR man listening on the line — Cutts gave another Googly response: “I don't believe we have published any papers regarding that,” he said.

But it would seem that Gill was referring to Spanner. And judging from Dean’s presentation, the technology has already been deployed. As reported by Data Center Knowledge, Google has also said that its new data center in Saint-Ghislain, Belgium, operates without chillers. Apparently, Spanner is used to automatically move loads out of the Belgium facility when the outside air gets too hot during the summer.

Additional information is sketchy. Dean refers to Spanner as a “single global namespace,” with names completely independent of the location of the data. The design is similar to BigTable, Google’s distributed database, but rather than organizing data in rows, it uses hierarchical directories.

Dean also mentions “zones of semi-autonomous control,” indicating that Google splits its distributed infrastructure into various subsections that provide redundancy by operating independently of each other.

The goal, Dean says, is to provide access to data in less than 50 milliseconds 99 per cent of the time. And Google aims to store data on at least two disks in the European Union, two in US, and one in Asia.

But one has to wonder how far this technology has actually progressed. Over the past year, two much-discussed Gmail outages occurred when Google was moving workloads between data centers.

Clearly, Google has talent for distributed computing. But it also has talent for leaking just enough information to make you think it must doing something that no one else could possibly do. ®