Google Caffeine: What it really is
Wake up and smell the file system
As it invites the world to play in a mysterious sandbox it likes to call "Caffeine," Google is testing more than just a "next-generation" search infrastructure. It's testing at least a portion of a revamped software architecture that will likely underpin all of its online applications for years to come.
Speaking with The Reg, über-Googler Matt Cutts confirms that the company's new Caffeine search infrastructure is built atop a complete overhaul of the company's custom-built Google File System, a project two years in the making. At least informally, Google refers to this file system redux as GFS2.
"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," he says. "Caffeine certainly does make use of the so-called GFS2."
Asked whether Caffeine also includes improvements to MapReduce, Google's distributed number-crunching platform, or BigTable, its distributed real-time database, Cutts declines to comment. But he does say that with Caffeine, Google is testing multiple platforms that could be applied across its entire online infrastructure - not just its search engine.
"I wouldn't get caught up on next-generation MapReduce and next-generation BigTable. Just because we have next-generation GFS does not automatically imply that we've got other next-generation implementations of platforms we've publicly talked about," he says. "But certainly, we are testing a lot of pieces that we would expect to - or hope to - migrate to eventually."
And he hints that Caffeine includes some novel platforms that could be rolled out to Google's famously unified online empire. "There are certainly new tools in the mix," he says.
Matt Cutts is the man who oversees the destruction of spam on the world's most popular search - the PageRank guru who typically opines about the ups and downs of Google's search algorithms. So, on Monday afternoon, when Cutts posted a blog post revealing a "secret project" to build a "next-generation architecture for Google's web search," many seemed to think this was some sort of change in search-ranking philosophy. But Cutts made it perfectly clear that this is merely an effort to upgrade the software sitting behind its search engine.
"The new infrastructure sits 'under the hood' of Google's search engine," read his blog post, "which means that most users won't notice a difference in search results."
Caffeine includes, as Cutts tells us, a top-to-bottom rewrite of the Google's indexing system - i.e., the system that builds a database of all known websites, complete with all the metadata needed to describe them. It's not an effort to change the way that index is used to generate search results.
"Caffeine is a fundamental re-architecting of how our indexing system works," Cutts says. "It's larger than a revamp. It's more along the lines of a rewrite. And it's really great. It gives us a lot more flexibility, a lot more power. The ability to index more documents. Indexing speeds - that is, how quickly you can put a document through our indexing system and make it searchable - is much much better."
Building an index is a number-crunching exercise - an epic number-crunching exercise. And for tasks like this, Google's uses a home-grown, proprietary distributed infrastructure to harness a sea of servers built from commodity hardware. That means GFS, which stores the data, and MapReduce, which crunches it.
Yes, Cutts plays down the idea that Google has overhauled MapReduce. But just as Yahoo!, Facebook, and others are working to improve the speed of Hadoop - the open source platform based on MapReduce - Google is eternally tweaking the original.
"The ideas behind MapReduce are very solid," Cutts tells us, "and that abstraction works very well. You can almost think of MapReduce as an abstraction - this idea of breaking up a task into many parts, mapping over them, and outputting data which can then be reduced. That's almost more of an abstraction, and specific ways you implement it can vary."
But when pressed, Cutts leaves no doubt that Caffeine employs the GFS2. And as we detailed earlier this week, GFS2 is a significant departure from the original Google File System that made its debut nearly ten years ago and is now used not only for search but for all of Google's online services.
Today, Caffeine. Tomorrow, The Empire
Google's philosophy is to build a single distributed architecture that treats its vast network of data centers as a single virtual machine.
"[Data centers] are just atoms," Google senior manager of engineering and architecture Vijay Gill said recently. "Any idiot can build atoms together and then create this vast infrastructure. The question is: How do you actually get the applications to use the infrastructure? How do you distribute it? How do you optimize it? That's the hard part. To do that you require an insane amount of force of will...
"We have a set of primitives, if you would, that takes those collections of atoms - those data centers, those networks - that we've built, and then they abstract that entire infrastructure out as a set of services - some of the public ones are GFS obviously, BigTable, MapReduce."
Caffeine is about the search index. But GFS2 is designed specifically for applications like Gmail and YouTube, applications that - unlike an indexing system - are served up directly to the end user. Such apps require ultra-low latency, and that's not something the original GFS was designed for.
With GFS, a master node oversees data spread across a series of distributed chunkservers. And for apps that require low latency, that lone master - a single point of failure - is a problem.
"One GFS shortcoming that this immediately exposed had to do with the original single-master design," former GFS tech lead Sean Quinlan has said. "A single point of failure may not have been a disaster for batch-oriented applications, but it was certainly unacceptable for latency-sensitive applications, such as video serving."
GF2S uses not only distributed slaves, but distributed masters as well.
So, today, Caffeine - tomorrow, everything else. Cutts confirms that Caffeine is running in a single Google data center - and that would seem to imply that GFS2 has only been deployed in that one facility. Reg readers have marveled at the scope of Google's pending upgrade, with one commenter hoping that Google has equipped its engineers with "massively reinforced underwear."
But Cutts downplays the risks and hassle, saying the migration is a matter of taking one data center offline at a time. "At any point, we have the ability to take one data center out of the rotation, if we wanted to swap out power components or different hardware - or change the software," he says. "So you can imagine building an index at one of the data centers and then copying that data throughout all the other data centers.
"If you want to deploy new software, you could take one of the data centers out of the traditional rotation. And you can send any degree of traffic to it."
Vijay Gill has even hinted that Google has developed some sort of magical software layer that can automatically migrate loads in and out of data centers in near time. But when asked about this - with a Google PR man listening on the line - Cutts gave a very Googly response. "I don't believe we have published any papers regarding that." The company likes being coy.
In similar fashion, Cutts won't say all that much about the tools rolled into Caffeine, which is publicly available here (except when it's not). But he leaves no doubt that this, well, semi-secret project isn't just a search upgrade. ®