Google Caffeine: What it really is
Wake up and smell the file system
As it invites the world to play in a mysterious sandbox it likes to call "Caffeine," Google is testing more than just a "next-generation" search infrastructure. It's testing at least a portion of a revamped software architecture that will likely underpin all of its online applications for years to come.
Speaking with The Reg, über-Googler Matt Cutts confirms that the company's new Caffeine search infrastructure is built atop a complete overhaul of the company's custom-built Google File System, a project two years in the making. At least informally, Google refers to this file system redux as GFS2.
"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," he says. "Caffeine certainly does make use of the so-called GFS2."
Asked whether Caffeine also includes improvements to MapReduce, Google's distributed number-crunching platform, or BigTable, its distributed real-time database, Cutts declines to comment. But he does say that with Caffeine, Google is testing multiple platforms that could be applied across its entire online infrastructure - not just its search engine.
"I wouldn't get caught up on next-generation MapReduce and next-generation BigTable. Just because we have next-generation GFS does not automatically imply that we've got other next-generation implementations of platforms we've publicly talked about," he says. "But certainly, we are testing a lot of pieces that we would expect to - or hope to - migrate to eventually."
And he hints that Caffeine includes some novel platforms that could be rolled out to Google's famously unified online empire. "There are certainly new tools in the mix," he says.
Matt Cutts is the man who oversees the destruction of spam on the world's most popular search - the PageRank guru who typically opines about the ups and downs of Google's search algorithms. So, on Monday afternoon, when Cutts posted a blog post revealing a "secret project" to build a "next-generation architecture for Google's web search," many seemed to think this was some sort of change in search-ranking philosophy. But Cutts made it perfectly clear that this is merely an effort to upgrade the software sitting behind its search engine.
"The new infrastructure sits 'under the hood' of Google's search engine," read his blog post, "which means that most users won't notice a difference in search results."
Caffeine includes, as Cutts tells us, a top-to-bottom rewrite of the Google's indexing system - i.e., the system that builds a database of all known websites, complete with all the metadata needed to describe them. It's not an effort to change the way that index is used to generate search results.
"Caffeine is a fundamental re-architecting of how our indexing system works," Cutts says. "It's larger than a revamp. It's more along the lines of a rewrite. And it's really great. It gives us a lot more flexibility, a lot more power. The ability to index more documents. Indexing speeds - that is, how quickly you can put a document through our indexing system and make it searchable - is much much better."
Building an index is a number-crunching exercise - an epic number-crunching exercise. And for tasks like this, Google's uses a home-grown, proprietary distributed infrastructure to harness a sea of servers built from commodity hardware. That means GFS, which stores the data, and MapReduce, which crunches it.
Yes, Cutts plays down the idea that Google has overhauled MapReduce. But just as Yahoo!, Facebook, and others are working to improve the speed of Hadoop - the open source platform based on MapReduce - Google is eternally tweaking the original.
"The ideas behind MapReduce are very solid," Cutts tells us, "and that abstraction works very well. You can almost think of MapReduce as an abstraction - this idea of breaking up a task into many parts, mapping over them, and outputting data which can then be reduced. That's almost more of an abstraction, and specific ways you implement it can vary."
But when pressed, Cutts leaves no doubt that Caffeine employs the GFS2. And as we detailed earlier this week, GFS2 is a significant departure from the original Google File System that made its debut nearly ten years ago and is now used not only for search but for all of Google's online services.
Sponsored: What next after Netezza?