Feeds

Google Caffeine: What it really is

Wake up and smell the file system

Internet Security Threat Report 2014

As it invites the world to play in a mysterious sandbox it likes to call "Caffeine," Google is testing more than just a "next-generation" search infrastructure. It's testing at least a portion of a revamped software architecture that will likely underpin all of its online applications for years to come.

Speaking with The Reg, über-Googler Matt Cutts confirms that the company's new Caffeine search infrastructure is built atop a complete overhaul of the company's custom-built Google File System, a project two years in the making. At least informally, Google refers to this file system redux as GFS2.

"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," he says. "Caffeine certainly does make use of the so-called GFS2."

Asked whether Caffeine also includes improvements to MapReduce, Google's distributed number-crunching platform, or BigTable, its distributed real-time database, Cutts declines to comment. But he does say that with Caffeine, Google is testing multiple platforms that could be applied across its entire online infrastructure - not just its search engine.

"I wouldn't get caught up on next-generation MapReduce and next-generation BigTable. Just because we have next-generation GFS does not automatically imply that we've got other next-generation implementations of platforms we've publicly talked about," he says. "But certainly, we are testing a lot of pieces that we would expect to - or hope to - migrate to eventually."

And he hints that Caffeine includes some novel platforms that could be rolled out to Google's famously unified online empire. "There are certainly new tools in the mix," he says.

Matt Cutts is the man who oversees the destruction of spam on the world's most popular search - the PageRank guru who typically opines about the ups and downs of Google's search algorithms. So, on Monday afternoon, when Cutts posted a blog post revealing a "secret project" to build a "next-generation architecture for Google's web search," many seemed to think this was some sort of change in search-ranking philosophy. But Cutts made it perfectly clear that this is merely an effort to upgrade the software sitting behind its search engine.

"The new infrastructure sits 'under the hood' of Google's search engine," read his blog post, "which means that most users won't notice a difference in search results."

Caffeine includes, as Cutts tells us, a top-to-bottom rewrite of the Google's indexing system - i.e., the system that builds a database of all known websites, complete with all the metadata needed to describe them. It's not an effort to change the way that index is used to generate search results.

"Caffeine is a fundamental re-architecting of how our indexing system works," Cutts says. "It's larger than a revamp. It's more along the lines of a rewrite. And it's really great. It gives us a lot more flexibility, a lot more power. The ability to index more documents. Indexing speeds - that is, how quickly you can put a document through our indexing system and make it searchable - is much much better."

Building an index is a number-crunching exercise - an epic number-crunching exercise. And for tasks like this, Google's uses a home-grown, proprietary distributed infrastructure to harness a sea of servers built from commodity hardware. That means GFS, which stores the data, and MapReduce, which crunches it.

Yes, Cutts plays down the idea that Google has overhauled MapReduce. But just as Yahoo!, Facebook, and others are working to improve the speed of Hadoop - the open source platform based on MapReduce - Google is eternally tweaking the original.

"The ideas behind MapReduce are very solid," Cutts tells us, "and that abstraction works very well. You can almost think of MapReduce as an abstraction - this idea of breaking up a task into many parts, mapping over them, and outputting data which can then be reduced. That's almost more of an abstraction, and specific ways you implement it can vary."

But when pressed, Cutts leaves no doubt that Caffeine employs the GFS2. And as we detailed earlier this week, GFS2 is a significant departure from the original Google File System that made its debut nearly ten years ago and is now used not only for search but for all of Google's online services.

Internet Security Threat Report 2014

More from The Register

next story
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
DEATH by COMMENTS: WordPress XSS vuln is BIGGEST for YEARS
Trio of XSS turns attackers into admins
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
5 critical considerations for enterprise cloud backup
Key considerations when evaluating cloud backup solutions to ensure adequate protection security and availability of enterprise data.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
How to simplify SSL certificate management
Simple steps to take control of SSL certificates across the enterprise, and recommendations centralizing certificate management throughout their lifecycle.