Feeds

Google Caffeine: What it really is

Wake up and smell the file system

Internet Security Threat Report 2014

As it invites the world to play in a mysterious sandbox it likes to call "Caffeine," Google is testing more than just a "next-generation" search infrastructure. It's testing at least a portion of a revamped software architecture that will likely underpin all of its online applications for years to come.

Speaking with The Reg, über-Googler Matt Cutts confirms that the company's new Caffeine search infrastructure is built atop a complete overhaul of the company's custom-built Google File System, a project two years in the making. At least informally, Google refers to this file system redux as GFS2.

"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," he says. "Caffeine certainly does make use of the so-called GFS2."

Asked whether Caffeine also includes improvements to MapReduce, Google's distributed number-crunching platform, or BigTable, its distributed real-time database, Cutts declines to comment. But he does say that with Caffeine, Google is testing multiple platforms that could be applied across its entire online infrastructure - not just its search engine.

"I wouldn't get caught up on next-generation MapReduce and next-generation BigTable. Just because we have next-generation GFS does not automatically imply that we've got other next-generation implementations of platforms we've publicly talked about," he says. "But certainly, we are testing a lot of pieces that we would expect to - or hope to - migrate to eventually."

And he hints that Caffeine includes some novel platforms that could be rolled out to Google's famously unified online empire. "There are certainly new tools in the mix," he says.

Matt Cutts is the man who oversees the destruction of spam on the world's most popular search - the PageRank guru who typically opines about the ups and downs of Google's search algorithms. So, on Monday afternoon, when Cutts posted a blog post revealing a "secret project" to build a "next-generation architecture for Google's web search," many seemed to think this was some sort of change in search-ranking philosophy. But Cutts made it perfectly clear that this is merely an effort to upgrade the software sitting behind its search engine.

"The new infrastructure sits 'under the hood' of Google's search engine," read his blog post, "which means that most users won't notice a difference in search results."

Caffeine includes, as Cutts tells us, a top-to-bottom rewrite of the Google's indexing system - i.e., the system that builds a database of all known websites, complete with all the metadata needed to describe them. It's not an effort to change the way that index is used to generate search results.

"Caffeine is a fundamental re-architecting of how our indexing system works," Cutts says. "It's larger than a revamp. It's more along the lines of a rewrite. And it's really great. It gives us a lot more flexibility, a lot more power. The ability to index more documents. Indexing speeds - that is, how quickly you can put a document through our indexing system and make it searchable - is much much better."

Building an index is a number-crunching exercise - an epic number-crunching exercise. And for tasks like this, Google's uses a home-grown, proprietary distributed infrastructure to harness a sea of servers built from commodity hardware. That means GFS, which stores the data, and MapReduce, which crunches it.

Yes, Cutts plays down the idea that Google has overhauled MapReduce. But just as Yahoo!, Facebook, and others are working to improve the speed of Hadoop - the open source platform based on MapReduce - Google is eternally tweaking the original.

"The ideas behind MapReduce are very solid," Cutts tells us, "and that abstraction works very well. You can almost think of MapReduce as an abstraction - this idea of breaking up a task into many parts, mapping over them, and outputting data which can then be reduced. That's almost more of an abstraction, and specific ways you implement it can vary."

But when pressed, Cutts leaves no doubt that Caffeine employs the GFS2. And as we detailed earlier this week, GFS2 is a significant departure from the original Google File System that made its debut nearly ten years ago and is now used not only for search but for all of Google's online services.

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
prev story

Whitepapers

Go beyond APM with real-time IT operations analytics
How IT operations teams can harness the wealth of wire data already flowing through their environment for real-time operational intelligence.
10 threats to successful enterprise endpoint backup
10 threats to a successful backup including issues with BYOD, slow backups and ineffective security.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Website security in corporate America
Find out how you rank among other IT managers testing your website's vulnerabilities.