Feeds

Google Caffeine: What it really is

Wake up and smell the file system

Top 5 reasons to deploy VMware with Tegile

As it invites the world to play in a mysterious sandbox it likes to call "Caffeine," Google is testing more than just a "next-generation" search infrastructure. It's testing at least a portion of a revamped software architecture that will likely underpin all of its online applications for years to come.

Speaking with The Reg, über-Googler Matt Cutts confirms that the company's new Caffeine search infrastructure is built atop a complete overhaul of the company's custom-built Google File System, a project two years in the making. At least informally, Google refers to this file system redux as GFS2.

"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," he says. "Caffeine certainly does make use of the so-called GFS2."

Asked whether Caffeine also includes improvements to MapReduce, Google's distributed number-crunching platform, or BigTable, its distributed real-time database, Cutts declines to comment. But he does say that with Caffeine, Google is testing multiple platforms that could be applied across its entire online infrastructure - not just its search engine.

"I wouldn't get caught up on next-generation MapReduce and next-generation BigTable. Just because we have next-generation GFS does not automatically imply that we've got other next-generation implementations of platforms we've publicly talked about," he says. "But certainly, we are testing a lot of pieces that we would expect to - or hope to - migrate to eventually."

And he hints that Caffeine includes some novel platforms that could be rolled out to Google's famously unified online empire. "There are certainly new tools in the mix," he says.

Matt Cutts is the man who oversees the destruction of spam on the world's most popular search - the PageRank guru who typically opines about the ups and downs of Google's search algorithms. So, on Monday afternoon, when Cutts posted a blog post revealing a "secret project" to build a "next-generation architecture for Google's web search," many seemed to think this was some sort of change in search-ranking philosophy. But Cutts made it perfectly clear that this is merely an effort to upgrade the software sitting behind its search engine.

"The new infrastructure sits 'under the hood' of Google's search engine," read his blog post, "which means that most users won't notice a difference in search results."

Caffeine includes, as Cutts tells us, a top-to-bottom rewrite of the Google's indexing system - i.e., the system that builds a database of all known websites, complete with all the metadata needed to describe them. It's not an effort to change the way that index is used to generate search results.

"Caffeine is a fundamental re-architecting of how our indexing system works," Cutts says. "It's larger than a revamp. It's more along the lines of a rewrite. And it's really great. It gives us a lot more flexibility, a lot more power. The ability to index more documents. Indexing speeds - that is, how quickly you can put a document through our indexing system and make it searchable - is much much better."

Building an index is a number-crunching exercise - an epic number-crunching exercise. And for tasks like this, Google's uses a home-grown, proprietary distributed infrastructure to harness a sea of servers built from commodity hardware. That means GFS, which stores the data, and MapReduce, which crunches it.

Yes, Cutts plays down the idea that Google has overhauled MapReduce. But just as Yahoo!, Facebook, and others are working to improve the speed of Hadoop - the open source platform based on MapReduce - Google is eternally tweaking the original.

"The ideas behind MapReduce are very solid," Cutts tells us, "and that abstraction works very well. You can almost think of MapReduce as an abstraction - this idea of breaking up a task into many parts, mapping over them, and outputting data which can then be reduced. That's almost more of an abstraction, and specific ways you implement it can vary."

But when pressed, Cutts leaves no doubt that Caffeine employs the GFS2. And as we detailed earlier this week, GFS2 is a significant departure from the original Google File System that made its debut nearly ten years ago and is now used not only for search but for all of Google's online services.

Beginner's guide to SSL certificates

More from The Register

next story
Ellison: Sparc M7 is Oracle's most important silicon EVER
'Acceleration engines' key to performance, security, Larry says
Oracle SHELLSHOCKER - data titan lists unpatchables
Database kingpin lists 32 products that can't be patched (yet) as GNU fixes second vuln
Lenovo to finish $2.1bn IBM x86 server gobble in October
A lighter snack than expected – but what's a few $100m between friends, eh?
Ello? ello? ello?: Facebook challenger in DDoS KNOCKOUT
Gets back up again after half an hour though
Hey, what's a STORAGE company doing working on Internet-of-Cars?
Boo - it's not a terabyte car, it's just predictive maintenance and that
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.