Feeds

Google Percolator – global search jolt sans MapReduce comedown

The machine that brews the Caffeine

Application security programs and practises

Google Caffeine — the revamped search infrastructure recently rolled out across Google's worldwide network of data centers — is based on a distributed data-processing system known as Percolator. Designed by Google and, until now, jealously guarded by Google, Percolator is a platform for "incremental processing" — a means of continually updating the company's epic search index without reprocessing the entire thing from scratch.

As Google senior director of engineering Eisar Lipkovitz told The Register earlier this month, the new platform is a speedier alternative to MapReduce, the distributed number-crunching platform that underpinned the company's previous indexing system. Two New York-based Google engineers — Daniel Peng and Frank Dabek — discuss the platform at length in a paper they are scheduled to present at the annual USENIX Symposium on Operating Systems Design and Implementation (OSDI) next month in Vancouver.

"MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency," the paper reads. "We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%."

Speaking with The Register, Lipkovitz compared the system to classic database programming and the use of "database triggers." Because the index can be updated incrementally, the median document moves through Caffeine over 100 times faster than it moved through the company's old MapReduce setup. "The Percolator-based indexing system (known as Caffeine), crawls the same number of documents, but we feed each document through Percolator as it is crawled. The immediate advantage, and main design goal, of Caffeine is a reduction in latency."

“By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.”

In the past, Google's search index — an index of the entire web — was built with a series of batch operations. The MapReduce platform "maps" tasks across a vast collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation. Google's webcrawlers would supply the raw data — the webpages and weblinks — and MapReduce would process this data, determining, among other things, each site's PageRank, that famous measure of how many other sites it links to.

Eight steps to building an HP BladeSystem

Next page: MapReduce reduced

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Apple fanbois SCREAM as update BRICKS their Macbook Airs
Ragegasm spills over as firmware upgrade kills machines
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
Amazon Reveals One Weird Trick: A Loss On Almost $20bn In Sales
Investors really hate it: Share price plunge as growth SLOWS in key AWS division
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
prev story

Whitepapers

Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.