Feeds

Google Percolator – global search jolt sans MapReduce comedown

The machine that brews the Caffeine

High performance access to file storage

MapReduce reduced

This worked well enough. But, the Google engineers say, it wasn't suited to rapidly updating the index. "Consider how to update that index after recrawling some small portion of the web. It’s not sufficient to run the MapReduces over just the new pages since, for example, there are links between the new pages and the rest of the web. The MapReduces must be run again over the entire repository, that is, over both the new pages and the old pages," they explain.

"Given enough computing resources, MapReduce’s scalability makes this approach feasible, and, in fact, Google’s web search index was produced in this way prior to the work described here. However, reprocessing the entire web discards the work done in earlier runs and makes latency proportional to the size of the repository, rather than the size of an update."

With the old system, the company crawled several billion documents each day and fed them, together with an epic collection of existing documents, through a sequence of roughly 100 MapReduces. Because the system was sequential, this meant that each document spent two to three days being indexed before it would actually turn up in Google's live search results.

Percolator slashes this delay by providing random access to the existing multi-petabyte index, letting Google make updates without reprocessing the entire repository. "Random access allows us to process documents individually, avoiding the global scans of the repository that MapReduce requires," the paper says. The system runs across a sea of machines, making vast numbers of changes in parallel with what the company calls ACID-complaint database transactions.

Google Percolator v MapReduce

As Lipkovitz indicated, Percolator runs atop BigTable (Google's distributed database platform) and GFS (its distributed file system). Lipkovitz also explained that the system uses a new version of GFS known as "Colossus" or GFS2, but this is not explicitly discussed in the paper. (You can find more on GFS2 here).

Percolator uses the same basic interface as BigTable. Data is stored in BigTable rows and columns, with Percolator metadata stacked up in its own "special" columns. And it uses a modified version of the BigTable API, with BigTable operations wrapped in computations specific to Percolator.

Unlike BigTable, Percolator does multi-row transactions, and it offers a compute framework for executing code atop the database. This framework is built around what Google calls "observers". In essence, these provide a means of organizing the myriad transactions. "Programmers of an incremental system need to keep track of the state of the incremental computation," the paper says. "To assist them in this task, Percolator provides observers: pieces of code that are invoked by the system whenever a user-specified column changes."

High performance access to file storage

Next page: Death to stragglers

More from The Register

next story
Seagate brings out 6TB HDD, did not need NO STEENKIN' SHINGLES
Or helium filling either, according to reports
European Court of Justice rips up Data Retention Directive
Rules 'interfering' measure to be 'invalid'
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
USA opposes 'Schengen cloud' Eurocentric routing plan
All routes should transit America, apparently
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.