Google's MapReduce patent - no threat to stuffed elephants

Hadoop will keep its head

Designing a Defense for Mobile Applications

In mid-January, Google won a patent for MapReduce, the distributed data crunching platform that underpins its globe-spanning online infrastructure. And that means there's at least a question mark hanging over Hadoop, the much-hyped open source platform that helps drive Yahoo!, Facebook, Microsoft's Bing, and an ever-expanding array of other web services and back-end business applications.

Hadoop is based in part on a MapReduce research paper Google published in 2004, about six months after it applied for the patent.

The Mountain View Chocolate Factory doesn't officially comment on specific patents in its portfolio. "Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops," the company recently told GigaOM, in response to questions about its MapReduce patent. "We feel that our behavior to date has been inline with our corporate values and priorities."

But the general assumption is that Google wouldn't use its patent against Hadoop or any other software that takes a lead from MapReduce, including databaseware from the likes of Aster Data Systems or Teradata. This is certainly the view of Cloudera, the all-star Silicon Valley startup that recently commercialized Hadoop in Red Hat-like fashion.

"I don't speak for Google. But Google has lots of patents, and it has basically has no track record of using those patents offensively, either involving licensing or pursuing people for infringement," Cloudera chief executive Mike Olson tells The Reg, before pointing out that Google is a member in the Open Invention Network, a patent pool that grants use licenses for patented technology in an effort to promote Linux.

"All of this convinces us that this is a strategic move from Google and not something that is aimed at the head of any Hadoop adopter or satellite company - Cloudera included."

Olson adds that Cloudera has "excellent ties" back to the Mountain View search giant and that he and his backers were well ware of Google patent before Cloudera was founded. "We - and our investors - talked about it in detail and at length, and without a qualm, we went ahead and founded the company."

The salient Google link is Cloudera vice president Christophe Bisciglia - the former Google engineer who Mountain View famously dispatched to the University of Washington to teach a course on what it likes to calls Big Data, i.e. net-scale distributed computing. Bisciglia's curriculum actually made use of Hadoop, and he stresses that the open source platform has become an important teaching tool for Google.

"In the past, it took three to six months to get hires up to speed with how to work with [Google] technology," Bisciglia has told The Reg. "But if schools are teaching this as part of the standard undergraduate curriculum, Google saved that three to six months - multiplied by thousands of engineers."

Google hired about half the students who took Bisciglia's first class.

But even if did Google change tact, if it suddenly went on the offensive with that MapoReduce patent, you wonder how successful it would be. As Yahoo! vice president of labs and research Ron Brachman points out, the basic concepts behind MapReduce are far from revolutionary. "To my mind, having grown up as a computer scientist in the 70s and taking courses on what was then though of as parallel processing, there were techniques around that felt very similar to [MapReduce's] type of parallelism," Brachman tells The Reg.

The patent - which you can see here - describes a "system and method for efficient large-scale data processing," and this involves "map" and "reduce" functions that have indeed been a part of parallel programming since Brachman's school days.

In essence, Google's platform "maps" data-crunching tasks across a collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation. As the patent abstract puts it, one or more map modules read input data, apply an operation to "produce intermediate data values," and distribute these values "across multiple processors in the parallel processing environment." One or more reduce modules then retrieve the intermediate data and apply a new operation to provide the ultimate output.

In any event, Hadoop mirrors this general setup, as Google described it in a research paper published in December 2004. The platform was originally developed by Nutch founder Doug Cutting, who needed a distributed data crunching platform for his open source web crawler, and after he open sourced it at Apache, the platform - named for his son's yellow stuffed elephant - soon spread to some the web's biggest names.

Yahoo! uses it to generate, among other things, the Yahoo! Search Webmap, which provides the index for its search engine. And it underpins Powerset, the so-called semantic search engine that was purchased by Microsoft and now drives portions of Bing.

Meanwhile, Cloudera is helping to deploy the platform on clusters used by countless other companies, including Rackspace, Netflix, LinkedIn, Samsung, and eHarmony. Rackspace, for one, is using a Hadoop cluster to crunch log data from its hosting infrastructure and serve up reports to support reps. The platform can applied to almost any breed of Big Data - and not so big data.

"We really don't like the term 'Big Data,'" Olson says. "To use Hadoop, you don't need to have petabytes of data. You don't even need terabytes. When customers hear a word like 'Big Data', they think 'It must be a Google thing.' But it's not," says Olson.

It's not - no matter what's on file at the US patent office. And we're quite sure that Google would agree. ®

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Apple fanbois SCREAM as update BRICKS their Macbook Airs
Ragegasm spills over as firmware upgrade kills machines
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Attack of the clones: Oracle's latest Red Hat Linux lookalike arrives
Oracle's Linux boss says Larry's Linux isn't just for Oracle apps anymore
THUD! WD plonks down SIX TERABYTE 'consumer NAS' fatboy
Now that's a LOT of porn or pirated movies. Or, you know, other consumer stuff
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
prev story


Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Reducing security risks from open source software
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Consolidation: the foundation for IT and business transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.