Feeds

Google's MapReduce patent - no threat to stuffed elephants

Hadoop will keep its head

Next gen security for virtualised datacentres

In mid-January, Google won a patent for MapReduce, the distributed data crunching platform that underpins its globe-spanning online infrastructure. And that means there's at least a question mark hanging over Hadoop, the much-hyped open source platform that helps drive Yahoo!, Facebook, Microsoft's Bing, and an ever-expanding array of other web services and back-end business applications.

Hadoop is based in part on a MapReduce research paper Google published in 2004, about six months after it applied for the patent.

The Mountain View Chocolate Factory doesn't officially comment on specific patents in its portfolio. "Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops," the company recently told GigaOM, in response to questions about its MapReduce patent. "We feel that our behavior to date has been inline with our corporate values and priorities."

But the general assumption is that Google wouldn't use its patent against Hadoop or any other software that takes a lead from MapReduce, including databaseware from the likes of Aster Data Systems or Teradata. This is certainly the view of Cloudera, the all-star Silicon Valley startup that recently commercialized Hadoop in Red Hat-like fashion.

"I don't speak for Google. But Google has lots of patents, and it has basically has no track record of using those patents offensively, either involving licensing or pursuing people for infringement," Cloudera chief executive Mike Olson tells The Reg, before pointing out that Google is a member in the Open Invention Network, a patent pool that grants use licenses for patented technology in an effort to promote Linux.

"All of this convinces us that this is a strategic move from Google and not something that is aimed at the head of any Hadoop adopter or satellite company - Cloudera included."

Olson adds that Cloudera has "excellent ties" back to the Mountain View search giant and that he and his backers were well ware of Google patent before Cloudera was founded. "We - and our investors - talked about it in detail and at length, and without a qualm, we went ahead and founded the company."

The salient Google link is Cloudera vice president Christophe Bisciglia - the former Google engineer who Mountain View famously dispatched to the University of Washington to teach a course on what it likes to calls Big Data, i.e. net-scale distributed computing. Bisciglia's curriculum actually made use of Hadoop, and he stresses that the open source platform has become an important teaching tool for Google.

"In the past, it took three to six months to get hires up to speed with how to work with [Google] technology," Bisciglia has told The Reg. "But if schools are teaching this as part of the standard undergraduate curriculum, Google saved that three to six months - multiplied by thousands of engineers."

Google hired about half the students who took Bisciglia's first class.

But even if did Google change tact, if it suddenly went on the offensive with that MapoReduce patent, you wonder how successful it would be. As Yahoo! vice president of labs and research Ron Brachman points out, the basic concepts behind MapReduce are far from revolutionary. "To my mind, having grown up as a computer scientist in the 70s and taking courses on what was then though of as parallel processing, there were techniques around that felt very similar to [MapReduce's] type of parallelism," Brachman tells The Reg.

The patent - which you can see here - describes a "system and method for efficient large-scale data processing," and this involves "map" and "reduce" functions that have indeed been a part of parallel programming since Brachman's school days.

In essence, Google's platform "maps" data-crunching tasks across a collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation. As the patent abstract puts it, one or more map modules read input data, apply an operation to "produce intermediate data values," and distribute these values "across multiple processors in the parallel processing environment." One or more reduce modules then retrieve the intermediate data and apply a new operation to provide the ultimate output.

In any event, Hadoop mirrors this general setup, as Google described it in a research paper published in December 2004. The platform was originally developed by Nutch founder Doug Cutting, who needed a distributed data crunching platform for his open source web crawler, and after he open sourced it at Apache, the platform - named for his son's yellow stuffed elephant - soon spread to some the web's biggest names.

Yahoo! uses it to generate, among other things, the Yahoo! Search Webmap, which provides the index for its search engine. And it underpins Powerset, the so-called semantic search engine that was purchased by Microsoft and now drives portions of Bing.

Meanwhile, Cloudera is helping to deploy the platform on clusters used by countless other companies, including Rackspace, Netflix, LinkedIn, Samsung, and eHarmony. Rackspace, for one, is using a Hadoop cluster to crunch log data from its hosting infrastructure and serve up reports to support reps. The platform can applied to almost any breed of Big Data - and not so big data.

"We really don't like the term 'Big Data,'" Olson says. "To use Hadoop, you don't need to have petabytes of data. You don't even need terabytes. When customers hear a word like 'Big Data', they think 'It must be a Google thing.' But it's not," says Olson.

It's not - no matter what's on file at the US patent office. And we're quite sure that Google would agree. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Object storage bods Exablox: RAID is dead, baby. RAID is dead
Bring your own disks to its object appliances
Nimble's latest mutants GORGE themselves on unlucky forerunners
Crossing Sandy Bridges without stopping for breath
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
BYOD's dark side: Data protection
An endpoint data protection solution that adds value to the user and the organization so it can protect itself from data loss as well as leverage corporate data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?