Yahoo! looks beyond Google's data cruncher
Can you really MapReduce natural language?
Nowadays, when it comes to crunching epic amounts of web data, Google's MapReduce credo is all the rage. The Mountain View method of distributing back-end compute tasks across a sea of commodity machines has given rise to the open source Hadoop platform, which now underpins Yahoo!, Facebook, and even a chunk of Microsoft Bing.
But for Ron Brachman - the former Bell Labs and DARPA man who now serves as vice president of Yahoo! labs and research - a future interwebs may need something very different. MapReduce splinters compute tasks into tiny pieces that are processed independently of each other, and this sort of parallelism by complete separation, he argues, may be ill-suited to a more nuanced breed of web application.
One example is a web that leans heavily on natural language processing. "When we get closer to doing broad-scale language processing that's more, if you will, semantic, we might need to move away from a MapReduce architecture to something that may be equally parallel but with a very different computational architecture," Brachman tells The Reg.
Yahoo! calls itself the leading Hadoop contributor, and the general assumption is that its Yahoo! Search Webmap - which generates the index for its public search engine - is still the world's largest Hadoop application. But two years after the launch of Webmap, the net giant is looking beyond the much-hyped open source platform.
"We continue to explore how to run complex computational jobs on data, and that starts with MapReduce," Brachman says. "But we're looking at other methods of very large scale parallelism. All of this stuff is still emerging - even though some people claim to offer the be-all, end-all 'cloud computing' product already."
Google's MapReduce framework maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. Mountain View published a research paper on the platform in 2004, and this inspired Hadoop, an Apache project founded by former Yahooligan Doug Cutting.
Though Hadoop is "quite a significant piece" of Yahoo!'s current distributed-computing research, Brachman wouldn't call it the only way to crunch data. "I couldn't tell you whether [its a significant piece] just because it's there or because it's the most essential way to do data processing."
Certainly, it gets the job done on today's web. After all, it handles back-end processing for three net giants likely juggling more data than any other web outfits on the planet. But that doesn't mean it's future-proof.
"There are cases where running very large scale parallelism on completely separable units of data - where there is no interaction between the units - and then gathering up the results is the natural way to attack a problem," Brachman says. "But clearly, there are also problems where we need to invent new ways of doing large-scale computing that are not MapReduce-oriented."
Brachman points to natural-language processing in part because the so-called semantic web is his particular area of expertise. "If you start trying to do true language understanding - which is beyond our reach right now, especially if you want to do it deeply - you need something else," he says. "If you're trying to understand, say, a single English sentence with multiple clauses, you can't just processing a sentence sequentially and know the meaning as you add one word and then another...
"Now imagine growing that to not just not just sentences, but discourses and dialogues across the entire web - or more, broadly, the entire internet. There will be cases where you can't just process little pieces completely divorced from everything else. You need to pull things together."
This is not to say that MapReduce is completely incompatible with semantic processing. Hadoop drives Carnegie Mellon's Read the Web project - an effort to create a semantic map of the web that runs on the M45 cluster Yahoo! serves up to various academic institutions - and it underpins Powerset, the semantic search engine that Microsoft has applied to portions of Bing. But Brachman is looking further down the road, to an altogether different level of machine "understanding."
Yes, Yahoo! will continue to explore such far-reaching avenues. Regulators are on the verge of approving the company's mega-pact with Microsoft, which will see Bing handle search duties on Yahoo.com, but even without search, Brachman and company are still in the business of juggling epic amounts of web data.
And as it pushes for advancements in distributing computing, Yahoo! will do so in tandem with the community at large - through continued contributions to Hadoop and other open source projects and through partnerships with academic institutions such as Carnegie Mellon and the University of California at Berkeley. "A rising tide," Brachman says, "lifts all ships."
This too contrasts with the Mountain View credo. Google did release that MapReduce research paper. And like Yahoo!, it offers back-end compute resources to academic researchers. But Google MapReduce is decidedly closed. And as of last month, it's patented. ®
Back to the Future...
The idea that MapReduce might not be the be-all and end-all of parallel computing shouldn't be surprising. Consider the more "traditional" field of large-scale scientific computing (i.e. solving PDEs). MapReduce is absolutely NOT the fundamental way to get parallelism because the different parallel parts need to communicate on a fine-grained scale.
The surprise to me was that something as simple as MapReduce was able to solve AS MANY problems as it has.
Re: Back to the Future...
That shouldn't really be surprising. As they say, when all you have is a hammer, everything looks like a nail.
Ever since MapReduce was described in that seminal white-paper, it became all the rage, and a lot of people did see it as the end-all, be all or distributed processing; and I posit that *because* of that, a lot of people have replaced their entire toolbox with this single hammer.
So then, it's not that MapReduce is able to solve so many problems, it's that it's been applied to so many problems, irrespective of its disposition to solve such problems.
I'm not sure?
So does this mean Yahoo! considers my web connection a distributed resource? Maybe I should send them I bill for using my leccy and PC? Hey - what about the cost of lighting, heating, beer, CPU cooling,coffees,biscuits? OMG - spreadsheet required...