Yahoo! looks beyond Google's data cruncher

Can you really MapReduce natural language?

Application security programs and practises

Nowadays, when it comes to crunching epic amounts of web data, Google's MapReduce credo is all the rage. The Mountain View method of distributing back-end compute tasks across a sea of commodity machines has given rise to the open source Hadoop platform, which now underpins Yahoo!, Facebook, and even a chunk of Microsoft Bing.

But for Ron Brachman - the former Bell Labs and DARPA man who now serves as vice president of Yahoo! labs and research - a future interwebs may need something very different. MapReduce splinters compute tasks into tiny pieces that are processed independently of each other, and this sort of parallelism by complete separation, he argues, may be ill-suited to a more nuanced breed of web application.

One example is a web that leans heavily on natural language processing. "When we get closer to doing broad-scale language processing that's more, if you will, semantic, we might need to move away from a MapReduce architecture to something that may be equally parallel but with a very different computational architecture," Brachman tells The Reg.

Yahoo! calls itself the leading Hadoop contributor, and the general assumption is that its Yahoo! Search Webmap - which generates the index for its public search engine - is still the world's largest Hadoop application. But two years after the launch of Webmap, the net giant is looking beyond the much-hyped open source platform.

"We continue to explore how to run complex computational jobs on data, and that starts with MapReduce," Brachman says. "But we're looking at other methods of very large scale parallelism. All of this stuff is still emerging - even though some people claim to offer the be-all, end-all 'cloud computing' product already."

Google's MapReduce framework maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. Mountain View published a research paper on the platform in 2004, and this inspired Hadoop, an Apache project founded by former Yahooligan Doug Cutting.

Though Hadoop is "quite a significant piece" of Yahoo!'s current distributed-computing research, Brachman wouldn't call it the only way to crunch data. "I couldn't tell you whether [its a significant piece] just because it's there or because it's the most essential way to do data processing."

Certainly, it gets the job done on today's web. After all, it handles back-end processing for three net giants likely juggling more data than any other web outfits on the planet. But that doesn't mean it's future-proof.

"There are cases where running very large scale parallelism on completely separable units of data - where there is no interaction between the units - and then gathering up the results is the natural way to attack a problem," Brachman says. "But clearly, there are also problems where we need to invent new ways of doing large-scale computing that are not MapReduce-oriented."

Brachman points to natural-language processing in part because the so-called semantic web is his particular area of expertise. "If you start trying to do true language understanding - which is beyond our reach right now, especially if you want to do it deeply - you need something else," he says. "If you're trying to understand, say, a single English sentence with multiple clauses, you can't just processing a sentence sequentially and know the meaning as you add one word and then another...

"Now imagine growing that to not just not just sentences, but discourses and dialogues across the entire web - or more, broadly, the entire internet. There will be cases where you can't just process little pieces completely divorced from everything else. You need to pull things together."

This is not to say that MapReduce is completely incompatible with semantic processing. Hadoop drives Carnegie Mellon's Read the Web project - an effort to create a semantic map of the web that runs on the M45 cluster Yahoo! serves up to various academic institutions - and it underpins Powerset, the semantic search engine that Microsoft has applied to portions of Bing. But Brachman is looking further down the road, to an altogether different level of machine "understanding."

Yes, Yahoo! will continue to explore such far-reaching avenues. Regulators are on the verge of approving the company's mega-pact with Microsoft, which will see Bing handle search duties on Yahoo.com, but even without search, Brachman and company are still in the business of juggling epic amounts of web data.

And as it pushes for advancements in distributing computing, Yahoo! will do so in tandem with the community at large - through continued contributions to Hadoop and other open source projects and through partnerships with academic institutions such as Carnegie Mellon and the University of California at Berkeley. "A rising tide," Brachman says, "lifts all ships."

This too contrasts with the Mountain View credo. Google did release that MapReduce research paper. And like Yahoo!, it offers back-end compute resources to academic researchers. But Google MapReduce is decidedly closed. And as of last month, it's patented. ®

Eight steps to building an HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
Amazon Reveals One Weird Trick: A Loss On Almost $20bn In Sales
Investors really hate it: Share price plunge as growth SLOWS in key AWS division
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
prev story


Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.