Feeds

Yahoo! looks beyond Google's data cruncher

Can you really MapReduce natural language?

Top 5 reasons to deploy VMware with Tegile

Nowadays, when it comes to crunching epic amounts of web data, Google's MapReduce credo is all the rage. The Mountain View method of distributing back-end compute tasks across a sea of commodity machines has given rise to the open source Hadoop platform, which now underpins Yahoo!, Facebook, and even a chunk of Microsoft Bing.

But for Ron Brachman - the former Bell Labs and DARPA man who now serves as vice president of Yahoo! labs and research - a future interwebs may need something very different. MapReduce splinters compute tasks into tiny pieces that are processed independently of each other, and this sort of parallelism by complete separation, he argues, may be ill-suited to a more nuanced breed of web application.

One example is a web that leans heavily on natural language processing. "When we get closer to doing broad-scale language processing that's more, if you will, semantic, we might need to move away from a MapReduce architecture to something that may be equally parallel but with a very different computational architecture," Brachman tells The Reg.

Yahoo! calls itself the leading Hadoop contributor, and the general assumption is that its Yahoo! Search Webmap - which generates the index for its public search engine - is still the world's largest Hadoop application. But two years after the launch of Webmap, the net giant is looking beyond the much-hyped open source platform.

"We continue to explore how to run complex computational jobs on data, and that starts with MapReduce," Brachman says. "But we're looking at other methods of very large scale parallelism. All of this stuff is still emerging - even though some people claim to offer the be-all, end-all 'cloud computing' product already."

Google's MapReduce framework maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. Mountain View published a research paper on the platform in 2004, and this inspired Hadoop, an Apache project founded by former Yahooligan Doug Cutting.

Though Hadoop is "quite a significant piece" of Yahoo!'s current distributed-computing research, Brachman wouldn't call it the only way to crunch data. "I couldn't tell you whether [its a significant piece] just because it's there or because it's the most essential way to do data processing."

Certainly, it gets the job done on today's web. After all, it handles back-end processing for three net giants likely juggling more data than any other web outfits on the planet. But that doesn't mean it's future-proof.

"There are cases where running very large scale parallelism on completely separable units of data - where there is no interaction between the units - and then gathering up the results is the natural way to attack a problem," Brachman says. "But clearly, there are also problems where we need to invent new ways of doing large-scale computing that are not MapReduce-oriented."

Brachman points to natural-language processing in part because the so-called semantic web is his particular area of expertise. "If you start trying to do true language understanding - which is beyond our reach right now, especially if you want to do it deeply - you need something else," he says. "If you're trying to understand, say, a single English sentence with multiple clauses, you can't just processing a sentence sequentially and know the meaning as you add one word and then another...

"Now imagine growing that to not just not just sentences, but discourses and dialogues across the entire web - or more, broadly, the entire internet. There will be cases where you can't just process little pieces completely divorced from everything else. You need to pull things together."

This is not to say that MapReduce is completely incompatible with semantic processing. Hadoop drives Carnegie Mellon's Read the Web project - an effort to create a semantic map of the web that runs on the M45 cluster Yahoo! serves up to various academic institutions - and it underpins Powerset, the semantic search engine that Microsoft has applied to portions of Bing. But Brachman is looking further down the road, to an altogether different level of machine "understanding."

Yes, Yahoo! will continue to explore such far-reaching avenues. Regulators are on the verge of approving the company's mega-pact with Microsoft, which will see Bing handle search duties on Yahoo.com, but even without search, Brachman and company are still in the business of juggling epic amounts of web data.

And as it pushes for advancements in distributing computing, Yahoo! will do so in tandem with the community at large - through continued contributions to Hadoop and other open source projects and through partnerships with academic institutions such as Carnegie Mellon and the University of California at Berkeley. "A rising tide," Brachman says, "lifts all ships."

This too contrasts with the Mountain View credo. Google did release that MapReduce research paper. And like Yahoo!, it offers back-end compute resources to academic researchers. But Google MapReduce is decidedly closed. And as of last month, it's patented. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
Want to STUFF Facebook with blatant ADVERTISING? Fine! But you must PAY
Pony up or push off, Zuck tells social marketeers
Oi, Europe! Tell US feds to GTFO of our servers, say Microsoft and pals
By writing a really angry letter about how it's harming our cloud business, ta
SAVE ME, NASA system builder, from my DEAD WORKSTATION
Anal-retentive hardware nerd in paws-on workstation crisis
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Internet Security Threat Report 2014
An overview and analysis of the year in global threat activity: identify, analyze, and provide commentary on emerging trends in the dynamic threat landscape.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.