Feeds

Yahoo! looks beyond Google's data cruncher

Can you really MapReduce natural language?

3 Big data security analytics techniques

Nowadays, when it comes to crunching epic amounts of web data, Google's MapReduce credo is all the rage. The Mountain View method of distributing back-end compute tasks across a sea of commodity machines has given rise to the open source Hadoop platform, which now underpins Yahoo!, Facebook, and even a chunk of Microsoft Bing.

But for Ron Brachman - the former Bell Labs and DARPA man who now serves as vice president of Yahoo! labs and research - a future interwebs may need something very different. MapReduce splinters compute tasks into tiny pieces that are processed independently of each other, and this sort of parallelism by complete separation, he argues, may be ill-suited to a more nuanced breed of web application.

One example is a web that leans heavily on natural language processing. "When we get closer to doing broad-scale language processing that's more, if you will, semantic, we might need to move away from a MapReduce architecture to something that may be equally parallel but with a very different computational architecture," Brachman tells The Reg.

Yahoo! calls itself the leading Hadoop contributor, and the general assumption is that its Yahoo! Search Webmap - which generates the index for its public search engine - is still the world's largest Hadoop application. But two years after the launch of Webmap, the net giant is looking beyond the much-hyped open source platform.

"We continue to explore how to run complex computational jobs on data, and that starts with MapReduce," Brachman says. "But we're looking at other methods of very large scale parallelism. All of this stuff is still emerging - even though some people claim to offer the be-all, end-all 'cloud computing' product already."

Google's MapReduce framework maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. Mountain View published a research paper on the platform in 2004, and this inspired Hadoop, an Apache project founded by former Yahooligan Doug Cutting.

Though Hadoop is "quite a significant piece" of Yahoo!'s current distributed-computing research, Brachman wouldn't call it the only way to crunch data. "I couldn't tell you whether [its a significant piece] just because it's there or because it's the most essential way to do data processing."

Certainly, it gets the job done on today's web. After all, it handles back-end processing for three net giants likely juggling more data than any other web outfits on the planet. But that doesn't mean it's future-proof.

"There are cases where running very large scale parallelism on completely separable units of data - where there is no interaction between the units - and then gathering up the results is the natural way to attack a problem," Brachman says. "But clearly, there are also problems where we need to invent new ways of doing large-scale computing that are not MapReduce-oriented."

Brachman points to natural-language processing in part because the so-called semantic web is his particular area of expertise. "If you start trying to do true language understanding - which is beyond our reach right now, especially if you want to do it deeply - you need something else," he says. "If you're trying to understand, say, a single English sentence with multiple clauses, you can't just processing a sentence sequentially and know the meaning as you add one word and then another...

"Now imagine growing that to not just not just sentences, but discourses and dialogues across the entire web - or more, broadly, the entire internet. There will be cases where you can't just process little pieces completely divorced from everything else. You need to pull things together."

This is not to say that MapReduce is completely incompatible with semantic processing. Hadoop drives Carnegie Mellon's Read the Web project - an effort to create a semantic map of the web that runs on the M45 cluster Yahoo! serves up to various academic institutions - and it underpins Powerset, the semantic search engine that Microsoft has applied to portions of Bing. But Brachman is looking further down the road, to an altogether different level of machine "understanding."

Yes, Yahoo! will continue to explore such far-reaching avenues. Regulators are on the verge of approving the company's mega-pact with Microsoft, which will see Bing handle search duties on Yahoo.com, but even without search, Brachman and company are still in the business of juggling epic amounts of web data.

And as it pushes for advancements in distributing computing, Yahoo! will do so in tandem with the community at large - through continued contributions to Hadoop and other open source projects and through partnerships with academic institutions such as Carnegie Mellon and the University of California at Berkeley. "A rising tide," Brachman says, "lifts all ships."

This too contrasts with the Mountain View credo. Google did release that MapReduce research paper. And like Yahoo!, it offers back-end compute resources to academic researchers. But Google MapReduce is decidedly closed. And as of last month, it's patented. ®

SANS - Survey on application security programs

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Bored with trading oil and gold? Why not flog some CLOUD servers?
Chicago Mercantile Exchange plans cloud spot exchange
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story

Whitepapers

Designing a defence for mobile apps
In this whitepaper learn the various considerations for defending mobile applications; from the mobile application architecture itself to the myriad testing technologies needed to properly assess mobile applications risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.