Feeds

Yahoo! tech boss gazes beyond Hadoop

MapReduce goes only so fast

Build a business case: developing custom apps

Hadoop underpins everything from Facebook to eBay to Yahoo!, but it's not designed for online applications.

Based on Google MapReduce – the distributed software platform that was long used to build the search giant's web index – Hadoop was designed for off-line data-crunching. It's used to feed online services – to, say, build a web index can that can then accessed by a search engine – but it's not suited to driving the sort of "real-time" interactions that happen on the web.

"With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes," Yahoo! chief technology officer Raymie Stata told The Register at the company's annual Hadoop Summit on Wednesday in Santa Clara, California. "In some cases, that's plenty 'real-time'. And we have been trying to get that down to less than fifteen minutes. But it will never be true real-time. It will never be what we call 'next click', where I click and by the time the page loads, the semantic implication of my decision is reflected in the page."

HBase – the open source distributed database that works in tandem with the Hadoop File Systems (HDFS) – is suited to some real-time web applications. But Stata questions whether HBase will ultimately reinvent the real-time web in the way Hadoop has overhauled batch processing. Yahoo! actually uses HBase to drive a service that customizes its homepage for each visitor. But the company is exploring other real-time platforms, including two Yahoo!-driven open source projects: MapReduce Online and S4.

"We have a number of internal solutions to the 'next-click' problem," Stata says. "And there's active dialogue on how to consolidate this, but we don't quite know enough yet to say 'this is the answer'."

A joint project between Yahoo! Research and the University of California, Berkeley, MapReduce Online is essentially an effort to modify the Hadoop framework for use with real-time tasks. "MapReduce is a sequential process. You map all the data. You reduce all the data. You can get the batch size smaller and smaller. But when you get to a batch size of one, it doesn't make sense," Stata says.

"MapReduce Online moves to a windowing model, like windowed SQL, where you map a 'window' of data...You take a window of stuff that comes in during a given time period, and map that and reduce that. The data is always streaming through."

It's unclear whether the platform is actually used on Yahoo!'s live infrastructure. But Stata says that, at least in theory, it can run the same code as runs on Hadoop MapReduce. The problem, he adds, is that the platform is still burdened by the inherent limitations of the MapReduce framework. In some ways, it's still a batch platform.

S4 is similar to MapReduce Online – except it parts ways with the Hadoop framework. "It's a streaming analytics library," says Todd Papaioannou, vice president of cloud architecture at Yahoo!. "You have data coming through along a pipe and you're doing analytics on a window of data as it goes through." S4 now drives the delivery of Yahoo!'s sponsored search results.

Off HBase

Though he's still unsure about the longterm prospects of MapReduce Online and S4, Stata seems particularly cold on the real-time future of HBase, which is based on Google's BigTable. "HBase can be used in a real-time infrastructure. It's not [necessarily suited] to real-time. Where do you host the application code itself? What do you do about fault tolerance?" he says. "For applications where you truly need to be fault tolerant, HBase is not a great solution. It's not transactional. It doesn't have a lot of features that help write robust code in a simple fashion. It puts a lot of burden on the app developer."

The same limitations apply to BigTable, but Google is looking to push through them. The company no longer builds its web index with MapReduce. It uses a new proprietary platform called Percolator that runs atop BigTable, letting the company continually update the index without having to reprocess the entire thing from scratch, and there's a compute framework that lets engineers execute code atop the platform.

Raymie Stata

"MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency," reads Google's paper describing Percolator. "By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50 per cent."

Stata declined to discuss Google's paper. But his take on HBase indicates this is another area where Yahoo! will not follow in Google's footsteps. With Hadoop – a project Yahoo! bootstrapped after hiring project founder Doug Cutting – the company mimicked the Google way, but only in part. Yahoo! chose to open source the platform. Just this week, it renewed its commitment to the open-source project with the creation Hortonworks, a Hadoop startup built around the company's 25 core Hadoop developers. And both MapReduce Online and S4 are open source. Google's platforms are decidedly closed.

Stata believes that web giants like Yahoo! will better serve themselves by exchanging their infrastructure ideas with the open source world, whether those ideas involve distributed batch processing or real-time calculations. "Yahoo! is often a few years ahead, just because of our scale and the complexities we have to deal with, but that's not where we want to gain longterm proprietary value," Stata says. "In fact, if what we do becomes what everyone else does, that actually benefits us." How very un-Google. ®

Boost IT visibility and business value

More from The Register

next story
Video of US journalist 'beheading' pulled from social media
Yanked footage featured British-accented attacker and US journo James Foley
Kate Bush: Don't make me HAVE CONTACT with your iPHONE
Can't face sea of wobbling fondle implements. What happened to lighters, eh?
Caught red-handed: UK cops, PCSOs, specials behaving badly… on social media
No Mr Fuzz, don't ask a crime victim to be your pal on Facebook
Ballmer leaves Microsoft board to spend more time with his b-balls
From Clippy to Clippers: Hi, I see you're running an NBA team now ...
Amazon takes swipe at PayPal, Square with card reader for mobes
Etailer plans to undercut rivals with low transaction fee offer
Assange™: Hey world, I'M STILL HERE, ignore that Snowden guy
Press conference: ME ME ME ME ME ME ME (cont'd pg 94)
Call of Duty daddy considers launching own movie studio
Activision Blizzard might like quality control of a CoD film
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Linux Foundation says many Linux admins and engineers are certifiable
Floats exam program to help IT employers lock up talent
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.