Beyond MapReduce: Hadoop hangs on
Open ... and Shut Hadoop is all the rage in enterprise computing, and has become the poster child for the big-data movement. But just as the enterprise consolidates around Hadoop, the web world, including Google – which originated the technology ideas behind Hadoop – is moving on to real-time, ad-hoc analytics that batch-oriented Hadoop can't match.
Is Hadoop already outdated?
As Cloudant chief scientist Mike Miller points out, Google's MapReduce approach to big data analytics may already be passé. It certainly is at Google:
[Google's MapReduce] no longer holds such prominence in the Google stack... Google seems to be moving past it. In fact, many of the technologies [Google now uses like Percolator for incremental indexing and analysis of frequently changing datasets and Dremel for ad-hoc analytics] aren’t even new; they date back the second half of the last decade, mere years after the seminal [MapReduce] paper was in print.
By one estimate, Hadoop, which is an open-source implementation of Google's MapReduce technology, hasn't even caught up to Google's original MapReduce framework. And now people like Miller are arguing that a MapReduce approach to Big Data is the wrong starting point altogether.
For a slow-moving enterprise, what to do?
The good news is that soon most enterprises likely won't have to bother with Hadoop at all, as Hadoop will be baked into the cloud applications that enterprises buy. And as those vendors figure out better technologies to handle real-time (like Storm) or ad hoc analysis (like Dremel), they, too, will be baked into cloud applications.
As an interim step to such applications, big-data tools vendors like Datameer and Karmasphere are already releasing cloud-based tools for analyzing Hadoop data. This is critical to Hadoop's short-term success as Forrester notes that Hadoop is still "an immature technology with many moving parts that are neither robust nor well integrated." Good tooling helps.
But is Hadoop the right place to start, good tooling or no?
Cloudscale chief executive Bill McColl, writing back in 2010, says "definitely not." He argues:
Simple batch processing tools like MapReduce and Hadoop are just not powerful enough in any one of the dimensions of the big data space that really matters. Sure, Hadoop is great for simple batch processing tasks that are “embarrassingly parallel”, but most of the difficult big data tasks confronting companies today are much more complex than that.
McColl isn't a neutral observer of Hadoop: his company competes with vanilla Hadoop deployments. My own company, Nodeable, offers a real-time complement to Hadoop, based on the open-source Storm project, but I'm much more sanguine about Hadoop's medium-term prospects than either McColl or Miller. But his point is well-taken, especially in light of Miller's observation that even the originator of MapReduce, Google, has largely moved on for faster, more responsive analytical tools.
Does it matter?
Probably not. At least, not anytime soon. It has long been the case that web giants like Facebook and Google have moved faster than enterprise IT, which tends to be much more risk-averse and more prone to hanging onto technology once it's made to work. So it's a Very Good Thing, as Businessweek highlights, that the web's technology of today is being open sourced to fuel the enterprise technology of tomorrow.
Hadoop still has several kinks to work out before it can go truly mainstream in the enterprise. It's not as if enterprises are going to go charging ahead into Percolator or other more modern approaches to big data when they have yet to squeeze Hadoop for maximum value. Enterprise IT managers like to travel in packs, and the pack is currently working on Hadoop. There may be better options out there, but they're going to need to find ways to complement Hadoop, not displace it. Hadoop simply has too much momentum going for it.
I suspect we'll see Hadoop continue forward as the primary engine of big data analytics. We're looking at many years of dominance for Hadoop. However, I think we'll also see add-on technologies offered by cloud vendors to augment the framework. Hadoop is never going to be a real-time system, so things like Storm will come to be viewed as must-have tools to provide real-time insight alongside Hadoop's timely, deep analytics.
Some early adopters will figure these tools out on their own without help from cloud application vendors. But for most, they're going to buy, not build, and that "buy" decision will include plenty of Hadoop, whether from Cloudera or Metamarkets or Hortonworks or EMC or anybody else. That's why Forrester pegs today's Hadoop ecosystem at $1bn, a number that is only going to grow, no matter what Google thinks is a better approach to big data. ®
Matt Asay is senior vice president of business development at Nodeable, offering systems management for managing and analysing cloud-based data. He was formerly SVP of biz dev at HTML5 start-up Strobe and chief operating officer of Ubuntu commercial operation Canonical. With more than a decade spent in open source, Asay served as Alfresco's general manager for the Americas and vice president of business development, and he helped put Novell on its open source track. Asay is an emeritus board member of the Open Source Initiative (OSI). His column, Open...and Shut, appears three times a week on The Register.
Sponsored: Becoming a Pragmatic Security Leader