Feeds

Beyond MapReduce: Hadoop hangs on

Tooling up

Intelligent flash storage arrays

Open ... and Shut Hadoop is all the rage in enterprise computing, and has become the poster child for the big-data movement. But just as the enterprise consolidates around Hadoop, the web world, including Google – which originated the technology ideas behind Hadoop – is moving on to real-time, ad-hoc analytics that batch-oriented Hadoop can't match.

Is Hadoop already outdated?

As Cloudant chief scientist Mike Miller points out, Google's MapReduce approach to big data analytics may already be passé. It certainly is at Google:

[Google's MapReduce] no longer holds such prominence in the Google stack... Google seems to be moving past it. In fact, many of the technologies [Google now uses like Percolator for incremental indexing and analysis of frequently changing datasets and Dremel for ad-hoc analytics] aren’t even new; they date back the second half of the last decade, mere years after the seminal [MapReduce] paper was in print.

By one estimate, Hadoop, which is an open-source implementation of Google's MapReduce technology, hasn't even caught up to Google's original MapReduce framework. And now people like Miller are arguing that a MapReduce approach to Big Data is the wrong starting point altogether.

For a slow-moving enterprise, what to do?

The good news is that soon most enterprises likely won't have to bother with Hadoop at all, as Hadoop will be baked into the cloud applications that enterprises buy. And as those vendors figure out better technologies to handle real-time (like Storm) or ad hoc analysis (like Dremel), they, too, will be baked into cloud applications.

As an interim step to such applications, big-data tools vendors like Datameer and Karmasphere are already releasing cloud-based tools for analyzing Hadoop data. This is critical to Hadoop's short-term success as Forrester notes that Hadoop is still "an immature technology with many moving parts that are neither robust nor well integrated." Good tooling helps.

But is Hadoop the right place to start, good tooling or no?

Cloudscale chief executive Bill McColl, writing back in 2010, says "definitely not." He argues:

Simple batch processing tools like MapReduce and Hadoop are just not powerful enough in any one of the dimensions of the big data space that really matters. Sure, Hadoop is great for simple batch processing tasks that are “embarrassingly parallel”, but most of the difficult big data tasks confronting companies today are much more complex than that.

McColl isn't a neutral observer of Hadoop: his company competes with vanilla Hadoop deployments. My own company, Nodeable, offers a real-time complement to Hadoop, based on the open-source Storm project, but I'm much more sanguine about Hadoop's medium-term prospects than either McColl or Miller. But his point is well-taken, especially in light of Miller's observation that even the originator of MapReduce, Google, has largely moved on for faster, more responsive analytical tools.

Does it matter?

Probably not. At least, not anytime soon. It has long been the case that web giants like Facebook and Google have moved faster than enterprise IT, which tends to be much more risk-averse and more prone to hanging onto technology once it's made to work. So it's a Very Good Thing, as Businessweek highlights, that the web's technology of today is being open sourced to fuel the enterprise technology of tomorrow.

Hadoop still has several kinks to work out before it can go truly mainstream in the enterprise. It's not as if enterprises are going to go charging ahead into Percolator or other more modern approaches to big data when they have yet to squeeze Hadoop for maximum value. Enterprise IT managers like to travel in packs, and the pack is currently working on Hadoop. There may be better options out there, but they're going to need to find ways to complement Hadoop, not displace it. Hadoop simply has too much momentum going for it.

I suspect we'll see Hadoop continue forward as the primary engine of big data analytics. We're looking at many years of dominance for Hadoop. However, I think we'll also see add-on technologies offered by cloud vendors to augment the framework. Hadoop is never going to be a real-time system, so things like Storm will come to be viewed as must-have tools to provide real-time insight alongside Hadoop's timely, deep analytics.

Some early adopters will figure these tools out on their own without help from cloud application vendors. But for most, they're going to buy, not build, and that "buy" decision will include plenty of Hadoop, whether from Cloudera or Metamarkets or Hortonworks or EMC or anybody else. That's why Forrester pegs today's Hadoop ecosystem at $1bn, a number that is only going to grow, no matter what Google thinks is a better approach to big data. ®

Matt Asay is senior vice president of business development at Nodeable, offering systems management for managing and analysing cloud-based data. He was formerly SVP of biz dev at HTML5 start-up Strobe and chief operating officer of Ubuntu commercial operation Canonical. With more than a decade spent in open source, Asay served as Alfresco's general manager for the Americas and vice president of business development, and he helped put Novell on its open source track. Asay is an emeritus board member of the Open Source Initiative (OSI). His column, Open...and Shut, appears three times a week on The Register.

Intelligent flash storage arrays

More from The Register

next story
Nexus 7 fandroids tell of salty taste after sucking on Google's Lollipop
Web giant looking into why version 5.0 of Android is crippling older slabs
Be real, Apple: In-app goodie grab games AREN'T FREE – EU
Cupertino stands down after Euro legal threats
Download alert: Nearly ALL top 100 Android, iOS paid apps hacked
Attack of the Clones? Yeah, but much, much scarier – report
Microsoft: Your Linux Docker containers are now OURS to command
New tool lets admins wrangle Linux apps from Windows
Bada-Bing! Mozilla flips Firefox to YAHOO! for search
Microsoft system will be the default for browser in US until 2020
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Mitigating web security risk with SSL certificates
Web-based systems are essential tools for running business processes and delivering services to customers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.