Feeds

Beyond MapReduce: Hadoop hangs on

Tooling up

Providing a secure and efficient Helpdesk

Open ... and Shut Hadoop is all the rage in enterprise computing, and has become the poster child for the big-data movement. But just as the enterprise consolidates around Hadoop, the web world, including Google – which originated the technology ideas behind Hadoop – is moving on to real-time, ad-hoc analytics that batch-oriented Hadoop can't match.

Is Hadoop already outdated?

As Cloudant chief scientist Mike Miller points out, Google's MapReduce approach to big data analytics may already be passé. It certainly is at Google:

[Google's MapReduce] no longer holds such prominence in the Google stack... Google seems to be moving past it. In fact, many of the technologies [Google now uses like Percolator for incremental indexing and analysis of frequently changing datasets and Dremel for ad-hoc analytics] aren’t even new; they date back the second half of the last decade, mere years after the seminal [MapReduce] paper was in print.

By one estimate, Hadoop, which is an open-source implementation of Google's MapReduce technology, hasn't even caught up to Google's original MapReduce framework. And now people like Miller are arguing that a MapReduce approach to Big Data is the wrong starting point altogether.

For a slow-moving enterprise, what to do?

The good news is that soon most enterprises likely won't have to bother with Hadoop at all, as Hadoop will be baked into the cloud applications that enterprises buy. And as those vendors figure out better technologies to handle real-time (like Storm) or ad hoc analysis (like Dremel), they, too, will be baked into cloud applications.

As an interim step to such applications, big-data tools vendors like Datameer and Karmasphere are already releasing cloud-based tools for analyzing Hadoop data. This is critical to Hadoop's short-term success as Forrester notes that Hadoop is still "an immature technology with many moving parts that are neither robust nor well integrated." Good tooling helps.

But is Hadoop the right place to start, good tooling or no?

Cloudscale chief executive Bill McColl, writing back in 2010, says "definitely not." He argues:

Simple batch processing tools like MapReduce and Hadoop are just not powerful enough in any one of the dimensions of the big data space that really matters. Sure, Hadoop is great for simple batch processing tasks that are “embarrassingly parallel”, but most of the difficult big data tasks confronting companies today are much more complex than that.

McColl isn't a neutral observer of Hadoop: his company competes with vanilla Hadoop deployments. My own company, Nodeable, offers a real-time complement to Hadoop, based on the open-source Storm project, but I'm much more sanguine about Hadoop's medium-term prospects than either McColl or Miller. But his point is well-taken, especially in light of Miller's observation that even the originator of MapReduce, Google, has largely moved on for faster, more responsive analytical tools.

Does it matter?

Probably not. At least, not anytime soon. It has long been the case that web giants like Facebook and Google have moved faster than enterprise IT, which tends to be much more risk-averse and more prone to hanging onto technology once it's made to work. So it's a Very Good Thing, as Businessweek highlights, that the web's technology of today is being open sourced to fuel the enterprise technology of tomorrow.

Hadoop still has several kinks to work out before it can go truly mainstream in the enterprise. It's not as if enterprises are going to go charging ahead into Percolator or other more modern approaches to big data when they have yet to squeeze Hadoop for maximum value. Enterprise IT managers like to travel in packs, and the pack is currently working on Hadoop. There may be better options out there, but they're going to need to find ways to complement Hadoop, not displace it. Hadoop simply has too much momentum going for it.

I suspect we'll see Hadoop continue forward as the primary engine of big data analytics. We're looking at many years of dominance for Hadoop. However, I think we'll also see add-on technologies offered by cloud vendors to augment the framework. Hadoop is never going to be a real-time system, so things like Storm will come to be viewed as must-have tools to provide real-time insight alongside Hadoop's timely, deep analytics.

Some early adopters will figure these tools out on their own without help from cloud application vendors. But for most, they're going to buy, not build, and that "buy" decision will include plenty of Hadoop, whether from Cloudera or Metamarkets or Hortonworks or EMC or anybody else. That's why Forrester pegs today's Hadoop ecosystem at $1bn, a number that is only going to grow, no matter what Google thinks is a better approach to big data. ®

Matt Asay is senior vice president of business development at Nodeable, offering systems management for managing and analysing cloud-based data. He was formerly SVP of biz dev at HTML5 start-up Strobe and chief operating officer of Ubuntu commercial operation Canonical. With more than a decade spent in open source, Asay served as Alfresco's general manager for the Americas and vice president of business development, and he helped put Novell on its open source track. Asay is an emeritus board member of the Open Source Initiative (OSI). His column, Open...and Shut, appears three times a week on The Register.

Internet Security Threat Report 2014

More from The Register

next story
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
Redmond top man Satya Nadella: 'Microsoft LOVES Linux'
Open-source 'love' fairly runneth over at cloud event
Chrome 38's new HTML tag support makes fatties FIT and SKINNIER
First browser to protect networks' bandwith using official spec
Admins! Never mind POODLE, there're NEW OpenSSL bugs to splat
Four new patches for open-source crypto libraries
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.