Feeds

Cloudera, MapR, IBM, and Intel bet on Spark as the new heart of Hadoop

Screeching sound heard amid Spark stampede as Cloudera changes opinion on Hive

HP ProLiant Gen8: Integrated lifecycle automation

Cloudera has rallied four major companies behind a scheme to tie two open source projects together for the benefit of the Hadoop community.

The partnership between Cloudera, IBM, Intel, DataBricks, and MapR to port Apache Hive onto Apache Spark is due to be announced this week at the Spark Summit in San Francisco. El Reg heard rumors of it last week after stumbling across a proposal by Cloudera to lift Hive onto Spark.

For those not familiar with the galaxy of codenames in the Hadoop world: Spark is a general-purpose cluster computing system originally developed at the University of California, Berkeley and based on the Hadoop File System. It can be used as an alternative data processor to Hadoop MapReduce and is billed as being around 100 times faster than MapReduce when running in memory or 10 times faster when running on disk.

Hive, meanwhile, is data warehouse software that uses a SQL-like language to query data stored in Hadoop.

Both projects are important, with Spark seen by many as a potential successor to MapReduce and Hive viewed as a likely candidate for accomplishing SQL-on-Hadoop work.

By lifting Hive up onto Spark, Cloudera & Co are hoping to force some consolidation in the labyrinthine Hadoop ecosystem, and in doing so are placing less emphasis on one of Cloudera's own projects, Impala.

In a chat with El Reg, Justin Erickson, Cloudera's director of product management, said the company has decided to push Hive because it wants to "go and combine the forces of the Spark community with the Hive community to make batch processing [in Hadoop] faster."

"Hive is the standard for doing batch on Hadoop right now," said Matt Brandwein, the company's head of product marketing. "We want to reduce fragmentation in the community. People are getting a bit aware of the fact there are so many options for so many different objects. Spark is the successor."

The move has big ramifications for the Hadoop ecosystem, and for Cloudera. In the past, Cloudera has been skeptical of the value of Hive. In a blog post late last year, Mike Olson, the company's chief strategy officer, wrote, "Decades of experience had taught people to expect real-time responses from their databases. Hive, built on MapReduce, couldn't deliver."

To address the perceived shortcomings of Hive, Cloudera built its own software, Impala. But with the new partnership between Cloudera, MapR, Databricks, and Intel, it seems like Cloudera has warmed to Hive and will use the tech as its main way of dealing with the wider Hadoop community, while continuing to develop Impala as a way to make a buck.

Another little complication in this story is that there already is a Hive-on-Spark project called Shark. But Cloudera feels that Shark has diverged too much from mainstream Hive.

"Shark took an approach of replacing several key components of Hive, including the query planner and other elements of Hive," Cloudera explained. "The result of this was that maintaining compatibility with Hive became very difficult as changes to Hive can not be transparently back-ported to Shark. With the Hive-on-Spark approach, we are making a much more limited change to only the physical query planner, which means that the Hive community can make changes and add new functionality to Hive and have this transparently work with either Spark or MapReduce or Tez. As such, the maintenance burden will be much lower for Hive on Spark and will be more deeply integrated with the core Hive community."

Speaking of Tez, Cloudera's move also puts pressure on Hortonworks, which helped develop the competing data-processing framework. But Cloudera says Spark, like Tez, is merely an option.

As the company explains in an FAQ document, "It is not a goal for the Spark execution backend to replace Tez or MapReduce. It is healthy for the Hive project for multiple backends to coexist. Users have a choice whether to use Tez, Spark or MapReduce. Each has different strengths depending on the use case. And the success of Hive does not completely depend on the success of either Tez or Spark."

When contacted for comment, Hortonworks said the decision to pour development resources into Hive on Spark is broadly a good thing. "It's an admission that the open source community driven model is the right one," Shaun Connolly, the company's VP of strategy, said in a chat with El Reg.

Another way of interpreting all of this is that now that Cloudera has raised $900m in funding – $740m of which came from Intel – it's trying to throw its weight around and take more of a leading role in the Hadoop community.

By taking charge of some of the planets in the the galaxy of open source projects associated with Hadoop, Cloudera can develop a better understanding of the future direction of the software and more carefully hone its business to reap cash off the growing user base. ®

Reducing security risks from open source software

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
Amazon Reveals One Weird Trick: A Loss On Almost $20bn In Sales
Investors really hate it: Share price plunge as growth SLOWS in key AWS division
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
prev story

Whitepapers

Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Seven Steps to Software Security
Seven practical steps you can begin to take today to secure your applications and prevent the damages a successful cyber-attack can cause.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.