Reg comments
2

Hadoop spinner Cloudera lights Spark on MapReduce retirement

Big Data pioneer succumbs to mounting pressure from the crowd

Sparks, image via Shutterstock

Cloudera, one of the Big Data pioneers founded on Hadoop – the open-source implementation of Google’s MapReduce – is replacing... MapReduce.

On Wednesday, the firm announced the One Platform Initiative, which will see it substitute MapReduce for the Apache Software Foundation’s Spark, a cluster-computing framework which has attracted big-name support.

Cloudera said its initiative would let Spark become the successor to Hadoop’s MapReduce framework for general Hadoop data processing. The firm claimed “wide adoption” of Spark among its customers in the last 18 months, with Spark also becoming the most popular open-source project in the Hadoop ecosystem.

Cloudera founder and chief strategy officer Mike Olson said in a statement:

Spark is well on its way to succeeding MapReduce in enabling jobs with hundreds of executors each, running simultaneously on large multi-tenant clusters with tens of thousands of nodes – but there is still some heavy lifting to do.

It's an ambitious goal, but with the community of committers and supporters, and our leadership, we think that's highly achievable.

Cloudera is no recent convert to Spark. The firm – along with IBM, Intel, DataBricks, and MapR – last year announced a collaboration to port the Apache Hive data warehouse to Apache Spark. The One Platform Initiative will tackle security, scale, management and streaming.

Hadoop was actually developed by Cloudera's Doug Cutting, along with Mike Cafarella, as a project at their employer Yahoo! in 2005 and released in 2011. They'd used a paper on MapReduce released by that framework's owner, Google.

Spark was developed by the AMPLab at the University of California, Berkeley, and open-souced under a BSD license in 2010, before being donated to ASF in 2013. It employs Spark SQL, streaming, a machine-learning framework called MLlib and a GraphX distributed graph processing framework.

It’s seen growing support from many in the Hadoop and MapReduce NoSQL ecosystem: NoSQL provider MapR announced Spark-based offerings for security, analytics and Genome sequencing software and Cloudera’s fellow Hadoop spinner Hortonworks released Spark as a part of its Hortonworks Data Platform in April.

But it was IBM who in June lent Spark what will be seen as its breakthrough: the giant announced “a major commitment” to Apache Spark by committing to embed the framework in its analytics and commerce software. Big Blue also pledged to donate its SystemML machine-learning technology to the Spark ecosystem and offer Spark on its cloud-as-a-service on IBM Bluemix.

MapReduce, and thus Hadoop, have long been under pressure for their complexity and lack of flexibility and performance, with many wondering what would come next.

Spark is seen as faster – able to process jobs between 10 to 100 times faster than MapReduce – and better for iterative and interactive processing, while it is able to run not just on Hadoop but also other Hadoopy tools such as Hive and Pig. ®

Biting the hand that feeds IT © 1998–2017