More like this

Data Centre

Spark man Zaharia on 2.0 and why it's 'not that important' to upstage MapReduce

Matei tells us about his brainchild

Photograph of Matei Zaharia provided by Matei Zaharia

Interview Spark is the open source cluster computing system started in 2009 by Matei Zaharia, when he was but an 'umble PhD candidate at Berkeley's AMPlab. Some people hope it will become the logical successor to MapReduce.

Donated to the Apache Software Foundation in 2013, Spark has been backed by IBM.

Proponents of Spark say it is superior to MapReduce, Google's data processing framework, which formed the basis for Hadoop – another cluster system and competitor in the Big Data support and services sector.

Companies such as Cloudera have announced initiatives allowing Spark to become the successor to Hadoop, and Spark is now supported by practically every business in the sector.

Ahead of February's Spark Summit in New York City, Zaharia told The Register that merely “replacing MapReduce is not that important.”

“Spark was designed as a superset of MapReduce,” he added, and noted that it has become “very optimised.”

“Most of the commercial companies that were shifting and providing MapReduce – like Cloudera and Hortonworks – have still not switched their current customers over, and they may never do so with the older ones, but newer customers are coming onboard with Spark.”

Rather than replacing MapReduce, however, Zaharia said it was more important to “bring in new users for whom MapReduce was too complex and giving them a way to work with big data.”

“Both on the Spark mailing list, and also at Databricks – our Spark company – most of the users who come in have not used Hadoop before,” Zaharia told us. “They're coming in directly from single-machine tools like Python and R, and I think the reasons for that are that it's easier to use even for the bach workload because of our APIs similarity to Python and R, and we have streaming and machine learning applications which aren't a good fit for MapReduce.”

“All the big data tools out there are still pretty difficult to use,” added Zaharia, “and to make them useful and impactful we have to expand the pool of people who can use these tools.”

System-agnostic compute for the masses

Zaharia remains the Spark project's veep at the Apache Software Foundation, where he also works as a Hadoop Committer. He is additionally an assistant professor at MIT, and the CTO at the aforementioned Spark business Databricks.

As Zaharia stated, Spark is “basically a framework for parallel computing on clusters. One of the things about it which is interesting compared to previous frameworks – compared to Hadoop, which is for both computing and storage – is that Spark is just a computing layer which is agnostic to the system,” said Zaharia: “You can connect it to Hadoop, or Cassandra, or DynamoDB. The only focus in Spark is how you compute stuff, not where you store it.”

“No-one from the original team knew or expected how much it would be taken up,” the original team's original member stated. “I was a PhD student when we started the project at UC Berkely, and at the beginning I used to know all of the users. Then, for a year or two, I worked with start-up companies in the Bay area, and then with Yahoo! for machine learning after a while, and then it very quickly changed to me only knowing a small fraction of the users.”

“That's been exciting to see,” added Zaharia, who explained how the Spark team has managed to achieve its growth. “What we've got a good formula for is scaling the development process. We can let a lot of people contribute, let it meet their needs, and not lock the project down. We spent a bunch of time automating testing for patches, and figuring out how we make releases on a predictable schedule,” he added, but ultimately “it was the right project at the right time.”

“There were great people behind it, computers science systems researchers who wanted to do interesting things with it,” and there are “a bunch of things happening for 2016,” Zaharia told us.

Sponsored: Customer Identity and Access Management

Next page: Spark version 2.0