On-Prem

This article is more than 1 year old

Spark man Zaharia on 2.0 and why it's 'not that important' to upstage MapReduce

Matei tells us about his brainchild

Mon 15 Feb 2016 // 12:41 UTC

Interview Spark is the open source cluster computing system started in 2009 by Matei Zaharia, when he was but an 'umble PhD candidate at Berkeley's AMPlab. Some people hope it will become the logical successor to MapReduce.

Donated to the Apache Software Foundation in 2013, Spark has been backed by IBM.

Proponents of Spark say it is superior to MapReduce, Google's data processing framework, which formed the basis for Hadoop – another cluster system and competitor in the Big Data support and services sector.

Companies such as Cloudera have announced initiatives allowing Spark to become the successor to Hadoop, and Spark is now supported by practically every business in the sector.

Ahead of February's Spark Summit in New York City, Zaharia told The Register that merely “replacing MapReduce is not that important.”

“Spark was designed as a superset of MapReduce,” he added, and noted that it has become “very optimised.”

“Most of the commercial companies that were shifting and providing MapReduce – like Cloudera and Hortonworks – have still not switched their current customers over, and they may never do so with the older ones, but newer customers are coming onboard with Spark.”

Rather than replacing MapReduce, however, Zaharia said it was more important to “bring in new users for whom MapReduce was too complex and giving them a way to work with big data.”

“Both on the Spark mailing list, and also at Databricks – our Spark company – most of the users who come in have not used Hadoop before,” Zaharia told us. “They're coming in directly from single-machine tools like Python and R, and I think the reasons for that are that it's easier to use even for the bach workload because of our APIs similarity to Python and R, and we have streaming and machine learning applications which aren't a good fit for MapReduce.”

“All the big data tools out there are still pretty difficult to use,” added Zaharia, “and to make them useful and impactful we have to expand the pool of people who can use these tools.”

System-agnostic compute for the masses

Zaharia remains the Spark project's veep at the Apache Software Foundation, where he also works as a Hadoop Committer. He is additionally an assistant professor at MIT, and the CTO at the aforementioned Spark business Databricks.

As Zaharia stated, Spark is “basically a framework for parallel computing on clusters. One of the things about it which is interesting compared to previous frameworks – compared to Hadoop, which is for both computing and storage – is that Spark is just a computing layer which is agnostic to the system,” said Zaharia: “You can connect it to Hadoop, or Cassandra, or DynamoDB. The only focus in Spark is how you compute stuff, not where you store it.”

“No-one from the original team knew or expected how much it would be taken up,” the original team's original member stated. “I was a PhD student when we started the project at UC Berkely, and at the beginning I used to know all of the users. Then, for a year or two, I worked with start-up companies in the Bay area, and then with Yahoo! for machine learning after a while, and then it very quickly changed to me only knowing a small fraction of the users.”

“That's been exciting to see,” added Zaharia, who explained how the Spark team has managed to achieve its growth. “What we've got a good formula for is scaling the development process. We can let a lot of people contribute, let it meet their needs, and not lock the project down. We spent a bunch of time automating testing for patches, and figuring out how we make releases on a predictable schedule,” he added, but ultimately “it was the right project at the right time.”

“There were great people behind it, computers science systems researchers who wanted to do interesting things with it,” and there are “a bunch of things happening for 2016,” Zaharia told us.

Next page: Spark version 2.0

Page:

More about

Narrower topics

Log4j

More about

COMMENTS

More about

Narrower topics

Log4j

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

On-Prem

Spark man Zaharia on 2.0 and why it's 'not that important' to upstage MapReduce

Matei tells us about his brainchild

System-agnostic compute for the masses

More about

More about

Narrower topics

More about

More about

More about

Narrower topics

TIP US OFF

Other stories you might like

Apache OFBiz zero-day pummeled by exploit attempts after disclosure

Four in five Apache Struts 2 downloads are for versions featuring critical flaw

Critical Apache ActiveMQ flaw under attack by 'clumsy' ransomware crims

A different view from the edge

Microsoft extends life support for aging Apache Cassandra 3.11 database

China outlines plan for National Integrated Government Affairs Big Data System

Mirai botnet loves exploiting your unpatched TP-Link routers, CISA warns

UK.gov finds billions in cash for big data contracts

Apache Superset: A story of insecure default keys, thousands of vulnerable systems, few paying attention

Airbus pulls up hard, no longer buying 29.9% stake in Atos-owned Evidian

Ex-BigQuery exec and Motherduck CEO: For some users, the answer is to think small

Native Americans urge Apache Software Foundation to ditch name

About Us

Our Websites

Your Privacy