Spark meets HAL: Apache's cluster master goes deep
IBM and co: Data for the masses – up the workers!
As a phrase, "democratisation of data" is rather glib – but it does have a serious purpose. The thinking: making use of company data should not be the preserve of just "professionals".
This was certainly a theme at the recent European Spark summit in Brussels.
Spark, the open-source framework used for building clusters run by the Apache Foundation, is finding its way into a variety of organisations – from startups to major enterprises – and is very much associated with data analysis and predictive applications.
Spark dates from 2009, originating among researchers at the University of California, but arguably hit the big time when IBM flung its corporate weight behind the open-source cluster framework last year.
IBM committed 3,500 researchers and developers to Spark, which provides the basis of all of Big Blue's analytics and commerce platforms, including work on Watson.
Spark is catching on because it gives software engineers and data scientists an easy way to provision and run server clusters, leaving them free to get on with the actual data analysis.
The idea is that a company doesn't want to waste techies' time on hardware problems, taking them away from their core work – it's something that's just practical with small teams.
As Dr Shuai Yuan, a data scientist at British startup MediaGamma, put it when talking to The Reg: "We don't have to have an engineering resource to manage clusters."
But conference speakers were going deeper, and talked of using Spark as the underlying framework for building applications behind that other meme of 2016 – AI.
And that drive towards more accessible deep learning is of interest to IBM – now heavily involved in Spark since 2015. IBM used the summit to make much of this move from data analysis being the preserve of data scientists to being for everybody.
"We're moving from machine learning to learning machine," Dinesh Nirmal – IBM's vice president for Big Data and Next Gen Analytics Platform development – told The Reg.
This sounds like yet another glib marketing phrase but there's a serious point behind it, one that typifies another theme of the conference – the emergence of deep learning.
Much of the interest around Spark has been as a means to handle machine learning applications, but it's the announcements around deep learning that took discussions to a new level. This is the name that has been given to a particular aspect of deep learning, one that uses a set of algorithms to "learn" from paths that data takes. It's strongly connected to neural networks, where computer systems ape the nerve pathways of a human body, a field of study that emerged in the 1970s but hit a dead end... until now, that is.
IBM is one of the companies that is looking to build on this renewed interest in deep learning. In order to take machine learning in a new direction, its launch of the Watson Data Platform takes the handling of machine learning in a new direction.
As Nirmal points out, the usual way that companies set up machine learning is to set up a model that can handle the data that's being generated. But, he explains, this has its limits. "What that doesn't take into account is that the system could degrade – the data could change, the model could change necessitating alterations to the model," he said.
What IBM is aiming for, Nirmal says, is a continual feedback mechanism so that "once you train the model, you don't have to retrain the model and we are training the model the whole time".
Nirmal reckons what IBM brings is simplicity, personalisation and convergence. "Most of our customers have tremendous amounts of data and they have to consider a cost-effective way of looking at it. They can run a Hadoop cluster but that involves a lot of specialist skills and they won’t always have the people," he said. It's this skills gap that saw IBM launch Watson Data Platform.
Ali Ghodsi, CEO of Databricks (the company founded by the original creators of Spark), believes there's a growing active interest in Spark, but complexity in deep learning remains an issue, making it something that can generally only be handled by large corporations with big teams of data scientists. "Someone like Google with thousands of PhDs is able to do this," says Ghodsi, "but it is something that requires a lot of resources."
That said, smaller teams can do innovative things with deep learning. As an example there's Real Eyes, used by marketing agencies to examine the responses to video ads.
It monitors physical reactions and matches them to learned responses from its store of video – in other words, a methodology that combines machine learning with deep learning.
Javier Orozco, a data scientist with Real Eyes, told us that the system has built up a store of 3.6 million recordings from 240 countries.
"We have hundreds of people giving up their own time to look at videos. We have to extract that emotion and analyse it – frame by frame, second by second," he said.
Spark has been vital in solving the problems that the company had pulling in the data from different sources and leveraging all that data from historical information, he added.
2015 was a pivotal year for Spark as IBM waded in and it's clearly going to be increasingly important going forward.
That's because it gives software engineers and data scientists an easy way to provision and run server clusters, leaving them free to get on with the actual data analysis.
But it's not all plain sailing. There does exist a shortage of data scientists and that will have an impact on companies wanting to employ deep learning in any meaningful way.
For Ghodsi, the wide-scale adoption of the technology will require many more people with the requisite skills. "There needs to be a big change in the way we teach, starting from schools but in universities too. We need to have people who understand data science," he said.
There are so many areas of data that need to be looked at, he added. “For a start, you have to have clean data – that’s half the work – and you have to it presented in the right format."
But the promise is there. Spark is offering a way to help tackle some of the crucial problems when it comes to predictive behaviour, setting a standard for deeper analysis. It's early days, still, but the initiatives of the likes of IBM and Databricks offer hope for the future. As Nirmal says: "Democratisation of machine learning is the key: how can we make it available?"
He draws a parallel with mobile phones: "Phones have been made so simple, you can make trades; you can do transactions with your bank and so on. Machine learning has to go in that direction. The question should be, 'how do we make things so simple that the end user can add value?'"