Databricks pushes machine learning on easy mode: Rock star data scientist, meet sweaty engineer
Co-founders chat to El Reg about liability, data silos and raw data pain
Interview Ninety-nine per cent of companies are struggling to make a success of machine learning, according to execs at analytics biz Databricks.
Firms have spent years amassing data in the hopes of doing something useful with it eventually. Meanwhile, the meteoric rise of Facebook and Google has helped make data analytics sexy, spawning a frenzied interest in machine learning.
But most have struggled to apply it to their business – despite a vague awareness that it has huge potential.
For Databricks' bosses, the problem is a lack of unification within businesses and an abundance of silos.
"The big divide in most enterprises is between IT and the line of business," CEO Ali Ghodsi told attendees at the European Spark+AI Summit yesterday.
"Data engineers have liability; they need to make sure systems are secure and will work for decades to come," he said. "In the line of business, where data scientists sit, they know the business and know what to do to move the needle, but need the data."
Unification has long been Databricks' mantra, but it is now focusing on the machine learning life cycle.
For Matei Zaharia, who created open-source analytics engine Spark, the tech on which Databricks is built, the silos that run deep into the technologies companies use is at the heart of this problem.
'Everyone who has tried to do machine learning knows it's complex'
The basic machine learning life cycle – taking raw data, preparing it, training your model and deploying it – is fraught with variables and complications. It needs to support a variety of tools, and parameters have to be constantly tuned, with models retrained and measured, to catch those that are drifting or becoming less accurate.
Or as Zaharia summed it up: "Everyone who has tried to do machine learning knows it's complex."
One of the major bottlenecks Zaharia sees is that machine learning systems involve many different people. There is a division between the data scientists who build the models, and the engineers who have to prepare the data or those that do production deployment.
"These are experts in different areas, and it's often very difficult getting things reliably work across them," Zaharia told The Reg in an interview after his keynote at the London conference.
For instance, a production team's implementation might not work in quite the same way as a data scientist intended, or a data engineer might prep data but doesn't know the exact requirements downstream. "There's a lot of back and forth," the CTO said.
Moreover, it can take a long time to change anything in the steps along the way, or subtle mistakes or bugs that occur might work their way in when something is re-implemented.
A machine learning platform for the masses
One of the ways companies tackle this is by creating a machine learning platform that simplifies development, and improves usability for both data scientists and engineers.
But these are tied to one company's infrastructure, and Zahari wanted to do something that would open this up, and so this summer Databricks launched MLflow, an open-source platform to standardise the complex ML life cycle.
One of the aims is to move the firm's enterprise customers from talking about ML to actually using it – although Zaharia emphasises that such platforms are widely used by companies that have plenty of expertise in data science.
"Even when they have expertise they still want to accelerate the process and accelerate the time into production."
The platform allows companies to track the parameters, code and data they use for each experiment and bundle up the code into reproducible and reusable packages and machine learning models.
Zaharia acknowledged MLflow has "rough edges" to be smoothed, but that – four months in – it was getting good traction.
"People are excited about having an open-source project in this space," he said. "They're excited about having an ML platform – it's something that resonates with them, and that many wanted to build already – and having one that is a community effort will be much better than what any company could build on its own."
As evidence of this, he noted that of the 48 contributors, about two-thirds are from outside Databricks and they were already using it and adding new features, such as being able to add notes on their experiments.
New integrations, moving on from alpha
The project yesterday announced it has gained the backing of RStudio, which is developing an R API for MLflow v0.7.0 – which significantly expands its reach.
"R is extremely widely used for both machine learning, statistics and data processing... it has very powerful libraries," Zaharia said. "It's also one of those things most alien to production and data engineers. So productionising R is considered a challenge."
He's also viewing RStudio's interest as validation of the project. "It's another well-known open-source vendor that's decided to join... they're the experts on interfaces for R users."
At the moment, MLflow is still in alpha, but Zaharia said the firm was pushing to move to a version that isn't alpha or beta as quickly as possible, with the hope this will be in the first half of 2019.
On top of this, the firm is working on specific tasks and integrations. For instance, there is a lot of interest in improving integrations with hyperparameter search libraries to complement the way MLflow works with training frameworks.
MLflow can already package up a step, such as a training job, as a project with parameters, Zaharia said. "It's kind of like a black box, you feed in the parameters and get a result. And that works really well with these existing hyperparameter search tuning systems that are already out there, and we can just connect with them."
There's also integration with the Databricks environment. The firm yesterday announced what it's calling "time travel", which combines MLflow's tracking with the tracking in Delta, the unified data management system it launched last year.
'It scratches all the itches'
But the whole point of the work is to get the enterprise consistently and effectively deploying machine learning, so is the messaging working?
"It's been surprising that, of the early adopter programme, the most traction is with large enterprise customers," co-founder Andy Konwinski told El Reg, when you might have expected that group to lag.
"Typically, they've hired a data science team and got in one or two rock star data scientists, who have got between one and five models in production."
This has proved there's a lot of value in machine learning, Konwinski said, and the business as a whole believes in it – but they're now left "scratching their head about how to scale it up".
Meanwhile, the data scientists "typically aren't happy because they're hitting that wall" of a lack of unification across the business.
"Nobody is retraining the model, measuring if it's drifting or making sure results stay within very high accuracy. They're barely able to keep these five models in production – and it's costly."
As a result, he said, there's a "uniform hunt for data science platform tooling", not just from the data scientists but also the C-suite and management level. "They all have a different bundle of concerns... [MLflow] scratches all the itches." ®
Sponsored: Becoming a Pragmatic Security Leader