MIT boffins hope to speed up analytics with GitHub-style platform
FeatureHub could help arduous process of feature engineering
Boffins at MIT have proposed a GitHub-style collaborative platform to speed up one of the first, most challenging stages of data analysis.
When faced with a large dataset, scientists need to first identify the features – individual measurable characteristics – of the variables.
It's crucial the chosen features are good representations of the raw variables because they will be used to train the machine-learning algorithm. Effectively, they determine whether the algorithm performs well or badly.
To do this, data scientists have to come up with ideas for the features and write scripts to extract the features – a process called feature engineering that is both challenging and time-consuming.
But, according to scientists at MIT, it's also the part of the data science pipeline that is most susceptible to collaboration – and they've developed a platform, FeatureHub, to do just that.
"Feature engineering is a main bottleneck preventing rapid development of data driven solutions and machine learning models," said Micah Smith, lead author on the paper (PDF) describing the framework.
"It's recognised as the most important factor in generating an accurate model, but it hasn't been given as much attention as developing the machine-learning model given the features."
There have been efforts to automate feature engineering – including some by the same MIT group – but Smith said that the aim of this work was to encourage humans to work together, in a bid to unlock "more subtle and creative features".
FeatureHub takes inspiration from code-sharing platform GitHub, but allows data scientists to collaborate on a feature engineering task, as well as view and discuss each others' features in real-time.
The source code they write is integrated into a single predictive machine-learning model, which abstracts model training, selection and tuning.
The idea is to automate all the parts of the data science pipeline that aren't feature engineering, to keep teams focused on the job at hand – and hopefully speed things up.
Smith added that it might also "lower the barrier to entry" for less experienced data scientists, as it would help them generate and test new ideas more quickly and easily.
Counting the days – or the months?
To test whether the platform did speed things up, FeatureHub was tested on 32 freelance data scientists – each spent about five hours learning about and using the platform – to work on two data science problems.
The resulting predictive models produced were tested against those that had been submitted to the competitive platform Kaggle, and – on a 100-point scale – the FeatureHub models landed within three and five points of the winning entries.
However, the FeatureHub conclusions were reached in a matter of days, compared with the winning Kaggle entries that generally took weeks or months of work.
Merve Alanyali, a doctoral researcher at the Alan Turing Institute and Warwick Business School, told The Reg that speeding the process up was "the main strength" of the study.
"It will definitely save researchers time, especially if the are working on their own," she said.
Although "data scientists usually follow the same procedure to decide on features, I can see them benefitting from the collaborative aspect of the platform."
Alanyali added that there are other collaborative data science platforms and tools – like Kaggle, GitHub and Mechanical Turk – that allow people to share code, create training datasets and discuss problems, but this is the first she had seen to focus on feature identification.
"It has usually been done in project groups or among researchers, but I haven’t seen an online platform for it," Alanyali said.
"I think users will feel the benefits of the platform more and more as the number of users increase – this means there will be more feature submissions in a shorter period of time."
How does it work?
The paper offers up as an example, a project that aims to predict which country Airbnb users will travel to.
This problem involved background information and a dataset, in relational format, of information on users and their interactions with the Airbnb website, along with potential destinations.
The FeatureHub workflow for this project begins with a data coordinator, who prepares the prediction programme for feature engineering. They will deploy a FeatureHub instance, upload the dataset and metadata to the servers and perform minimal cleaning and preprocessing.
Next, the data scientists are let loose: they start by reading the background information, loading the dataset and carrying out exploratory data analysis and visualisation. Then they start writing features, either solo or using the discussion tools.
The user interface is built on top of the data analysis software suite Jupyter Notebook and features are written in Python, but the design has to follow a template that keeps the syntax simple.
Once the features are written, FeatureHub automatically builds a simple machine-learning model on training data using each feature, and reports metrics back to users in real-time. If predictive performance meets expectations, it is submitted to a feature database.
Every time a new feature is submitted, a model is selected and trained automatically, and the performance reported to the coordinator.
Once the model reaches a suitable performance threshold, the coordinator can stop the project, and the data scientists move on.
Smith said FeatureHub could be incorporated into most systems, and that the hope was that it would be used to address difficult projects that need a lot of person-power.
"We are looking at how we can further enhance the scaffolding, collaboration and make it as effective as how collaborative models like GitHub have been for software engineering. We are also looking to use this platform to solve some of the pressing societal problems through collaboration." ®