TensorFlow: I want to like you, but you're tricksy
Wrestling with Google's machine learning framework
Hands-on Occasionally a technology comes along that changes the way that people work. Docker has had a profound effect on how applications are deployed in the cloud, Hadoop changed how analysis of big data was done and the R language has disrupted the statistics market.
And so to TensorFlow, which emerged from the Machine Learning team at the Google Brain project. Building on their experience of a system called DistBelief, TensorFlow is a second-generation framework for the implementation of machine learning at scale.
Users described their ML models as dataflow graphs, combining a number of machine learning techniques into a single model. TensorFlow itself does nothing to reduce the learning curve found in ML (in fact it might make it steeper), but Google's framework does enormously simplify the deployment of ML models. If you think of ML model construction as a data science then TensorFlow is a Data Engineering tool for deployment.
Google open-sourced TensorFlow in November 2015 under an Apache 2.0 license with a 1.0 release in February this year that introduced a number of improvements. For example, the TensorFlow had been limited to C++ and Python, but version 1.0 introduces interfaces to Go and Java, and there's a C library for building new language interfaces.
What makes TensorFlow so powerful for deployment is its ability to map ML models on to a number of hardware devices and platforms with very little change to the model. The TensorFlow framework will happily run on Android or iOS devices, a laptop, machines with one or more GPU devices to huge networks of machines of specialised machines with many GPU cards.
Most ML can be divided into two stages: training a model and running the model to make new inferences from new data. Generally, it is the training phase that needs the most computational power, the model is fed data that has been classified and the model learns (using a number of well-known techniques) features in the data that allow it to identify those features in unknown data. This is called the training phase and is the phase that needs the most power. Once a model has been trained it can be used to infer new facts about data it hasn't "seen before", generally running this inference model takes less power so it can be run on a mobile phone for instance.
The bigger the dataset used for training, the more power is required. Luckily TensorFlow scales very well and has built-in support for running training models across large networks of machines. TensorFlow itself will map computation of the model on to the devices it knows are available, be they a local CPU, GPU or remote machines.
The algorithm actually simulates the activity on devices to optimise the completion time on the available devices.
What is it like to work with TensorFlow? I'm not going to pretend I'm in a position to start writing new ML models on the platform, but – luckily – we may not need to. The TensorFlow webpage contains a link to a TensorFlow Zoo model library on Github with a number of user-contributed models. The zoo has around 27 exhibits at the moment, but it should be pointed out that other user-generated models are available around the interwebz so – for instance – if you want a movie recommendation engine then you could pick TF-recomm from user songgc here.
I wanted to see how easy it would be to build my own useable recommendation engine, based on this repository running under TensorFlow. This engine uses a small dataset from the Movielens library here.
The dataset consists of rows of data with a user ID, a film ID and the user's rating of the film. The recommendation engine reads this file and makes links between films and users to learn what films a user would like. It's important to note that although the dataset includes details of the films (title, year, genre and so on) this information is not used at all by the model, which only knows users and films by their IDs and the only link is the strength of recommendations.
Running the original code was easy, I packaged it into a Docker container for easy deployment on any platform and was able to train the model simply by running the svd_train.py file. The original code trains a model but does not use it for inference.
I built a socket interface on to the code that will read a user ID and provide a list of recommended movies. My version cloned from the original is available here.
Sharp-eyed readers will notice a number of problems with my code and if you want to run the inference engine you will need to retrain the model. Once the model is trained, it should be saved and then reloaded in an inference engine; this shouldn't be hard to accomplish, but as with all machine learning, it appears to be a black art. Secondly, this version of the code packages only the small version of the dataset, which – even so – takes tens of minutes to train on my reasonably powered laptop. Larger datasets are available from Movielens, but I've not been able to get them to read into memory on any machine.
One option may be to use other techniques for reading data into the model but taking the existing code base and extending it is problematic without a deep dive in to the documentation – and that will take some time.
Differences in the datasets from Movielens have also caused a number of problems with at least one dataset crashing the learning phase of the model. Not only that, but the problem flagged its presence using an error message that adds nothing to the cause of understanding. Another dataset, meanwhile, terminated the program with a simple "killed" message – and no traceability!
Simply making errors more readable and easy to trace would assist programmers just getting started. Saving and reading trained models proved to be problematic, but at least I could see them on disk. Attempts to read them back in, however, have proved fruitless.
TensorFlow has much promise. Google's original paper states it has been used in machine vision, speech recognition, playing Go and in language representation.
What really sets TensorFlow apart, though, is its graph model that lets end users build ML models out of sets of well-known modelling techniques. In addition, because the inference engine can be run on devices as small as mobile phones, ML could appear on devices everywhere. Yes, even your fridge may one day tap into Google Brain.
There's a way to go between then and now, though. Using TensorFlow effectively still requires a deep understanding of ML techniques and how to tie them together into a meaningful dataflow graph. I found the TensorFlow learning curve to be steep – that's without the curve associated with the ML algorithms themselves. ®