The hidden horse power driving Machine Learning models
When owning tin is a cheap(er) alternative to expensive cloud
Machine Learning is becoming the only real available method to perform many modern computational tasks in near real time. Machine Vision, speech recognition and natural language processing have all proved difficult to crack without ML techniques.
When it comes to hardware, the tasks themselves do not need a great deal of computational power; but training the machine does – not to mention an awful lot of data. In the machine learning world, the more data you have, the more accurate your ML model can be. Of course the more data you have the longer the training process will take.
Teaching a machine to fish
Take for example the Imagenet 2012 dataset, a set of test images that can be used for evaluating machine learning methods. This data set is a mere 138GB that I’m told will take one day to download to my machine. Once I have the data, it can be fed into the training phase of a machine learning algorithm - typically an iterative process where the model is trained with data, tested, the parameters tweaked and then the model is trained again.
Each iteration is termed an "epoch" in machine learning terms. For an idea of how long this can take, let’s to the Tensorflow program I built for learning movie recommendations. This will typically learn in 100 epochs fairly good recommendations for movies. Development was carried out on a dataset of 100,000 ratings, and this can realistically carried out on a MacBook Pro in less than an hour. Once you move up to the full 24 million ratings, however, the training time per epoch moves to more than 140 seconds, which will run for around nine days on the MacBook Pro. This really isn’t viable.
There's a, er, cloud on the horizon...
Something needs to be done. Maybe we could move this problem into the cloud and let the big boys with their big machines take over. The problem is moving your data into the cloud. For universities and the likes of Google, this isn’t really a problem, providing you’ve got access to end-to-end fast networks. Universities in Britain are all connected over the Janet network, whose backbone runs at 100Gbps, more than enough to shift large datasets around. Google, of course, has its own dark net, but what if we want to move data out of our walled garden and onto a public cloud ML system?
This was just the problem we faced a few years back at Dundee University when trying to use Microsoft’s Azure to process Mass Spectrometer data. These files were fairly big - a few gigabytes in size - but we were hoping to process lots of them in near real time. Sadly, we just couldn’t get the data into the cloud fast enough.
Companies are starting to offer hardware that can be situated close to the data production (in terms of network speed) for machine learning. These appliances employ GPU to speed up the math needed of machine learning.
This is why Amazon released Snowball, essentially a box of hard disks delivered to your door and that you fill with data (at 10Gbps) and return to be loaded into AWS. It is remarkably cheap - around $200 per job - with 10 days to fill the disks with your data, but it is not real-time.
It is for this reason that companies are starting to offer hardware that can be situated close to the data production (in terms of network speed) for machine learning. These appliances employ GPUs to speed up the maths needed for machine learning. Essentially the mathematics performed to produce the fast action in a video game is similar to that done in machine learning (and other fields); you just want to do lots of sums in parallel on fairly small chucks of data.
There are a number of manufacturers of GPU cards – Nvidia, AMD and Intel’s Xeon – but it is only really Nvidia that has grasped the ML nettle and made it easy to do deep learning on its cards.
So what sort of speed-up can you get? I used an Nvidia Titan X and set it up to run the Film recommendation engine with the large dataset. The training per epoch fell from 140 seconds to around 11, making training this large dataset a realistic proposition. A Titan GPU card will set you back around £900, plus the cost of the machine to run it in. You will still have the problem of getting the data onto the machine, and find way to stream the data through the card efficiently.
So if one GPU is good, multiple GPUs must be better. Nvidia certainly thinks so, and it is pushing the DGX-1 for Deep Learning.
The DGX-1 is the only machine of its type around at the moment. Sure, you can build your own machine with five GPU cards, but you still want to get close to the performance, go to the DGX-1 due to its custom bus features allowing date to be transferred to the GPU cards at impressive speeds.
The DGX-1 I got a shot at was lent to the University of Dundee, where I’m a senior lecturer, for testing by our machine vision group.
It’s an impressive looking machine - and a very loud one, once all its GPU cores are up and running. To get an idea of its speed, a researcher loaded up the Imagenet 2012 dataset and trained a Resnet50 machine learning model on the dataset. With one GPU the machine could process 210 images a second, with 2GPUs that was 404 and with 5 GPUs that was 934 images per second.
All plain sailing? Not quite
This kind of speed comes at a price, of course, and the machine isn’t cheap - expect to pay more than £100,000 for a system. Second, for best performance you’ll need to use Nvidia’s customised version of Google’s Tensorflow to get that sort of performance, which could mean a delay in getting at the latest features from the original open source well of Tensorflow code. In addition, trying to load the images from the dataloader will cause severe CPU bottlenecks; the data set needs to be optimised – a task that, in itself, can take several hours.
There’s one other problem that could deter some: every machine learning job in the DGX-1 needs to be packaged into a Docker container - not a problem for devops engineers, but the learning curve for researchers can be off-putting to those who aren’t. If we return to the movie recommendation engine, I was hoping to run this on the DGX-1 to get an idea of how that could be sped up. Sadly it’s not that simple, I’ll admit, making this code run in parallel was beyond me in the short time I had on the system. Others, too, have reported problems in taking code and making it run across multiple GPUs, with some reporting no increase in speed either.
Power at a price
The DGX-1 is an impressive machine delivering a lot of raw processing power, but at its price, it won’t be for everyone. Certainly, you can get a lot of poke for your money using just a single GPU card and that might be enough to get your ML AI up and running, providing you’ve got the time to wait for the training to be complete. On the upside though, you won’t need one of these machines to run the model once it’s been trained, you might only need your mobile phone and a copy of Tensorflow for Mobile or Apple’s Core ML.
The market for ML appliances is in its infancy: impressive machines, lots of raw power but at a prohibitive price. You probably won’t even need one of these machines once your model has been trained, so to really get your money’s worth, you need to build and test a lot of models at present.
If – and it’s a big “if” - OEMs get on board, I’d expect a greater number of machines to show up using multiple GPU cards optimized for ML libraries. What will happen is difficult to predict; over in the world of relational databases the market for powerful, hardware-and-software optimized appliances, while attractive, never really broke out into the mass market.
They, however, remained a relatively expensive device for the few, with demanding data needs. That was a time before the HPC expansion and the availability of raw compute online in the cloud. ®
We'll be covering machine learning, AI and analytics - and ethics - at MCubed London in October. Full details, including early bird tickets, right here.