This article is more than 1 year old

Cloud giants 'ran out' of fast GPUs for AI boffins

Capacity droughts hit just before conference paper deadlines, say researchers

Top cloud providers struggled to provide enough GPUs on-demand last week, AI experts complained to The Register.

As a deadline for research papers loomed for a major conference in the machine-learning world, teams around the globe scrambled to rent cloud-hosted accelerators to run tests and complete their work in time to submit their studies to be included in the event.

That, we're told, sparked a temporary shortage in available GPU capacity.

Graphics processors are suited for machine learning, as they can perform the necessary vector calculations for neural networks extremely fast in parallel, compared to generic application CPUs.

Using more GPUs means shorter training times and faster results, something wide-eyed researchers desperately craved as they stayed up to the early hours racing to submit their latest efforts to The Conference and Workshop on Neural Information Processing Systems (NIPS) by 1pm Pacific Time on Friday. NIPS is the biggest academic AI conference, drawing over 5,000 attendees.

Conference deadlines tend to put a strain on cloud GPU availability, researchers told The Register, some speaking on condition of anonymity. The run up to this year's NIPS was particularly bad, we gather, and as AI development ramps up, hosting providers must scale to meet demand. It's something to bear in mind the next time a cloud giant boasts about new features and hardware acceleration on its platform: this tech is not always immediately available to everyone.

It's claimed a drought of GPUs hit Google Cloud and Microsoft's Azure in particular as the pair strained to keep up with demand ahead of the NIPS deadline.

One researcher told The Register his team was able to get one of the last remaining Nvidia DGX-1 boxes, with eight Tesla P100 GPUs, from Nimbix, a smaller rival cloud platform, after being unable to get the resources needed at larger cloud players.

A source familiar with Amazon's operations claimed AWS had no capacity issues; it is speculated by others, though, that this may have been due to AWS's spot pricing putting off pressed-for-time AI teams and others who need a stable service.

AWS offers “spot instances,” where customers can bid on spare capacity. Prices fluctuate depending on demand. It can be a cheaper option than on-demand pricing, where users are billed by the hour.

It only works if users offer more than the current spot price to access its CPUs or GPUs. If capacity is low, and the spot price exceeds the maximum price offered, users are notified two minutes before their session is terminated that they are being kicked off.

A graph uploaded by Reza Zadeh, CEO of Matroid, a machine learning startup, and a professor at Stanford, shows that two days before the NIPS deadline, the price to rent out a p2.16Xlarge with 16GPUs was a whopping $144 per hour - the maximum possible price. Considering that models often take days to train, AI research is not cheap.

Adam Gibson, cofounder and CTO of Skymind, a startup geared towards enterprises implementing AI on a large scale, said: “A lot of companies can’t keep up with the GPU demand. Cloud vendors often have data centers in regions. Most data centers don’t have enough GPUs per region. Azure is particularly bad at this.”

Spot instances aren’t ideal for research as “they are inconsistent and jobs can die at any time.” Google and Azure offer similar spot pricing tiers.

Meanwhile, Xavier Amatriain, VP of engineering at Quora, a question-and-answer site, said a Google engineer has been struggling to nab GPUs due to the major influx of requests on Twitter.

Google's cloud offers Nvidia Tesla K80 GPUs; Nvidia P100 and AMD chips are coming soon. Google's pricing calculator is broken right now for GPU instances, but we understand it costs from $0.70 an hour per GPU added to a generic compute instance. A bog-standard 16 vCPU cores and 60GB of RAM starts from about $0.65.

Azure's GPU instances start from one K80, 6 vCPU cores and 56GB of RAM from $0.90 an hour. Amazon K80 instances start from $0.90 an hour for one GPU, 4 vCPU cores, and 61GB of RAM.

Spokespeople for Microsoft Azure and Amazon Web Services (AWS) declined to comment on the AI researchers' claims. A spokesperson for Google Cloud did not respond to a request for comment. ®

More about

TIP US OFF

Send us news


Other stories you might like