This article is more than 1 year old

Microsoft Azure to spin up AMD MI200 GPU clusters for 'large scale' AI training

Windows giant carries a PyTorch for chip designer and its rival Nvidia

Microsoft Build Microsoft Azure on Thursday revealed it will use AMD's top-tier MI200 Instinct GPUs to perform “large-scale” AI training in the cloud.

“Azure will be the first public cloud to deploy clusters of AMD's flagship MI200 GPUs for large-scale AI training,” Microsoft CTO Kevin Scott said during the company’s Build conference this week. “We've already started testing these clusters using some of our own AI workloads with great performance.”

AMD launched its MI200-series GPUs at its Accelerated Datacenter event last fall. The GPUs are based on AMD’s CDNA2 architecture and pack 58 billion transistors and up to 128GB of high-bandwidth memory into a dual-die package.

At launch, AMD’s Forrest Norrod, SVP and GM of datacenter and embedded solutions, claimed the chips were nearly 5X faster than Nvidia’s then top-tier A100 GPU, at least in “highly precise” FP64 calculations. That lead narrowed substantially in more common FP16 workloads, where AMD claims the chips are about 20 percent faster than Nvidia’s A100. Nvidia being the dominant datacenter GPU player.

However, it remains to be seen if and when Azure instances based on the aforementioned AMD's graphics chips will become generally available, or if Microsoft plans to use them for internal workloads.

What we do know is Microsoft is working closely with AMD to optimize the chipmaker’s GPUs for PyTorch machine learning workloads.

“We're also deepening our investments in the open-source PyTorch framework, working with the PyTorch core team and AMD both to optimize the performance and developer experience for customers running PyTorch on Azure, and to ensure that developers' PyTorch projects work great on AMD hardware,” Scott said.

The Register reached out to Microsoft about its plans for AMD’s GPUs. We'll let you know if we hear anything.

The enterprise GPU market heats up

PyTorch development was a central component of Microsoft’s strategic partnership with Meta AI, announced earlier this week.

In addition to working with Microsoft to optimize its infrastructure for PyTorch workloads, the social media giant announced it would to deploy “cutting-edge ML training workloads” on a dedicated Azure cluster of 5,400 Nvidia A100 GPUs.

The deployment came just as the datacenter became Nvidia’s largest business unit at $3.75 billion, outstripping gaming ($3.62 billion) for the first time.

While Nvidia may still command the bulk of the enterprise GPU market, it’s facing its stiffest competition in years, and it's not just AMD knocking at its door.

Intel’s long-hyped Ponte Vecchio GPUs are expected to roll out later this year alongside the chipmaker’s long-delayed Sapphire Rapids Xeon Scalable processors.

And, earlier this month, Intel debuted its second-generation AI training and inference accelerators, which it claims offer A100-beating performance. ®

More about

TIP US OFF

Send us news


Other stories you might like