On-Prem

HPC

If you've got $1m+ to blow on AI, meet Pure, Nvidia's AIRI fairy: A hyperconverged beast

0.5 PFLOPS FP32, 0.5 PB of effective flash storage


Pure Storage and Nvidia have produced a converged machine-learning system to train AI models using millions of data points.

It’s called AIRI – AI-Ready Infrastructure – and combines a Pure FlashBlade all-flash array with four Nvidia DGX-1 GPU-accelerated boxes and a pair of 100GbitE switches from Arista.

The system has been designed by Pure and Nvidia, and is said to be easier and simpler to buy, deploy, and operate than buying and integrating the components separately; the standard converged infrastructure pitch.

AIRI's rack is meant to be an object of desire in your data centre.

FlashBlade is Pure Storage’s all-solid-state-storage array for fast access to unstructured data. It is a 4U box containing 15 vertically-mounted blades or object node servers. Each blade has a multi-core Intel Xeon CPU and 17TB of flash, totaling 255TB overall or around 523TB effective capacity after data reduction. Each blade also includes a pair of ARM cores, an FPGA, NVRAM, and PCIe-based networking. It is definitely not a commodity SSD-based system.

This is a powerful parallel-access flash array, and, in the AIRI, it has to feed more than 160,000 GPU cores.

It supplies data to the four DGX-1 systems which are Intel Xeon-based servers with eight Tesla V100 GPUs, the graphics chips interlinked with NVlink. These are seriously powerful GPUs, each with 5,120 CUDA cores, 640 Tensor cores, a 16MB cache and a 16GB HBM2 memory bank with a bandwidth of 900GB/sec. A V100 maxes out at 7.5TFLOPS using 64-bit floating-point math and 15TFLOPS using 32-bit.

With its 32 V100s, an AIRI has 163,840 CUDA cores and 20,480 Tensor cores at its disposal. We were told it provides four Tensor PFLOPS. One aspect is that applications – machine-learning jobs – have to run in containers.

To connect to the outside world, there are a couple of Arista 100GbE switches supporting GPUDirect RDMA. This enables a direct and fast path for data transmission between the GPUs and FlashBlades using PCIe features. The interface between FlashBlade and the DGX-1s is file-based: NFS.

The specific Arista products used weren’t revealed.

There are two included software items:

Pure said this software should help data scientists to get machine-learning projects up and running in hours, and not days or weeks.

AIRI will compete with HPE’s Apollo 6500 gen-10 system and GPU-enhanced IBM AC922 servers in the machine-learning system stakes. AIRI looks much more powerful.

HPE’s Apollo 6500 gen-10 can crunch up to 125 TFLOPS using single-precision floating-point math. A single V100 manages about 15 TFLOPS on that measure, meaning one DGX-1 will perform up to roughly 120 TFLOPS in comparison, and an AIRI with four of them 480 TFLOPS. It’s in a different league.

IBM’s AC922 supports up to six Tesla V100 GPUs, two fewer than a single DGX-1, and AIRI has four DGX-1s. Again, it’s in a different league from the IBM system.

Pure said multiple training jobs can run in parallel and complete faster on AIRI than with other systems, with run time cut to a quarter of what it would otherwise be.

What about the price? Pure wouldn’t provide one, saying it was down to the channel supplier, but would reflect the cost of the components.

Back of the envelope math says four DGX-1s will cost around $600,000. The Arista Switches costs, say, $3,000-plus apiece while FlashBlade costs under $1/effective GB, meaning, with its 523TB effective capacity, it will cost less than $523,000.

We are looking at a million-bucks-plus system here – a large enterprise or specialized customer purchase. The system as introduced supports four DGX-1s but deployments might start with one to two DGX-1s, which would lessen the upfront cost.

Customers will be organisations that need to run lots of large-scale machine learning jobs to train models on millions of data items.

Each DGX-1 draws 3.2kW of power, so you're looking at near enough 13kW before factoring in storage and networking and the Intel compute silicon.

AIRI is available now through selected reseller partners, such as ePlus Technology, FusionStorm, GroupWare Technology, PNY, Trace3, World Wide Technology and Xenon. ®

Send us news
11 Comments

Nvidia turns up the AI heat with 1,200W Blackwell GPUs

Five times the performance of the H100, but you'll need liquid cooling to tame the beast

One rack. 120kW of compute. Taking a closer look at Nvidia's DGX GB200 NVL72 beast

1.44 exaFLOPs of FP4, 13.5 TB of HBM3e, 2 miles of NVLink cables, in one liquid cooled unit

What Nvidia's Blackwell efficiency gains mean for DC operators

Air cooling's diminishing returns on full display with Nv's B-series silicon

AI bubble or not, Nvidia is betting everything on a GPU-accelerated future

LLMs powering generative AI may be moving GPUs, but Huang and co already looking at next big opportunity

Dell adds Nvidia's next GPUs to its portfolio of AI platforms

Nvidia is a kingmaker, and who wouldn't want to be king?

Nvidia: Why write code when you can string together a couple chat bots?

GPU giant says NIM will eliminate dependency headaches for the low low cost of $4,500/year per GPU

Nvidia software exec Kari Briski on NIM, CUDA, and dogfooding AI

A RAGs to riches story

UXL Foundation readying alternative to Nvidia's CUDA for this year

An open standard challenger appears

Tiny Corp launches Nvidia-powered AI computer because 'it just works'

Startup slams AMD for buggy firmware

Samsung preps inferencing accelerator to take on Nvidia, scores huge sale

PLUS: Tencent's profit plunge; Singtel to build three AI datacenters; McDonald's China gobbles Microsoft AI

Desktop GPU shipments jumped by a third – no thanks to AI PCs

Prices stabilized, so buyers opened their wallets

Dell exec reveals Nvidia has a 1,000-watt GPU in the works

Hot and hungry, yet direct liquid cooling may not be required