HPC

This article is more than 1 year old

Wow, machine learning, what a snoozefest... less so if you strap a bunch of GPUs to your storage

GPU-boosted system market is, like, literally so hot right now

Wed 28 Mar 2018 // 16:56 UTC

Analysis Machine learning stresses storage because training the models means millions if not billions of files have to be fed to the training system with its GPUs in as quick a time as possible.

Suppliers are devising converged, hyperconverged and composable systems to sidestep chokepoints and make it simpler to get ML customers up and running.

Recently we have had the Pure Storage and Nvidia AIRI converged system, which brings four Nvidia DGX-1 GPU-enhanced servers to bear on FlashBlade-stored data.

This follows HPE's Apollo 6500 Gen10 and IBM's AC922 supercharged servers.

Now Nvidia has released an updated DGX, the DGX-2. Chinese server firm Inspur and composable infrastructure supplier Liqid have produced a Matrix Rack Composable Platform for machine learning while X-IO has added GPUs and SQream database software to its Axellio combined server+storage box.

Nvidia DGX-2

The DGX-2 is two DGX-1s plus more CPU, memory, interconnect bandwidth and storage:

	DGX-1	DGX-2	Notes
GPUs	8x V100	16x V100
Interconnect	NVlink	NVlink2 with 12 NVSwitches	216 ports
CPUs	2x 20-core Xeon E5-2698 v4 2l.2GHz	2x Xeon Platinum	Faster CPUs
GPU Memory	256GB HBM	512GB
System Memory	512GB DDR4	1.5GB HBM	Triple pooled memory space
Storage	4x 1.92TB SSD – 7.68TB	30-60TB NVMe SSD	4-8x more capacity
Performance	960 TFLOPS	1,920 TFLOPS	Bigger memory pool means larger jobs
CUDA Cores	40,960	81,920
Tensor Cores	5,120	10,240
Weight	134lbs	350lbs	More than 2x
Networking	4x EDR InfiniBand & 2x 10GbitE	8x EDR InfiniBand or 100GbitE
Power	3.5kW	10kW
Price	$149,000	$399,000	More than 2x

The much larger system memory means larger jobs can be run in the DGX-2. They should complete more than twice as fast because of this.

With the DGX-2 being announced so close to the Pure-Nvidia AIRI system, it's clear that Pure and Nvidia decided not to have a DGX-2-based AIRI. However, it's possible that a subsequent AIRI system could be DGX-2-based, and have larger flash drives inside to keep the 16 GPUs occupied. This would be, we suppose, a $2m-plus system which would reduce the number of potential customers.

Inspur and Liqid

Inspur and Liqid have co-developed their Matrix Rack Composable Platform which lets users dynamically set up CPU-GPU-storage combinations composed for specific workloads. Inspur provides the i24 servers and GX4 chassis, Nvidia the Tesla V100 and P100 GPUs, and Liqid the Grid PCIe-based fabric hardware and software.

Start with a set of disaggregated pools of compute, GPU, storage and Ethernet networking resources. Elements from these pools can be combined, clustered, orchestrate and shared over the PCIe fabric.

The pool elements are:

24x Compute Nodes (Dual Intel Xeon Scalable Processors)
144x U.2 Solid-State Drives (SSD), 6.4 TB per SSD (922TB)
24x Network Adapters (NIC), Dual 100 Gb/NIC
48x NVIDIA GPUs (V100 and P100)
Liqid Grid (Managed PCIe Gen 3.0 Fabric) and Liqid Command Center (software)

Liqid Grid PCIe fabric switch

A maximally configured system might blow the Pure-Nvidia AIRI system away and has three times more V100 GPUs than Nvidia's own DGX-2. The cost of such a fully configured Matrix Rack would be astronomical.

Dolly Wu, GM and VP at Inspur Systems, said: "AI and deep learning applications will determine the direction of next-generation infrastructure design, and we believe dynamically composing GPUs will be central to these emerging platforms."

We might expect the other composable server system suppliers to add GPUs to their disaggregated pools too, meaning Attala Systems, HPE with Synergy, DriveScale and Intel with its RackScale product.

X-IO, SQream and Nvidia

Back on the more affordable side of planet Earth we have X-IO's Axellio edge compute+storage product receiving an Nvidia GPU implant and SQream database software to deliver a "converged appliance for extremely rapid data analytics of massive datasets".

What SQream has done with its DBMS software is to take repetitive low-level SQL query operations and run them on a server GPU accelerator. The company says complex queries contain multiple filters, type conversions, complex predicates, exotic join semantics, and subqueries. When these are run on 100TB-level datasets, with billions of rows in several tables, they can take several minutes to hours to complete (query latency.)

SQream says it can provide a 20x speedup of queries on columnar data base sets, and query large and complex data up to 100x faster than other relational databases. Its latency on the complex query of 100TB-level datasets is, it claims, in seconds to minutes territory.

Its ingest speed is up to 2TB/hour.

This enables a large-scale reduction on the servers needed to run SQL queries on large data sets; SQream claims a single 2U server plus GPU is equivalent to a 42U rack full of servers. Basically SQream says use our relational database to get screaming SQL performance.

Then X-IO says run it on our hardware and go faster still.

The server/storage base is X-IO's Axellio Edge Micro-Datacenter appliance product; a 2U box containing two Xeon server modules with two Xeons apiece, 2x Tesla P100 GPUs, a PCIe fabric, and 1 to 6 FlashPacs, which each hold up to 12x dual-port NVMe SSDs (800, 1,600, 3,200 or 6,400GB) with a maximum capacity of 500TB.

SQream and X-IO claim a two-node example of their combined system can push data from storage to the GPU at up to 3.2GB/sec per GPU. Their combined system can reach 11.5TB/hour in an analytics run.

They say users can get real-time answers to queries that took minutes before, or expand their query windows from weeks to years to find trends, query trillions of rows of data and get results faster.

X-IO might also be looking at the machine learning space. In theory it would be easy enough to climb into bed with a machine learning framework software supplier. Just another partnership, right?

Get an Axellio datasheet here.

Salivating

Machine learning is seen as a hot growth market. Combine that with on-premises NVMe flash storage and big data analytics applications, and the result is hot boxes galore.

We must surely expect Dell EMC and NetApp to enter the GPU-boosted system market, not to mention Huawei and Lenovo. Other all-flash array vendors might look at the Pure-Nvidia deal and think "me too" e.g. Kaminario, Tintri and WDC Tegile.

The performance gains over non-GPU systems are so impressive that profit margins can be set high enough to get on-commission sales reps salivating like crazy. This GPU-accelerated server/storage product development space is going to see frenzied development as suppliers pile in to take advantage of the growth prospects. ®

Topics

Special Features

Vendor Voice

Resources

HPC

Wow, machine learning, what a snoozefest... less so if you strap a bunch of GPUs to your storage

GPU-boosted system market is, like, literally so hot right now

Nvidia DGX-2

Inspur and Liqid

X-IO, SQream and Nvidia

Salivating

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Intel Gaudi's third and final hurrah is an AI accelerator built to best Nvidia's H100

AI cloud startup TensorWave bets AMD can beat Nvidia

AI bubble or not, Nvidia is betting everything on a GPU-accelerated future

Industrial systems integrating digitalisation

Lambda borrows half a billion bucks to grow its GPU cloud

What Nvidia's Blackwell efficiency gains mean for DC operators

Los Alamos Lab powers up Nvidia-laden Venado supercomputer

China scientists talk of powering hypersonic weapon with cheap Nvidia chip

Overclocking muddies waters for Nvidia's redesigned RTX 4090 and US sanctions

Dell shaves months off lead times for GPU-powered AI servers

Next-gen Meta AI chip serves up ads while sipping power

Nvidia software exec Kari Briski on NIM, CUDA, and dogfooding AI

About Us

Our Websites

Your Privacy