Server makers leap on GPU bandwagon
HPC visions dance in their heads
GTC Server makers are swarming the GPU Technical Conference put on by graphics card and GPU co-processor maker Nvidia in San Jose this week. They smell money – HPC money in its many flop-happy variants, including traditional HPC simulation as well as electronic design automation, data analytics, financial modeling, machine vision, and digital media rendering. Many applications are only now getting GPU extensions, and now the appropriate iron has to be brought into the field to do it.
The fanless M2050 and M2070 GPUs announced  in June were an important piece of the hybrid CPU-GPU puzzle, since server makers need a smaller GPU and want to use the server's own cooling fans to keep the GPU from frying. (The GPU cards with fans on them are too fat and can be packed into pizza box servers easily). It is no wonder that now that the M series GPUs are out from Nvidia - as well as their FireStream 9350 and 9370 fanless GPU co-processor counterparts from Advanced Micro Devices, also announced  in June - that server makers are finally weaving them into their hard wares.
Supercomputer maker Cray said that the GPU Tech Conference that it would be creating a variant of its blade servers in the new XE6 massively parallel supers that would allow Tesla GPU co-processors from Nvidia to be mixed in with the x64 processors and offload floating point work.
Thanks to its OctigaBay acquisition back in February 2004 , Cray has engineering expertise is weaving field programmable gate array (FPGA) and other types of accelerators into massively parallel x64 machines. Speaking to El Reg, Barry Bolding, vice president of products at Cray, said that Cray was "very picky" about the components that it puts into its supers, but that the GPU co-processors had matured and that supercomputer customers were saying they were interested in GPU acceleration. No doubt about that, then a petaflops of massively parallel x64 server capacity runs on the order of $45m, based on Cray's own sales of the XE6 systems.
Bolding didn't say much more about exactly how the Tesla 20 GPU co-processors would be put into the XE6 supers, which are comprised of eight-socket Opteron blades and the "Gemini" XE interconnect and which made their debut  as a complete system in May. He did confirm that the Tesla GPUs would be put on blades, would like to the Opteron blades using PCI-Express links, and would be able to use the Gemini interconnect to share data and work.
Cray is planning to base the blades on the next generation of Tesla GPUs, which are code-named "Kepler" and which are due in 2011. Bolding said that Cray, being a partner of Advanced Micro Devices for CPUs, was looking at the GPUs coming out of AMD and is in discussions to see how they might be used in Cray massively parallel supers as well. Just like Cray has learned to have two sources of x64 chips, it will no doubt want to have two sources of GPU accelerators.
Over at Silicon Graphics, the company announced today that its high-end Altix UV 1000 parallel supers, which use SGI's NUMAlink 5 interconnect to scale to 256 of Intel's Xeon 7500s in a shared memory parallel super, will also be equipped with Tesla 20 GPUs. According to Bill Mannel, vice president of product marketing at SGI, the company will plug in the 1U GPU S2050 chassis, which sports two GPUs in the chassis. The Altix UV blade server has a PCI-Express riser card and the S2050 links to the blades through it.
You can't do a one-for-one pairing of GPUs and CPU sockets on the Altix UVs, however. You can only hook four of these S2050s into each 256-socket Altix box. The thing that is important is that the single memory space of the Altix UV design and the high-speed NUMAlink 5 interconnect means that an application using GPU co-processors can gather up its data into main memory and feed it directly at very high speed to the GPUs, making them run at a kind of efficiency that Mannel says is not possible in a normal CPU-GPU cluster.
SGI is also supporting Tesla 20 GPU co-processors in its Octane III personal supercomputer, the Altix XE workgroup servers, and the Altix ICE x64-based clusters. SGI is also, in a tip of the hat to its Rackable Systems heritage, doing some bespoke server designs for customers in the HPC space as it has always done for hyperscale Web customers. In this case, the designs include compact chassis design, low thermals, and GPUs, says Mannel.
A whole lot of flops in the box
Perhaps the most interesting new machine launched today came out of Moscow's T-Platforms, a very clever blade server that crams 16 blade servers into a 7U blade enclosure, with each blade having two Intel Xeon processors and two Tesla X2070 GPU co-processors. (This is the first time I have ever heard of the X2070 GPU, but it appears to be an M2070 with all of its packaging ripped off). The TB2-TL blades have some real engineering. Memory sticks have been tipped on their sides to keep the blades skinny, and the entire blade is covered with a giant heat sink that keeps it from melting. Here's what the blade looks like with the heat sink removed:
The T-Platforms T-Blade 2 GPU blade server
As you can see, floating off to the right is a naked Tesla GPU co-processor. Presumably there is one on the other side of the card to give you the two you need. T-Platforms says the X2070 GPU is rated at 515 gigaflops in double precision, which matches the M2070 fanless GPU specs. The compute blade holds two of Intel's low-voltage L5630 processors, which are four-core processors that run at 2.13 GHz and which have a 40 watt thermal design profile. The blade supports up to 24 GB of main memory and has one Gigabit Ethernet port and two Mellanox 40 Gb/sec InfiniBand ports. Here's what it looks like in the back of the T-Blade 2 chassis when you slide 16 of these bad boys in:
The hot end of the T-Blade 2 chassis with CPU-GPU blades installed
Alexey Nechuyatov, director of product marketing at T-Platforms, says that the T-Blade 2 chassis can be equipped with two 36-port InfiniBand switches from Mellanox. When you add it all up, the T-Blade 2 chassis with the CPU-GPU blades can pack 17.5 teraflops in a 7U enclosure for a list price of around $300,000 for a fully loaded box. If you buy ten racks of these babies - that's 60 blade chassis for $24m - you are near the top of the Top 500 supercomputing list for a little more than half the price of what a Cray XE6 will run you for the same oomph. The only difference is that you have to have code that can make real use of the GPUs.
The T-Blade 2 chassis and CPU-GPU blades will begin production in the fourth quarter and shipments to initial customers at that time. They will be generally available in the first quarter of 2011. Nechuyatov says that T-Platforms is using the GPU Tech Conference and the fact that it has a CPU-only blade cluster installed at Moscow State University that ranked number 15 on the June 2010 Top 500 list as a means of breaking into the HPC market in Western Europe and North America. T-Platforms has more than 200 customers worldwide and has labs in Moscow, Hannover, Kiev, and Taipei.
The other news coming out of the GPU Tech Conference is that IBM will launch a GPU-based blade for its BladeCenter chassis tomorrow. Hewlett-Packard is also whispering a bit about its own GPU blade for its BladeSystem enclosures, which sources tell El Reg the company has been quietly shipping to a few customers. The word we hear is that this HP blade will launch in October. ®