Nvidia shows off Tesla K10 performance
No single-precision flops anxiety here
ISC 2012 The Top 500 supercomputer ranking is based on the performance of machines running the Linpack Fortran matrix math benchmark using double-precision floating point math, but a lot of applications will do just fine with single-precision math. And it is for these workloads, graphics chip maker and supercomputing upstart Nvidia says, that it designed the new Tesla K10 server coprocessors.
Nvidia just hosted its own GPU Technical Conference back in May, where it unveiled the next-generation Tesla K10 and K20 GPU coprocessors: the "Kepler" GK104 and GK110 GPUs that are used in these two server coprocessor cards and some of the new hardware and software features that will make them substantially better coprocessors than the existing "Fermi" Tesla 20 series.
Nvidia left a little something for the International Super Computing conference in Hamburg, Germany this week, however, talking about the uptake of GPU coprocessing among the HPC sites in the world and providing some real-world benchmarks on the Tesla K10 that are more indicative of performance than raw peak number-crunching oomph or the Linpack test.
The ramp of coprocessors has been pretty fast since the "Roadrunner" hybrid, which married Advanced Micro Devices' Opteron CPUs to IBM's Cell Power-derived vector coprocessors to become the first machine to break the petaflops barrier four years ago. It demonstrated on a vast scale the benefits of using a parallel coprocessor to do offloaded math in a massively parallel supercomputer cluster.
The ramp of GPU coprocessors in the Top 500 supercomputer list
Nvidia launched the first Tesla GPU coprocessors in 2007, and reworked them to support a modicum of double-precision math in 2008. With the "Fermi" series of Tesla 20 series GPUs in 2010, the GPU coprocessors got error correction on their GDDR memory and a slew of features, including a much-improved CUDA programming environment that made the Tesla 20s not only the cheap math option, but one from which HPC centers could get some actual work done without having to be a Department of Energy nuke lab with seemingly unlimited budgets.
The June 2012 ranking announced at ISC this week showed 58 coprocessor-assisted machines. Of those, 53 use Nvidia Tesla GPU coprocessors, two use AMD Radeon graphics cards, and one uses Intel's "Knights Corner" MIC x86-coprocessor, now known as the Xeon Phi chip. But this ramp is not as important perhaps as where the coprocessors are being added, and who is adding them to their machines.
Count of supers using Tesla GPU coprocessors in the Top 500 supers list
Sure, there are three machines in the top ten that have Tesla GPU coprocessors, just like in the list a year ago. But, explains Sumit Gupta, senior product manager of the Tesla line at Nvidia, who spoke to El Reg ahead of ISC this week, the GPU is democratizing HPC, with a factor of eight increase in the number of machines in the Bottom 400 of the list now adopting Tesla GPU coprocessors to do most the mathematical lifting.
"In fact, a lot of our focus has been on the bottom four thousand HPC sites in the world," says Gupta with a laugh. "We are driving supercomputing for the masses, and now, with GPUs, small universities can build a small cluster and still make it onto the Top 500 ranking."
Rather than try to make a GPU coprocessor do both single-precision and double-precision math well, as it has done with prior generations of Teslas, Nvidia is bifurcating its line, with the Tesla K10 aimed at single-precision math and delivering a total of 4.58 teraflops of oomph on a card with two GK104 GPUs soldered onto it with 8GB of GDDR5 graphics memory.
The Tesla K20 will use a monster, monolithic GPU called the GK110 that has 7.1 billion transistors and that may offer 2 teraflops of double-precision performance. That's twice what Intel can do with its current Xeon Phi coprocessor, which will ship to OEMs by the end of the year – about the same time that OEMs can get their hands on Tesla K20s.
Memory bandwidth and flops, Xeon versus Tesla
Each GPU on the Tesla K10 card has 160GB/sec of memory bandwidth, for a total of 320GB/sec of bandwidth in and out of that memory block. This is 6.4 times the 50GB/sec into and out of a "Sandy Bridge" Xeon E5-2600 processor and considerably larger than the 177.6GB/sec on the prior Tesla M2090 GPU coprocessor. That K10 card with two GK104 GPUs also sports about a dozen times the floating point oomph of the Xeon E5 processor, but that may not be as important as you think.
"A lot of applications scale faster with memory bandwidth than they do with flops," Gupta tells El Reg. "And a lot of machines have plenty of flops already but they just don't have enough memory bandwidth to feed them."
Given that, you would think that Nvidia would have put a lot more memory and a lot more memory bandwidth than it has onto the Tesla K10 relative to its floating point power. Go figure.
As it stands, each single GK104 chip has 2.29 teraflops of oomph but only 160GB/sec of memory, compared to 1.3 teraflops of oomph and 1.77.6GB/sec of memory. But by doubling up the GPUs and widening the pipes into the GDDR memory (and presumably allowing data sharing across those two ranks of memory), Nvidia has been able to put considerably more performance into the same 225 watt thermal envelope, and this is also important to many customers.
This is what matters to the oil and gas industry, for instance. Gupta says that right now, more than 50 per cent of the seismic processing that happens in the world is done on GPU coprocessors> In Houston, Texas, alone, where a lot of this processing takes place, there are tens of thousands of GPUs installed in systems, doing work all day and all night, looking for oil. Among these seismic processors who have been kicking the tires on the Tesla K10s, they are seeing a speedup of between 1.8 and 2 times over the Tesla M2090s they currently have installed.
In the life sciences area, Nvidia has run a bunch of benchmarks to see how the apps stack up on the new versus old GPU coprocessors, and the performance boost ranges from just shy of a 2X improvement to as much as a 3X improvement, depending on the workload.
Nvidia Tesla K10 benchmarks
The ability to boost the oomph of life sciences workloads is illustrated by the AMBER molecular dynamics modeling software. Back in 2009, it would take 96 x86 server nodes all day to simulate 46 nanoseconds a molecule with 23,558 atoms. With eight Tesla M2050 cards running in eight server nodes, you could do 52 nanoseconds of simulation for that molecule in a day.
With the shift to the M2090s, you could move to four of these babies and - over the course of a 24 hour period - simulate 69 nanoseconds of the molecule wiggling about. This year, you can move to two Tesla K10 processors and get 66 nanoseconds of simulation done, or you can move to eight K10s and get four times as much wiggling computed in the same 24-hour period for the same 23,558-atom molecule.
As it goes with seismic processing and molecular dynamics, so it goes with satellite imaging, video enhancement, signal processing, computer vision, video transcoding, and digital rendering – just to name other areas where DP math doesn't mean squat.
Incidentally, the London Olympics will be streamed to all kinds of devices, with both live and archived streams, using GPU-based video streaming servers from Elemental. It expects to have over 1 billion people hit its connections and which needs the oomph of GPUs to process the hundreds of different streams it needs to send to TVs, phones, tablets, and other devices from which people will be watching the competition this summer. ®