Nvidia shows off Tesla K10 performance

No single-precision flops anxiety here

Top three mobile application threats

ISC 2012 The Top 500 supercomputer ranking is based on the performance of machines running the Linpack Fortran matrix math benchmark using double-precision floating point math, but a lot of applications will do just fine with single-precision math. And it is for these workloads, graphics chip maker and supercomputing upstart Nvidia says, that it designed the new Tesla K10 server coprocessors.

Nvidia just hosted its own GPU Technical Conference back in May, where it unveiled the next-generation Tesla K10 and K20 GPU coprocessors: the "Kepler" GK104 and GK110 GPUs that are used in these two server coprocessor cards and some of the new hardware and software features that will make them substantially better coprocessors than the existing "Fermi" Tesla 20 series.

Nvidia left a little something for the International Super Computing conference in Hamburg, Germany this week, however, talking about the uptake of GPU coprocessing among the HPC sites in the world and providing some real-world benchmarks on the Tesla K10 that are more indicative of performance than raw peak number-crunching oomph or the Linpack test.

The ramp of coprocessors has been pretty fast since the "Roadrunner" hybrid, which married Advanced Micro Devices' Opteron CPUs to IBM's Cell Power-derived vector coprocessors to become the first machine to break the petaflops barrier four years ago. It demonstrated on a vast scale the benefits of using a parallel coprocessor to do offloaded math in a massively parallel supercomputer cluster.

The ramp of GPU coprocessors

The ramp of GPU coprocessors in the Top 500 supercomputer list

Nvidia launched the first Tesla GPU coprocessors in 2007, and reworked them to support a modicum of double-precision math in 2008. With the "Fermi" series of Tesla 20 series GPUs in 2010, the GPU coprocessors got error correction on their GDDR memory and a slew of features, including a much-improved CUDA programming environment that made the Tesla 20s not only the cheap math option, but one from which HPC centers could get some actual work done without having to be a Department of Energy nuke lab with seemingly unlimited budgets.

The June 2012 ranking announced at ISC this week showed 58 coprocessor-assisted machines. Of those, 53 use Nvidia Tesla GPU coprocessors, two use AMD Radeon graphics cards, and one uses Intel's "Knights Corner" MIC x86-coprocessor, now known as the Xeon Phi chip. But this ramp is not as important perhaps as where the coprocessors are being added, and who is adding them to their machines.

Tesla GPU count in the Top 500 supers list

Count of supers using Tesla GPU coprocessors in the Top 500 supers list

Sure, there are three machines in the top ten that have Tesla GPU coprocessors, just like in the list a year ago. But, explains Sumit Gupta, senior product manager of the Tesla line at Nvidia, who spoke to El Reg ahead of ISC this week, the GPU is democratizing HPC, with a factor of eight increase in the number of machines in the Bottom 400 of the list now adopting Tesla GPU coprocessors to do most the mathematical lifting.

"In fact, a lot of our focus has been on the bottom four thousand HPC sites in the world," says Gupta with a laugh. "We are driving supercomputing for the masses, and now, with GPUs, small universities can build a small cluster and still make it onto the Top 500 ranking."

Rather than try to make a GPU coprocessor do both single-precision and double-precision math well, as it has done with prior generations of Teslas, Nvidia is bifurcating its line, with the Tesla K10 aimed at single-precision math and delivering a total of 4.58 teraflops of oomph on a card with two GK104 GPUs soldered onto it with 8GB of GDDR5 graphics memory.

The Tesla K20 will use a monster, monolithic GPU called the GK110 that has 7.1 billion transistors and that may offer 2 teraflops of double-precision performance. That's twice what Intel can do with its current Xeon Phi coprocessor, which will ship to OEMs by the end of the year – about the same time that OEMs can get their hands on Tesla K20s.

Memory bandwidth and flops, Xeon versus Tesla

Memory bandwidth and flops, Xeon versus Tesla

Each GPU on the Tesla K10 card has 160GB/sec of memory bandwidth, for a total of 320GB/sec of bandwidth in and out of that memory block. This is 6.4 times the 50GB/sec into and out of a "Sandy Bridge" Xeon E5-2600 processor and considerably larger than the 177.6GB/sec on the prior Tesla M2090 GPU coprocessor. That K10 card with two GK104 GPUs also sports about a dozen times the floating point oomph of the Xeon E5 processor, but that may not be as important as you think.

"A lot of applications scale faster with memory bandwidth than they do with flops," Gupta tells El Reg. "And a lot of machines have plenty of flops already but they just don't have enough memory bandwidth to feed them."

Given that, you would think that Nvidia would have put a lot more memory and a lot more memory bandwidth than it has onto the Tesla K10 relative to its floating point power. Go figure.

As it stands, each single GK104 chip has 2.29 teraflops of oomph but only 160GB/sec of memory, compared to 1.3 teraflops of oomph and 1.77.6GB/sec of memory. But by doubling up the GPUs and widening the pipes into the GDDR memory (and presumably allowing data sharing across those two ranks of memory), Nvidia has been able to put considerably more performance into the same 225 watt thermal envelope, and this is also important to many customers.

This is what matters to the oil and gas industry, for instance. Gupta says that right now, more than 50 per cent of the seismic processing that happens in the world is done on GPU coprocessors> In Houston, Texas, alone, where a lot of this processing takes place, there are tens of thousands of GPUs installed in systems, doing work all day and all night, looking for oil. Among these seismic processors who have been kicking the tires on the Tesla K10s, they are seeing a speedup of between 1.8 and 2 times over the Tesla M2090s they currently have installed.

In the life sciences area, Nvidia has run a bunch of benchmarks to see how the apps stack up on the new versus old GPU coprocessors, and the performance boost ranges from just shy of a 2X improvement to as much as a 3X improvement, depending on the workload.

Nvidia Tesla K10 benchmarks

Nvidia Tesla K10 benchmarks

The ability to boost the oomph of life sciences workloads is illustrated by the AMBER molecular dynamics modeling software. Back in 2009, it would take 96 x86 server nodes all day to simulate 46 nanoseconds a molecule with 23,558 atoms. With eight Tesla M2050 cards running in eight server nodes, you could do 52 nanoseconds of simulation for that molecule in a day.

With the shift to the M2090s, you could move to four of these babies and - over the course of a 24 hour period - simulate 69 nanoseconds of the molecule wiggling about. This year, you can move to two Tesla K10 processors and get 66 nanoseconds of simulation done, or you can move to eight K10s and get four times as much wiggling computed in the same 24-hour period for the same 23,558-atom molecule.

As it goes with seismic processing and molecular dynamics, so it goes with satellite imaging, video enhancement, signal processing, computer vision, video transcoding, and digital rendering – just to name other areas where DP math doesn't mean squat.

Incidentally, the London Olympics will be streamed to all kinds of devices, with both live and archived streams, using GPU-based video streaming servers from Elemental. It expects to have over 1 billion people hit its connections and which needs the oomph of GPUs to process the hundreds of different streams it needs to send to TVs, phones, tablets, and other devices from which people will be watching the competition this summer. ®

High performance access to file storage

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Inside the Hekaton: SQL Server 2014's database engine deconstructed
Nadella's database sqares the circle of cheap memory vs speed
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
prev story


Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.