Nvidia snaps out snappier Tesla GPU coprocessors
All fired up on all 512 cores
GPU chipmaker Nvidia knows that it has to do more to grow its Tesla biz than slap some passive heat sinks on a fanless GPU card and talk up its CUDA parallel-programming tools. It has to keep delivering price/performance improvements, as well.
And that's exactly what it's doing with the new Tesla M2090 GPU coprocessor.
Back when the "Fermi" GPU chips were previewed at the SC2009 supercomputing event a year and a half ago, Nvidia showed off a chip with 512 cores, plus L1 and L2 cache memories for those cores (this was new) and ECC memory scrubbing (also new). The design bundled up 16 sets of 32 cores each into a streaming multiprocessor with 64KB of L1 cache, and a higher level L2 cache weighing in at 768KB that the cores can share.
That Fermi chip sported GDDR5 memory controllers, and the cards using the Fermi chips (either as discrete graphics cards or GPU coprocessors for accelerating floating point calculations) could have 3GB or 6GB of main memory. The memory controllers on the Fermi GPUs can address up to 1TB of memory, in theory.
But in the chip racket, theory does not always happen on the first iteration of a product, and so it was with the Fermi GPUs.
When the Fermi chips started shipping in the Tesla line of GPU coprocessors in May 2010, the initial Teslas had only 448 cores activated. Nvidia never explained this, but most people surmised that this had to do with yield issues (gunk on some cores in the chip) and the chips generating too much heat at a particular clock speed.
With those 448 cores running at 1.15GHz and GDDR5 memory chips running at 1.56GHz, the Tesla M2050 GPU coprocessor was rated at the 515 gigaflops of double-precision and 1.03 teraflops single-precision when performing floating-point operations.
The Tesla M2050 is a single-wide PCI-Express 2.0 device that has 3GB of GDDR5 memory, while the M2070 is a two-slot device that packs 6GB of memory and has the same floppish performance.
Both are rated at a top-end 225 watts of peak power draw, but Nvidia says the actual heat thrown off by the device is often a lot less and depends on the workload. That is a little bit less than 238 watts that the Tesla C2050 and C2070 coprocessors, which have fans built into them and which are aimed at goosing the number-crunching power of workstations to create a "personal supercomputer" – although these devices, too, are rated at the same 515 gigaflops of double-precision and 1.03 teraflops single-precision.
Sumit Gupta, senior product manager of the Tesla line at Nvidia, says that the Fermi GPUs used in the new M2090 coprocessors are not just a bin sort, looking for Fermis with more working cores or clocks that can run faster reliably. Nvidia has actually done a new tape-out of the Fermi design using Taiwan Semiconductor Manufacturing Corp's 40-nanometer processes, which Gupta says have some improvements that make chips run better.
When you add up some nips and tucks here and there on the Fermi chip plus the process improvements from TSMC, Nvidia can crank up the Fermi core clock speed by 13 per cent to 1.3GHz, and the GDDR5 memory speed by 18.6 per cent, to 1.85GHz, on the Tesla M2090.
Nvidia's Tesla M2090 server GPU coprocessor
Those increases help performance considerably. And so does the fact that with the TSMC process improvement, Nvidia can now have all 512 cores in the Fermi design activated, which yields a theoretical 14.3 per percent improvement over those initial Fermi chips with only 448 active cores.