The GPU tails wag the CPU dogs at Nvidia show
Where are the Tesla roadmaps?
Updated The Nvidia-sponsored 2010 GPU Technical Conference kicks off today in San Jose, California, and all of the key HPC players as well as some upstarts will be on hand to try to surf on the cresting wave of CPU-GPU hybrid computing models that will no doubt start taking over the HPC centers of the world and start moving out to our desktops and into corporate data centers in the coming years.
If you were expecting some insight into what Nvidia has cooking for the generation of GPU chips beyond the current "Fermi" chips that are used in the company's GeForce and Quadro graphics cards and Tesla co-processors, then you are going to be disappointed. Because according to Sumit Gupta, senior product manager of the Tesla line, Nvidia is not talking about roadmaps at the GPU Tech Conference.
And that is a damned shame, because now that the Fermi-based graphics cards, GPUs co-processors (both regular C2050 and C2070 PCI-Express cards and fanless M2050 and M2070 models for ceepie-geepie HPC clusters) are in the field, what everyone wants to really know is what Nvidia is going to do next.
There's plenty of chatter, of course, and it doesn't take a genius to figure out what Nvidia's next moves will be. First, there will be a process shrink that allows the company to get more flops out of a GPU, very likely offeringroughly twice the GPU cores and twice the oomph of the current machines.
The path is pretty plain. The first generation Tesla co-processor, the C870, debuted in the summer of 2007 with 128 cores running at 600 MHz, 1.5 GB of GDDR3 memory running at 1.6 GHz, and only offering single-precision floating point math. The C870, which burned 171 watts, was rated at 345.6 gigaflops. The second generation Tesla 10 GPU co-processors made their debut in November 2008 in the Tesla C1060, launching at the SC08 supercomputing conference, which had 240 cores running at 600 MHz, 4 GB of GDDR3 memory at the same 1.6 GHz speed, and the addition of double-precision math. The C1060 was rated at 622.1 gigaflops on single-precision math, but only 77.8 gigaflops on double precision.
With the Fermi GPUs at the heart of the current generation of Tesla 20 co-processors, Nvidia is shipping the C2050 and C2070, which have 448 cores running at 575 MHz and either 3 GB (C2050) or 6 GB (C2070) of GDDR5 memory running at a much faster 3 GHz. The Tesla 20 GPU co-processors offer more balanced floating point performance, with 1.03 teraflops of single-precision oomph and 515.2 gigaflops of double-precision number crunching. The Tesla 20s, which were announced in November 2009 at the SC09 conference, had the added extra goodie of ECC scrubbing on the GDDR5 memory inside the GPU co-processor - something that a lot of HPC workloads require and something that is missing from AMD's line of FireStream GPU co-processors.
A betting man would say that at SC10 this year in New Orleans Nvidia will be talking about the guts behind the forthcoming Tesla 30 co-processors and related discrete GPU graphics cards. The GPUs were designed with 128, 256, and 512 cores in the first three generations, delivering 128, 240, and 448 working cores after the boogers in each chip were de-allocated.
It is a pretty safe bet that Nvidia is trying to cram 1,024 cores in its next GPU design, and based on current trends, where a successively larger percent of the cores don't make it, a fair guess is that 838 cores will be live in whatever future designs come out unless Taiwan Semiconductor Manufacturing Corp gets better yields on future processes than it is getting on current ones.
Adding so many cores on what I will call the Tesla 30 GPU co-processors probably means dropping the clock speed a bit, too. Maybe to somewhere around 500 MHz or so, depending on how hot the chip gets. If this is the case, then the single precision math on such a future Tesla 30 GPU co-processor would come in at 1.5 teraflops or so. If the clock speed can be pushed up to 600 MHz, that gets you about 1.8 teraflops.
Obviously, having more cores not be duds means you can drop the clock speed and still get the same flops. In an ideal universe, all 1,024 potential cores would run at 600 MHz and you'd get 2.2 teraflops. I am assuming that Nvidia can keep double-precision math to half the rate of single-precision math going forward, and I bet Tesla customers are, too. And if the naming conventions mean anything, then this future GPU co-processor will be called the C3040, with a C3060 variant with extra GDDR5 memory.
I'll take my Tesla 30s now, Nvidia. Thank you very much. Now let's talk about the Tesla 40s...
But seriously, there is one other interesting possibility that Nvidia could throw into some future generation of GPU co-processor, and it reminds me of an old joke: A man walks into a doctor's office with a chicken on his head and the chicken says, "Hey, doc, can you cut this idiot off my ass?"
Next page: One more interesting option
Why not indeed
That is just what I thought and commented in a related article (submitted but not cleared yet). We can only hope!
Rugged systems and bandwidth
Wander over to Simon Collins from GE at the show - see some defense and aerospace targetted rugged CUDA boards in OpenVPX chassis - 20 x 10G links or more
What you need
Is a graphics co-processor with an inbuilt arm core. I know of at least one designed for the mobile space (<1w) which would make a great building block. Lots of oomp somewhere between PS2 and PS3 3D performance, HD encode decode etc using vector processors and dedicated HW blocks, and a nice Arm core to keep things in order.