Original URL: http://www.theregister.co.uk/2010/09/21/nvidia_gpu_tech/
The GPU tails wag the CPU dogs at Nvidia show
Where are the Tesla roadmaps?
Updated The Nvidia-sponsored 2010 GPU Technical Conference kicks off today in San Jose, California, and all of the key HPC players as well as some upstarts will be on hand to try to surf on the cresting wave of CPU-GPU hybrid computing models that will no doubt start taking over the HPC centers of the world and start moving out to our desktops and into corporate data centers in the coming years.
If you were expecting some insight into what Nvidia has cooking for the generation of GPU chips beyond the current "Fermi" chips that are used in the company's GeForce and Quadro graphics cards and Tesla co-processors, then you are going to be disappointed. Because according to Sumit Gupta, senior product manager of the Tesla line, Nvidia is not talking about roadmaps at the GPU Tech Conference.
And that is a damned shame, because now that the Fermi-based graphics cards, GPUs co-processors (both regular C2050 and C2070 PCI-Express cards and fanless M2050 and M2070 models for ceepie-geepie HPC clusters) are in the field, what everyone wants to really know is what Nvidia is going to do next.
There's plenty of chatter, of course, and it doesn't take a genius to figure out what Nvidia's next moves will be. First, there will be a process shrink that allows the company to get more flops out of a GPU, very likely offeringroughly twice the GPU cores and twice the oomph of the current machines.
The path is pretty plain. The first generation Tesla co-processor, the C870, debuted in the summer of 2007 with 128 cores running at 600 MHz, 1.5 GB of GDDR3 memory running at 1.6 GHz, and only offering single-precision floating point math. The C870, which burned 171 watts, was rated at 345.6 gigaflops. The second generation Tesla 10 GPU co-processors made their debut in November 2008 in the Tesla C1060, launching at the SC08 supercomputing conference, which had 240 cores running at 600 MHz, 4 GB of GDDR3 memory at the same 1.6 GHz speed, and the addition of double-precision math. The C1060 was rated at 622.1 gigaflops on single-precision math, but only 77.8 gigaflops on double precision.
With the Fermi GPUs at the heart of the current generation of Tesla 20 co-processors, Nvidia is shipping the C2050 and C2070, which have 448 cores running at 575 MHz and either 3 GB (C2050) or 6 GB (C2070) of GDDR5 memory running at a much faster 3 GHz. The Tesla 20 GPU co-processors offer more balanced floating point performance, with 1.03 teraflops of single-precision oomph and 515.2 gigaflops of double-precision number crunching. The Tesla 20s, which were announced in November 2009 at the SC09 conference, had the added extra goodie of ECC scrubbing on the GDDR5 memory inside the GPU co-processor - something that a lot of HPC workloads require and something that is missing from AMD's line of FireStream GPU co-processors.
A betting man would say that at SC10 this year in New Orleans Nvidia will be talking about the guts behind the forthcoming Tesla 30 co-processors and related discrete GPU graphics cards. The GPUs were designed with 128, 256, and 512 cores in the first three generations, delivering 128, 240, and 448 working cores after the boogers in each chip were de-allocated.
It is a pretty safe bet that Nvidia is trying to cram 1,024 cores in its next GPU design, and based on current trends, where a successively larger percent of the cores don't make it, a fair guess is that 838 cores will be live in whatever future designs come out unless Taiwan Semiconductor Manufacturing Corp gets better yields on future processes than it is getting on current ones.
Adding so many cores on what I will call the Tesla 30 GPU co-processors probably means dropping the clock speed a bit, too. Maybe to somewhere around 500 MHz or so, depending on how hot the chip gets. If this is the case, then the single precision math on such a future Tesla 30 GPU co-processor would come in at 1.5 teraflops or so. If the clock speed can be pushed up to 600 MHz, that gets you about 1.8 teraflops.
Obviously, having more cores not be duds means you can drop the clock speed and still get the same flops. In an ideal universe, all 1,024 potential cores would run at 600 MHz and you'd get 2.2 teraflops. I am assuming that Nvidia can keep double-precision math to half the rate of single-precision math going forward, and I bet Tesla customers are, too. And if the naming conventions mean anything, then this future GPU co-processor will be called the C3040, with a C3060 variant with extra GDDR5 memory.
I'll take my Tesla 30s now, Nvidia. Thank you very much. Now let's talk about the Tesla 40s...
But seriously, there is one other interesting possibility that Nvidia could throw into some future generation of GPU co-processor, and it reminds me of an old joke: A man walks into a doctor's office with a chicken on his head and the chicken says, "Hey, doc, can you cut this idiot off my ass?"
One more interesting option
What I mean by this is very simple. Last November, tongues were a-wagging last November  that Nvidia had hired some engineers from the defunct Transmeta, a maker of low-power clone x86 chips. A year earlier, there were rumors  that Nvidia would buy clone x64 chip maker VIA Technologies.
Both rumors have been stepped on, and Nvidia has said it would stick with the GPU business and not get into the CPU business. But what about embedding an x64 processor in the GPU package? What would happen if you created a GPU that could run a Linux kernel and the same MPI-based applications and had a fast link on the chip package to the GPU and its memory? You could put a bunch of these PCI-Express ceepie-geepies inside of a barebones server that provides connectivity out to switches and disk/flash storage.
It's an interesting thought. But Gupta didn't want to talk about any of this, and instead wanted to focus on the success of the Nvidia Tesla 20 GPU co-processors and the uptake of the CUDA programming tools for these GPUs.
As best as Nvidia can figure, there are more than 1,000 HPC clusters configured with Tesla GPUs running out there in the world. There are some 100,000 active developers working on porting code to GPUs, says Gupta, and 350 universities worldwide are teaching CUDA as part of their computer science programs.
The Nvidia GPU paper chase
"I have been in the parallel computing business for a long time, and I have never seen a programming model take off like this," says Gupta. As part of the festivities at the GPU Tech Conference this week, the Portland Group, which makes C/C++ and Fortran compilers popularly used in the HPC community, will announce a CUDA for x86 compiler that can sense whether an Nvidia GPU is present or not and compile two sets of code - one for just CPUs and one for CPU-GPU hybrids and run whichever set of code makes sense given the underlying hardware. The CPU side of the compiler has optimizations to help it make better use the multiple threads and cores in modern x64 CPUs; it is not just tossing the code on the x64 chip and saying that this is good enough.
Different HPC communities are also getting behind GPUs, mostly because there is no other way to get cheap flops within a given thermal envelope. The NIMA weather model at the National Oceanographic and Atmosphere Administration is getting a ground-up redesign with GPUs in mind, and so is the Tsunami weather model supported by the Japanese government and the ocean circulation model maintained by the Massachusetts Institute of Technology. The National Institute for Environmental Studies in Japan has just installed a CPU-GPU cluster and is porting its carbon dioxide emissions model for the country to the hybrid setup.
Various life sciences, engineering, astrophysics, and mathematical applications are being ported to CUDA as well. The Amber molecular dynamics, Ansys finite element analysis, and MatLab mathematical programs were singled out for supporting CUDA and Tesla GPUs. The first two have hundreds of thousands of users worldwide, who have been trapped by the limitations of their workstations and multicore x64 processors just don't provide the kind of oomph they need.
On one test, the Amber program was ported using MPI to the "Kraken" Opteron-based massively parallel supercomputer installed at Oak Ridge National Laboratory, which has 99,072 cores and is rated at 1.03 petaflops. Running a molecular modeling benchmark called JAC NVE, the Kraken XT5 super was able to simulate 46 nanoseconds of the molecule per day of compute time running on a 192-core partition of the Kraken super, while a server with eight of the M2050 embedded GPUs for servers could do 52 nanoseconds of molecular simulation per day. It is not clear what would happen if you tried to run Kraken over all of its cores, but most of us don't have $45m or so it would take to build such a petaflops workstation.
Bootnote: Although Sumit Gupta, senior product manager of the Tesla line, told El Reg that Nvidia was not going to talk roadmaps, Dan Olds, our HPC compatriot, said that Jen-Hsun Huang, president and CEO at Nvidia, pulled rank and decided to give out some details on future GPUs. We'll let you know as soon as we find out the details. ®