Nvidia shows off superjuiced Kepler GPU
From workhouse to racehorse
HPC blog There were quite a few surprises in today’s GTC12 keynote by NVIDIA CEO and co-founder Jen-Hsun Huang.
If NVIDIA were just introducing a new and faster rev of its latest GPU processor, one that brings three times the performance without breaking the bank on energy usage, that would be a solid win, and in line with expectations. But there was more to this announcement – much more. Our buddy TPM gives the down-and-dirty details on Kepler here .
I’m not sure this is exactly the right analogy, but to me, what NVIDIA has done with Kepler is transform the GPU from a simple task-worker into a much more productive member of the knowledge working class. But Kepler isn’t a paper-shuffling, PowerPoint-wielding MBA type: it still has a solid work ethic, outperforming predecessor Fermi by more than three times. More important than performance, however, is Kepler’s sophistication in processing work.
The features I’m talking about below apply to the Kepler K20, the dual GPU behemoth that’s due in Q4 2012. You’ll also need to be using the new version of CUDA, since it contains the instructions to take advantage of these new capabilities.
The first new feature is something called Hyper-Q. With Hyper-Q, a Kepler GPU can now accept work from up to 32 CPU cores simultaneously. Before Hyper-Q, only one CPU core at a time could dispatch work to the GPU, which meant that there were long stretches of time when the GPU would sit idle while waiting for more tasks from whichever CPU core it was working with.
With Hyper-Q, the GPU is now a full-fledged team player in the system, able to accept work from many cores at the same time. This will drive GPU utilisation up, of course, but it will also push CPU utilisation higher as more CPU cores at a time can dispatch and receive work from the GPUs.
The next new wrinkle is something called Dynamic Parallelism, a feature that will also serve to radically increase overall processing speed and system utilisation while reducing programming time and complexity.
Today, without Dynamic Parallelism, GPUs are very fast, but they’re limited in what they can do on their own. Lots of routines are recursive or data dependent, meaning that the results from one set of steps or calculations dictate what happens in the next set of steps or calculations. GPUs can run through these calculations very fast, but then they have to ship the results out to the CPU and wait for further instructions. The CPU then evaluates the results and gives the GPUs another set of tasks to do – perhaps run the same calculations with new data or different assumptions.
But with Dynamic Parallelism, GPUs can now run recursive loops right on the GPU – no need to run back to the CPU for instructions. Kepler can run almost limitless loops, cranking through calculation after calculation using thousands of cores. It can spawn new processes and new processing streams without having to depend on the CPU to give it directions.
Taking advantage of Dynamic Parallelism will obviously result in higher efficiency and utilisation as highly parallelised work is performed on speedy GPUs, leaving CPUs either free to perform other work or to simply stand quietly off to one side.
I’m not a programmer by any stretch of the imagination – that’s probably obvious. But from what little experience I have, buttressed by conversations with real programmers, it’s clear that using Dynamic Parallelism will also make the CUDA programmer’s job much easier. According to NVIDIA, programming jobs that used to take 300 steps can now be accomplished with as few as 20, because they don’t have to code all of the back-and-forth traffic between CPUs and GPUs.
Just Hyper-Q and Dynamic Parallelism on their own are pretty big steps in the evolution of the GPU and hybrid computing. With the addition of these two features, the GPU is now able to be shared by an entire system, rather than just a single core, and it’s able to generate its own workload and complete much more of that workload without needing to be led through it by a slower, general-purpose CPU.
Before the Kepler K20, the CPU’s role in a hybrid system was mainly as a traffic cop – responsible for sending traffic (data and tasks) to the GPU and accepting the results. With Kepler and its advanced feature set, the GPU can now work for 32 different cops at the same time and manage a larger part of the overall job on its own. This gives the cops more time to handle other tasks, write some parking tickets, or just pull their hats down over their eyes and catch a nap. ®