Petaflops beater: Nvidia chief talks exascale
Programming for parallel processes
"Power is now the limiter of every computing platform, from cellphones to PCs and even data centres," said NVIDIA chief executive Jen-Hsun Huang, speaking at the company's GPU Technology Conference in Beijing last week. There was much talk there about the path to exascale, a form of supercomputing that can execute 1018 flop/s (Floating Point Operations per Second).
Currently, the world's fastest supercomputer, Japan's K computer, achieves 10 petaflops (one petaflop = a thousand trillion floating point operations per second), just 1 per cent of exascale. The K computer consumes 12.66MW (megawatts), and Huang suggests that a realistic limit for a supercomputer is 20MW, which is why achieving exascale is a matter of power efficiency as well as size. At the other end of the scale, power efficiency determines whether your smartphone or tablet will last the day without a recharge, making this a key issue for everyone.
Huang's thesis is that the CPU, which is optimised for single-threaded execution, will not deliver the required efficiency. "With four cores, in order to execute an operation, a floating point add or a floating point multiply, 50 times more energy is dedicated to the scheduling of that operation than the operation itself," he says.
Power limits: NVIDIA chief executive Jen-Hsun Huang
"We believe the right approach is to use much more energy-efficient processors. Using much simpler processors and many of them, we can optimise for throughput. The unfortunate part is that this processor would no longer be good for single-threaded applications. By adding the two processors, the sequential code can run on the CPU, the parallel code can run on the GPU, and as a result you can get the benefit of the both. We call it heterogeneous computing."
He would say that. NVIDIA makes GPUs after all. But the message is being heard in the supercomputing world, where 39 of the top 500 use GPUs, up from 17 a year ago, and including the number 2 supercomputer: Tianhe-1A in China. Thirty-five of those 39 GPUs are from NVIDIA.
At a mere 2.57 petaflops though, Tianhe-1A is well behind the K computer, which does not use GPUs. Does that undermine Huang's thesis? "If you were to design the K computer with heterogeneous architecture, it would be even more," he insists. "At the time the K computer was conceived, almost 10 years ago, heterogeneous was not very popular."
Using GPUs for purposes other than driving a display is only practical because of changes made to the architecture to support general-purpose programming. NVIDIA's system is called CUDA and is programmed using CUDA C/C++. The latest CUDA compiler is based on LLVM, which makes it easier to add support for other languages. In addition, the company has just announced that it will release the compiler source code to researchers and tool vendors. "It's open source enough that anybody who would like to develop their target compiler can do it," says Huang.
Another strand to programming the GPU is OpenACC, a set of directives you can add to C code that tell the compiler to transform it to parallelised code that runs on the GPU when available. "We've made it almost trivial for people with legacy applications that have large parallel loops to use directives to get a huge speedup," claims Huang.
OpenACC is not yet implemented, though it is based on an existing product from the Portland Group called PGI Accelerator. Cray and CAPS also plan to have OpenACC support in their compilers. These will require NVIDIA GPUs to get the full benefit, though it is a standard that others could implement. There is a programming standard called OpenCL that is already supported by multiple GPU vendors, but it is lower level and therefore less productive than CUDA or OpenACC.
The distinction between driving a display and general-purpose programming is blurring. As game visuals become more advanced, more of the code is devoted to simulating real-world physics. "The combination of simulation and visualisation is going to transform how people enjoy games," Huang says.
In the same way, designers and engineers with workstations can use GPU accelerators to render accurate simulations of their designs. NVIDIA Maximus uses two GPUs, one from its Tesla line for general purpose programming and the other a Quadro for the display. "Now the workstation is completely changed because it can combine the workflow of two parts of the design, the design part, and the simulation part," claims Huang.
Huang is looking forward to Windows on ARM. He talks about the Asus Transformer tablet and its long battery life, and then says: "Imagine Windows on ARM on that device, and next-generation versions of that device. It's a foregone conclusion that the PC industry will be revolutionised. I'm anxious to see Windows on ARM come to market and I think Microsoft is going to be very successful with it."
There are a few clouds on NVIDIA's horizon. One is that ARM, which dominates the world of mobile CPUs, is now also designing mobile GPUs, under the brand Mali. That could undermine NVIDIA's Tegra business, a SoC (System on a Chip) which combines an ARM CPU with an NVIDIA GPU. Huang does his best to dismiss Mali as having only "basic capabilities". He adds, "We have to continue to find our value-add, if we don't then we don't have a role in the world."
Huang will not be drawn on the subject of Kepler, his company's next generation GPU family, which seems to be delayed though only in a notional sense since no date has been announced.
The Intel issue
There is also Intel to think about. Intel's multi-core evangelist James Reinders says its forthcoming "Knights Corner" MIC (Many Integrated Core) processor will solve the efficiency issues Huang describes. "Knights Corner is superior to any general-purpose GPU type solution for two reasons," Reinders tells us.
"We don't have the extra power-sucking silicon wasted on graphics functionality when all we want to do is compute in a power efficient manner, and - second - we can dedicate our design to being highly programmable because we aren't a GPU - we're an x86 core, a Pentium-like core for "in order" power efficiency - every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor.
"MIC used to be a GPU," says Huang when asked about Intel's co-processor. "MIC is Larrabee 3, and Larrabee 1 was a GPU. So there is no difference, except of course that we care very much about GPU computing, and we believe this is going to be the way that high performance computing is performed."
NVIDIA's other advantage? CUDA is available now. ®