The Register® — Biting the hand that feeds IT

Feeds

Petaflops beater: Nvidia chief talks exascale

Programming for parallel processes

SaaS data loss: The problem you didn’t know you had

"Power is now the limiter of every computing platform, from cellphones to PCs and even data centres," said NVIDIA chief executive Jen-Hsun Huang, speaking at the company's GPU Technology Conference in Beijing last week. There was much talk there about the path to exascale, a form of supercomputing that can execute 1018 flop/s (Floating Point Operations per Second).

Currently, the world's fastest supercomputer, Japan's K computer, achieves 10 petaflops (one petaflop = a thousand trillion floating point operations per second), just 1 per cent of exascale. The K computer consumes 12.66MW (megawatts), and Huang suggests that a realistic limit for a supercomputer is 20MW, which is why achieving exascale is a matter of power efficiency as well as size. At the other end of the scale, power efficiency determines whether your smartphone or tablet will last the day without a recharge, making this a key issue for everyone.

Huang's thesis is that the CPU, which is optimised for single-threaded execution, will not deliver the required efficiency. "With four cores, in order to execute an operation, a floating point add or a floating point multiply, 50 times more energy is dedicated to the scheduling of that operation than the operation itself," he says.

Jen-Hsun Huang, photo Tim Anderson

Power limits: NVIDIA chief executive Jen-Hsun Huang

"We believe the right approach is to use much more energy-efficient processors. Using much simpler processors and many of them, we can optimise for throughput. The unfortunate part is that this processor would no longer be good for single-threaded applications. By adding the two processors, the sequential code can run on the CPU, the parallel code can run on the GPU, and as a result you can get the benefit of the both. We call it heterogeneous computing."

He would say that. NVIDIA makes GPUs after all. But the message is being heard in the supercomputing world, where 39 of the top 500 use GPUs, up from 17 a year ago, and including the number 2 supercomputer: Tianhe-1A in China. Thirty-five of those 39 GPUs are from NVIDIA.

At a mere 2.57 petaflops though, Tianhe-1A is well behind the K computer, which does not use GPUs. Does that undermine Huang's thesis? "If you were to design the K computer with heterogeneous architecture, it would be even more," he insists. "At the time the K computer was conceived, almost 10 years ago, heterogeneous was not very popular."

Using GPUs for purposes other than driving a display is only practical because of changes made to the architecture to support general-purpose programming. NVIDIA's system is called CUDA and is programmed using CUDA C/C++. The latest CUDA compiler is based on LLVM, which makes it easier to add support for other languages. In addition, the company has just announced that it will release the compiler source code to researchers and tool vendors. "It's open source enough that anybody who would like to develop their target compiler can do it," says Huang.

Another strand to programming the GPU is OpenACC, a set of directives you can add to C code that tell the compiler to transform it to parallelised code that runs on the GPU when available. "We've made it almost trivial for people with legacy applications that have large parallel loops to use directives to get a huge speedup," claims Huang.

OpenACC is not yet implemented, though it is based on an existing product from the Portland Group called PGI Accelerator. Cray and CAPS also plan to have OpenACC support in their compilers. These will require NVIDIA GPUs to get the full benefit, though it is a standard that others could implement. There is a programming standard called OpenCL that is already supported by multiple GPU vendors, but it is lower level and therefore less productive than CUDA or OpenACC.

Steps to Take Before Choosing a Business Continuity Partner

Next page: Blurred lines

Latest Comments

Asynchronous computing.

There are a few big problems with asynchronous computing. One of then is instruction coordination and dependency. A lot of computing tasks in a CPU are interdependent so have to wait on the state of different units. In synchronous units, timings are better known so processors can be tuned so that dependent results arrive at predictable times, reducing the need for gatekeeping.

The second and more fundamental issue goes to uncertainty. Without the clock ruling the processor, there must still exist some form of coordination between the parts of a processor to determine who gets what first and so on. Otherwise, you can end up in metastable states which can produce dangerous uncertainties.

0
0

Good point...

Until we can get processors with freely scaling speed, it would be difficult indeed to combine them - especially in terms of power. It's not worth it, otherwise.

Though I could see it working - turn off all the low speed/high parallel processors, turn on the high speed/low parallel, process the serial command quickly, then flip it back the other way - it wouldn't be worth the effort to put it on one chip, rather than two distinct chips.

0
0

Dual CPU GPU

@ArmanX tricky to do with big complex x86 type processors. One problem is that they are synchronous - all the billions of transistors clock together which uses a huge amount of power. GPUs typically run at a much lower clock speed.

Designing CPUs where different parts run at different speeds, or free run has been a series of failures for at least the last 25years. One possible approach is to add a small linear CPU to the GPU to do the boring housekeeping stuff - but probably an ARM rather than x86

0
0

More from The Register

SCO vs. IBM battle resumes over ownership of Unix
Zombie lawsuit back and wants to suck the brains out of Linux
 breaking news
You don't need phone lines or cable for ANYTHING, says Dish
The satellite-dish man can sort you out with phone and broadband over the air too
 breaking news
What's HP got under wraps? Looks awfully flash and tape shaped
What happens in Vegas won't stay there - we've got the details
Microsoft borks botnet takedown in Citadel snafu
Stupid Redmond kicked over our honeypots, wail white hats
IBM's $1bn layoffs latest: Now axe swings in US, Canada - reports
Union claims 121 storage bods canned after dismal sales
NetApp musters muscular cluster bluster for ONTAP busters
Storage array OS overhauled to juggle more nodes, go down on you, er, less
HP adds 'Haswell' Xeon E3s to entry ProLiant servers
Gussies up MicroServer for SMBs, adds baby switches