Original URL: http://www.theregister.co.uk/2013/03/20/the_future_of_cuda/

Nvidia and ARM: It's a parallel, parallel, parallel world

Big changes coming to the CUDA programming model

By Rik Myslewski

Posted in Developer, 20th March 2013 21:34 GMT

GTC 2013 Nvidia envisions a future in which ARM processors and the GPU-maker's CUDA parallel-computing platform and programming model will work together in perfect harmony, and the company has a raft of planned CUDA enhancements to not only make that coexistence seamless, but to enhance that programming environment for discrete GPUs, as well.

"If we look five years out, we expect that ARM will be a very important platform for CUDA," Nvidia's chief technologist for GPU computing software, Mark Harris, told his audience on Tuesday at the GPU Technology Conference in San José, California.

Today, there's no SoC that combines ARM compute cores with a CUDA-enabled GPU, but that's about to change. Nvidia's next Tegra processor, code-named "Logan", will incorporate CUDA 5 support when it hits full production early next year, and its follow-on, "Parker", will upgrade that capability in a processor that Nvidia president and CEO Jen-Hsun Huang promises will have 100 times the performance of 2011's Tegra 2.

Harris and his team's work on CUDA aims to make the notoriously difficult parallel programming challenge smoother.

One of the first items on his list, as The Reg has reported in detail, is Nvidia's work with Continuum Analytics to create the NumbaPro Python-to-GPU compiler. Why Python? Well, Harris said, it's not only an exceptionally popular language, but coding in it is productive, interactive, and "even fun."

But there's more to CUDA's future than ARM and Python. For one, the CUDA development team plans compiler improvements such as just-in-time (JIT) compilation and linking of device code. "This will enable you to specialize code and even generate code on the fly," he said.

Support for C++ 11 is also in the cards, plus what Harris characterized as "really fast" sparse solvers and the addition of multi-GPU support to some CUDA libraries "where it makes sense."

C++ 11 will not, of course, be the last iteration of that language, and Harris said that he hopes that in the next version, scheduled for around 2017, "we would like for acelerators such as GPUs to be a core part of programming in C++." Towards that goal, Nvidia is working with others to include a library of parallel algorithms for that version of C++.

"And of course," he said, "we're always improving the development tools" for CUDA, such as adding "step-by-step guidance to finding developments in your application" to the next generations of Nvidia's Visual Profiler and Nsight Eclipse Edition. Harris said that these additions would present developers with information about the bottlenecks in their applications in a more-visual way.

Currently, CUDA developers need to use a separate GPU to run their display from the GPU on which they're debugging their application. "In the future," Harris said, "we'll be lowering that restriction and enabling single-GPU debug."

Those improvements, he said, will come relatively soon, but a number of larger challenges remain. As heterogeneous computing becomes more prevalent, for example, it becomes increasingly important to control the locality of data – keeping it near the core or cores that are working with it. The challenge there, he said, is to add that capability "without getting in the way, without making development more difficult."

Nvidia may be able to tackle that challenge on its own, Harris said, but a bit further down the road it's going to become increasingly important for operating systems to support what he described as "hybrid computer architectures." To that end, he said, Nvidia is working with the developers of Windows, Linux, and OS X.

There's also work to be done with compiler developers. "Obviously," he said, "compiling code for these interesting hybrid architectures is essential."

In the future, Harris believes that hybrid parallel computing will become ubiquitous, seeing as how all processors being designed today are parallel in some form or another because what he referred to as "the power wall" has pushed processor designers to improve performance by adding parallelism rather than by simply cranking up clocks.

"In the future," he said, "all programmers should be parallel programmers, or shoud be able to at least create parallel programs" – a subtle but important distinction, and one that compiler developers can help to define.

That aforementioned power wall, along with the increasing number of transistors in modern processors that need to be power-managed, will cause programmers to keep their eyes on the power prize. In the future, Harris said, programmers will need to tune their apps not just for performance, but for performance-per-watt as well. Tools and program models need to be created to give developers that ability.

"There's a lot of really big challenges in this," Harris told his audience, "so those of you that are researchers, there's a lot of hard problems to solve here, so lots of great research areas and directions."

Harris wrapped up his talk with his vision of an ARM-heavy future, when he says "it will be very common to be programming on the CUDA platform on ARM processors, on ARM systems across various industries and architectures."

And on discrete GPUs, as well. "Of course, we're Nvidia, so we envision GPUs everywhere," Harris said. ®