Nvidia revs up CUDA GPU coder toolkit
Unified CPU and GPU memories
Nvidia has staked a large part of its future on the idea that GPUs and their massively parallel architectures can replace CPUs for a big chunk of computational jobs. But parallel programming on one device is tough, across two incompatible devices is very difficult, and across clusters of hybrid machines can be very tricky indeed. That's why Nvidia's CUDA parallel programming environment is probably as important as any chip or Tesla GPU co-processor that Nvidia will ever ship.
Nvidia figured out even before the first Tesla GPU co-processors came to market a few years back that programming tools were going to be the lubricant that got the Teslas moving for HPC and commercial applications. And to its credit, Nvidia has put a lot of work into making hybrid and parallel programming easier than it has been historically. Luckily, some of the techniques that Nvidia has come up to parallelize C and C++ code across the many cores in a GPU can also be applied to the increasingly cored and threaded processors on the market; ditto for the work that The Portland Group has done to get its Fortran compilers to spread their work over GPU and CPU cores.
But the GPU nirvana of transparent and automatic parallelization and optimization of codes running across ceepie-geepie gear is not yet in sight, although Nvidia is taking a few steps closer to it with the CUDA 4.0 release. The software has three key new features: GPUDirect 2.0, unified virtual addressing, and support for the Thrust C++ parallel libraries.
One of the big problems with using CPU-GPU hybrids is that the CPU is in control of everything on the system while the GPU co-processors just hang off the PCI-Express buses, waiting for data to chew on and spit back out to the bus. With GPUDirect 1.0, Nvidia worked with InfiniBand networking adapter makers Mellanox and QLogic to allow for copies of GPU data to be copied out to the system main memory so in the event that one GPU in one server needed data from another server in the cluster, it could go out over the PCI-Express bus and up through the chipset and CPU's memory controller to access that data right there rather than having to go the extra steps of sending a request to the GPU on that second machine and waiting for it to come back through the CPU stack again.
This simple change boosted network communication performance by around 30 per cent, according to Nvidia.
With GPUDirect 2.0, which is embedded in the CUDA 4.0 toolkit, a GPU co-processor in the system has a new driver stack that allows it to talk directly to another GPU on the same system over the PCI-Express bus, getting the system chipset, CPU memory controller, and system main memory out of the loop entirely.
In a future CUDA release, says Sumit Gupta, senior product manager of the Tesla line, the GPUDirect software will be tweaked so GPUs on different servers within a cluster of machines can directly access information from each other over InfiniBand links without getting the CPUs in the act at all with the copies of data into system main memory. So there will be peer-to-peer communication between GPUs over the PCI-Express bus within a system as well as between GPUs linked to each other over InfiniBand links that lash together multiple servers. (Which once again begs the question, what will you need the CPU for? Oh, right, the operating system that holds the C, C++, or Fortran code.)
The Message Passing Interface (MPI) protocol commonly used for clustering x64-based servers together into parallel machines is not able to use the GPUDirect 2.0 functionality yet, according to Gupta, but in the CUDA 4.0 release functionality similar to GPUDirect 1.0 allows for data inside of GPUs to be moved to system memory and be available for MPI collective operations. Modified versions of MPI, such as OpenMPI, can move data from and to the GPU memory over InfiniBand when applications do an MPI send or receive operation.
Another neat feature of the CUDA 4.0 environment is called unified virtual addressing, and it takes the memory space of the system and the memory spaces of the multiple GPUs in the machine and maps them as a single unified address space. Developing applications with prior CUDA toolkits required programmers to maintain pointers to CPU and GPU memories in their code, but now they won't have to do that. CUDA will keep track of what data is stored where. According to Gupta, programs that were written with these pointers will continue to work, but coders working to port applications to CPU-GPU hybrids will now have less work to do. Gupta says that the use of the unified memory scheme will not have an adverse effect on performance.
"This unified virtual addressing is a step towards Denver, which will have a single address space anyway," explains Gupta, referring to the hybrid ARM CPU-Maxwell GPU chip that Nvidia said it was working on back in January.
The Maxwell GPUs are expected to offer 16 times the gigaflops per watt of the current Fermi GPUs when they are delivered in 2013, and some of them will have one or more ARM processors on them so they can be used in servers and PCs. (My guess is Nvidia will do a quad-core ARM chip, but the company has provided little details on its CPU plans.)
The other big feature in CUDA 4.0 is the Thrust C++ library, which is an open source project that Nvidia has been contributing to. The Thrust library is similar to the Standard Template Library (STL) for C++, except it has been tweaked for parallel algorithms and data structures and, in the case of CUDA, to work on GPUs as well as CPUs. With the enhancements in CUDA 4.0, the toolkit will analyze the code and automatically divvy up work between the CPUs and GPUs to get the fastest code path and the best performance for an algorithm. In many cases, of course, the GPU will give the best oomph, such as with sorting algorithms. Gupta says that the Thrust library for C++ running on a GPU can do parallel sorting anywhere from 5 to 100 times faster than the C++ STL library.
CUDA 4.0 has some improvements in its threading model, too. Now a single CPU thread in a system can access all of the GPUs in the system at the same time to dispatch work, and conversely, multiple CPU threads inside of an application can share contexts on a single GPU at the same time rather than having to wait their turn. The updated toolkit has a new GPU binary disassembler, adds support for the cuda-gdb debugger for MacOS clients, and better C++ debugging, including new/delete and virtual functions.
The first release candidate for the CUDA 4.0 toolkit is being announced today, and will be available for download for free on March 4. You have to be a registered developer to get your hands on it. Gupta says that Nvidia expects it will take a six to eight weeks to shake whatever bugs developers find out of the release candidate, and then it will become generally available. Nvidia has a large pool of programmers to pull from to help it harden the CUDA 4.0 code. Through the end of 2010, the company had more than 700,000 cumulative downloads of CUDA tools and estimates that this represents around 100,000 active developers. ®
I wsa surprised not to see a mention of OpenCL in the article ? or did i totally miss the point that if GPU computing was to become widespread it would have to run on different hardware from different manufacturers including intels future laughabee and AMD ?
Look at the Thrust library. If you are comfortable with the STL, Thrust will sort your data on the GPU with code that looks just like a stock STL algorithm.
Spent the afternoon playing with the Thrust library for CUDA. Very very nice work.
There is nothing wrong with OpenCL but it has some hoops to jump through. Thrust is very much like simple C++ STL programming. I like Thrust a lot!
Re: mere mortals
I think we both know the answer to that one. There isn't.
If your data happens to already be in the right place, there probably *is* a way of sorting it really quickly, but the overhead of putting your data in the right place for each wave of the magic wand will be comparable to the savings.
And of course you'll note that a tool was mentioned in connection with partitioning your problem. In the future, you'll be shipping something semantically equivalent to source code, to be compiled by the OS for however many processors the OS scheduler reckons it can make available right now. That's another overhead, unless you precompile for various possibilities, in which case you'll never be quite as efficient.
These problems are soluble, but it's not quite the free lunch they want you to believe in.
100,000 active developers?
"Through the end of 2010, the company had more than 700,000 cumulative downloads of CUDA tools and estimates that this represents around 100,000 active developers."
Call me cynical, but I'd be surprised if the correct figure was within an order of magnitude of that.
First you have to discard all those who simply downloaded an SDK to read the documentation -- and then decided it wasn't appropriate for them (or knew beforehand that it wouldn't be, such as journalists and bloggers). That *alone* probably takes you down to the 100,000 mark.
Next you have to discard all those who are still interested, but have a day job that doesn't fit the CUDA hole. Yes, I expect you can contribute in your spare time, but most of us don't have enough spare time to make a real difference. Just look at how many open source projects can't get the developers they need.
Lastly, you have to discard that fraction who call themselves programmers but who struggle with anything harder than simple scripts and macros. No matter how committed they may be, they'll never be "active" in the sense of developing real CUDA applications.