Weather gets granular with GPUs
Just say NOAA
HPC Blog Everyone complains about the weather, but no one is doing anything about it.
The folks at the National Oceanic and Atmospheric Administration (NOAA) aren't doing anything about the weather. They're too busy trying to figure out what it's going to do tomorrow and next week.
I sat in on a very interesting presentation from NOAA on Tuesday afternoon at the GPU Technology Conference about how they're going to use GPUs to sharpen up their forecasts and dial them in to a much greater resolution. This is quite a computational problem, as it turns out. In 2008, it took 800 cores to drive their model at a 15 to 30 kilometer resolution. To get to a 10 KM resolution, it took a bit more hardware –125,200 more processor cores, to be exact – for a grand total of 126,000 cores.
Their next step is to get to 3.5 KM resolution, which is an entirely different kettle of fish. The only way to get to this level of granularity is to move to GPUs in a big way - which is what they're pursuing right now. They've learned some lessons along the way, the foremost being that the key to efficiently taking advantage of GPUs is to intimately know their code.
For example, when they were running their models with CPUs exclusively, interprocess communications used up about 5 per cent of the cycles. The move to GPUs didn't change the need or the time necessary for these communications, but because of the greater speed of the GPUs, the ratio of communication time to processing became 50% of total processing - making these processes enemy number one.
Memory management is also hugely important. GPUs are incredibly fast on the right code, but not understanding how to best utilize the memory on the GPU card can keep you from getting the most out of them. There are two classes of memory on the cards: the 16k that is closest to each GPU core, and then the much larger (1GB in the NOAA situation) global shared memory on the card.
The difference in speed in accessing this memory is profound - it takes only two cycles to get to the close memory, and 100 cycles to get to the global memory. Accessing memory on the server host would, assumedly, be measured in geologic time. Wise use of the blazingly fast, but tiny, memory attached to each core can make the difference between going faster and going a whole hell of a lot faster.
Likewise, constantly fetching data from the CPU-based host server is also costly from a performance standpoint. One weather model, called WRF (pronounced "Worf," like the Star Trek guy) showed a 20x speed-up in raw performance that shrunk to 7x when taking into account the time needed to copy data from the server over and over again. The NOAA folks have restructured their programs to minimize data copying and have seen performance rise commensurately.
Right now they're seeing performance ranging from 15x to 39x speed-up with GPU + CPU systems vs. exclusively CPU-based hardware. This is with fully optimized CUDA code running on a smallish pilot system, but it has proven the validity of their approach and is a pretty big win. Their push going forward is to scale the model to larger hardware - fueled by GPUs - completing the transition to the 3.5 KM resolution. ®
Sponsored: Flash storage buyer's guide