Deconstructing databases with Jim Gray
A genuine guru
Graphical Processor Units or GPUs
What are your views on using Graphical Processor Units (GPUs) for data processing?
Unbeknownst to most of us, the graphics community has been building vector co-processors and putting them into every laptop, every desktop. At present, the typical vector co-processor has about 40 times the CPU power - instruction per second power - of the CPU. It also has about ten times the memory bandwidth.
Intel and AMD talk about multi-core; well, a typical GPU has a hundred cores, a hundred pipelines and is doing all this texture mapping in parallel. So if you could figure out how to program the GPU you could go a lot faster, and the coda to this is that the GPU continues to double at a much faster rate of speed than CPUs. So it is incumbent upon us software guys to use the instructions where we find them, and we are beginning to find lots and lots of instructions in the GPU.
There is a Microsoft research project, which has published a technical report, I think two months ago, about Accelerator [Microsoft Tech Report 2005-184 - Ed], a general purpose programming language that extends C# and allows you to program array-oriented operations in video and ATI chips. The language is generic; you can program not knowing which GPU you've got.
This project was working with some people at the University of North Carolina and they are graphics guys who know how to program GPUs and they came up with a novel algorithm for sorting. So the general thing we are doing is going methodically through the problems and seeing if we can map those problems to GPUs. I'm currently looking at processing satellite imagery and much of the satellite image processing fits GPUs beautifully because it is fundamentally pixel programming. My main role in working on that paper was that, in doing something like sorting, you have to balance I/O and CPU and memory utilisation. So it's a system-wide design and I helped the people at North Carolina with making sure that the I/O could deliver data quickly enough so that we weren't disk-bound. We had to use several disks in order to keep the system busy because a disk only gives you data at about 50MB per second and the GPU can sort at about 500MB per second.
What kind of applications would the GPU processing would suit?
Crypto would be a good application. In addition, many of the large data applications; like mapping applications or sensor applications. Of course, the most natural thing is any kind of the processing of cameras so if you can get Photoshop recast using the GPU instead of the CPU it would run a lot faster. Fundamentally highly parallel "do this to all the pixels" applications are good.
Of course, this doesn't fit within the x86 instruction set and it requires writing algorithms in a very different way; you have to think in a very different way. The bitonic sort [see a general definition of this term here - Ed] that we used with the GPUs really is radically different from any sort anyone has ever thought of before. It is massively parallel. You have to think of a single instruction stream, multi-output, data-stream architecture, and most people are used to the ALGOL, COBOL single instruction stream, single data-stream idea [and, in my experience, only a few such programmers can reliably code parallel processes, as Gray says later - Ed]
Why is this taking so long to be accepted?
Remember the year 2000 problem? Looking back it's amazing people had to work so hard changing two digit to four digit dates. Many of us use libraries that already exist, many of us use algorithms that already exist, we all know that weighted sorting is pretty good and quick sort is pretty good. Then somebody comes along and says "No, no, no you've got to use bitonic sort and you've got to figure out how to program everything in parallel and incidentally the memory model is different and you just have to think differently, and oh, incidentally they vary from one machine to the next". So you say, "Oh, excuse me, how does this work on a server? I have a multi-threaded system and how does the GPU get shared among the multiple threads?"
There are a few details like that that haven't, quote "been worked out". About two weeks ago IBM came out with a cell architecture which is a fairly radical departure from...it's not just a GPU any more 48:42 it's an array of array processors and so we are seeing an architectural blossoming right now which could make life so confusing for people.
You can see why people end up saying, "You know, I'm a C programmer, I'm just going to stick to my good old C programs and I'm sure they'll be working in a few years and if I invest in one of these bizarre architectures they are likely to have evaporated and I'll have wasted all that effort."
Another problem is that people find it hard to think in parallel. If you read a cookbook, it tells you how to make things and there's parallelism inside the cookbook. They never tell you to do things at the same time...you know, each step is do this, do this, do this, it's all sequential. Knitting is another very sequential process.
Sponsored: IBM FlashSystem V9000 product guide