James Reinders says that, soon,
a programmer who doesn't
"think parallel" won't
be a programmer.
Here are Reinders' rules for multiprogramming, in my words, and as I see them:
- Think parallel first. Don't even contemplate bolting on parallel processing capabilities afterwards.
- Code to express the parallel nature of the problem. Don't write thread management code – this is the equivalent of writing in C# or Java instead of Assembler.
- Don't tie threads to particular processors. You don't want to write programs that only run properly on a particular number of cores.
- Plan to scale through increased workload. Amdahl's Law often limits the performance gain you can get from parallel processing applied to a fixed-size workload (there is usually some significant serial part of the process which can't easily be parallelised); but Gustafson observed that if you increase the workload, the serial part of the process often remains fixed and parallel processing then lets you get through the much bigger workload with similar performance.
- Only create programs which can arbitrarily add tasks to the workload, so if more processors become available, the workload can take advantage of them
- Only write programs that can run serially, mainly because (assuming that all new PCs will be multicore) they'll then be easier to debug. However, for the time being your programs will still be expected to run OK on legacy single processor machines - and remember that a program optimised for multiprocessors will usually run more slowly on a uniprocessor, so be aware of this and don't rush headlong into coding for multicore architectures.
And some of the tools which Intel thinks will help you follow these rules are:
- The OpenMP standard, which bolts efficient parallelising onto C++ and Fortran compilers using compiler hints. This is what Sutter calls "industrial strength duct tape", but it works.
- Threaded Building Blocks – C++ algorithms for scalable threading (Reinders seems very confident in this tool).
- Thread Profiler - which highlights potential performance bottlenecks.
- Thread Checker - which detects latent race conditions and potential deadlocks.
But now a note of caution. Parallel processing has always been a holy grail of computing (although Intel came to it late, perhaps). Many of the issues talked about in this conference I've met before – on multiprocessor mainframes (the most efficient way to achieve parallel processing in practice may be the mainframe job scheduler).
I've told good programmers to think about the consequences of running on multiprocessors, only to be told that "the compiler will look after it" (in general, it can't). And I've seen the results of programmers forgetting that their code can run on several processors and, in production, things may sometimes run in the wrong order as a result. This seldom shows up in test as, even if several processors are available to the test system, the chances are that you don't process enough data to see the latent race conditions, which tend to appear when the system is overloaded.
I've had to deal with the consequences of programmers deciding that they can do locks better than IBM and coding them for themselves (the application I'm thinking of was very fast – for a while, until the consequences of never releasing locks became apparent).
This stuff seems to be hard, so we're going to need very good tools and more training. And probably, much better adherence to good development process.
Do I think that parallel processing of this sort is the way of the future? Yes, emphatically, if you run on Intel or similar models it's the only way (it seems to me) to scale computer processor power effectively. Although whether we need to scale computer processor power or whether lots of specialised small computers, another kind of parallel processing, will work better, might be another question.
Reinders tried to make the point that parallelism was intuitive. His example was the queue – it's really quite intuitive that if you have a long queue, you just need more people on the desks servicing it. Simple. But this can hide a lot of complexity – if you have more desks and shorter queues checking in at Heathrow, things go faster. But you don't expect to get past check-in and find several people are assigned to one seat.
This is a trivial example, but move back a bit and airlines have gone bust because their booking systems couldn't cope with the essentially parallel activity of selling seats in an aeroplane at travel agents across the country. Planes flying three quarters full with spare capacity to cover "collisions" for seats – or upgrading overbooked passengers for travel on the next flight - can get expensive.
Do I think that parallelism is intuitive? "Only up to a point, Lord Copper". The consensus among the speakers at the conference was that this would be a revolution in thinking comparable with the OO revolution or structured programming. And (rather like OO) it will probably only become routine once the "old guard" dies off and a new generation of graduates that knows no other way of thinking takes over. ®
Parallel Needs New Designs
Parallel programming is tough and will remain so, but it need not be one step short of impossible.
One of the keys to good parallel programs is the decomposition of the job into tasks which perform independent (or close to that) operations. The highest level of this needs to be done at the design stage.
Multitasking happens at two scales - the large individual tasks and the small steps that make up a task. In many cases, working at the task level is enough for effective parallel programming, but where individual tasks are large, they may need to be done in parallel steps, essentially a micro tasking.
However, micro tasking steps runs a danger of increasing complexity and overhead. So the task level parallel work will require design level choices, while the step level parallelism will need checking and testing for overhead. If the steps are so small that parallel code is a 50% overhead, the extra complexity will probably not pay useful dividends.
Ultimately, if task level parallelism still can't handle the workload, the best answer may be splitting the workload and running multiple copies of the whole program, essentially segmenting the workload rather than increasing the internal level of parallel programming.
We already do this in multiple system web servers or file servers as Akami uses. But the same segmentation and replicatiion approach could be a better answer than trying to increase parallel operations at the micro level because of overhead.
old news or not ?
To those of us in the HPC community this seems like very old news. Every major supercomputer for the last 10 years has needed parallel programming.
However in this arena we have long since hit the point where codes are limited by the memory system more than the instruction rate. For this reason most really big systems are distributed memory rather than thread based. If you think thread programming is hard you should try distributing a problem across multiple memory systems.
Intels move to massive core counts could have one of 2 outcomes.
1) large scale scalable shared memory systems become commodity and the HPC arena becomes much easier.
2) people discover that having lots of cores attached to a rubbish memory system goes at the same speed as a couple of cores no matter what you do.
Guess which one I believe :-(
Intuitive up to a point
The problem with parallel programming is that it adds a third dimension in which one has to think. It's no longer just data organisation and control flow, there is now also communication to think about.
The change in thinking that is required does not compare with the structure programming and OO "revolutions" which are just more convenient ways of thinking about the Von Neumann architecture.
Because parallel programming is inherently more complex than sequential coding, it is both easier to make mistakes and harder to track them down (especially timing-dependent bogeys that go into hiding when one hauls out the debugger).
One solution may be to reduce complexity in other parts of development environment - although this has already been tried in the form of the Occam language. In retrospect there was nothing wrong with Occam (once floating-point support had been added!), its sin was that it was not Fortran or C.
In my view the ideal solution for applications outside the number-crunching realm would be something similar to the swingeingly expensive G2 "real time intelligent system" in which it is remarkably easy to express searches that are executed as concurrent tasks. When I last used it (some 10 years ago) G2 was an interpretive system and so was not exactly suitable for number crunching.
Over and above issues with the development environment, there is also the challenge of developing parallel algorithms. At least in the fluid dynamics area in which I've worked, the "smarter" algorithms always came with an increased degree of data coupling - in space or time. Loosening the coupling between data elements would generally result in a loss of algorithmic efficiency.