A Deadlock Holiday

Original URL: https://www.theregister.com/2009/12/10/verity_stob/

What to do now there's no Moore's Law?

Posted in Columnists, 10th December 2009 11:26 GMT

Stob Moore's Law, I need hardly remind a top-notch industry professional like you, states that as the density of silicon circuitry doubles, the probability of you not being able to find some sensibly-priced extra memory to fit your old lappy approaches 1.0.

In recent times it has become generally admitted that, if this well-known observation has not yet quite joined Elvis, it is at the very least fiddling determinedly with the fire exit door. Instead of increasingly quick processors, we are given increasingly cored processors. Whereas it used to take just one running instance of Access 2000 to bring your CPU usage meter to 100%, it now takes two, four or possibly 128.

Once upon a time, all one needed to know about multi-tasking code was how to hang a few lines of badly-written (I speak for myself) assembly language off the timer interrupt. Those were the days, my friend, we thought they'd carry on; yet here I am giving you a whistle-stop tour through the baffling mechanisms designed to help you envisage and write para-multi-threaded applications.

Let us begin with some ancient history.

Co- not Sub-

The great Donald Knuth described the coroutine, a mechanism for writing single-tasked multi-tasking that allowed any programmer to cope with any situation, merely by thinking of everything at once.

Although it has never really taken off as a popular programming construct, it has been very influential, in particular pioneering the important adjective 'lightweight' (subroutines are a lightweight specialisation of coroutines). Nowadays, of course, 'lightweight' is used as a synonym for 'good' throughout comp. sci., yet it still retains a special affinity for the parallel and pseudo-parallel arts.

Threads

A 'thread', abbreviating 'thread of execution', is a lightweight version of the heavier 'process', abbreviating 'process of execution' (as in the phrase 'during the process of execution, the program abruptly died'). Threads save all the tedious mucking about creating state and context demanded by processes, by simply enabling multiple threads to cavort together in the same address space, and with the same resources, like drowning kittens in a bucket of water. Adding cores to the processor improves the threading model by significantly increasing the amount of water available. (Peace: no actual kittens were drowned in the manufacture of that simile.)

In particular, threads suffer badly from 'race conditions'. The race of despised worker threads is made to do boring, low status, 'background' tasks. Meanwhile, the high privilege 'system' threads get to party with the hardware. It's the same the whole world over...

Three old ladies

In order to overcome race conditions, and perhaps to compensate for taking away our beloved 'goto' away from us, top Dutch comp. sci. genius Edsger Dijkstra invented the 'semaphore'. The semaphore is a data structure that allows friendly communication between the threads and tasks of all nations. There are two kinds: the counting semaphore and later, when it has been compiled, the binary semaphore.

The semaphore helped do away with race conditions, and for a while all was sweetness and synchronicity. But it soon became clear that a brand new peril had been introduced: the deadlock.

Today there are at least four well-known kinds of deadlock breeding in the wild:

Recursive deadlock, which bloody well happens again and again
Deadly mutual embrace deadlock, which is much less fun than it sounds, being a kind of inter-task stalemate
Death-of-process deadlock, where the process that claimed the semaphore dies intestate
Lady Dedlock, a plaintiff in Jarndyce v. Jarndyce

By way of countering the threat and obtaining a deadlock holiday, it was decided to invent the 'mutex' (or 'mutant' as it is known by posher, system threads).

There is a lot of confusion about the difference between mutexes and semaphores, which frankly I do not see as part of my business to clear up. Instead I will refer you to the conventional explanatory model which is, weirdly, based on the notion of the lockable lavatory (or securable brickhouse, as it is almost known by commoner, worker threads).

The scenario is this: imagine a loo in a restaurant of the dismal kind where you must humiliatingly apply to staff for the key. If there is one lavatory and one key, you have a mutex and a long, fidgety wait. If there are four toilets and four interchangeable keys, you have a counting semaphore. The important thing about a counting semaphore is: it doesn't prevent four threads entering one cubicle together.

This licentious view of synchronisation is disputed by other writers, who say that the difference with mutexes is that, before sitting down, they draw the bolt of ownership across the door.

My own view is: synchronisation primitives will never be understood until somebody goes over their metaphors with gallons of industrial-strength bleach.

Yet more

As well as the tools I have listed above, there are many other OS structures to support better multitasking. For example:

Fibers (sic, geddit?) are a Windows lightweight alternative to threads, which combine the pleasure of the Win32 threading interface with the intellectual opportunity afforded by the coroutine.
Critical sections are Windows's lightweight alternative to mutexes, thus cleverly getting more use out of a previously well-defined phrase by using it to mean something related to but slightly different from the earlier general understanding.
Not to be outdone by its closed source competition, Linux has sixteen types of mutex, varying the whole gamut of mutexity from lightweight to ultra-, hyper- lightweight.
Lockless programming is a wonderful lightweight alternative to locky programming, which is great if it is applicable, which it rather often isn't. Most OSes offer special atomic increment and decrement operations to support this approach, and there is sometimes assistance in the programming languages too, which feeds into...

Language facilities

The old C language offered a pair of keywords with which one could season variable declarations: register, which sounded rather staid, was an adjective essentially meaning 'fast'; whereas volatile, which sounded exciting, was an adjective essentially meaning 'slow'. (Yes, yes; only volatile is relevant here. I know.) Part of the expertise of the K&R generation was judicial decoration of one's code with these keywords. It often was only decoration. I discovered, by examining the output of my compiler. The damn thing took no notice at all. But at least the code looked good, which is the main thing.

These days register is gone, and volatile is very old hat, being merely an unsuccessful technique to try to get the double checked locking pattern back on its feet. [pdf] So what is new?

Well, C++ is finally getting its own standard threading library. There. I am sure that Java people in particular are all mightily impressed by this, and are queuing all around the block to rejoin Bjarne's merry throng, like High Church C of E traditionalists flocking outside the local RC recruitment office.

More radically, the Intel people - who I suppose given their core enthusiasm are entitled to an opinion in these matters - believe that threads are far too hard for mere C++ programmers to manage. To this end, they have constructed the Threaded Building Blocks library, and released it under dual GPL and commercial licences. Here is a taster, cribbed from the pages of the accompanying O'Reilly book:

parallel_for(blocked_range(0, n, grainsize), avg);

This implements a parallel for loop (duh!), with an iteration of the range 0 to n-1 calling functor avg. The work is automatically partitioned and handed out to the available processor cores, with each core dealing with at least grainsize iterations.

I thought this looked like a really easy-peasy approach to threading, and planned to make a parallel version of my latest project, which in one place iterates over an array of nearly 40 items. I was a bit taken aback when I found that the recommended first cut value of grainsize was 10,000.

Walter Bright's D language takes an interesting approach to the problem of accidentally shared state. Instead of using a mechanism like C++'s __declspec(thread) or Delphi's threadvar to declare thread local storage - which, as Bright notes, hardly anybody uses - D defaults to all variables being allocated thread-local, and obliges the programmer to declare thread shared variables explicitly. Many other good things follow without much programmer effort - see here [pdf of talk slides] for details.

No round up such as this would be complete without mentioning the language du jour. Google's Go's goroutines are lightweight threads, often set to execute an inline lightweight lambda function and synchronise data via lightweight language-supported channels. It all looks jolly clever, and is so lightweight it might all be capable of being supported by hot air. Of course, I haven't actually tried it. I'm holding off until Google sees sense and implements full MI.

Shang Erlang

Finally we come to my hot tip for concurrency champion. Erlang is the functional language that differentiates itself from its competition by having been around for nearly 25 years. No danger of being burned by V1.0 compiler bugs, then.

Erlang has a very different paradigm from procedural languages. For one thing, its so-called variables are WORM (Write Once Read Many), like 1990s backup media. Presumably its designers at Ericsson thought to encourage programmers to jolly well get it right first time.

The style of programming relies heavily on recursion and pattern matching. For example, a simple Erlang routine to perform a factorial might exploit the fact that, for example,

20! = 21! / 21

Generalising this into code, and adding pattern matching to supply the stopping condition, we have a definition for a fact() function:

fact(∞) -> ∞; % Exploit the fact that ∞! = ∞; fact(N) -> fact(N+1)/(N+1).

Don't worry about that use of very large numbers there - Erlang uses arbitrary-sized integers, so it won't fall over just because we pass the 64-bit limit! (Actually, that example doesn't look quite right. Perhaps you better google it up before you use it. I don't think it implements proper tail recursion.)

Anyway, Erlang programs are organised as lots of independent-but-lightweight processes, which communicate by sending messages to each other. There is no shared state, which means all that ghastly business with races and locks and lavatories I have been discussing just goes away, and your programs are automatically parallel without any extra work on your part.

Finally, there is one particular Erlang trick which I think is the clincher - and must surely impress all who have returned to an overnight run test on a Windows box to discover that the machine has instead automatically been rebooted in order to update some security-compromised DLL or other. You can update running Erlang programs on the fly. That in my view is the feature rival systems need to top. Happy parallelising.

Revision Questions

If your double-checked locking is broken on a multi-core machine (it is), explain why this isn't observable behaviour.
Which is trendier: to eschew mere multicore programming and dump all your hard processing onto the GPU? Or port it all to Occam and start a campaign to reinstate transputers as the Next Big Thing? Show your working. ®