Has writing operating systems changed since OS/2?
Comment As Vista slowly slips further into the mists of the future, I sometimes wonder if anything has really changed since I was on the losing side of the IBM-Microsoft OS/2 war. Why do we now hear of huge re-writes in a product that's supposed to be almost ready?
As a former O/S (Operating System) bug hunter, it sounds rather familiar. No matter how disciplined you are in architecture, little issues have a way of forcing you to implement wider-ranging changes, and these build up like water behind a dam. This can become a nasty techno/political problem. Managers will play bluffing games, each hoping another team will take the bullet of making the project late. So, you have managers expressing increasingly fictitious optimism, until the breaking point - when suddenly torrents of issues will flood through, with lots of relief (and a bit of schadenfreude) all round.
Then you can "fix" the project, merely by firing managers at random and allowing them to become blame sinks. It would be too cynical of me to say that the management changes we've seen reported here are the result (El Reg's take on this is here). Way too cynical - I can't be right can I?
It's hard to gauge progress in a huge project like an O/S, so management focuses on a range of statistics and capability milestones. OS/2 reached the point several times where we found bugs faster than they could be fixed. One set were caused by the compiler and processor having different ideas about valid opcodes. Thus, bugs could depend upon the revision level of the CPU, and no code review could help you. Management responded to this more than once by simply banning the finding of bugs, in a denial of bad news that would shame a Soviet spin doctor; you had to be a third level manager to "approve" a bug at one point.
Bug hunters are seen as the bad guys by individual managers, whatever the top level says (some managers think that the testing process itself actually "makes" bugs, which wouldn't be there if you didn't test - Ed]. They bring bad news. In all organisations the bringers of good news prosper over the bringers of bad. At IBM, however, us bug hunters were generally seen as valuable by the organisation as a whole, not least because we annoyed Microsoft; and, yes, blame management became a big issue, complete with its negative productivity. Microsoft has lots of smart people; "evangelists" who ride out to inspire us all with the wisdom of Microsoft ways. Ever heard of its testers? Reckon they're the best paid people in Redmond?
A tester should have a superset of the skills used by developers; but at too many firms testing is a sin bin, and paid accordingly; and it's hard to see that not being the cause of many disasters on the scale of Vista. At IBM, some of us had a simplistic model. Each line of new/fixed code put into the system has some probability of breaking it in a new way, and this probability grows with the size of code already written. So, for a given quality and size of programming team, there is a maximum size of system. Beyond that, any attempt at change is as likely to break something else as to fix a problem. This steady state is permanent and fatal; have the Vista teams hit it?
Yes, I do mean "teams" plural. The failure of OS/2 commercially was partly down to the decision to do version 1.3, and thus stop 2.0 in its tracks. 1.3 was a result of listening to customers, and was a lighter, faster, crap version, of 1.2, whereas 2.0 had loads of cool features like being 32 bit and having the ability to run Windows apps. There are competing teams within any large project – and when one version hits the sweet spot of desirability and plausibility for delivery, do you think that makes them friends?
Individual Microsoft programming units are often larger than several whole companies, and so (it seems to me) "us" may be our bit of Microsoft, and "them" another part of Microsoft; not the official Linux enemy. Also, techies are known for their bitchiness (I have no doubt that this article will prod people to point out some of my screw-ups over the last 20 years) and often bug reports and fix requests are seen as coming from malicious incompetents, not colleagues. We referred to the IBM team rewriting MS compilers as "children playing with matches" and refused point-blank to use their output; and they doubtless had equally unkind things to say about us. "Arrogant prima donnas" is probably all I can use here, without getting Reg Developer added to the banned site list.
The Vista teams must also be hitting "deadline fatigue" by now. People get increasingly cynical about the assertions made by other groups and, after a while, by their own managers, who are torn between honesty with the troops and being seen as a "good team player" by their peers and bosses. Also, in order to actually get a program out of the door, there are always compromises and fudges and things that just happen to work. More than one bug in a Microsoft product has been fixed by deleting something from the manual [that happened back in the days of MVS mainframes too, nothing changes - Ed]. We've all done this, but the more deadlines you rush for, the more "clever hacks" accumulate, and by their nature are not only undocumented, but often unknown to anyone other than the developer who bodged them.
Source code control can fray a bit in the last desperate hours - it's not unknown for the source for the version of memory manager that actually shipped to be "lost". These issues actually make net progress slower, and the sort of managers who genuinely believe that sport is a good metaphor in setting goals for teams often don't realise that trying for an unachievable target doesn't bring out the best in people. In fact, it actually digs a hole for the next wave of cannon fodder to fall into.
DRM (Digital Rights Management) doesn't help. Ever since Fred Brooks wrote The Mythical Man-Month, based on his experiences with OS/360 (and since augmented with new material), we've tried to segment large systems so that bugs don't spread. But to be effective, digital rights must be managed right across the system in a cooperative fashion. That massively increases the effort and bug count, yet as far as the customer is concerned DRM itself is the bug. A "feature" is something you'd pay money to get. DRM does not pass that test.
But I'm not in the OS-writing game any more, thankfully, so I'm looking forward to lots of feedback telling me in detail where my ideas are obsolete.®