Stratus, NEC see double with fault-tolerant iron
Xeon 5500 engineering push
Server partners Stratus Technologies and NEC have revamped their fault tolerant server lineups to take advantage of Intel's quad-core Nehalem EP Xeon 5500 servers.
The two companies typically like to get their fault tolerant machines out the door within a quarter of a new chip launch from Intel.
This time, though, the big shift from the old frontside bus on Xeon chips to the new QuickPath Interconnect used with the Xeon 5500s, plus other changes in the system BIOS, forced them to do more engineering work and testing to hit the market. In a few weeks they will begin shipping the fault-tolerant machines they created together, with their respective labels.
Fault tolerant servers are distinct from the more popular clusters in that they are two completely mirrored systems running two copies of an operating system and their applications are kept in absolute lockstep by a chipset and electronics in Intel's Xeon chips.
High-availability clusters, by contract, link one or more server nodes and replicate data between machines so they can take over each others' work in the event one node in the cluster goes down.
Historically, clustering has been cheaper to do even if it is more complex for IT shops to manage. The advent of x86 and x64 servers from Stratus and NEC, though, have brought the price of an fault-tolerant setup down even as the performance has gone up with every successive Xeon chip generation.
With support for the Xeon 5500s, even a two-socket fault tolerant box is going to have enough oomph for a lot of workloads, and for many kinds of applications - such as police, fire, and other emergency responders, who like fault tolerant boxes. These are environments, though, where the IT skills are a little thin and paying extra for mirrored machines that manage themselves is easier than trying to build and support an high-availability fail-over cluster. And where the bugets for doubling up on software licenses are kinda thin.
To make the current generation of fault-tolerant machines - this is the fifth generation of Xeon-based FT boxes from Stratus, but the sixth generation from NEC, which did its own before partnering with Stratus - the two companies collaborated on the design of the GeminiEngine chipset. This chipset accesses the lockstepping functions inside the Xeon chip and allows for the two server modules in the fault-tolerant box to be kept in absolute synch in terms of CPU, memory, disk, and network processing.
Thanks for that explanation, I understand the one of intel's were designed to do something similar (the PII?) but we're talking several generations ago. IIRC the 68040 would be running ~30MHz, against current clock speeds of round about 3 *orders of magnitude* faster. I don't know but I'm guessing the clock is on-chip, so how do you sync 2 sockets? Even if it is off-chip you would have significant clock skew.
Dunno, it seems more plausible but still extremely hard.
You must have heard this story
Told to me a long time ago; somebody phoned up the service centre and said 'we had an earthquake, and our fault tolerant box has fallen over'. Service guy; 'it can't fall over, it's a fault tolerant system'. Customer; 'No, no, it's still *running*, but it's fallen over on its side and we need someone to help us get it upright again' :-) :-) :-)
The setup inside the box seems to be basically the same as the original 68xxx based machines that I last used, so here it is in a nutshell.
Every board in the machine is paired and hot pluggable
All disks are mirrored - Raid 0
Each logical CPU consists of 4 logical CPUs, two in each of the paired boards. The same sort of arrangement also applies to comms boards, disk controllers, etc, but for simplicity I'll just describe CPU boards.
The basis of the system is that the two CPUs on a board are very tightly synchronised. They both execute the same instruction at the same time and fast comparators compare the outputs of the pair. If a comparator spots a difference it can turn the board off before the bad data gets onto the main busses that connect all boards in the cabinet.
The entire box has a single system image with paired boards synchronised at the bus read/write level. This means that a failing board can be turned off without affecting system operation - unless, of course, its pair has already failed. IOW the system is completely tolerant to single point failures and also to a more limited set of multi-point failures. You can pull up to half the boards out of the system without affecting its operation or performance provided that you don't pull both halves of a pair.
There is only one copy of the OS and of each application program in memory. Each running process is a single image, but each executes simultaneously on all four processors that make a logical processor: when everything is working correctly the data from three of them is discarded. If a board fails, things go on in the same way, but now the logical processor has only two chips until the board is replaced. On replacement the board is tested, brought up and synchonised with its active pair-mate, so the logical processor is again made of four physical processors. During all this the affected processes have continued to run without interruption and at full speed.
It used to be said that any non-fault-tolerant OS could be run on a Stratus with one change: it needed a special fault detecting interrupt handler whose only job was to kick off the phone-home process if an error interrupt ocurred.
Two things happen when a board fails: its switched off and after a 30 second delay, the system rings Stratus and tells them what broke so they can send an engineer round with a replacement. The delay was introduced because in the early days actual faults were hugely exceeded false alarms that were due to people showing their mates that you could pull a board or two without anything happening apart from BOARD FAILED messages appearing on the console as you took them out and IN SERVICE messages appearing when you stuck the board back in. The delay let you pull the board, say "look Ma, no fault" and stick it back in without the system phoning home.