Stratus tolerates faults on Windows HPC
Think banking transactions (if you dare)
There are many ways to gang up machinery to scale applications on groups of servers or provide a measure of disaster recovery or fault tolerance for those applications. Supercomputer customers are known for spending big bucks on exotic technology, but they're also notorious cheapskates. That's why Linux and the clustering of commodity x86 servers took off a decade ago, essentially wiping out the market for vector supercomputers and nearly knocking out RISC architectures.
So, it may come as some surprise to many of you that Stratus Technologies - one of the venerable vendors of fault tolerant servers for commercial applications - is now trying to get its x64-based ftServer machinery into supercomputer sites, thanks to the charge of Microsoft into the high performance computing arena with its Windows HPC Server 2008 edition.
Here's the deal. Fault tolerant servers are really aimed at commercial transaction processing - and they're meant for the kinds of workloads where a system crash is a big problem. (Think banking transactions - if you can bear to think of such things right now). Tandem and Stratus were two of the early sellers of fault tolerant servers, the latter being a big partner of IBM and the former having disappeared into Compaq more than a decade ago and now part of Hewlett-Packard.
Stratus doesn't make its own processors and systems any more, like it did in the old days, but it does take Xeon servers from Japanese partner NEC and then equip them with chipsets and firmware that allows for the absolute lockstepping of applications running on two distinct physical machines. These machines are identically configured, including identical processors, memory, disks, and such, and they can provide 99.999 per cent uptime for Windows or Linux operating systems.
Supercomputer centers haven't generally worried about fault tolerance, basically because they have much larger issues to worry about concerning their Linux and, Microsoft hopes, Windows clusters. There are a lot of ways to lash together server nodes to create parallel supercomputers, which are used to model everything from car crashes to the weather to nuclear explosions to the interaction of subatomic particles on an absurdly small scale and the Universe on the large scale.
The typical parallel supercomputer today uses a protocol called Message Passing Interface, or MPI, to link server nodes together so they can pass information between the nodes. This information passing is necessary since calculations that define a simulated state in a region within that simulation - say a 3D chunk of air in the atmosphere - depend on the state of the regions that surround it. So, in simulating stuff, parallel supercomputers chop the job up into pieces, model what's going on in those pieces and their interactions, and show how the whole system changes over time based on initial conditions.
While parallel supercomputer clusters have lots of clustering for scalability, they do not generally scale for high availability. This could be done, and in fact, Microsoft and Stratus will be making the argument that for key nodes in an HPC cluster, it should be done.
This is something of a surprise position for Stratus to be taking, and the company is up front about it. "Generally, Stratus has had this aversion to clusters, and clusters have been the enemy when it comes to availability," explains Denny Lane, director of product and marketing management at Stratus.
And for commercial data processing, despite the headache of setting up and maintaining high availability clusters, which have one server's applications switch over to a backup set of servers in the event of a failure on the primary machine, there is an order of magnitude (or two) difference in adoption of HA clusters over fault tolerant machines out there in the data center. Lane said that Stratus had sold 10,000 fault tolerant servers worldwide. The world consumes more than 8 million servers a year.
According to Lane, Microsoft approached Stratus with the idea that the marriage of key nodes in a supercomputing cluster running Windows HPC Server (which manage the workloads running on the cluster and access to the nodes) with fault tolerance like that provided by ftServers would make Windows supercomputers more resilient. (And since ftServers support Linux, the idea applies equally well).
In terms of Windows-based clusters, Microsoft and Stratus are suggesting that ftServers should be used in what is called the head node, as well as in the broker nodes that run the Windows Communication Foundation (WCF) stack. And perhaps file systems in baby Windows clusters could also use the ftServers too, Stratus believes, now that it is coming around to Microsoft's thinking. That leaves the workstation, an Active Directory server, a System Center Server, and maybe a mail server within the company network where the Windows HPC cluster sits running regular Windows on regular x64 servers.
So why bother using fault tolerance with Windows HPC? "We do a lot to harden the operating system," explains Lane. "We do a lot of work with the I/O vendors to allow Windows and Linux to ride out transient errors." The kind of thing you don't want to have happen to the head and broker nodes in a supercomputer cluster. No one wants to restart a job that takes days, weeks, or months to finish.
Microsoft and Stratus are targeting Windows clusters with around 50 compute nodes, which they reckon is the sweet spot for what they are offering together. That would include one head node and maybe three broker nodes running Windows HPC Server on an ftServer setup (which is actually two servers working in lockstep). A two-socket ftServer running Windows costs somewhere between $20,000 to $25,000 in a reasonable configuration. This is not cheap, of course, but neither is HA clustering and neither is losing work.
Stratus has no plans to take the idea of fault tolerant nodes in HPC clusters to the Linux market yet, but this is something that the company can - and probably will - do in the long run. Right now, Microsoft's marketing muscle is important to get the idea out there. ®
"TheDude" ...time to look at Stratus again
As an employee of Stratus, I am sorry that you encountered problems with the ftServers you worked with in the past. Maybe it is time to look at them again.
Our products are designed to provide users with the highest availability possible for their most critical workload and deliver this solution in an operational simple fashion. This level of focus on availability is not for every customer nor every workload, just those that demand the most available systems in the market. As you may know we monitor the uptime (hardware & OS) of the thousands of ftServers installed around the world and it currently is 99.99990% - actual, not calculated. That's an average of 32 seconds of downtime with off the shelf Windows or Linux.
A couple points to update you on:
* Current models ship with the latest Intel Xeon quad core processors and 15K SAS hard disks
* Our systems are 4U ( 2 x 2U) which is the same as if you were to cluster two HP DL380 servers
* Pricing will vary depending on configuration, reseller, and geography. If you were to compare our system to the above mentioned HP cluster with shared storage - we would have about a 20% price premium. Most customers feel that is a small price for the higher availability and simpler operation than a cluster.
For any IT professional building an HPC cluster, either Windows or Linux, should consider using a fault tolerant server at the critical points like the Head and Broker nodes.
Hm... I doubt we'd be still using command-line systems even if MS had never existed. We already had ISPF, that Xerox graphical environment, the X Window System and even Macintosh. (Note: It wasn't called "MacOS" until sometime around 7.5)
Hell, some large organizations still use ISPF to this day, including my local telco and some banks. Even the web basically works as a glorified 3270 terminal!
I somehow doubt MS would get any significant traction in the HPC business, HPC users are more of the scientific type, not the "average Joe Bloggs" user they're used to. Its bad enough that the RISC architecture and vector processors have been mostly displaced by el-cheapo x86 junk; now trying to put Windows on top of that is even uglier. I really hope that actual HPC users don't go down this road.
Stratus doesn't live up to the marketing
This would be great if Stratus was a clean and transparent hardware solution; but it isn't.
The Stratus Windows 2003 install relies on a *lot* of proprietary drivers and is hooked into Windows on such a fundamental level that upgrading Windows patches can break the Stratus drivers! I have seen nodes randomly rebooting *both sides* of an FTserver after installing a regular set of patches. Stratus' software is also only rarely updated, which means you have to wait for Stratus to give you a thumbs up on any Microsoft updates you want to install.
Not to mention that a Stratus server with 2-year old hardware (2xdual core Core 2 Duo and 4GB of RAM with SATA, yes, SATA disks) costs over £15000!!! How is that scalable? Why not buy 10 commodity servers instead? Not to mention that your pathetic 4 core server is taking 4U of rack space!! I can get 8 cores in 1U, why do I want this?
All in all I cannot support Stratus and having worked with them for a couple of years I have seen no reason to approach them again whatsoever. They happen to be the only vendor in this particular market but that doesn't mean that they have a good product.
All in all another case of good marketing, bad product. Stay well away....