Stratus tolerates faults on Windows HPC

Think banking transactions (if you dare)

Boost IT visibility and business value

There are many ways to gang up machinery to scale applications on groups of servers or provide a measure of disaster recovery or fault tolerance for those applications. Supercomputer customers are known for spending big bucks on exotic technology, but they're also notorious cheapskates. That's why Linux and the clustering of commodity x86 servers took off a decade ago, essentially wiping out the market for vector supercomputers and nearly knocking out RISC architectures.

So, it may come as some surprise to many of you that Stratus Technologies - one of the venerable vendors of fault tolerant servers for commercial applications - is now trying to get its x64-based ftServer machinery into supercomputer sites, thanks to the charge of Microsoft into the high performance computing arena with its Windows HPC Server 2008 edition.

Here's the deal. Fault tolerant servers are really aimed at commercial transaction processing - and they're meant for the kinds of workloads where a system crash is a big problem. (Think banking transactions - if you can bear to think of such things right now). Tandem and Stratus were two of the early sellers of fault tolerant servers, the latter being a big partner of IBM and the former having disappeared into Compaq more than a decade ago and now part of Hewlett-Packard.

Stratus doesn't make its own processors and systems any more, like it did in the old days, but it does take Xeon servers from Japanese partner NEC and then equip them with chipsets and firmware that allows for the absolute lockstepping of applications running on two distinct physical machines. These machines are identically configured, including identical processors, memory, disks, and such, and they can provide 99.999 per cent uptime for Windows or Linux operating systems.

Supercomputer centers haven't generally worried about fault tolerance, basically because they have much larger issues to worry about concerning their Linux and, Microsoft hopes, Windows clusters. There are a lot of ways to lash together server nodes to create parallel supercomputers, which are used to model everything from car crashes to the weather to nuclear explosions to the interaction of subatomic particles on an absurdly small scale and the Universe on the large scale.

The typical parallel supercomputer today uses a protocol called Message Passing Interface, or MPI, to link server nodes together so they can pass information between the nodes. This information passing is necessary since calculations that define a simulated state in a region within that simulation - say a 3D chunk of air in the atmosphere - depend on the state of the regions that surround it. So, in simulating stuff, parallel supercomputers chop the job up into pieces, model what's going on in those pieces and their interactions, and show how the whole system changes over time based on initial conditions.

While parallel supercomputer clusters have lots of clustering for scalability, they do not generally scale for high availability. This could be done, and in fact, Microsoft and Stratus will be making the argument that for key nodes in an HPC cluster, it should be done.

This is something of a surprise position for Stratus to be taking, and the company is up front about it. "Generally, Stratus has had this aversion to clusters, and clusters have been the enemy when it comes to availability," explains Denny Lane, director of product and marketing management at Stratus.

And for commercial data processing, despite the headache of setting up and maintaining high availability clusters, which have one server's applications switch over to a backup set of servers in the event of a failure on the primary machine, there is an order of magnitude (or two) difference in adoption of HA clusters over fault tolerant machines out there in the data center. Lane said that Stratus had sold 10,000 fault tolerant servers worldwide. The world consumes more than 8 million servers a year.

According to Lane, Microsoft approached Stratus with the idea that the marriage of key nodes in a supercomputing cluster running Windows HPC Server (which manage the workloads running on the cluster and access to the nodes) with fault tolerance like that provided by ftServers would make Windows supercomputers more resilient. (And since ftServers support Linux, the idea applies equally well).

In terms of Windows-based clusters, Microsoft and Stratus are suggesting that ftServers should be used in what is called the head node, as well as in the broker nodes that run the Windows Communication Foundation (WCF) stack. And perhaps file systems in baby Windows clusters could also use the ftServers too, Stratus believes, now that it is coming around to Microsoft's thinking. That leaves the workstation, an Active Directory server, a System Center Server, and maybe a mail server within the company network where the Windows HPC cluster sits running regular Windows on regular x64 servers.

So why bother using fault tolerance with Windows HPC? "We do a lot to harden the operating system," explains Lane. "We do a lot of work with the I/O vendors to allow Windows and Linux to ride out transient errors." The kind of thing you don't want to have happen to the head and broker nodes in a supercomputer cluster. No one wants to restart a job that takes days, weeks, or months to finish.

Microsoft and Stratus are targeting Windows clusters with around 50 compute nodes, which they reckon is the sweet spot for what they are offering together. That would include one head node and maybe three broker nodes running Windows HPC Server on an ftServer setup (which is actually two servers working in lockstep). A two-socket ftServer running Windows costs somewhere between $20,000 to $25,000 in a reasonable configuration. This is not cheap, of course, but neither is HA clustering and neither is losing work.

Stratus has no plans to take the idea of fault tolerant nodes in HPC clusters to the Linux market yet, but this is something that the company can - and probably will - do in the long run. Right now, Microsoft's marketing muscle is important to get the idea out there. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story


5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.