Stratus tolerates faults on Windows HPC

Think banking transactions (if you dare)

Maximizing your infrastructure through virtualization

There are many ways to gang up machinery to scale applications on groups of servers or provide a measure of disaster recovery or fault tolerance for those applications. Supercomputer customers are known for spending big bucks on exotic technology, but they're also notorious cheapskates. That's why Linux and the clustering of commodity x86 servers took off a decade ago, essentially wiping out the market for vector supercomputers and nearly knocking out RISC architectures.

So, it may come as some surprise to many of you that Stratus Technologies - one of the venerable vendors of fault tolerant servers for commercial applications - is now trying to get its x64-based ftServer machinery into supercomputer sites, thanks to the charge of Microsoft into the high performance computing arena with its Windows HPC Server 2008 edition.

Here's the deal. Fault tolerant servers are really aimed at commercial transaction processing - and they're meant for the kinds of workloads where a system crash is a big problem. (Think banking transactions - if you can bear to think of such things right now). Tandem and Stratus were two of the early sellers of fault tolerant servers, the latter being a big partner of IBM and the former having disappeared into Compaq more than a decade ago and now part of Hewlett-Packard.

Stratus doesn't make its own processors and systems any more, like it did in the old days, but it does take Xeon servers from Japanese partner NEC and then equip them with chipsets and firmware that allows for the absolute lockstepping of applications running on two distinct physical machines. These machines are identically configured, including identical processors, memory, disks, and such, and they can provide 99.999 per cent uptime for Windows or Linux operating systems.

Supercomputer centers haven't generally worried about fault tolerance, basically because they have much larger issues to worry about concerning their Linux and, Microsoft hopes, Windows clusters. There are a lot of ways to lash together server nodes to create parallel supercomputers, which are used to model everything from car crashes to the weather to nuclear explosions to the interaction of subatomic particles on an absurdly small scale and the Universe on the large scale.

The typical parallel supercomputer today uses a protocol called Message Passing Interface, or MPI, to link server nodes together so they can pass information between the nodes. This information passing is necessary since calculations that define a simulated state in a region within that simulation - say a 3D chunk of air in the atmosphere - depend on the state of the regions that surround it. So, in simulating stuff, parallel supercomputers chop the job up into pieces, model what's going on in those pieces and their interactions, and show how the whole system changes over time based on initial conditions.

While parallel supercomputer clusters have lots of clustering for scalability, they do not generally scale for high availability. This could be done, and in fact, Microsoft and Stratus will be making the argument that for key nodes in an HPC cluster, it should be done.

This is something of a surprise position for Stratus to be taking, and the company is up front about it. "Generally, Stratus has had this aversion to clusters, and clusters have been the enemy when it comes to availability," explains Denny Lane, director of product and marketing management at Stratus.

And for commercial data processing, despite the headache of setting up and maintaining high availability clusters, which have one server's applications switch over to a backup set of servers in the event of a failure on the primary machine, there is an order of magnitude (or two) difference in adoption of HA clusters over fault tolerant machines out there in the data center. Lane said that Stratus had sold 10,000 fault tolerant servers worldwide. The world consumes more than 8 million servers a year.

According to Lane, Microsoft approached Stratus with the idea that the marriage of key nodes in a supercomputing cluster running Windows HPC Server (which manage the workloads running on the cluster and access to the nodes) with fault tolerance like that provided by ftServers would make Windows supercomputers more resilient. (And since ftServers support Linux, the idea applies equally well).

In terms of Windows-based clusters, Microsoft and Stratus are suggesting that ftServers should be used in what is called the head node, as well as in the broker nodes that run the Windows Communication Foundation (WCF) stack. And perhaps file systems in baby Windows clusters could also use the ftServers too, Stratus believes, now that it is coming around to Microsoft's thinking. That leaves the workstation, an Active Directory server, a System Center Server, and maybe a mail server within the company network where the Windows HPC cluster sits running regular Windows on regular x64 servers.

So why bother using fault tolerance with Windows HPC? "We do a lot to harden the operating system," explains Lane. "We do a lot of work with the I/O vendors to allow Windows and Linux to ride out transient errors." The kind of thing you don't want to have happen to the head and broker nodes in a supercomputer cluster. No one wants to restart a job that takes days, weeks, or months to finish.

Microsoft and Stratus are targeting Windows clusters with around 50 compute nodes, which they reckon is the sweet spot for what they are offering together. That would include one head node and maybe three broker nodes running Windows HPC Server on an ftServer setup (which is actually two servers working in lockstep). A two-socket ftServer running Windows costs somewhere between $20,000 to $25,000 in a reasonable configuration. This is not cheap, of course, but neither is HA clustering and neither is losing work.

Stratus has no plans to take the idea of fault tolerant nodes in HPC clusters to the Linux market yet, but this is something that the company can - and probably will - do in the long run. Right now, Microsoft's marketing muscle is important to get the idea out there. ®

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Amazon Reveals One Weird Trick: A Loss On Almost $20bn In Sales
Investors really hate it: Share price plunge as growth SLOWS in key AWS division
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.