Big-time Linux cluster breaks cover
16-node failover cluster seeks solvent, non-smoking application vendor.
SGI's FailSafe looks like being the first high-availability clustering for Linux to break cover.
The first public demos of the open sourced Linux FailSafe are expected at the LinuxWorld Expo in August, we gather; the binaries were made available on request last week, and should be available on SGI's site this week.
Unlike the current hamper of web server clusters, FailSafe was designed to host database and TP applications using the shared-everything model pioneered by DEC in its original VAXClusters (now TruClusters) - a model adopted by almost everyone else except Tandem and Microsoft.
Largely at the request of SuSE, SGI announced it make the open source available earlier this year, giving a jump start to other long-term and even more ambitious groundwork to create a VAXish high-availability platform for Linux.
According to SuSE's Alan Robertson, maintainer of the Linux-HA Web site and a lead on the FailSafe project, the source code is still undergoing legal scrutiny.
However, as Robertson acknowledges, the initial release of FailSafe marks the beginning rather than the end of business. Unless a clustered file system such as GFS finds its way into the equation, allowing graceful concurrent access to shared disks, FailSafe must use a crude approach to ensuring data integrity: one node simply cuts the power from its contending rival, a technique which uses the delightful acronym STONITH (or, Shoot The Other Node In The Head).
And Linux FailSafe 1.0 is as much a grab for mindshare as a finished article.
The really ambitious long-term, ground-up Linux HA work came to the fore with the short-lived Linux Cluster Cabal last fall, and continues with Stephen Tweedie's HA architecture and Peter Braam's work on a VAXish Distributed Lock Manager and clustered file systems. There's some overlap here: and Robertson says the FailSafe project is keen to ensure interoperability with the erstwhile Cabalites.
Tweedie himself describes FailSafe as "incredibly important" if Linux is to match the highly available commercial Unixes, but points out it doesn't scale beyond 16 nodes, or provide sophisticated load-balancing. Robertson says that he's keen to agree on APIs for cluster services such as quorum and heartbeat that both projects have in common.
There's no doubt that Linux FailSafe looks like a pretty complete package right down to the GUI front end for cluster management, and its 16-node cluster stands up to SCO's NSC clusters, let alone Microsoft's two-node MCS. However it needs the applications, and porting teams at the likes of Oracle, Informix and IBM need to see a durable-looking API before they can propose a business case. With FailSafe, it looks like they've just got one. ®