The VAX of Life: Sun's cluster guru talks Full Moon
Net Effect™ meets Reg Effect™
Yousef Khalidi, Sun Microsystems Distinguished Engineer, Full Moon chief architect, and Register reader took us on a ride through the history of Sun Clusters 3.0 yesterday. First as might be seen by visiting aliens, and then from a programmer's perspective. Which was nice: plenty of detail gets left out even from the "technical spec sheets" that accompany press launches these days, and so to counter the Net Effect™, here's the RegEffect™.
First of all, Khalidi wanted to clear up the impression that the new offering introduces a new proprietary cluster file system of Sun's own making. It doesn't. Khalidi himself had hinted at this possibility in recent talks, such as this one: "next-generation clusters could rely on a Cluster File System (CFS) to enable global access to all files, devices and network resources in the system, and create a full single-system image."
But that isn't the case. "That was an explicit design decision. The last thing we want to do is invent a new proprietary file system. So if something becomes important - like the new Linux file systems - we can plug it in." Sun Clusters 3.0 indeed employs a "Global File Service", but that boils down to using Solaris' native file system and if you want it - and if you use disks on non-Solaris systems, you almost certainly will - Veritas VxFS. He left door open for Linux file systems to plug-in to Full Moon in the future.
A bicycle made for two... er, eight nodes
So what does Sun Cluster 3.0 most resemble, when viewed from space? For the sake of argument, is it more like Tandem NSC or say, VAX clusters and its spiritual descendents?
"It's both shared-nothing and shared-everything. It's shared-nothing in that the hardware topology can be either. But unlike most everybody else, ours does not require a fully connected SAN," says Khalidi.
However it requires no modifications to the existing applications, which is probably the biggest difference between it and systems based on VMS-ish distributed lock manager (DLM).
"A DLM is several things.. an API. We've been talking to ISVs for five years and these ISVs already write for a DLM for VAX, Oracle has its own lock manager... that's fine. But they don't want another API," he says.
It's a measure of the enduring legacy of the world's first commercial cluster from DEC that Yousuf (and Sun's marketing lead Andy Ingram) referred to it as VAX, or VAX-like throughout. Even though the technology is now called VMSClusters or TruClusters, migrated to Alpha many years ago, and VAX systems are no longer in production.
"To implement locking Sun Cluster 3.0 we use simple Unix APIs. With either the Solaris file system or Veritas' file system." So programmers can map data across instances of the Solaris OS - the standard Unix way of sharing memory across nodes - although this isn't encouraged as an IPC mechanism.
Nor does Sun Clusters mirror processes in the manner of Himalaya (formerly Tandem) NSC machines. Instead, the system logs file behaviour: open, read, write and sync calls. So this avoids duplication, argues Khiladi, but preserves everything but session state information. And any real transaction will be obeying these semantics, so that's OK. Since the socket calls that Internet applications use follow Unix file semantics (although yes, we know, IP preceded Unix), that ought to be pretty watertight.
That's just one of several fashionable, or once-fashionable approaches to clusters that the Full Moon team was happy to trample in the design work. Another being "process migration", an approach adopted by for example the MOSIX Linux cluster project, which in the event of a node failure, fails over individual processes.
"That was removed from the prototype - on purpose," says Khalidi.
As for Compaq's nest of clusters, "Is that five or seven clustering products they sell?" he asks. Most Q users are still using TruClusters 4x, even though when the analysts last did looked at their scorecards, Compaq's TruCluster 5x was top of the pops. Sun doesn't want to cede that Compaq has any traction in the internet business, and so it's been written out of its own product comparison sheets completely. And it still requires "quirks" as he describes them, such as requiring Q's ADFS to do read-write access.
The design goal was to cluster-enable bog-standard Sun kit and applications that are already in use, over standard interconnects (SCI is on its way), using skills that any BOFH can relate to (such as mounting file systems). Anything special that the competition may boast, goes the line, is a "bragging competition".
Those patents in full...
Forty patents have been applied for says Khalidi. He mentioned process active pair technology (which we know nothing about, but if anyone wants to enlighten us...), mini transaction technology (ditto) and interesting quorum techniques. Quorum is commonly understood in the parallel processing world to be the way a cluster decides who comprises membership of the collective, although Sun's definition is different. Quorum isn't about membership, and Khalidi prefers fencing to quorum for this.
He says that the heartbeat code, another staple of HA cluster planning, actually runs at the highest process priority, rather than being some ad hoc or add-on process, which we thought was interesting. In fact, the toughest problems the team had to crack he said, were around heartbeat issues. Not the "someone's just yanked out the Ethernet cable" but where a node goes quiet because it's under a high-workload.
Trawling around the patent database we found plenty of patents which involved Full Moon 3's precuror's: Khalidi's Solaris MC system in particular. That was a clustered Solaris using CORBA for message passing. Specifically stuff about recovery, clustered file systems and memory mapping. Anyone want to fill in the blanks? ®