Big Blue kills off CSM clustering
xCAT king of the HPC jungle
For a while now, IBM has had multiple and competing tools for managing AIX and Linux clusters for its supercomputer customers and yet another set of tools that were used for other HPC setups with a slightly more commercial bent to them. But Big Blue has now cleaned house, killing off its closed-source Cluster Systems Management (CSM) tool and tapping its own open source Extreme Cluster Administration Toolkit (known as xCAT) as its replacement.
IBM tapped xCAT as its future clustering tool for HPC setups as part of this week's Dynamic Infrastructure blitz, where the company made a number of storage, networking, and systems management announcements. Because there are loads of customers still using CSM on Power and x64 systems, IBM is giving them plenty of warning that CSM is going the way of all flesh.
The xCAT tool was created in 2002 by Egan Ford, a cluster architect at IBM, so the clusters that Big Blue was building for the largest supercomputer centers in the world would have an open source management tool that could image and provision Red Hat Enterprise Linux, SUSE Linux Enterprise Server, or Windows instances on cluster nodes and then give HPC shops a choice of the job schedulers (such as Torque, PBS, Maui, and Moab) to control how jobs are deployed on the clusters as xCAT changes them. IBM put the original xCAT V1 tool out on its alphaWorks experimental software site, and as it grew in popularity, the company decided with xCAT's Version 2 to release the code as an open source project under the Eclipse Public License. (You can see the xCAT project here).
Last September, the handwriting was on the wall for CSM when xCAT - which had been tweaked with Version 2.3 to support IBM's System x, iDataPlex, BladeCenter, and Power Systems boxe - was given official IBM support contracts, and a ridiculously low price (for IBM at least) of $25 per node per year for enhanced support and $60 per node per year for elite support.
CSM was IBM's first attempt to create a cluster management tool that spanned both AIX and Linux, a condition that was forced on Big Blue because of the onslaught of Linux on all HPC installed bases in the late 1990s. At first, CSM was used to manage Linux instances alongside AIX instances on Power-based clusters, and eventually, it was extended to manage Linux instances on x64 rack and blade servers.
CSM was not created from the ground up, like xCAT was, but is rather an offshoot of the company's Parallel System Support Program (PSSP), a cluster management tool that was built specifically for the RS/6000 PowerParallel supercomputers back in the mid-1990s and then deployed on much larger SMP clusters that Big Blue built for the US Department of Energy under the ASCI supercomputer program.
According to a statement of direction put out by IBM, the company will stop selling the Linux versions of the CSM on January 29, 2010. Support will be available for the product until April 30, 2011. Customers who are using CSM to manage non-HPC server clusters, such as data warehouses or parallel transaction processing systems, are being encouraged to move to IBM's Systems Director tool and its VMControl plug-ins to manage the provisioning of physical and virtual servers. (You can see more about these tools, also announced this week, here.
IBM added that while it was mothballing the Linux versions of CSM, it would continue to sell the AIX version of the tool for now and would keep it alive for the currently supported AIX 5.3 and 6.1 releases, including future hardware that comes out and supports those releases, such as the Power7 boxes due throughout 2010. ®
Quote, "It is a mistake to assume that xCat is built from the ground up."
xCAT 2 was/is built from the ground up. In 2007 the xCAT 1 and CSM teams merged. We defined a new framework based on our combined experience and then set forth to build it. xCAT 2 is the best of both worlds (CSM and xCAT 1), and is all new code.
Quote, "It still uses underlying components that are currently used by CSM, including NIM (NIMoL for Linux) for system image deployment and RSCT for monitoring..."
NIM is not used to provision Linux. Each OS has its own unique and native solution. E.g. Kickstart for RH, Autoyast for SuSE, Windows (something) and ImageX for Windows, NIM for AIX, etc...
RSCT is not part of xCAT 2 and not required. However it can be use with xCAT 2 if desired. Many AIX shops do this.
Quote, "The real problem is that often the people who write the code often do not work in the real world, and end up making assumptions about the shape of the systems."
I have been designing and deploying some of IBM's largest HPC systems for 10 years.
Quote, "All I can hope for is the fact that xCat2 has come out of Alphaworks means that real system admins have had input to the requirements"
xCAT is an open project. All feedback comes directly to the developers. You can provide any input via the mailing list or the SourceForge site.
Quote, "I wonder how many 200 Power node clusters have been deployed anywhere."
The LANL Roadrunner system (Top #1 system at 1.1 PF) has over 6000+ Power-based Cell blades that boot with OpenFirmware just like any other pSeries machines. The entire system is managed with xCAT 2.
Guess CSM customers will have to migrate
On one hand, I feel sorry for all the IBM CSM customers who now have to migrate to a new software stack, on the other hand, IBM is really doing them a favor because there are so many better software choices out there, Sun's own HPC software stack, www.sun.com/software/products/hpcsoftware/ or
Unicluster from Univa UD being two good options.
Only just got used to CSM!
It is a mistake to assume that xCat is built from the ground up. It still uses underlying components that are currently used by CSM, including NIM (NIMoL for Linux) for system image deployment and RSCT for monitoring, and they all revolve around other well known systems such as NFS, Kerberos, and rsh/ssh (to say nothing of the open source components ).
It's true that the overall gloss on the top is new, but many of the bits under the covers are the same. It's interesting to also see that IBM Director still uses NIM for Power/AIX systems.
I must admit that I believe that the switch from PSSP to CSM was a bit of a dogs dinner (I understand the architectural reasons for the change, that PSSP was designed around the constraints of the AIX SP/2 which were too rigid for the Cluster/1600 offering) . It took a couple of years for CSM to even approach the usability that PSSP had, and I fear the same will be true for the switch from CSM to xCat2.
The real problem is that often the people who write the code often do not work in the real world, and end up making assumptions about the shape of the systems. This means, once you take into account the various networking, commercial support and security restraints placed on real-world systems by people like Government Agencies and the Financial organisations (the people most likely to deploy large scale commercial clusters), very often the management tools, as delivered out of the box are about as useful as a perforated condom.
Even in HPC environments, it is comparatively unusual to have the systems configured exactly as the vendors suggest. I'm currently working with a large Power6 HPC cluster, and the requirements for outward event reporting to an enterprise reporting system that is NOT Tivoli are causing more than a few problems, along with security, access and data control that IBM had no real incentive to architect solutions for. As a result, it is necessary to dig under the glossy covers of the management and deployment tools using whatever can be found to implement what is needed.
All I can hope for is the fact that xCat2 has come out of Alphaworks means that real system admins have had input to the requirements and may have spotted any potential problems, but I wonder how many 200 Power node clusters have been deployed anywhere.
I'm still not very happy about having to learn another clustering tool, though, and I've still got IBM Director to contend with in the future for non-HPC clusters.