Original URL: https://www.theregister.com/2010/04/19/cray_third_gen_linux/

Cray mimics Ethernet atop SeaStar interconnect

Linux shortcut cooks with SLES

By Timothy Prickett Morgan

Posted in OSes, 19th April 2010 02:02 GMT

Supercomputer maker Cray doesn't talk much about the systems software that runs on its massively parallel, midrange, or entry HPC gear, but it probably will start doing so more because of the work it has done to make its non-standard XT boxes look a little less proprietary as far as Linux applications are concerned.

The company has just started shipping the third generation of its Cray Linux Environment, and this one has a new feature called Cluster Compatibility Mode that is sure to get the interest of HPC shops that might not even give Cray's gear more than a passing thought because of the perception they have of being exotic and expensive. While the largest supercomputer labs have enough money and manpower to create a parallel super using schools of fish with OS/2 grafted onto their gills if they decided this was a good idea, entry and midrange HPC shops have all the same budget and skills constraints that SMBs have in the "real world" of commercial computing.

So exotic and expensive will not sell down there, and that's why Cray rolled out the CX1 baby super running Windows in conjunction wit Intel and Microsoft a year and a half ago and just last month debuted a new midrange lineup called the CX1000 based on Intel's Xeon 5600 and 7500 processors and graphics co-processors from Nvidia.

What Cray customers want is something that is as simple to use as these baby and midrange clusters, which run kosher Red Hat Enterprise Linux or Windows HPC Server 2008 and which have MPI software extensions that run over Ethernet or InfiniBand, but which work on massively parallel Opteron-Linux XT machines and their proprietary SeaStar interconnect.

The SeaStar interconnect at the heart of the XT line started out as the "Red Storm" project at Sandi National Laboratories and was only gradually commercialized by Cray. This interconnect doesn't look or smell anything like either Ethernet or InfiniBand, so applications have to be tweaked to run atop it, which turns a lot of customers off and which, quite frankly, has forced Cray to invent whole new product lines like the CX1 and CX1000 to chase revenue opportunities.

According to Barry Bolding, vice president of scalable systems at Cray, when the Red Storm system was first designed, the system's creators were sure that if they wanted to create a system that scaled to hundreds of thousands of processors in parallel, they would have to create a skinnied-down microkernel based on Linux to be able to squeeze more performance out of the box than would be possible based on a full-blown Linux distribution. This home-grown Linux distro was known as the Cray Linux Environment 1.0, and it ran on the Red Storm super and the XT3 commercialized versions of the boxes.

About four years ago, Cray looked at processor and Linux roadmaps and decided that it would take another approach to putting Linux on its Opteron-based parallel machines, one that would make it more compatible with plain vanilla Linux distros and one that would allow it to support more processors and peripherals besides those based into its homegrown Linux. And so, the company took Novell's SUSE Linux Enterprise Server 10 distro and did a little semi-homemade cooking like Sandra Lee, New York State attorney general Andrew Cuomo's girlfriend, is famous for doing, quickly turning processed foods into something that looks like it is homemade and took hours in the kitchen slaving over.

Cray took SLES 10 and locked it down and hardened it in various ways, and then added some tweaks for HPC shops in general and specifically to support the SeaStar interconnect at the heart of the XT machines, which are not supported on SLES 10. Cray also disabled a whole bunch of features in the Linux that are not useful on XT machines and that just end up causing the supers to get less work done. The resulting modified SLES 10 was called the Cray Linux Environment 2.0, and it was supported on the XT3 and XT4 supers that were shipping at the time. The current XT5 machines also can run CLE 2.0.

With the Cray Linux Environment 3.0 operating system, Cray is moving up to the Novell Linux stack to support SLES 11, which has been out for a little more than a year and which is due to get its first service pack soon. Bolding says that most of the current installed base has upgraded its iron and therefore its software to CLE 2.0, so they are ready and eager for the enhancements that come with CLE 3.0.

The big change with CLE 3.0, and something that is not part of the standard SLES 11, is the Cluster Compatibility Mode. In the past, because the SeaStar interconnect is not standard, Cray required HPC shops that wanted to run parallel Linux applications and MPI stacks on their XT machines to compile them on CLE, which had tweaks so the compiler could see the SeaStar interconnect and squeeze the absolute most performance out of it. With Cluster Compatibility Mode, Cray is adding an emulation layer to the SeaStar drivers so as far as Linux is concerned, it looks like a normal TCP/IP driver and a regular set of x64-based applications that are tuned to run on Ethernet-based clusters using the MPI protocol for linking nodes can run unchanged atop CLE 3.0 on the XT iron. No more recompiling applications to run on XT iron.

Preview of things to come

Bolding warns that at the moment this is a "feature release," which means the Cluster Compatibility Mode is really a technology preview. He adds that the clone TCP/IP stack riding atop the SeaStar interconnect "can provide reasonable performance for a relatively small number of nodes," but cautions that on very large XT implementations, customers are going to want to fall back on what Cray is now calling Extreme Scalability Mode - recompiling the Linux applications to have their nodes talk directly through the SeaStar interconnect. CCM can scale to 2,048 cores on the TCP/IP stack now, which means somewhere between 85 and 170 nodes, depending on the Opteron processors customers choose.

Next year, Cluster Compatibility Mode will get a whole lot more interesting, when Cray supports the OpenFabrics Enterprise Distribution (OFED) drivers for InfiniBand much as it is doing for TCP/IP drivers today. One of the key features of InfiniBand that Ethernet still does not have (but soon will) is called Remote Direct Memory Access, which allows server nodes to talk to each other directly, using InfiniBand controllers to link memory controllers, bypassing the network stack entirely and offering much lower latency than even 10 Gigabit Ethernet. In essence, with support of the OFED drivers, Cray's Cluster Compatibility Mode will allow the SeaStar interconnect to emulate InfiniBand and yield much better performance than the emulated TCP/IP stack being offered initially with CLE 3.0.

"Once we have the OFED drivers, we think we can come very close to our native communications speed," says Bolding. No word yet on how far the emulated InfiniBand will scale in terms of processor nodes, but it has to be pretty far to bother to go to the trouble.

Cray has been working on Cluster Compatibility Mode for the past two years, and Bolding admits that this clever network emulation would have been useful for Cray to expand its addressable market. But at the time, Cray was more concerned with breaking the petaflops barrier at the big supercomputing centers like Oak Ridge National Laboratory that are paying the current bills.

Cray has high hopes for Cluster Compatibility Mode. "We think this will take away the fear of getting a Cray system," explains Bolding. "We have removed cost as a concern over the past few years, and when we did, some customers feared that they would end up getting something that was not compatible with other Linux machines."

Well, of course, the customers were right in this regard. But if the OFED drivers running atop SeaStar and emulating InfiniBand work as well as Bolding says they can, this would indeed be another barrier down. Provided the SeaStar interconnect has enough oomph that emulated InfiniBand performs as well or better than the real thing, of course. By the way, the emulated Ethernet and InfiniBand drivers will support multiple MPI stacks, so you are not locked in.

CLE 3.0 will initially only ship on the new XT6 and XT6m parallel supers, which use blade servers based on the brand-new twelve-core "Magny-Cours" Opteron 6100 processors from Advanced Micro Devices. The XT6 nodes were previewed last fall at the SC09 supercomputing trade show; Cray has not said that the XT6 nodes are actually shipping yet in volume.

Later this year, Cray will support CLE 3.0 on the XT5 supers, which are based on an earlier six-core Opteron generation but which are based on the same SeaStar2+ generation of interconnect that was held over for the XT6 nodes. In early 2011, Cray will support CLE 3.0 on XT4 generations of supers, but has no plans to support it on XT3 machines. It is a matter of testing and qualification, which Cray is not going to spend money on with so few of these XT3 machines still in the field. If you want to run an emulated Ethernet-MPI stack on top of an XT3 machine, you have to move up to an XT4 or higher.

Presumably the combination of the upcoming "Gemini" interconnect and the XT6 nodes, which comprise the "Baker" family of Opteron supers machines slated for later this year, will have some sort of hardware assistance for helping speed up the emulated Ethernet or InfiniBand that the Cluster Compatibility Mode offers inside CLE 3.0. Bolding did not say.

CLE 3.0 has a number of other enhancements. First, it includes Oracle's open source Lustre 1.8 clustered file system, and also supports IBM's Global Parallel File System (GFPS) and Panasas clustered file systems. GPFS and Panasas are new; the Cray XTs have been running Lustre since their inception. CLE 3.0 is also designed to scale across 500,000 cores in a parallel cluster, up from a 200,000-core ceiling with CLE 2.0. CLE 3.0 also includes a diagnostic tool called NodeKARE, short for Node Knowledge and Reconfiguration, which makes sure jobs are scheduled to run only on nodes that are behaving themselves and not acting all wobbly.

What Cray has not said is whether or not it will be offering a Cluster Compatibility Mode in conjunction with Microsoft for its XT line of supers. This would be clearly very useful. Although Cray supports Windows HPC Server 2008 on its baby and midrange lineup, this Windows variant is not supported on the massively scalable XT line. But over the long haul, that will have to be a goal for the company, since the point of having an entry and midrange super line is to get he customers and grow them up to full-scale, massively parallel machines as their workloads expand. ®