Supercomputer niche chucks rocks at Nehalem
SiCortex and the Intel 'problem'
As niche supercomputer-maker SiCortex works on the next generation of its line and watches the IT marketing machine gearing up for Intel's impending Nehalem-based Xeon EP, the company says that Chipzilla isn't moving in the right direction for high-performance computing (HPC) workloads.
"The major improvement is in the integration of the memory controller, so you don't have to send all the memory traffic over the northbridge," says Jud Leonard, co-founder and chief architect at SiCortex. "But this only solves one of the three issues that Intel faces - issues that we have already solved. Intel hasn't done anything about solving the performance-per-watt problem, and they haven't done anything in terms of on-chip communication."
That assessment is perhaps a bit harsh, since Intel has in fact moved to successively lower-powered processor designs. The Nehalem chips are an incredible improvement over the old Xeon chips - remember how awful the "Nocona" and "Paxville" Xeons were?
The dual-core Paxville Xeon DP ran at 2.8GHz, had 2MB of L2 cache, 800MT/sec of bandwidth into the front side bus (FSB), and a 135-watt thermal design power (TDP). The Paxville DP chips, which plugged into dual-socket servers in late 2005, supported HyperThreading - each socket provided four threads for operating systems and applications.
The future "Gainestown" Nehalem Xeon EP due on March 31 will run at between 1.86GHz and 3.2GHz with DDR3 main memory running at 800MHz, 1066MHz, or 1333MHz. TDPs will range from 60 to 130 watts.
That top-end, 130-watt part is just stupid unless you have a workload that needs the clocks, so the fair comparison to Paxville Xeon DPs is the 2.93GHz part that has four cores, eight threads, and 8MB of L3 cache. That Xeon EP will handle 6.4GT/sec of memory bandwidth thanks to its integrated memory controllers and the QuickPath Interconnect (QPI) Intel is finally putting into server chips. (Welcome to 2003, Chipzilla.)
When you think about it that way, maybe SiCortex has a point.
But in the last three and a half years, Intel has doubled the Xeon's cores, improved HyperThreading, added lots of L3 cache, boosted memory bandwidth by a factor of eight, and dropped TDPs by 30 per cent. Performance has improved if your application likes more threads, but if it doesn't know how to use those threads the dollars per performance-per-watt calculation isn't impressive. If your applications are memory-constrained, however, you will no doubt see big performance gains from the Xeon EP.
"But here's the problem," says Leonard. "When you get right down to it, these x64 chips were really designed for desktop environments, and they have to hang off an expensive network fabric in supercomputers." As an example, he said that Sun Microsystems' gazillion-port Magnum InfiniBand switch at the heart of its blade-style HPC systems costs more than an entire SiCortex system.
The SiCortex system-on-chip (SoC) nodes were upgraded last October, and sport six MIPS cores running at 700MHz with a built-in a router handling mesh networking of the chips. That on-chip network means you don't have to do any external networking to get the processors to do the communication necessary for HPC workloads - usually MPI but also a fair amount of plain-vanilla TCP/IP.
The October upgrade, when SiCortex's chip fab TSMC moved to a 90nm process, boosted the speed of the chip from 500MHz to 700MHz. A full 5,832 core machine supports 8TB of DDR2 memory running at 667 MHz or 800 MHz, supplying 2.1Tbps of I/O bandwidth over the internal network and 8.1 teraflops of number-crunching power.
Because the speed of the processors is close to the speed of the memory in the SiCortex box, the MIPS cores are not wasting cycles waiting for memory to fill them up with data, which tends to run about half as fast as CPUs in x64 machines these days.
"Our attitude is that HPC applications are constrained by memory access, so you have to spread the application over a large number of cores," says Leonard. But doing so efficiently requires that on-chip network, as far as SiCortex is concerned.
The SiCortex super runs a variant of Gentoo Linux and has a tweaked version of the Lustre open-source clustered file system controlled by Sun and used by many supercomputer centers. Sun bought the company behind the Lustre project in September 2007.
SiCortex bought parallel-compiler maker PathScale in August 2007 and ported the compiler stack to its MIPS-based Gentoo rev. SiCortex also replaced the Linux boot sequence and added system-management tools, but Leonard says that the machine looks and feels like any normal Linux-based Beowulf cluster.
Don't expect SiCortex to suddenly announce a version of its box based on Intel's Nehalem chips. Leonard won't divulge much of the details of the next generation of SiCortex machines, but he did say that they'll be "substantially speeded up," have more cores per die, and take advantage of a "major step" in semiconductor processes from TSMC. The only hint he gave is that SiCortex will exceed a 1GHz clock speed.
Leonard also added that the company has created software to allow multiple SiCortex machines to be daisy-chained to share workloads, but said that code is not yet ready for prime time.
We're just guessing, but SiCortex could move to DDR3 memory and get it and the cores running at 1.33GHz. A process shrink from 90nm to 65nm could allow the clocks to be cranked that high, and maybe the core count could jump from 6 to 10 per die, as well. That would put the SiCortex kicker at about 13.6 teraflops.
There's an even more interesting possibility for future SiCortex HPC machines. With TSMC now partnering with Intel to build Atom x64-based SoCs for embedded applications, it is just faintly possible that SiCortex might move from MIPS to Atom chips, thereby allowing customers to more-easily move their x64 applications to SiCortex iron.
Leonard in no way suggested this was the plan. But it's an interesting possibility, and perhaps a good long-term plan for SiCortex or anyone else who wants to take on the HPC space with low-power parallel machines.
And it is often better to have Chipzilla as an ally than an enemy. ®