ScaleMP scales up to 128 nodes
The 128 node, 64TB virtual SMP
Big SMP boxes are expensive, and puny two-socket and four-socket boxes are cheap. And for years now, ScaleMP has been trying to use software and fast networks to make a big server out of a bunch of little ones.
Managing clusters is a pain in the neck, and applications really like to have a single address space to play in, and that is what ScaleMP's vSMP Foundation, which is part server BIOS, part virtualization hypervisor, and part symmetric multiprocessor for aggregating many servers into one virtual SMP box, is all about. The initial ScaleMP software essentially used InfiniBand as an SMP backplane to link 16 server nodes into a single system image with coherent memory.
Last fall, in order to chase the enterprise and HPC shops that wanted to make a cheap eight-socket box out of four two-socket boxes, ScaleMP announced a new version of its software, called vSMP Foundation for SMB Edition, that could directly link the four nodes together through cross-wiring InfiniBand host channel adapters inside the servers together. No switch required. And a fake eight-way using four-core Xeon 5500s had 35 per cent more floating point performance, 75 per cent better memory bandwidth, and 25 per cent lower cost of acquisition than a real eight-way box using the six-core Opteron 8400 processors from Advanced Micro Devices.
Granted, ScaleMP's vSMP Foundation only supports Linux and only supports certain kinds of workloads, like HPC and messaging jobs that would otherwise have nodes chatting with each other across some other software layer in the system. Which is one of the reasons why ScaleMP only has 175 customers so far. It really needs to support Windows to go mainstream.
With vSMP Foundation 3.0, ScaleMP is taking scalability out nearly an order of magnitude. Now, 128 nodes can be lashed together instead of 16, and the address space is now 64 TB across all those nodes instead of 4 TB with vSMP Foundation 2.0 and 2.1. Also, with the 2.X releases, the number of threads per machine was actually maxxed out at 16 for each physical server node, and yes, that was counting HyperThreads if you had them turned on with Xeon processors. (Opterons do not support simultaneous hyperthreading, so in this case, you count the cores to figure out how many threads you are using).
With vSMP Foundation 3.0, now you can have 128 threads per server node inside the vSMP cluster. And that means, for instance, that you can build an eight-socket server node using real SMP using Intel's "Beckton" chipset and its "Nehalem-EX" Xeon 7500 processor, which sports eight cores and HyperThreading. And then you could glue up to 128 of these nodes together using the vSMP Foundation code to make a big, badass virtual SMP box with 16,384 threads. Not that anyone is going to do this, because they don't have the $100,000 or more per node it would take to do it. And heaven knows if applications would be able to take advantage of this, but it would be fun to try to figure it out if you were playing with other people's money.
The important thing is that with the vSMP Foundation 3.0 code - which the company calls a virtual machine, and which it is of a different sort from the carved up container on a physical server than ESX Server, KVM, Xen, or Hyper-V create - vSMP can now scale like it never could before. So as Intel and AMD add cores and threads to their commodity two-socket and four-socket boxes, vSMP clusters will be able to borg them into the vSMP collective.
By the way, according to Benjamin Baer, vice president of marketing at ScaleMP, as long as the processors are running at the same clock speed, vSMP allows for server nodes to have two, four, or eight sockets, and they can come from different vendors so long as they have InfiniBand HCAs. So just like you could cobble together a Beowulf Linux cluster to make a supercomputer out of random PCs and servers you had laying around, you can create a virtual SMP out of random servers you have generating a lot of heat but probably doing little work in the data center. It's a pity that vSMP is closed source and not free, because otherwise it might have taken off like Beowulf, pushing Linux into even more markets.
The vSMP Foundation 2.0 was more enthusiastic about supporting Ethernet as a node interconnect, but by the 2.1 release, ScaleMP stopped talking about it, and it is not supporting Ethernet as an interconnect in the 3.0 release. This is ironic, given that ScaleMP was using only Gigabit Ethernet with 16 nodes with vSMP 2.0 and here were are with ten times that bandwidth available over Ethernet today.
While you still need InfiniBand to act as the backplane in the virtual SMP with the 3.0 release, ScaleMP understands that companies have 10 Gigabit Ethernet networks and the server nodes can now talk to the outside world through 10 GE NICs (specifically, there are drivers for Broadcom NetXtreme II 57711 10 GE adapters) and to storage through Fibre Channel (using Emulex LPe12XXX 8 Gb/sec adapters). If ScaleMP's business grows, and tries to break out of the HPC niche it is in now, the company knows it has to support Ethernet as an interconnection.
"We know we need to broaden the interconnect support to play on Cisco and other servers," admits Baer. But right now, the latency and bandwidth advantages, particularly as a virtual SMP backplane, go to InfiniBand.
The 3.0 release also allows for multiple InfiniBand links to be added to each server, which can be used for redundancy and multipathing or to provide twice the bandwidth between nodes. In the 2.0 release, you had one InfiniBand host channel adapter per server node in the vSMP cluster, but now you can cram four into each node and vSMP knows how to make the best use of them.
Having multiple HCAs is recommended for machines with high core counts, in fact, to balance out performance, and using 40 Gb/sec InfiniBand switches, that is effectively delivering a backplane with 160 Gb/sec of aggregate bandwidth. This is a lot more oomph than vSMP 2.X could deliver. Perhaps enough for it to run back-office workloads like databases. We'll see. ScaleMP is mum on that subject at the moment.
Next page: VM-on-VM action
My experience with ScaleMP is
not very good. I tested software which ran beautifully on a 24 core (6 quad-core) opterons (18-22 x speed-up) crunching through multi-scale analysis of a 1.2 gigapixel image in 60 seconds. On a 64 core (4x 16 cores if I am right) ScaleMP box performance was DISMAL. As more threads are added, the performance tends to drop severely. On a single thread I would get a timing of say 60 seconds for a smallish data set, on 2 threads it took anything up to 5 minutes. The scheduler NEVER puts two threads of the same program on a single board, but scatters them far and wide. You can only gain speed up on these boxes if you have many light-weight processes which do not need to share much memory. Did we not have clusters for that?