Original URL: http://www.theregister.co.uk/2010/05/26/scalemp_vsmp_3/
ScaleMP scales up to 128 nodes
The 128 node, 64TB virtual SMP
Big SMP boxes are expensive, and puny two-socket and four-socket boxes are cheap. And for years now, ScaleMP has been trying to use software and fast networks to make a big server out of a bunch of little ones.
Managing clusters is a pain in the neck, and applications really like to have a single address space to play in, and that is what ScaleMP's vSMP Foundation, which is part server BIOS, part virtualization hypervisor, and part symmetric multiprocessor for aggregating many servers into one virtual SMP box, is all about. The initial ScaleMP software essentially used InfiniBand as an SMP backplane to link 16 server nodes into a single system image with coherent memory.
Last fall, in order to chase the enterprise and HPC shops that wanted to make a cheap eight-socket box out of four two-socket boxes, ScaleMP announced a new version of its software, called vSMP Foundation for SMB Edition, that could directly link the four nodes together through cross-wiring InfiniBand host channel adapters inside the servers together. No switch required. And a fake eight-way using four-core Xeon 5500s had 35 per cent more floating point performance, 75 per cent better memory bandwidth, and 25 per cent lower cost of acquisition than a real eight-way box using the six-core Opteron 8400 processors from Advanced Micro Devices.
Granted, ScaleMP's vSMP Foundation only supports Linux and only supports certain kinds of workloads, like HPC and messaging jobs that would otherwise have nodes chatting with each other across some other software layer in the system. Which is one of the reasons why ScaleMP only has 175 customers so far. It really needs to support Windows to go mainstream.
With vSMP Foundation 3.0, ScaleMP is taking scalability out nearly an order of magnitude. Now, 128 nodes can be lashed together instead of 16, and the address space is now 64 TB across all those nodes instead of 4 TB with vSMP Foundation 2.0 and 2.1. Also, with the 2.X releases, the number of threads per machine was actually maxxed out at 16 for each physical server node, and yes, that was counting HyperThreads if you had them turned on with Xeon processors. (Opterons do not support simultaneous hyperthreading, so in this case, you count the cores to figure out how many threads you are using).
With vSMP Foundation 3.0, now you can have 128 threads per server node inside the vSMP cluster. And that means, for instance, that you can build an eight-socket server node using real SMP using Intel's "Beckton" chipset and its "Nehalem-EX" Xeon 7500 processor, which sports eight cores and HyperThreading. And then you could glue up to 128 of these nodes together using the vSMP Foundation code to make a big, badass virtual SMP box with 16,384 threads. Not that anyone is going to do this, because they don't have the $100,000 or more per node it would take to do it. And heaven knows if applications would be able to take advantage of this, but it would be fun to try to figure it out if you were playing with other people's money.
The important thing is that with the vSMP Foundation 3.0 code - which the company calls a virtual machine, and which it is of a different sort from the carved up container on a physical server than ESX Server, KVM, Xen, or Hyper-V create - vSMP can now scale like it never could before. So as Intel and AMD add cores and threads to their commodity two-socket and four-socket boxes, vSMP clusters will be able to borg them into the vSMP collective.
By the way, according to Benjamin Baer, vice president of marketing at ScaleMP, as long as the processors are running at the same clock speed, vSMP allows for server nodes to have two, four, or eight sockets, and they can come from different vendors so long as they have InfiniBand HCAs. So just like you could cobble together a Beowulf Linux cluster to make a supercomputer out of random PCs and servers you had laying around, you can create a virtual SMP out of random servers you have generating a lot of heat but probably doing little work in the data center. It's a pity that vSMP is closed source and not free, because otherwise it might have taken off like Beowulf, pushing Linux into even more markets.
The vSMP Foundation 2.0 was more enthusiastic about supporting Ethernet as a node interconnect, but by the 2.1 release, ScaleMP stopped talking about it, and it is not supporting Ethernet as an interconnect in the 3.0 release. This is ironic, given that ScaleMP was using only Gigabit Ethernet with 16 nodes with vSMP 2.0 and here were are with ten times that bandwidth available over Ethernet today.
While you still need InfiniBand to act as the backplane in the virtual SMP with the 3.0 release, ScaleMP understands that companies have 10 Gigabit Ethernet networks and the server nodes can now talk to the outside world through 10 GE NICs (specifically, there are drivers for Broadcom NetXtreme II 57711 10 GE adapters) and to storage through Fibre Channel (using Emulex LPe12XXX 8 Gb/sec adapters). If ScaleMP's business grows, and tries to break out of the HPC niche it is in now, the company knows it has to support Ethernet as an interconnection.
"We know we need to broaden the interconnect support to play on Cisco and other servers," admits Baer. But right now, the latency and bandwidth advantages, particularly as a virtual SMP backplane, go to InfiniBand.
The 3.0 release also allows for multiple InfiniBand links to be added to each server, which can be used for redundancy and multipathing or to provide twice the bandwidth between nodes. In the 2.0 release, you had one InfiniBand host channel adapter per server node in the vSMP cluster, but now you can cram four into each node and vSMP knows how to make the best use of them.
Having multiple HCAs is recommended for machines with high core counts, in fact, to balance out performance, and using 40 Gb/sec InfiniBand switches, that is effectively delivering a backplane with 160 Gb/sec of aggregate bandwidth. This is a lot more oomph than vSMP 2.X could deliver. Perhaps enough for it to run back-office workloads like databases. We'll see. ScaleMP is mum on that subject at the moment.
In addition to the new vSMP, ScaleMP is rolling out a tech preview of a capability it calls VM-on-VM. Yup, you guessed it. You spend all this dough to aggregate servers, and then you spend some more to buy hypervisors to carve them back up again. The idea is a replay of the SMP argument, but done for server virtualization. Buying bigger and more expensive boxes cram more VMs onto a single physical box is what a lot of companies are doing, and the rising average selling prices for servers reflect this.
So rather than buy a lot of fat boxes and run them at a certain utilization, you buy a larger number of cheaper two-socket boxes and make a virtual SMP on the fly when a fat VM is actually required. You get out of buying extra capacity for peaks and creating it on the fly from a pool of resources when you need it. No word on what kind of overhead it takes to run KVM or Xen atop the vSMP hypervisor, as ScaleMP is initially supporting, but that's what the tech preview is all about: figuring out the overhead and the issues.
vSMP Foundation 3.0 is in beta right now at selected customers. The code will be generally available on June 14 for configurations supporting up to 32 nodes. Three Nehalem-EX systems will certified to run vSMP 3.0 on day one. Dell, which already resells the prior releases, and IBM are talking ScaleMP right now about how they might use the code, and Hewlett-Packard and Oracle have expressed some interest. (Personally, I think Amazon should use it to make proper SMPs on its EC2 cloud. For all I know, it already does). The VM-on-VM feature is expected to be ready by the middle of summer this year, and support for between 32 and 128 nodes is expected by the end of the third quarter.
vSMP Foundation 3.0 costs $2,500 per node; a special HPC variant costs $1,700 per node. There is also a vSMP Foundation for Cloud Edition aimed at service providers who want to be able to make virtual vSMPs out of pools of thousands of server nodes on the fly that costs $50,000 for a base license plus $500 per node under management. This cloud edition has the same scalability limits for a vSMP, it just has tools and the license to span a lot more iron and to do so more cheaply. ®