3Leaf makes big SMPs out of x64 clusters
Shared memory trumps virtualization
Everybody is looking to shake up the server business this days, it seems. But everyone had better get in line behind 3Leaf Systems, which is launching its much awaited "Aqua" system pooling and virtualization chipset and an intriguing x64 system to match.
A little more than two years ago, 3Leaf Systems came out  of stealth mode with a funky I/O virtualization product for X64 servers and the promise that it would create a special chip for system boards that would allow for the virtualization of multiple CPUs dozens or hundreds of servers and - here's the neat part - the ability to create virtual symmetric multiprocessors on the fly from a pool of processor cores.
3Leaf then went quiet, and was never heard from again until this fall, when El Reg noticed  that 3Leaf had some killer results on the SPECjbb2005 Java application benchmark. At the time, the company didn't want to talk about its Voyager ASIC and the clustering and virtualization technology it implements, and said that it was only doing the test because it was required to do so by a prospective customer. And ahead of the launch of the Voyager ASIC and related systems, which come out today for shipment later this month.
Let's back up for a second, because El Reg seems to have missed the original launch of 3Leaf back in May 2007. The company was started by Bob Quinn, who is currently chairman and chief technology officer at 3Leaf, back in 2003, who saw that as switches kept getting faster and faster (as they have), there would be a crossover point where they could be used as the backplane of SMP systems, which tightly cluster servers together and allow them to share main and cache memory and therefore present a single memory space for applications and a single operating system to play in.
With the right kind of traffic cop and virtualization support in the x64 chip and in standard operating systems, it would be possible to create a scale-up server (or, as it turns out, many of them) from a scale-out rack of the workhorse two-socket boxes that populate most of the data centers of the world.
So, 3Leaf got to work on project "Aqua," a special ASIC that will come to market under the brand name Voyager that extends the HyperTransport bus of the Opteron processors from Advanced Micro Devices out to the network switches and virtualizes the memory and cores inside the Opteron chip.
3Leaf was one of the first licensees of the HyperTransport interconnect, and is now a licensee of the QuickPath Interconnect alternative from Intel (but at the 1.1 level that will be used for the future "Sandy Bridge" family of Xeon processors). And because 3Leaf wanted to comfort its investors and customers, it hired storage chip giant LSI to manage the "back-end" part of the Voyager chip design on its behalf, working with foundry partner Taiwan Semiconductor Manufacturing Corp to ensure that the 40 nanometer processes all work well.
To date, 3Leaf has raised $67.5m in venture backing, with $500,000 in initial angel funding, plus a $12m first round with Storm Ventures and Alloy Ventures kicking in the dough; Intel led the $20m second round and LSI led the $35m third round.
One of the reasons why Intel and LSI are doing this is because the server space needs a better way to glue and virtualize two-socket servers together. While server virtualization hypervisors are good for dicing and slicing servers, they cannot mash up small servers to create larger ones, as Voyager can do. (ScaleMP has a similar technology, all implemented in software, and Virtual Iron was able to use InfiniBand to glue together 16 servers into one image with a hypervisor, but then dumped this idea when it shifted to the Xen hypervisor. We see how well that worked out, now that Oracle ate its carcass.)
The idea is not that much different than the NUMA extensions to SMP that are prevalent in big iron servers today, except that you are using a two-socket server with 10 Gigabit Ethernet or InfiniBand as the server backplane connecting nodes together. And you do not need to use the Remote Direct Memory Access (RDMA) protocol to lash servers together, which other approaches have tried to do.
As far as the ACPI BIOS and operating system is concerned, when the Voyager ASIC groups together a pool of cores, memory and I/O, it looks and feels just like an SMP server with those elements. The secret, apparently is the nested page table memory management features that AMD and Intel put into their chips to support virtualized CPUs; the Voyager ASIC makes use of this to create SMPs in hardware partitions on the fly over the HyperTransport.
Here is the important idea: programming for a shared memory system is a lot easier than programming for a more loosely clustered programming environment. You let the operating system, hypervisor, and chipset create a single address space and just program as you normally would on an SMP box.
The initial Voyager system is technically known as the Dynamic Data Center Server, or DDC-Server, for short, and the Voyager ASIC has been rebranded the DDC-ASIC. The server nodes have three Opteron sockets: two for the Opteron processors and one for the Voyager ASIC. The first DDC-Server system boards are being made by motherboard and white box server maker Super Micro, which is about as safe a starting point as any upstart could break into the server racket from. But Quinn says that 3Leaf is really hoping that the tier one server makers, as well as storage vendors, will license the DDC technology to create their own variants of the Voyager design.
The DCC-ASIC is currently limited to Opteron processors in the quad-core "Shanghai" and six-core "Istanbul" families, and the chip will also support the 12-core "Magny-Cours" processors coming out from AMD in the first quarter of 2010. This first DDC-ASIC spans up to 16 two-socket Opteron server nodes, and can create an image with 192 cores and 1 TB of main memory (using 4 GB DDR2 memory modules). This, says Quinn, covers 99 per cent of the workloads out there on big iron today.
3Leaf is also working on a future version of the Voyager ASIC that will span up to 32 server nodes and up to 64 TB of shared memory. This ASIC will plug into system boards that use Intel's Sandy Bridge Xeon processors and their QPI 1.1 interconnect. 3Leaf is keeping mum about whether or not this ASIC will be tweaked to support future Opterons and their improved HyperTransport.
The DDC software for creating virtual SMPs includes a hypervisor of sorts, but this one is really for creating hardware partitions, not for dicing and slicing individual cores into tiny virtual machines. And while it is technically possible to run x64 hypervisors that do such dicing and slicing atop of the DDC Server and its software stack, Quinn says that 3Leaf is not supporting this. "We wouldn't encourage anyone doing this," Quinn says with a laugh.
The management software stack, called DDC-Software appropriately enough, can allocate pools of CPUs and memory fluidly across the nodes in the cluster. So for instance you could, if you were a wiseguy, allocate all 1 TB of main memory in a cluster to a single processor core, and it would work. The DDC-Software stack has three capabilities: pooling, sharing, and flexing.
The DDC-Pool function allows for the x64 servers to be coalesced into SMPs at the server node level, but only allows for OSes to be configured statically; if you want to change the number of cores or amount of memory allocated to a virtual SMP, you have to reboot the servers and restart the OS. DDC-Share has finer granularity, and allows for CPUs to be allocated down to the core level and memory to be allocated in units of 64 MB.
In 2010, DDC-Flex will come to market, allowing for the reconfiguration and reallocation of cores, memory, and I/O on the fly, without a reboot. DDC-Flex will also allow for clusters and their SMP images to make use of resources that are hot-added to the system.
At the moment, 3Leaf has two configurations of server clusters it is selling. A base DDC-Server has eight server nodes, with a total of 256 GB of memory, 96 Opteron cores (using the Istanbul chips) running at 2.4 GHz, 4 TB of disk capacity, an InfiniBand switch, the DDC-Pool software, and a Linux license for all the nodes; it sells for $99,000, which is a fraction of the cost of a big RISC/Unix server. (Yeah, that's the point.)
A larger configuration with the full 16 nodes, which has 192 cores running at 2.8 GHz, 1 TB of memory, 8 TB of storage, a suitable InfiniBand switch, DDC-Pool, and Linux costs $250,000. Both of these machines will ship in December. Once 3Leaf works out its build-to-order operations with Super Micro, customers will be able to pick any Shanghai or Istanbul processors they want.
The 3Leaf DDC-Servers support Red Hat Enterprise Linux 5.2 and SUSE Linux Enterprise Server 10 and 11; Windows Server 2008 is slated to be supported atop the iron next year. Pricing for the DDC-Share feature was not available at press time.
In terms of storage, customers can use the local storage inside the server nodes or link out to Fibre Channel or iSCSI storage area networks. All of the storage links are virtualized and can be reconfigured from the DDC-Software stack, just like CPUs and memory. ®