Oracle plops true live migration onto SPARC hypervisor
From sleepy and warm to fully alive
Oracle has brought true live migration of workloads to platforms using the SPARC T series of processors, by tweaking the VM Server for SPARC server hypervisor formerly known as logical domains – LDoms, for short.
With Oracle VM Server for SPARC 2.1, a system can be carved up into as many as 128 LDoms. These LDoms can span many threads and multiple sockets, of course, as long as the machine – from Oracle or Fujitsu, which resells Oracle's entry and midrange T series servers – has multiple sockets.
The SPARC T3 chip and four servers using it launched last September. The SPARC T3 chip has 16 cores, with eight threads per core, and comes in machines with one, two, or four sockets in a single system image with shared main memory. That's a total of 512 threads in a four-socket box. However, with the 2.1 release of the VM Server for SPARC hypervisor, the maximum of 128 LDoms is smaller than the aggregate number of threads by a factor of four on a four-socket SPARC T3-4 machine.
This is not just an Oracle issue. IBM has a similar limitation with its PowerVM hypervisor for its Power Systems machines using Power7 processors. The Power7 chips top out at eight cores, with four threads per core; the top-end Power 795 server has 32 sockets, for a total of 256 cores and 1,024 threads. But although PowerVM can create a partition in as little as 1/10th of a CPU core and can create a single partition that spans all cores, any Power Systems machine (regardless of processor vintage) can only have a maximum of 254 partitions.
It would seem like both Oracle and IBM have some work to do on their respective VM Server for SPARC and PowerVM hypervisors.
Back in July 2008, when it was an independent, Sun Microsystems added a "sort-of" live-migration capability to the LDom hypervisor, as it was then called. Although it was called live migration in LDom 1.2, it might have better been called "sleep migration" (Oracle calls it "warm migration") because a machine had to be quiesced before it could be move from one physical server to another.
With the LDom 1.3 hypervisor announced in February 2010, Sun/Oracle added dynamic resource management for virtual CPUs on the SPARC T systems, allowing admins to set low- and high-capacity thresholds, and letting LDoms automagically steal spare capacity from each other to speed up their work. The hypervisor also got memory compression for LDoms to speed up this sleepy migration.
As Jeff Savit, principal engineer at Oracle responsible for the SPARC hypervisor, put it in his blog, the LDom hypervisor is interesting because it gives each virtual machine threads that they own – it does not timeslice a CPU, as other hypervisors do, and therefore LDoms do not incur high overheads, because caches and translation lookaside buffers don't have to be flushed every couple of milliseconds. Neither do they need to worry about running instructions in privileged mode, and how doing this affects the state of the machine.
LDoms are among the most clever things that Sun cooked up in the past decade, and it's a pity that they're not supported in the high-end SPARC64-VII+ processors used in the SPARC Enterprise M midrange and high-end servers – which have hardware partitions that are not particularly virtual, and barely dynamic, by comparison.
We realize that you need hardware features to support LDoms – our point is that those features should be added to SPARC64 chips.
The true live-migration support with Oracle VM Server for SPARC 2.1 is significant, and it is no doubt something that Oracle SPARC shops have been clamoring for in order to get them to parity with other x64, RISC, and Itanium hypervisors. Oracle's VirtualBox hypervisor, for example, has it already.
An LDom that is getting prepped for live migration has its complete state snapshotted – but instead of stopping the LDom, VM Server for SPARC keeps it running. That snapshot state of the LDom is then teleported over to another machine running another copy of the hypervisor, but the system keeps taking snapshots of the changed state of the running partition to be migrated, and transmits subsequently smaller pieces of state info over to the new machine and its hypervisor.
Eventually there is so little new state information in the original domain since the prior snapshot (each bite gets smaller and smaller) that the last chunk of LDom state is transmitted to the target machine's hypervisor, and the new VM is fired up, running the application in the final state it was in. It's a bit like running a relay race: you start running ahead of your teammate before he hands you the baton and you take off.
With VM Server for SPARC 2.1, the dynamic resource management feature that debuted in LDom 1.3 has been tweaked to allow a domain with a higher priority to not only get dibs on any spare capacity in the machine, but also to take resources away from lower-priority LDoms on a machine. The cryptographic processors embedded on SPARC T series chips can now be linked with virtual machines that use them and be dynamically reconfigured and live migrated as a single unit instead of individually.
Perhaps more importantly, the cryptographic units can be used to encrypt the data comprising an LDom, send it over the network to a different machine, then decrypt it in the other side of the wire as an LDom is live-migrated.
The physical-to-virtual (P2V) conversion tool that is included with the hypervisor to suck the applications running on Solaris 8, 9, 10 and plunk them into LDoms or Solaris containers has also been improved in some fashion to more quickly and easily do that conversion.
Solaris 11 Express, which is the tech preview of the forthcoming Solaris 11 release due sometime this year, has some network tweaks that allow for virtual network devices to use shared memory to exchange packets, thereby lowering the overhead for virtual networks on SPARC T servers.
Finally, the LDom 2.1 hypervisor exposes DTrace points to the outside world, so you can use DTrace to see what the heck is going on inside of virtual machines.
You can read the full release notes for Oracle VM Server 2.1 to get all the nitty-gritty details on what's new in the hypervisor, which is supported on machines using SPARC T2, T2+, and T3 processors – if you have the original SPARC T1 processors, you're outta luck. Oracle VM Server for SPARC 2.1 is preloaded on new SPARC T series machines and is available for download to existing customers. ®
Sun/Oracle have been playing catchup on the LPAR capability from IBM for ages; this almost brings them to parity on features and certainly up to the stage where the feature set covers most of what you need. Live migration is one of those things people expect their virtualisation layer to do; Containers can't do it yet and I've not heard of it even being on the roadmap; the technical difficulties of such a migration are probably sufficiently complex as to be impossible...
Now, if they can provide LDOMs on something with decent single thread performance, it'll really give them a boost. If T4 can deliver that, it'll give Oracle a boost in the hardware arena.
re: Finally is about 2 years away
I agree that a feature like this takes time to test, but from my understanding no one really uses this feature on IBM HW, as all of requirements to make it work properly are next to impossible to meet. Correct me if I am wrong... this is what I hear from friends that use Power systems (I do not).
Regarding the cache comment on the number of partitions... bull. Sun nor Oracle make any such recommendation based on cache size. SPARC does not have to have as much cache as Power since it is CMT and can get work done even while a thread is waiting on a memory load. Power will stall all threads on a cache miss, which will happen regardless of how much cache you have. When you slice up a CPU, like IBM does, you WILL have a lot of cache misses as each partition will be doing different things.
Max number of partitions on IBM Power Systems
PowerVM was enhanced in April 2011 to support up to 320 partitions on the Power 750, 640 partitions on the Power 770 and 780 and up to 1000 partitions on the Power 795.