Server virtualisation: How to pick the right model
Which is the fairest of them all?
Virtualisation has become an over-used buzzword.
On mainframes, it has been around for ages. Its introduction to x86 took a concept formerly reserved for Big Tech and let it loose among the masses.
Once a straightforward technology with a limited number of implementation models, virtualisation has been bootstrapped and shoehorned into every crevice of IT imaginable. Even smartphones are getting the treatment.
New capabilities do nothing for refuseniks who eschew the use of virtualisation. Some feel the need to evangelise this choice, while others loudly proclaim the "one true way" to use the technologies involved.
Direct-attached virtualisation versus distributed models is a common ideological battleground.
Direct-attached virtualisation is simple. A server with local storage hosts several virtual machines. These use the virtual switch (vSwitch) provided by the hypervisor to communicate with each other without having to send packets across the network interface card (NIC) and thus out to the rest of the network.
Talk among yourselves
Typically, virtual machines hosted in a direct-attached scenario are capable of communication with servers and clients located outside the host system, but most of the communication occurs among virtual machines residing on that one system.
Distributed virtualisation is very different. The host server is treated as much as possible as an entirely disposable processing unit. Storage is centralised and delivered to multiple hosts over a storage area network (SAN) with communication between virtual machines offloaded to physical switches.
Each model has its quirks.
Direct-attached virtualisation is fast. The maximum theoretical speed that a 10Gb NIC (a standard interface for modern SANs) could provide information is 1280 megabytes per second (MBps). A fairly common PCIe 8x 2.0 RAID card can theoretically provide up to 4000MBps.
Real-world numbers are not so clean. I’ve only ever got a 10Gb network attached storage up to 900MBps, and the best I’ve wrung out of my RAID cards (SSD RAID 10) is 2200MBps. But 2200MBps beats the pants off 900MBps, and handily demonstrates the storage speed advantage that the direct attached model can deliver.
Networking tells a similar tale. A hypervisor’s vSwitch provides each virtual machine with a virtual 10Gb NIC. This allows all the virtual machines located on a single host to chat among themselves at 10Gb, or faster if you feel like attaching multiple virtual NICs to a given virtual machine.
When heading off-host to the rest of the network, these virtual machines need to fight for the limited bandwidth provided by the hardware available. Having 30 virtual machines talking merrily away at 10Gb each is a completely different experience from asking those same 30 virtual machines to squeeze through a single 10Gb network card – and back again – to have networking processed by a physical switch.
Were we to consider only the numbers presented so far, distributed virtualisation would seem insane. But it has its advantages, and for many they are worth the cost.
What direct-attached virtualisation can’t do is rapidly move a virtual machine from one host to another. Virtual machines can be quite large, and moving the entire thing across a network can take a long time.
This is not an issue with distributed virtualisation’s centralised storage model. Distributed virtualisation also allows for live migration of running virtual machines between hosts.
High availability is another key selling point for distributed virtualisation.
Direct-attached virtualisation relies on robust, fault-tolerant virtual hosts for high availability. Distributed virtualisation senses when a host has failed and restarts all its virtual machines on other hosts in the cluster. The more hosts you have in play, the more the distributed model makes sense.
I can see you
Another benefit is that despite the speed bottlenecks, forcing all traffic through a physical switch gives network administrators visibility and manageability.
Enterprise-class networks run networking gear with tools providing end-to-end management straight down to the very last port. They can offer encryption between links, traffic isolation, monitoring, quality of service and a bingo card of other tick-box features.
All of that goes away the instant a vSwitch is brought into play. vSwitches don’t speak the same management language as the physical network providers. Instead of being able to control every packet to every system on the network, the closest you can get when using a vSwitch is control to and from the host servers.
Until recently, these two models were all we had. You picked the features that were more important to you and lived with your choices. This is unsatisfactory and in the grand IT tradition of nothing ever remaining sacred for long, hybrid virtualisation models have started to appear.
A new generation of NICs is starting to blur the lines, employing leading-edge standards such as 802.1Qbg, also known as Edge Virtual Bridging or Virtual Ethernet Port Aggregation (VEPA).
VEPA NICs are switches in their own right. When in use, virtual machines on a host bypass the vSwitch and talk directly to the switch integrated into the NIC. The NIC can talk to the management software, and now we have all the advantages of distributed networking without the bottleneck caused by having to send all virtual machine traffic out to the physical switch.
The competing approach to VEPA is 802.1Qbh, also known as Bridge Port Extension or VN-Tag. It is backed almost exclusively by Cisco, and requires an extension to the Ethernet specification, thus lots of new hardware.
This is a stark contrast to VEPA, which doesn't require you to rip up and replace your network estate, and yet provides a viable solution to end-port management issues in virtual environments.
Configurations making use of both direct-attached storage and distributed storage in a single host are also beginning to appear. I have recently finished a deployment in which all hosts have a large amount of local storage to facilitate backups.
Each host has a virtual backup appliance (VBA) that takes live image-based backups of the virtual machines assigned to that host and stores them on the local buffer drive. This makes for very fast backups.
A central VBA reads the backups from all hosts and writes them out to tape during the day. The tape drive is mapped directly through from the host to the VBA rather than being a network-attached device.
This hybrid approach was not found in a whitepaper but born out of the necessity to make the best use of existing equipment. It has worked so well that, with refinements, I will re-use it in future deployments.
The continual introduction of new technologies into the mix will ensure that no virtualisation model stays static for long. IOMMU is the latest greatest, and promises to allow individual virtual machines direct access to system devices such as graphics cards.
Virtual machines will have the ability to tap into the full power of GPGPU computing, and will need to be fed data far faster than distributed technologies such as fibre channel can provide.
Advances in fault-tolerant hardware promise to make the individual host more reliable while new networking technologies push to 40Gb, 100Gb and beyond.
We have come full circle. Virtualisation started on the mainframe, and virtualisation is driving x86 to adopt technologies that bring it closer behaving like a mainframe.
Regardless of the similarities, there remains a fundamental difference between a mainframe and a cluster of x86 virtual hosts.
The mainframe is designed to be a single entity. Rack after rack, node after node, everything from the operating system to the interconnects binding individual nodes together, treats the mainframe as a single gigantic computer that is then sliced up for individual tasks.
x86 virtualisation, on the other hand, is a kludge
An x86 virtual cluster is very different. Whether direct attached, distributed or hybrid, each processing node is very much a distinct unit. Each node matters: it must be configured, licensed and designed separately as well as with consideration to the whole.
A mainframe is an expensive computer that you custom-design software for: a high-performance system worth high-quality development. The x86 virtual cluster is a collection of cheap systems that you wrap around existing software.
A mainframe is what you build when you are running a financial system where milliseconds of latency can mean millions of dollars. It shines when you feed it applications that can break work down into small chunks and lots of small tasks in parallel.
x86 virtualisation, on the other hand, is a kludge.
It is our way of compensating for the fact that we are dragging around decades worth of software that is remarkably single-threaded, not very environmentally aware, and which needs to be insulated from other programs running on the same system.
x86 virtualisation models will continue to evolve because of this need to accommodate the sheer diversity of x86 applications.
There are many options available today to accomplish large amounts of computing efficiently. You can buy a mainframe or maintain a fleet of x86 systems with applications installed on the bare metal.
You can venture into x86 virtualisation and explore all the myriad different possibilities it presents. You could even lash together a few thousand cell phones into an incredibly awkward Beowulf cluster if you so chose.
There is no "one true way" to get the job done. The needs of your software and the capabilities of the hardware available to you will determine the implementation paths you can choose. ®
Trevor Pott is a sysadmin from Edmonton, Canada.
I don't think direct attached virtualization requires high end hardware for HA.
What you need is fault tolerant software. Something that is scale out rather than scale up is ideal for direct attached virtualization. Where if a physical server with 20 VMs die you don't care.
A couple companies ago a decent portion of our web servers were deployed this way. Some of the core web apps could not scale beyond 1 physical CPU. Rather than try to re-write the code it was simpler to put a hypervisor on a dual proc quad core box and run 8 copies of the app (with 4 servers/site or roughly 32 copies of the app per active-active site[multiple sites for reduced customer latency]). Another app was higher performance and could leverage the underlying hardware to it's full extent so that web application ran on bare metal. The cost of a good shared storage array was going to outweigh the cost of all of the rest of the equipment at each site, so it didn't make a lot of sense or cents from a cost stand point, as much as I would of jumped at the opportunity to have such a system for an ease of use standpoint.
You forgot to mention the hybrid approach., ala vSphere 5 VSA or things like HP Lefthand VSA which turn direct attached storage into fault tolerant network storage - good for SMBs with low I/O needs, can get the best of both worlds.
I don't think your storage numbers are realistic either, measuring storage based on throughput when virtualization is for a large part random I/O workloads rendering throughput numbers meaningless, it's all about IOPS, the network is not the bottleneck. Sure there are some workloads that are throughput based but in the vast majority of cases your going to run out of IOPS long, long before you even get close to running out of bandwidth(consider, peaking out at maybe 4-8MB/second/disk at 15k RPM if your lucky). Latency is just as important as IOPS too. If your truly throughput bound then you may be better off running on bare metal. Hypervisors don't make sense for everything.
Depending on the organization you can design your network up front so that once your hypervisors are deployed and your virtual switches are configured you rarely have to touch them again. For my last VM deployments I can count the number of times I had to configure a vSwitch after I installed the hypervisors on one hand (~60 VM hosts, several different clusters). My new VM infrastructure which is going in early next month is planned so I don't expect to have to touch the vSwitches for the lifetime of the product - at least 2-3 years. Not that it's a big deal if I have to, but if I don't need to then so much the better. Anything can happen but my experience tells me vSwitch configuration changes are few and far between.
Memory (capacity not performance) is the driving force behind virtualizaiton, which is why Vmware added the vTAX in v5. Memory availability is just as important in these big boxes making technologies such as HP Advanced ECC and IBM Chipkill absolutely critical to any VM deployment. ECC by itself is not enough, and has not been for years.
If mainframes were so good then why is IBM using KVM + Red Hat for their developer cloud instead of running Linux on mainframes (IBM used to advertise a lot about leveraging the multi tennant abilities of mainframes they don't seem to advertise that nearly as much anymore I haven't seen such an ad in years). IBM after all unlike anyone else has got to have the lowest cost of operating their very own gear, and I'm sure licensing their own software comes at no cost as well. I wrote about this a while back, the IBM developer cloud was focused around Java apps so from a technical perspective it wouldn't matter what platform they ran it on.
What do people use when milliseconds can cost millions of dollars? More often than not these days it seems to be overclocked x86-64 systems(El Reg has many articles on this). Mainframes are what you use when you can't tolerate downtime.
One of the better kept open secrets of open source virtualization is XCP, and it's new sibling Project Chronos (a full port available for Debian/Ubuntu using apt get). Both are essentially FOSS versions of the $pendy Citrix Censerver (Talking Enterprise/Platinum editions, not the freebie base edition.)
One of the cooler new features is a hybrid Storage Model, enabling a pool of servers to access shared storage, but have each host automatically replicate the virtual disks to local storage as they are accessed. The net result is local disk performance after the initial read from the remote SR.
Doubly cool if the local storage is SSD. :)
On the cheap
We're in the SMB bracket, and use a KVM cluster (using the Proxmox VE distribution).
Direct attached storage is used with 2-way disk replication (DBRD). So with a small number of VM hosts it is possible to get the speed advantage of direct attached disk, and still have live migration.
What we aren't able to do is have a pool of VM hosts that are effectively interchangeable.