Virtual management? It's complicated
Load balancing VMs by hand
Sysadmin Blog Virtualisation is complicated. I am not talking here about implementation, or even the concepts or technologies underpinning virtualisation. I am talking about the realities of managing and maintaining a virtualised infrastructure.
My particular quest of late has been one of decreasing power utilisation, a difficult task in an almost entirely virtualised environment. Virtualisation is mature enough now to have many different vendors offering many tiers of products to help you get meet your goals. All these management tools and virtualisation platforms exist only to solve the same basic set of problems. Strip the particular approaches taken by specific vendors away, and we can take the time to look solving the problems themselves.
Once you know how to solve the problem the hard way, gauging the effectiveness of tools designed to take the grunt work off your hands will probably be easier. From a power management standpoint, critical workloads with a stable resource requirement are probably bad candidates for virtualisation. I can best deal with my power consumption requirements by tailoring a piece of hardware specifically to them. (Or more accurately, introducing a new class of server into my fleet designed for ultra-low-power use.) What then of dynamic workloads? (Services with bursty resource requirements, or which are very cyclical, such as only operational during the day?)
I think these are ideal candidates for virtualisation; many of these workloads with offset usage patterns would keep server utilisation high. This becomes a game of load balancing, and load balancing is the art of knowing your network. It takes some very careful consideration and planning to deal with virtualising dynamic workloads. It is easy to look at a CPU graph without understanding the impact the workload has on other system resources. CPU usage and disk I/O are the traditional banes of virtualisation, but both network and even RAM bandwidth can become constrained by the improper mix of virtual machines (VMs) on a host.
Sadly, there are no magic bullets to solve this problem. Unless you are planning on massively overspecifying the hardware you run your workloads on, then you need to do the legwork of workload profiling. Establish a baseline for the systems currently supplying that service to your network. Look at total resource consumption, and try to identify patterns that recur over time. In order to accomplish this, I actually find that virtualising the workload onto its own server for a while is a great help.
Virtualisation tools often offer you some reasonably good reporting on the resource consumption of individual VMs. Run your VM on its own server for a while until you have it profiled, and the configurations tweaked for running in a virtual environment. After a few days or weeks of this, you should be safe to put it in the cage with the other hamsters and introduce it to a shared resource environment.
Now to get really complicated you can start trying to power off systems or at least send them into a lower power state when your network is not under load. This requires some really advanced load balancing where you are taking into account not only resource division, but trying to align those resources in such a way that for long periods you can power down whole systems. In this sense VDI is almost optimal.
If your workers are primarily the 9-5 type, then it should be possible to set some basic scripts up to help you in your quest. Remember that an operating system which “suspends” itself while virtualised actually tells the virtualisation software to suspend it. Suspended VMs have their RAM written to a file on disk, ready to be woken up later. (In this way, “suspending” a virtualised operating system actually becomes a form of “hibernating” it.) If you set your VDI guests up such that their power management will suspend the operating systems after a given period of inactivity after hours, then in theory there will come a point where all those user VMs are suspended.
At this point you have a couple of choices. If you want optimal power savings you can opt to turn the server right off. A simple script can tell the virtualisation software: “Shut off the server if no VMs are active.” The BIOS can be programmed to wake the server at a given time, or a wake on lan (WOL) scheduler of some sort can poke the server a few minutes before staff start trickling in the next morning. If your VMs are configured to auto-start on boot, then when the system comes up, it will “resume” all your virtual machines, and you’ve just saved several hours of electricity consumption on that system.
The other approach is that of relying on power management features within the hardware to give you power savings without turning the system off. Your CPUs could be configured to back down to as near “off” as is possible when idle - the disks could stop spinning. Advanced servers can even power down DIMMs that are not in use.
If you are going to need to be able to bring the VMs on that server up in less than the five minutes or so it takes a server to boot and resume its VMs, then this is the only way to eke out power savings. No modern business ever truly shuts down. Even after all the staff have gone home, there are some computer systems active. Routers, email and web servers, some desktops or VDI instances left running because certain staff remote in from home at odd hours. We even leave dedicated systems up to monitor the activity of all the other systems and report problems. These systems left active all day are certainly among those that would be considered “critical”. Outages of more than a few minutes result in a red-faced pointy-haired boss pacing the IT offices demanding answers.
While there may be other systems which are “critical” to have operational during the day, these 24/7 workloads are special. In an ideal world, one where the resiliency of computer hardware didn’t suck, you would consolidate all your 24/7 workloads onto a single spiffy virtual server, and shut everything else down when it wasn’t needed. In the real world though, this is “too many eggs in one basket”, and generally a terrible idea.
So let’s say that you manage to spread these 24/7 services out across a couple different physical servers, thus at least partly mitigating the risks. You’ve figured out what can be shut down at night, and done this. You still need to keep an eye on the physical servers running those 24/7 VMs, and be alerted when they fail. You then need to move the workloads over to backup servers (or spin up some of the sleeping ones) so you can restart the dropped VMs. This is where, unless your sysadmins are manning the servers 24/7, we start getting into absolutely requiring management software. If you want to virtualise your 24/7 critical workloads, then you absolutely require good management software that is capable of treating all virtual servers available to it as a gigantic cluster.
You need a good SAN, so that you can move workloads from one virtual server to the other in real time, and you need to spend the time to get to know your flavour of virtual server management software in depth. You should be able to tell your management software which are your absolutely critical VMs by assigning resource priorities. You can configure it to spin up critical VMs as soon as they fail, and even to power down unused nodes. None of these tools remove the need to do proper resource modelling on your VMs; nor to properly configure suspend and resume times in order to conserve power.
In all honesty, if you aren’t trying to run critical workloads on your virtual infrastructure, you can probably do all the management, including power optimisation by hand, without having to purchase any of the expensive tools from any of the vendors. (I do just fine on ESXi, thank you very much.) What these management tools do is make it realistically possible to virtualise 24/7 critical services if you so choose. That will be the focus of my next article: when talking about critical workloads (especially 24/7 ones) is virtualisation really worth it? ®