Fault tolerance in virtualised environments
Doesn’t get much more exciting than this
You the Expert In this, our final Experts column in the current server series, our reader experts look at fault tolerance in Virtualised environments. As ever, we’re grateful to Reg reader experts Adam and Trevor for sharing their experience. They are joined by Intel’s Iain Beckingham and Freeform Dynamics’ Martin Atherton.
Server virtualisation has a number of benefits when it comes to fault tolerance but it also suffers from the ‘eggs-in-one-basket’ syndrome should a server go down. How can fault tolerance be built into the virtualised environment such that availability can be ensured?
As server virtualisation technology has matured and become more widely adopted its fast becoming clear that we can now do far more work with far less resource, even on older servers. Machines that would have turned end-of-life this year can now run a handful of virtual servers with ease and the huge savings in terms of costs of equipment, space and power are now just being appreciated. But with any new technology come new challenges and new hazards and while some proclaim that virtualisation improves fault tolerance and provides increased redundancy does it really, or are we simple moving the risks and the points of failure?
I think one of the biggest developments in virtualisation in the last couple of years has to be the hypervisor management suites such as System Centre Virtual Machine Manager from Microsoft and vCenter from VMWare. Now we have flexibility like never before, we can convert physical servers to virtual ones and redeploy virtual server images onto physical hardware. We can also take images of live, production servers for absolute redundancy and availability or to test potential upgrades and even those tools can greatly improve the productivity of the staff charged with using them.
The freedom and versatility I, in an organisation currently engaged in a full-scale virtualisation project, now have in terms of being able to shift and consolidate workloads from one machine to another, one rack to another and one site to another is massive evolution from traditional server management.
Yet our core servers, domain controllers, email servers and database servers remain largely unchanged, for us those servers are in fact more fault tolerant as dedicated physical boxes as they would virtual ones.
Non-business-critical systems can and have been consolidated onto single servers for greater flexibility, where we once had a single server running four applications we now have four virtual servers running one application each. Now we can reboot our patch management server without affecting the AV server or the file and print server.
At some point however, additional servers will need to be procured to mirror these hypervisors as now if one component fails it can potentially affect all four servers rather than just one. Of course if the budget won’t stretch to buying mirrored or clustered servers there are already hosting companies willing to provide various solutions, with some offering to host replicas of on or off-site hypervisors.
There are vendors who specialise in fault tolerant hardware on which to run your hypervisors but these are expensive to buy and even more expensive to code software for, an arguably cheaper option would be to invest in a blade frame for perhaps the greatest resilience and even cheaper option than that is fault tolerant software.
Following the same principles as virtualisation software, fault tolerant software runs as an abstraction layer across multiple off the shelf servers to create a single seamless interface, two physical servers appear as one, hosting maybe half a dozen virtual servers. The software continuously scans for hardware faults and upon finding them, directs all I/O away from the failed component. Virtualisation of the virtualised it may be, but we’re still doing far more with far less.
Manager of the Enterprise Technical Specialist team in EMEA, for Intel
Today, virtualisation is targeting mission critical servers distributed in a dynamic virtual infrastructure where loads are balanced within a cluster of servers and high availability and automated failover technologies exist. Intel is designing new innovative solutions that incorporate RAS (Reliability, Availability, and Serviceability) features.
RAS features are even more important with high-end systems where higher virtualization ratios can be achieved. Intel’s new Xeon® processor codenamed ‘Nehalem-EX’, will allow scaling above the traditional 4 sockets to systems to greater than 32 sockets.
With Nehalem-EX, Intel has invested extensively in incremental RAS capabilities to support high availability and data integrity, while minimizing maintenance cycles. All said, there are over twenty new RAS features in the Nehalem-EX platform that OEMs can use to build high-availability ‘mission-critical’ servers. Some of these features are built into the memory and processor interconnect and provide the ability to retry data transfers if errors are detected, or even to automatically heal data links with persistent errors to keep the system running until repairs are made.
Other capabilities like Machine Check Architecture (MCA) recovery take a page from Intel® Itanium® processor, RISC, and mainframe systems. MCA recovery supports a new level of cooperation between the processor and operating system or VMM to recover from data errors that cannot be corrected with more standard Error Correcting Code (ECC) and that would have caused earlier systems to shutdown.
Further capabilities enable OEMs to offer the support of hot addition or replacement of memory and CPUs by bringing new components online and migrating active workloads to them in the event that existing CPUs or memory indicate that they are failing.
Hopefully this gives you and idea of the RAS platform capabilities that, along with clustering failover and virtualisation VM failover configurations, will make Nehalem-EX systems go much further toward providing an even more reliable and robust platform for IT.
Next page: Trevor Pott
Infrastructure Support Engineer
Can you give specific examples? Preferably cheap examples?
Oh **** yes it's a kludge. The ideal system would be something small, yummy and stable that I could run on a vast array of disposable lightweight servers. (See: mini racks of Atoms or CULV servers that are becoming a "niche thang.") Problem is that huge numbers of workloads require windows, and windows is both shite at high availability and bad with the not dying.
As to Photoshop (and similar client-side apps necessitating windows) and what they have to do with the price of rice here...client side windows means server side windows. Don't bother with the "that's bollocks, you can use server-side Linux with your Windows clients." Been there, done that, went back to Windows. For all the alternatives, Windows clients talking to Windows servers, (and the nice stack of vertically integrated goodies Microsoft sells CALs for,) really are just way easier to use.
There is a point where the time and sanity of the admin who has to run everything has to be considered, and (sad but true,) if there is a significant Windows estate deployed, you will probably be better off with Windows Servers running the show behind the scenes.
Still, kludge or not, x86 virtualisation *is* the greatest thing since sliced bread. It solves a gigantic pile of problems that used to give me ulcers, and I am one of them folks who are far too poor to even buy the management tools. (Let alone the blue crystals!) The world is full of crummy programmers, and great programmers restrained by crummy project managers. Virtualisation helps the man in the trenches keep it all running.
For everything else, there’s Mastercard.
Correction for Adam Salisbury
"There are vendors who specialise in fault tolerant hardware on which to run your hypervisors but these are expensive to buy and even more expensive to code software for, an arguably cheaper option would be to invest in a blade frame for perhaps the greatest resilience and even cheaper option than that is fault tolerant software".
That statement is unfortunately false with regards to the "expensive to code software for" part. There are FT systems that can run VMWare ESX and then you can run on top of ESX other OS's and any application that can run on those OS's without any extra coding, let alone "expensive" coding. Both NEC and Stratus have such xeon based FT systems and though pricewise they are more expensive than standard Xeon servers, they do offer complete hardware redundancy and uninterrupted operation even if they experience a failure on any component (cpu/memory/chipset/video card/netword card etc) without the need for any extra or special configuration, or coding, or software because the fault tolerance is built into the hardware.
Keep in mind that they don't offer cluster type continuous operations because in a cluster you have what is called failover time. For such FT systems there is no failover; it is true uninterrupted operation in the event of failure.
Get your facts straight please before posting erroneous articles which may mislead others!
OK, I'm with you on that. You have some apps that will only run on Windows. There's not much you can do about that. But are we talking about running VM on the desktop or on the server? Correct me if I'm wrong, but most VMs are running on servers. Does photoshop really care what system is providing network-based storage (I don't know - what other network services might photoshop use)?
It seems to me (and again, I am happy to stand corrected), many people seem to take apps like email, web servers, database servers, etc etc) and run each of these in a VM instance. These are not desktop applications, and viable (and (almost?) always, MUCH better) alternatives are available that will run on non-Windows platforms.
I'm not an evangelist; it is no skin off my nose what you run on your machines. But I DO have to work with Windows on a daily basis and I find it incredible that many people and companies persist with it, and (in the shape of VMs and other techniques) constantly prop it up and then sit back and say "hey - that's really cool", when it reality it is (as you day yourself) and kludge, and a pretty awful one at that. In fact, it's not "cool" at all; it is a distinctly backward step.
One last thing
Kudos to you for admitting that the VM thing is a kludge. You must be the first person I have (virually) seen that uses (and indeed supports the use of) VM but admits it's a kludge. Every other VM person seems to treat it as a holy grail.