Fault tolerance in virtualised environments

Original URL: https://www.theregister.com/2009/11/25/fault_tolerance_virtualization/

Doesn’t get much more exciting than this

Posted in Systems, 25th November 2009 11:02 GMT

You the Expert In this, our final Experts column in the current server series, our reader experts look at fault tolerance in Virtualised environments. As ever, we’re grateful to Reg reader experts Adam and Trevor for sharing their experience. They are joined by Intel’s Iain Beckingham and Freeform Dynamics’ Martin Atherton.

Server virtualisation has a number of benefits when it comes to fault tolerance but it also suffers from the ‘eggs-in-one-basket’ syndrome should a server go down. How can fault tolerance be built into the virtualised environment such that availability can be ensured?

Adam Salisbury
Systems Administrator

As server virtualisation technology has matured and become more widely adopted its fast becoming clear that we can now do far more work with far less resource, even on older servers. Machines that would have turned end-of-life this year can now run a handful of virtual servers with ease and the huge savings in terms of costs of equipment, space and power are now just being appreciated. But with any new technology come new challenges and new hazards and while some proclaim that virtualisation improves fault tolerance and provides increased redundancy does it really, or are we simple moving the risks and the points of failure?

I think one of the biggest developments in virtualisation in the last couple of years has to be the hypervisor management suites such as System Centre Virtual Machine Manager from Microsoft and vCenter from VMWare. Now we have flexibility like never before, we can convert physical servers to virtual ones and redeploy virtual server images onto physical hardware. We can also take images of live, production servers for absolute redundancy and availability or to test potential upgrades and even those tools can greatly improve the productivity of the staff charged with using them.

The freedom and versatility I, in an organisation currently engaged in a full-scale virtualisation project, now have in terms of being able to shift and consolidate workloads from one machine to another, one rack to another and one site to another is massive evolution from traditional server management.

Yet our core servers, domain controllers, email servers and database servers remain largely unchanged, for us those servers are in fact more fault tolerant as dedicated physical boxes as they would virtual ones.

Non-business-critical systems can and have been consolidated onto single servers for greater flexibility, where we once had a single server running four applications we now have four virtual servers running one application each. Now we can reboot our patch management server without affecting the AV server or the file and print server.

At some point however, additional servers will need to be procured to mirror these hypervisors as now if one component fails it can potentially affect all four servers rather than just one. Of course if the budget won’t stretch to buying mirrored or clustered servers there are already hosting companies willing to provide various solutions, with some offering to host replicas of on or off-site hypervisors.

There are vendors who specialise in fault tolerant hardware on which to run your hypervisors but these are expensive to buy and even more expensive to code software for, an arguably cheaper option would be to invest in a blade frame for perhaps the greatest resilience and even cheaper option than that is fault tolerant software.

Following the same principles as virtualisation software, fault tolerant software runs as an abstraction layer across multiple off the shelf servers to create a single seamless interface, two physical servers appear as one, hosting maybe half a dozen virtual servers. The software continuously scans for hardware faults and upon finding them, directs all I/O away from the failed component. Virtualisation of the virtualised it may be, but we’re still doing far more with far less.

Iain Beckingham
Manager of the Enterprise Technical Specialist team in EMEA, for Intel

Today, virtualisation is targeting mission critical servers distributed in a dynamic virtual infrastructure where loads are balanced within a cluster of servers and high availability and automated failover technologies exist. Intel is designing new innovative solutions that incorporate RAS (Reliability, Availability, and Serviceability) features.

RAS features are even more important with high-end systems where higher virtualization ratios can be achieved. Intel’s new Xeon® processor codenamed ‘Nehalem-EX’, will allow scaling above the traditional 4 sockets to systems to greater than 32 sockets.

With Nehalem-EX, Intel has invested extensively in incremental RAS capabilities to support high availability and data integrity, while minimizing maintenance cycles. All said, there are over twenty new RAS features in the Nehalem-EX platform that OEMs can use to build high-availability ‘mission-critical’ servers. Some of these features are built into the memory and processor interconnect and provide the ability to retry data transfers if errors are detected, or even to automatically heal data links with persistent errors to keep the system running until repairs are made.

Other capabilities like Machine Check Architecture (MCA) recovery take a page from Intel® Itanium® processor, RISC, and mainframe systems. MCA recovery supports a new level of cooperation between the processor and operating system or VMM to recover from data errors that cannot be corrected with more standard Error Correcting Code (ECC) and that would have caused earlier systems to shutdown.

Further capabilities enable OEMs to offer the support of hot addition or replacement of memory and CPUs by bringing new components online and migrating active workloads to them in the event that existing CPUs or memory indicate that they are failing.

Hopefully this gives you and idea of the RAS platform capabilities that, along with clustering failover and virtualisation VM failover configurations, will make Nehalem-EX systems go much further toward providing an even more reliable and robust platform for IT.

Trevor Pott
Infrastructure Support Engineer

Fault tolerance and high availability were once distinct concepts which have blurred into one single goal. Two decades ago high availability was a pipe dream for even the largest corporations. Basic fault tolerance was itself something that required massive planning. Resumption of service or data availability was limited to the speed at which a system could be rebuilt or a backup restored.

As the ability to back up systems and data became commonplace, the ability to recover from failure quickly became the pressing demand. RAID allowed (with varying degrees of success) disk failures without service interruption. ECC and chipkill were born to help us survive RAM glitches. Technologies to make networks robust proliferated enough to require their own article. Eventually some bright spark lit upon the idea of SANs, rendering entire servers nothing but disposable nodes and finally turning the entire network into one big computer.

Virtualisation, when handled properly, can lead to some pretty amazing uptimes. Let us mentally design a very small and basic high availability system using VMs. Take two SANs, each kept in sync with the other and visible to the rest of the network as “a single SAN.” The SANs themselves are attached to network via redundant network interfaces to redundant switches. Three server nodes (A, B and C) are also attached to these switches with their own redundant interfaces. If SAN A has an accident, SAN B takes up the slack. If a switch goes down, or the boss trips over a network cable, the network deals with it.

Populate Servers A and B with virtual machines configured in a cluster, and leave Server C as an idle spare. Go kill Server A with an axe. The virtual machine on Server B senses the loss of its clustered partner, and continues on with its duties. Remap the virtual machine that used to be on Server A over to Server C and bring it online. The cluster members sense each other, sync up, and are fault tolerant once more.

This basic example can be scaled as appropriate, and is something that hardware vendors are eagerly waiting to sell you right now. You can add even more layers of redundancy and there are entire symposiums dedicated to exactly that. Add neat virtualisation tricks like the ability to “VMotion” a live VM from one node to another and even routine maintenance doesn’t have to affect availability.

Those happy thoughts aside, in the real world of High Availability cost is asymptotic to perfect uptime. The closer you attempt to get to it, the more resources you will be forced to expend. Though we now have tools that can allow getting within a fraction of a percentage of perfect uptime, not even a company with Google’s resources has yet achieved it.

Martin Atherton
Research Director, Freeform Dynamics Ltd

There are rarely any ‘single’ answers to problems we face in IT. The question here is a good example. A way to think about protecting the operational performance of a virtualised server could be to take examples from different parts of its lifecycle. A good place to start when thinking about desirable attributes of any system is… at the start. Design them in. From a cost efficiency point of view in terms of avoiding redesigns, rewrites and re-engineering — basically getting it right first time — it makes perfect sense.

However, we know this doesn’t usually happen. For numerous reasons, but primarily due to the fragmented nature of the management of the steps involved from requirements to technical specification through testing and operational rollout, areas such as operational resiliency are often short-changed, as many operational IT staff will attest to.

Another side of the coin is to consider the context in which the virtualised server will be used in. For example, the services being supported by a particular server should define or at least guide the specific fault tolerances and other protective measures required in the first place. These contextual requirements are much more relevant to the business — and thus what it pays for — than taking a purely ‘technical’ approach to this area.

As always though, the answer in practice is likely to lie in a combination of approaches, not one or the other — or indeed any other options that exist — in isolation. It’s important to appreciate the strengths, weaknesses and risks of different approaches and how they complement or amplify any specific attributes in each other.

If you think you've got what it takes to contribute to our experts sessions, answer the question above and email it to: philip.mitchell (at) theregister.com.