Fault tolerance in virtualised environments

Doesn’t get much more exciting than this

  • alert
  • submit to reddit

Choosing a cloud hosting partner with confidence

Trevor Pott
Infrastructure Support Engineer

Fault tolerance and high availability were once distinct concepts which have blurred into one single goal. Two decades ago high availability was a pipe dream for even the largest corporations. Basic fault tolerance was itself something that required massive planning. Resumption of service or data availability was limited to the speed at which a system could be rebuilt or a backup restored.

As the ability to back up systems and data became commonplace, the ability to recover from failure quickly became the pressing demand. RAID allowed (with varying degrees of success) disk failures without service interruption. ECC and chipkill were born to help us survive RAM glitches. Technologies to make networks robust proliferated enough to require their own article. Eventually some bright spark lit upon the idea of SANs, rendering entire servers nothing but disposable nodes and finally turning the entire network into one big computer.

Virtualisation, when handled properly, can lead to some pretty amazing uptimes. Let us mentally design a very small and basic high availability system using VMs. Take two SANs, each kept in sync with the other and visible to the rest of the network as “a single SAN.” The SANs themselves are attached to network via redundant network interfaces to redundant switches. Three server nodes (A, B and C) are also attached to these switches with their own redundant interfaces. If SAN A has an accident, SAN B takes up the slack. If a switch goes down, or the boss trips over a network cable, the network deals with it.

Populate Servers A and B with virtual machines configured in a cluster, and leave Server C as an idle spare. Go kill Server A with an axe. The virtual machine on Server B senses the loss of its clustered partner, and continues on with its duties. Remap the virtual machine that used to be on Server A over to Server C and bring it online. The cluster members sense each other, sync up, and are fault tolerant once more.

This basic example can be scaled as appropriate, and is something that hardware vendors are eagerly waiting to sell you right now. You can add even more layers of redundancy and there are entire symposiums dedicated to exactly that. Add neat virtualisation tricks like the ability to “VMotion” a live VM from one node to another and even routine maintenance doesn’t have to affect availability.

Those happy thoughts aside, in the real world of High Availability cost is asymptotic to perfect uptime. The closer you attempt to get to it, the more resources you will be forced to expend. Though we now have tools that can allow getting within a fraction of a percentage of perfect uptime, not even a company with Google’s resources has yet achieved it.

Martin Atherton
Research Director, Freeform Dynamics Ltd

There are rarely any ‘single’ answers to problems we face in IT. The question here is a good example. A way to think about protecting the operational performance of a virtualised server could be to take examples from different parts of its lifecycle. A good place to start when thinking about desirable attributes of any system is… at the start. Design them in. From a cost efficiency point of view in terms of avoiding redesigns, rewrites and re-engineering — basically getting it right first time — it makes perfect sense.

However, we know this doesn’t usually happen. For numerous reasons, but primarily due to the fragmented nature of the management of the steps involved from requirements to technical specification through testing and operational rollout, areas such as operational resiliency are often short-changed, as many operational IT staff will attest to.

Another side of the coin is to consider the context in which the virtualised server will be used in. For example, the services being supported by a particular server should define or at least guide the specific fault tolerances and other protective measures required in the first place. These contextual requirements are much more relevant to the business — and thus what it pays for — than taking a purely ‘technical’ approach to this area.

As always though, the answer in practice is likely to lie in a combination of approaches, not one or the other — or indeed any other options that exist — in isolation. It’s important to appreciate the strengths, weaknesses and risks of different approaches and how they complement or amplify any specific attributes in each other.

If you think you've got what it takes to contribute to our experts sessions, answer the question above and email it to: philip.mitchell (at) theregister.co.uk.

Beginner's guide to SSL certificates

More from The Register

next story
Ex-US Navy fighter pilot MIT prof: Drones beat humans - I should know
'Missy' Cummings on UAVs, smartcars and dying from boredom
Don't wait for that big iPad, order a NEXUS 9 instead, industry little bird says
Google said to debut next big slab, Android L ahead of Apple event
Xperia Z3: Crikey, Sony – ANOTHER flagship phondleslab?
The Fourth Amendment... and it IS better
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
A drone of one's own: Reg buyers' guide for UAV fanciers
Hardware: Check. Software: Huh? Licence: Licence...?
The Apple launch AS IT HAPPENED: Totally SERIOUS coverage, not for haters
Fandroids, Windows Phone fringe-oids – you wouldn't understand
Apple SILENCES Bose, YANKS headphones from stores
The, er, Beats go on after noise-cancelling spat
Here's your chance to buy an ancient, working APPLE ONE
Warning: Likely to cost a lot even for a Mac
prev story


Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.