Feeds

Building high availability into IT

How would yours fare in a disaster?

  • alert
  • submit to reddit

Choosing a cloud hosting partner with confidence

Lab What does ‘high availability’ actually mean in the context of IT today? We’ve written elsewhere about more general availability and good systems design, but what if you want to assure availability when things go unexpectedly wrong?

From a systems perspective this is where we start adding to the number of nines we talk about, for example 99.999% (“five-nines”) availability. However, systems failure is just one possible issue that might befall our IT. The terminology is a bit mixed up, in that high availability, business continuity and disaster recovery are often talked about synonymously. Business continuity goes broader than IT, and in the UK we have a standard associated with it (BS 25999). Terrorist threats, swine flu, you name it, business continuity best practice is there to get things back up and running.

While disaster recovery is also about responding to external events, it does tend to refer more specifically to how to get the IT infrastructure back up and running. When we researched the risks a couple of years ago from an IT perspective, we found that theft and fire were the most likely ‘disasters’ that respondents felt needed to be dealt with, with natural disasters and terrorist activity being less significant.

Of course, since then we have seen a number of quite major flooding incidents, power outages and health scares, so no doubt general perceptions of the risks will have changed. Above all, this is indicative that trying to predict what disasters might occur is about as straightforward as betting on horses at the Gold Cup: while you might be able to whittle it down to the top five, any decision you make after that will be a gamble.

In IT terms, it is enough at least to know what the potential risks are, as attention can then turn to the business-critical IT systems and services to be kept going in the face of disaster. “Business critical” implies some level of prioritisation, with business involvement. Availability and recoverability come at a cost, and it is ultimately up to senior management to decide just how important each system is, and whether they want to cough up the cash to put the belts and braces in place.

Two criteria are important: first, the recovery time objective, namely how long it takes to get a service back up and running should disaster strike. In this case you may be able to get away with a reduced level of performance or access while more comprehensive recovery procedures are put in place. Second, we have the recovery point objective, namely how far back you need to go without suffering too badly. If you’re talking about a collaboration tool for example, you might be more than happy to revert to whatever state it was in yesterday, should trouble befall it. For a banking transactional system, you won’t want to lose a single transaction.

Such criteria dictate what mechanisms we need to put in place. For core systems that need to be guaranteed to be available immediately, some kind of external facility will be required, such as a second data centre or DR facility – this may be provided as a managed service (shared with other organisations, on the premise that a disaster won’t befall all of them at once). The specific protection mechanism depends on the system – for example clustering, load balancing and failover in the case of servers, or replication and snapshots in the case of storage.

As well as having the service available, you also need to think about how it will be accessed. It’s all very well having a second instance of the application for example, but if head office is flooded out, alternate arrangements for staff become an important piece of the puzzle. Again, it is worth thinking about ‘recovery mode’ arrangements, for example enabling access for a reduced set of people perhaps connecting remotely, while the main facility is restored.

High availability and disaster recovery capabilities aren’t standing still. As we know for example, virtualisation brings quite a lot to the HA/DR party in that it is a darn sight easier to take a copy of a virtual machine than a physical one. Traditional mechanisms such as clustering are also becoming more capable with the inclusion of multi-site clusters, though these will require network bandwidth and latency criteria to be met.

Today’s IT platforms do offer a raft of options to be sure, but these do not diminish the level of expertise required to implement high availability in a workable fashion. Let’s not beat about the bush: disaster recovery can push the dependencies between IT systems to the absolute limit, and so it becomes crucial to understand the relationships, technicalities and constraints between different components of IT.

For example, it’s all very well having a separate server that can act as a warm standby should anything go wrong. But from a configuration perspective, things are not that straightforward – specifics such as IP address failover for example, such that existing connections can be re-routed, or indeed mappings between the server HBA (storage card) and the logical storage unit the server is accessing on the SAN. There is much about IT that is still hard coded, and if the warm standby is not configured in exactly the same way as the live system, unexpected problems can occur during the transition.

It’s worth thinking about these very real constraints when considering more esoteric approaches to IT architecture and its management. We have, on various occasions, talked about how IT could always be more dynamic, managed and provisioned more efficiently using virtualisation, etcetera etcetera. But if the systems being deployed require any level of protection against risk, their complexity goes up an order of magnitude.

Specifically, it takes someone five seconds to create and boot a virtual machine. If that virtual machine needs to exhibit five-nines (99.999%) availability however, a number of dependencies immediately appear in terms of the hardware platform it’s running on, the storage it is using and the management processes involved in assuring the continuity of service should something go wrong. While products (including management tools and failover solutions) do exist to help assure availability in more dynamic environments, these need to be factored into the costs, as do additional overheads in terms of monitoring.

At this stage it’s still early days for virtualisation, so it’s also unclear whether high availability best practice is impacted by such new ways of doing things – for example whether it becomes easier or harder to implement a disaster recovery plan, or whether things stay much the same despite the new capabilities. If you do have any advance experience of this, we’re all ears.

If you want to know more about this, try our webcast, recorded last week, for a heated discussion on what building High Availability systems takes. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
Redmond top man Satya Nadella: 'Microsoft LOVES Linux'
Open-source 'love' fairly runneth over at cloud event
Chrome 38's new HTML tag support makes fatties FIT and SKINNIER
First browser to protect networks' bandwith using official spec
Admins! Never mind POODLE, there're NEW OpenSSL bugs to splat
Four new patches for open-source crypto libraries
Torvalds CONFESSES: 'I'm pretty good at alienating devs'
Admits to 'a metric ****load' of mistakes during work with Linux collaborators
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.