Feeds

Building high availability into IT

How would yours fare in a disaster?

  • alert
  • submit to reddit

High performance access to file storage

Lab What does ‘high availability’ actually mean in the context of IT today? We’ve written elsewhere about more general availability and good systems design, but what if you want to assure availability when things go unexpectedly wrong?

From a systems perspective this is where we start adding to the number of nines we talk about, for example 99.999% (“five-nines”) availability. However, systems failure is just one possible issue that might befall our IT. The terminology is a bit mixed up, in that high availability, business continuity and disaster recovery are often talked about synonymously. Business continuity goes broader than IT, and in the UK we have a standard associated with it (BS 25999). Terrorist threats, swine flu, you name it, business continuity best practice is there to get things back up and running.

While disaster recovery is also about responding to external events, it does tend to refer more specifically to how to get the IT infrastructure back up and running. When we researched the risks a couple of years ago from an IT perspective, we found that theft and fire were the most likely ‘disasters’ that respondents felt needed to be dealt with, with natural disasters and terrorist activity being less significant.

Of course, since then we have seen a number of quite major flooding incidents, power outages and health scares, so no doubt general perceptions of the risks will have changed. Above all, this is indicative that trying to predict what disasters might occur is about as straightforward as betting on horses at the Gold Cup: while you might be able to whittle it down to the top five, any decision you make after that will be a gamble.

In IT terms, it is enough at least to know what the potential risks are, as attention can then turn to the business-critical IT systems and services to be kept going in the face of disaster. “Business critical” implies some level of prioritisation, with business involvement. Availability and recoverability come at a cost, and it is ultimately up to senior management to decide just how important each system is, and whether they want to cough up the cash to put the belts and braces in place.

Two criteria are important: first, the recovery time objective, namely how long it takes to get a service back up and running should disaster strike. In this case you may be able to get away with a reduced level of performance or access while more comprehensive recovery procedures are put in place. Second, we have the recovery point objective, namely how far back you need to go without suffering too badly. If you’re talking about a collaboration tool for example, you might be more than happy to revert to whatever state it was in yesterday, should trouble befall it. For a banking transactional system, you won’t want to lose a single transaction.

Such criteria dictate what mechanisms we need to put in place. For core systems that need to be guaranteed to be available immediately, some kind of external facility will be required, such as a second data centre or DR facility – this may be provided as a managed service (shared with other organisations, on the premise that a disaster won’t befall all of them at once). The specific protection mechanism depends on the system – for example clustering, load balancing and failover in the case of servers, or replication and snapshots in the case of storage.

As well as having the service available, you also need to think about how it will be accessed. It’s all very well having a second instance of the application for example, but if head office is flooded out, alternate arrangements for staff become an important piece of the puzzle. Again, it is worth thinking about ‘recovery mode’ arrangements, for example enabling access for a reduced set of people perhaps connecting remotely, while the main facility is restored.

High availability and disaster recovery capabilities aren’t standing still. As we know for example, virtualisation brings quite a lot to the HA/DR party in that it is a darn sight easier to take a copy of a virtual machine than a physical one. Traditional mechanisms such as clustering are also becoming more capable with the inclusion of multi-site clusters, though these will require network bandwidth and latency criteria to be met.

Today’s IT platforms do offer a raft of options to be sure, but these do not diminish the level of expertise required to implement high availability in a workable fashion. Let’s not beat about the bush: disaster recovery can push the dependencies between IT systems to the absolute limit, and so it becomes crucial to understand the relationships, technicalities and constraints between different components of IT.

For example, it’s all very well having a separate server that can act as a warm standby should anything go wrong. But from a configuration perspective, things are not that straightforward – specifics such as IP address failover for example, such that existing connections can be re-routed, or indeed mappings between the server HBA (storage card) and the logical storage unit the server is accessing on the SAN. There is much about IT that is still hard coded, and if the warm standby is not configured in exactly the same way as the live system, unexpected problems can occur during the transition.

It’s worth thinking about these very real constraints when considering more esoteric approaches to IT architecture and its management. We have, on various occasions, talked about how IT could always be more dynamic, managed and provisioned more efficiently using virtualisation, etcetera etcetera. But if the systems being deployed require any level of protection against risk, their complexity goes up an order of magnitude.

Specifically, it takes someone five seconds to create and boot a virtual machine. If that virtual machine needs to exhibit five-nines (99.999%) availability however, a number of dependencies immediately appear in terms of the hardware platform it’s running on, the storage it is using and the management processes involved in assuring the continuity of service should something go wrong. While products (including management tools and failover solutions) do exist to help assure availability in more dynamic environments, these need to be factored into the costs, as do additional overheads in terms of monitoring.

At this stage it’s still early days for virtualisation, so it’s also unclear whether high availability best practice is impacted by such new ways of doing things – for example whether it becomes easier or harder to implement a disaster recovery plan, or whether things stay much the same despite the new capabilities. If you do have any advance experience of this, we’re all ears.

If you want to know more about this, try our webcast, recorded last week, for a heated discussion on what building High Availability systems takes. ®

High performance access to file storage

More from The Register

next story
Windows 8.1, which you probably haven't upgraded to yet, ALREADY OBSOLETE
Pre-Update versions of new Windows version will no longer support patches
Android engineer: We DIDN'T copy Apple OR follow Samsung's orders
Veep testifies for Samsung during Apple patent trial
OpenSSL Heartbleed: Bloody nose for open-source bleeding hearts
Bloke behind the cockup says not enough people are helping crucial crypto project
Microsoft lobs pre-release Windows Phone 8.1 at devs who dare
App makers can load it before anyone else, but if they do they're stuck with it
Half of Twitter's 'active users' are SILENT STALKERS
Nearly 50% have NEVER tweeted a word
Windows XP still has 27 per cent market share on its deathbed
Windows 7 making some gains on XP Death Day
Internet-of-stuff startup dumps NoSQL for ... SQL?
NoSQL taste great at first but lacks proper nutrients, says startup cloud whiz
US taxman blows Win XP deadline, must now spend millions on custom support
Gov't IT likened to 'a Model T with a lot of things on top of it'
Microsoft TIER SMEAR changes app prices whether devs ask or not
Some go up, some go down, Redmond goes silent
prev story

Whitepapers

Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
HP ArcSight ESM solution helps Finansbank
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.