Feeds

Building high availability into IT

How would yours fare in a disaster?

  • alert
  • submit to reddit

The smart choice: opportunity from uncertainty

Lab What does ‘high availability’ actually mean in the context of IT today? We’ve written elsewhere about more general availability and good systems design, but what if you want to assure availability when things go unexpectedly wrong?

From a systems perspective this is where we start adding to the number of nines we talk about, for example 99.999% (“five-nines”) availability. However, systems failure is just one possible issue that might befall our IT. The terminology is a bit mixed up, in that high availability, business continuity and disaster recovery are often talked about synonymously. Business continuity goes broader than IT, and in the UK we have a standard associated with it (BS 25999). Terrorist threats, swine flu, you name it, business continuity best practice is there to get things back up and running.

While disaster recovery is also about responding to external events, it does tend to refer more specifically to how to get the IT infrastructure back up and running. When we researched the risks a couple of years ago from an IT perspective, we found that theft and fire were the most likely ‘disasters’ that respondents felt needed to be dealt with, with natural disasters and terrorist activity being less significant.

Of course, since then we have seen a number of quite major flooding incidents, power outages and health scares, so no doubt general perceptions of the risks will have changed. Above all, this is indicative that trying to predict what disasters might occur is about as straightforward as betting on horses at the Gold Cup: while you might be able to whittle it down to the top five, any decision you make after that will be a gamble.

In IT terms, it is enough at least to know what the potential risks are, as attention can then turn to the business-critical IT systems and services to be kept going in the face of disaster. “Business critical” implies some level of prioritisation, with business involvement. Availability and recoverability come at a cost, and it is ultimately up to senior management to decide just how important each system is, and whether they want to cough up the cash to put the belts and braces in place.

Two criteria are important: first, the recovery time objective, namely how long it takes to get a service back up and running should disaster strike. In this case you may be able to get away with a reduced level of performance or access while more comprehensive recovery procedures are put in place. Second, we have the recovery point objective, namely how far back you need to go without suffering too badly. If you’re talking about a collaboration tool for example, you might be more than happy to revert to whatever state it was in yesterday, should trouble befall it. For a banking transactional system, you won’t want to lose a single transaction.

Such criteria dictate what mechanisms we need to put in place. For core systems that need to be guaranteed to be available immediately, some kind of external facility will be required, such as a second data centre or DR facility – this may be provided as a managed service (shared with other organisations, on the premise that a disaster won’t befall all of them at once). The specific protection mechanism depends on the system – for example clustering, load balancing and failover in the case of servers, or replication and snapshots in the case of storage.

As well as having the service available, you also need to think about how it will be accessed. It’s all very well having a second instance of the application for example, but if head office is flooded out, alternate arrangements for staff become an important piece of the puzzle. Again, it is worth thinking about ‘recovery mode’ arrangements, for example enabling access for a reduced set of people perhaps connecting remotely, while the main facility is restored.

High availability and disaster recovery capabilities aren’t standing still. As we know for example, virtualisation brings quite a lot to the HA/DR party in that it is a darn sight easier to take a copy of a virtual machine than a physical one. Traditional mechanisms such as clustering are also becoming more capable with the inclusion of multi-site clusters, though these will require network bandwidth and latency criteria to be met.

Today’s IT platforms do offer a raft of options to be sure, but these do not diminish the level of expertise required to implement high availability in a workable fashion. Let’s not beat about the bush: disaster recovery can push the dependencies between IT systems to the absolute limit, and so it becomes crucial to understand the relationships, technicalities and constraints between different components of IT.

For example, it’s all very well having a separate server that can act as a warm standby should anything go wrong. But from a configuration perspective, things are not that straightforward – specifics such as IP address failover for example, such that existing connections can be re-routed, or indeed mappings between the server HBA (storage card) and the logical storage unit the server is accessing on the SAN. There is much about IT that is still hard coded, and if the warm standby is not configured in exactly the same way as the live system, unexpected problems can occur during the transition.

It’s worth thinking about these very real constraints when considering more esoteric approaches to IT architecture and its management. We have, on various occasions, talked about how IT could always be more dynamic, managed and provisioned more efficiently using virtualisation, etcetera etcetera. But if the systems being deployed require any level of protection against risk, their complexity goes up an order of magnitude.

Specifically, it takes someone five seconds to create and boot a virtual machine. If that virtual machine needs to exhibit five-nines (99.999%) availability however, a number of dependencies immediately appear in terms of the hardware platform it’s running on, the storage it is using and the management processes involved in assuring the continuity of service should something go wrong. While products (including management tools and failover solutions) do exist to help assure availability in more dynamic environments, these need to be factored into the costs, as do additional overheads in terms of monitoring.

At this stage it’s still early days for virtualisation, so it’s also unclear whether high availability best practice is impacted by such new ways of doing things – for example whether it becomes easier or harder to implement a disaster recovery plan, or whether things stay much the same despite the new capabilities. If you do have any advance experience of this, we’re all ears.

If you want to know more about this, try our webcast, recorded last week, for a heated discussion on what building High Availability systems takes. ®

Securing Web Applications Made Simple and Scalable

More from The Register

next story
NO MORE ALL CAPS and other pleasures of Visual Studio 14
Unpicking a packed preview that breaks down ASP.NET
Cheer up, Nokia fans. It can start making mobes again in 18 months
The real winner of the Nokia sale is *drumroll* ... Nokia
Mozilla fixes CRITICAL security holes in Firefox, urges v31 upgrade
Misc memory hazards 'could be exploited' - and guess what, one's a Javascript vuln
Put down that Oracle database patch: It could cost $23,000 per CPU
On-by-default INMEMORY tech a boon for developers ... as long as they can afford it
Google shows off new Chrome OS look
Athena springs full-grown from Chromium project's head
Apple: We'll unleash OS X Yosemite beta on the MASSES on 24 July
Starting today, regular fanbois will be guinea pigs, it tells Reg
HIDDEN packet sniffer spy tech in MILLIONS of iPhones, iPads – expert
Don't panic though – Apple's backdoor is not wide open to all, guru tells us
prev story

Whitepapers

Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Seven Steps to Software Security
Seven practical steps you can begin to take today to secure your applications and prevent the damages a successful cyber-attack can cause.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.