Feeds

Building high availability into IT

How would yours fare in a disaster?

  • alert
  • submit to reddit

Security for virtualized datacentres

Lab What does ‘high availability’ actually mean in the context of IT today? We’ve written elsewhere about more general availability and good systems design, but what if you want to assure availability when things go unexpectedly wrong?

From a systems perspective this is where we start adding to the number of nines we talk about, for example 99.999% (“five-nines”) availability. However, systems failure is just one possible issue that might befall our IT. The terminology is a bit mixed up, in that high availability, business continuity and disaster recovery are often talked about synonymously. Business continuity goes broader than IT, and in the UK we have a standard associated with it (BS 25999). Terrorist threats, swine flu, you name it, business continuity best practice is there to get things back up and running.

While disaster recovery is also about responding to external events, it does tend to refer more specifically to how to get the IT infrastructure back up and running. When we researched the risks a couple of years ago from an IT perspective, we found that theft and fire were the most likely ‘disasters’ that respondents felt needed to be dealt with, with natural disasters and terrorist activity being less significant.

Of course, since then we have seen a number of quite major flooding incidents, power outages and health scares, so no doubt general perceptions of the risks will have changed. Above all, this is indicative that trying to predict what disasters might occur is about as straightforward as betting on horses at the Gold Cup: while you might be able to whittle it down to the top five, any decision you make after that will be a gamble.

In IT terms, it is enough at least to know what the potential risks are, as attention can then turn to the business-critical IT systems and services to be kept going in the face of disaster. “Business critical” implies some level of prioritisation, with business involvement. Availability and recoverability come at a cost, and it is ultimately up to senior management to decide just how important each system is, and whether they want to cough up the cash to put the belts and braces in place.

Two criteria are important: first, the recovery time objective, namely how long it takes to get a service back up and running should disaster strike. In this case you may be able to get away with a reduced level of performance or access while more comprehensive recovery procedures are put in place. Second, we have the recovery point objective, namely how far back you need to go without suffering too badly. If you’re talking about a collaboration tool for example, you might be more than happy to revert to whatever state it was in yesterday, should trouble befall it. For a banking transactional system, you won’t want to lose a single transaction.

Such criteria dictate what mechanisms we need to put in place. For core systems that need to be guaranteed to be available immediately, some kind of external facility will be required, such as a second data centre or DR facility – this may be provided as a managed service (shared with other organisations, on the premise that a disaster won’t befall all of them at once). The specific protection mechanism depends on the system – for example clustering, load balancing and failover in the case of servers, or replication and snapshots in the case of storage.

As well as having the service available, you also need to think about how it will be accessed. It’s all very well having a second instance of the application for example, but if head office is flooded out, alternate arrangements for staff become an important piece of the puzzle. Again, it is worth thinking about ‘recovery mode’ arrangements, for example enabling access for a reduced set of people perhaps connecting remotely, while the main facility is restored.

High availability and disaster recovery capabilities aren’t standing still. As we know for example, virtualisation brings quite a lot to the HA/DR party in that it is a darn sight easier to take a copy of a virtual machine than a physical one. Traditional mechanisms such as clustering are also becoming more capable with the inclusion of multi-site clusters, though these will require network bandwidth and latency criteria to be met.

Today’s IT platforms do offer a raft of options to be sure, but these do not diminish the level of expertise required to implement high availability in a workable fashion. Let’s not beat about the bush: disaster recovery can push the dependencies between IT systems to the absolute limit, and so it becomes crucial to understand the relationships, technicalities and constraints between different components of IT.

For example, it’s all very well having a separate server that can act as a warm standby should anything go wrong. But from a configuration perspective, things are not that straightforward – specifics such as IP address failover for example, such that existing connections can be re-routed, or indeed mappings between the server HBA (storage card) and the logical storage unit the server is accessing on the SAN. There is much about IT that is still hard coded, and if the warm standby is not configured in exactly the same way as the live system, unexpected problems can occur during the transition.

It’s worth thinking about these very real constraints when considering more esoteric approaches to IT architecture and its management. We have, on various occasions, talked about how IT could always be more dynamic, managed and provisioned more efficiently using virtualisation, etcetera etcetera. But if the systems being deployed require any level of protection against risk, their complexity goes up an order of magnitude.

Specifically, it takes someone five seconds to create and boot a virtual machine. If that virtual machine needs to exhibit five-nines (99.999%) availability however, a number of dependencies immediately appear in terms of the hardware platform it’s running on, the storage it is using and the management processes involved in assuring the continuity of service should something go wrong. While products (including management tools and failover solutions) do exist to help assure availability in more dynamic environments, these need to be factored into the costs, as do additional overheads in terms of monitoring.

At this stage it’s still early days for virtualisation, so it’s also unclear whether high availability best practice is impacted by such new ways of doing things – for example whether it becomes easier or harder to implement a disaster recovery plan, or whether things stay much the same despite the new capabilities. If you do have any advance experience of this, we’re all ears.

If you want to know more about this, try our webcast, recorded last week, for a heated discussion on what building High Availability systems takes. ®

Website security in corporate America

More from The Register

next story
New 'Cosmos' browser surfs the net by TXT alone
No data plan? No WiFi? No worries ... except sluggish download speed
'Windows 9' LEAK: Microsoft's playing catchup with Linux
Multiple desktops and live tiles in restored Start button star in new vids
iOS 8 release: WebGL now runs everywhere. Hurrah for 3D graphics!
HTML 5's pretty neat ... when your browser supports it
'People have forgotten just how late the first iPhone arrived ...'
Plus: 'Google's IDEALISM is an injudicious justification for inappropriate biz practices'
Mathematica hits the Web
Wolfram embraces the cloud, promies private cloud cut of its number-cruncher
Mozilla shutters Labs, tells nobody it's been dead for five months
Staffer's blog reveals all as projects languish on GitHub
SUSE Linux owner Attachmate gobbled by Micro Focus for $2.3bn
Merger will lead to mainframe and COBOL powerhouse
iOS 8 Healthkit gets a bug SO Apple KILLS it. That's real healthcare!
Not fit for purpose on day of launch, says Cupertino
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.