Building high availability into IT
How would yours fare in a disaster?
Lab What does ‘high availability’ actually mean in the context of IT today? We’ve written elsewhere about more general availability and good systems design, but what if you want to assure availability when things go unexpectedly wrong?
From a systems perspective this is where we start adding to the number of nines we talk about, for example 99.999% (“five-nines”) availability. However, systems failure is just one possible issue that might befall our IT. The terminology is a bit mixed up, in that high availability, business continuity and disaster recovery are often talked about synonymously. Business continuity goes broader than IT, and in the UK we have a standard associated with it (BS 25999). Terrorist threats, swine flu, you name it, business continuity best practice is there to get things back up and running.
While disaster recovery is also about responding to external events, it does tend to refer more specifically to how to get the IT infrastructure back up and running. When we researched the risks a couple of years ago from an IT perspective, we found that theft and fire were the most likely ‘disasters’ that respondents felt needed to be dealt with, with natural disasters and terrorist activity being less significant.
Of course, since then we have seen a number of quite major flooding incidents, power outages and health scares, so no doubt general perceptions of the risks will have changed. Above all, this is indicative that trying to predict what disasters might occur is about as straightforward as betting on horses at the Gold Cup: while you might be able to whittle it down to the top five, any decision you make after that will be a gamble.
In IT terms, it is enough at least to know what the potential risks are, as attention can then turn to the business-critical IT systems and services to be kept going in the face of disaster. “Business critical” implies some level of prioritisation, with business involvement. Availability and recoverability come at a cost, and it is ultimately up to senior management to decide just how important each system is, and whether they want to cough up the cash to put the belts and braces in place.
Two criteria are important: first, the recovery time objective, namely how long it takes to get a service back up and running should disaster strike. In this case you may be able to get away with a reduced level of performance or access while more comprehensive recovery procedures are put in place. Second, we have the recovery point objective, namely how far back you need to go without suffering too badly. If you’re talking about a collaboration tool for example, you might be more than happy to revert to whatever state it was in yesterday, should trouble befall it. For a banking transactional system, you won’t want to lose a single transaction.
Such criteria dictate what mechanisms we need to put in place. For core systems that need to be guaranteed to be available immediately, some kind of external facility will be required, such as a second data centre or DR facility – this may be provided as a managed service (shared with other organisations, on the premise that a disaster won’t befall all of them at once). The specific protection mechanism depends on the system – for example clustering, load balancing and failover in the case of servers, or replication and snapshots in the case of storage.
As well as having the service available, you also need to think about how it will be accessed. It’s all very well having a second instance of the application for example, but if head office is flooded out, alternate arrangements for staff become an important piece of the puzzle. Again, it is worth thinking about ‘recovery mode’ arrangements, for example enabling access for a reduced set of people perhaps connecting remotely, while the main facility is restored.
High availability and disaster recovery capabilities aren’t standing still. As we know for example, virtualisation brings quite a lot to the HA/DR party in that it is a darn sight easier to take a copy of a virtual machine than a physical one. Traditional mechanisms such as clustering are also becoming more capable with the inclusion of multi-site clusters, though these will require network bandwidth and latency criteria to be met.
Today’s IT platforms do offer a raft of options to be sure, but these do not diminish the level of expertise required to implement high availability in a workable fashion. Let’s not beat about the bush: disaster recovery can push the dependencies between IT systems to the absolute limit, and so it becomes crucial to understand the relationships, technicalities and constraints between different components of IT.
For example, it’s all very well having a separate server that can act as a warm standby should anything go wrong. But from a configuration perspective, things are not that straightforward – specifics such as IP address failover for example, such that existing connections can be re-routed, or indeed mappings between the server HBA (storage card) and the logical storage unit the server is accessing on the SAN. There is much about IT that is still hard coded, and if the warm standby is not configured in exactly the same way as the live system, unexpected problems can occur during the transition.
It’s worth thinking about these very real constraints when considering more esoteric approaches to IT architecture and its management. We have, on various occasions, talked about how IT could always be more dynamic, managed and provisioned more efficiently using virtualisation, etcetera etcetera. But if the systems being deployed require any level of protection against risk, their complexity goes up an order of magnitude.
Specifically, it takes someone five seconds to create and boot a virtual machine. If that virtual machine needs to exhibit five-nines (99.999%) availability however, a number of dependencies immediately appear in terms of the hardware platform it’s running on, the storage it is using and the management processes involved in assuring the continuity of service should something go wrong. While products (including management tools and failover solutions) do exist to help assure availability in more dynamic environments, these need to be factored into the costs, as do additional overheads in terms of monitoring.
At this stage it’s still early days for virtualisation, so it’s also unclear whether high availability best practice is impacted by such new ways of doing things – for example whether it becomes easier or harder to implement a disaster recovery plan, or whether things stay much the same despite the new capabilities. If you do have any advance experience of this, we’re all ears.
If you want to know more about this, try our webcast, recorded last week, for a heated discussion on what building High Availability systems takes. ®
Sponsored: Benefits from the lessons learned in HPC