Data integrity and failover in the hybrid cloud
Between two stools one falls to the ground
Discussions of information security tend to revolve around keeping confidential information confidential: preventing intruders from compromising the protection of the systems and gaining access to data they're not entitled to see. But there's more to security than just keeping information secret: it's a three-pronged concept.
Sadly, security is subject to a hideously twee three-letter abbreviation: CIA. Nothing to do with an American intelligence agency, but Confidentiality, Integrity and Availability. Confidentiality is the obvious one, and it tends to draw attention away from the other two – yet all three elements are as important as each other.
In this feature we'll look at how you deal with integrity and availability in a hybrid cloud setup – where you have a multi-location setup part of which sits in your on-premise infrastructure and part of which sits in a public cloud setup.
What do we mean by these two terms? First is integrity, which Techopedia says: “refers to the overall completeness, accuracy and consistency of data."
We care about integrity in the context of security because even if an intruder isn't actually stealing your data, they can cause damage to your business by corrupting the data and rendering it unusable.
With regard to availability, TechTarget tells us it's about ensuring: “that data continues to be available at a required level of performance”.
The classic example of an attack on the availability of data is a Denial of Service (DoS) attack, where a system is bombarded with unwanted connections such that it can't keep up and fails to respond to legitimate requests.
Resilient systems are the first step to preserving availability in the event of a system failure. One server dies, a second server takes up the load either by taking over the role of the original server (in an active/passive system) or by soaking up the load of the first server as well as its own (in an active/active system).
You need to ask yourself, though, how you want to spread the resilient systems within your infrastructure. You might automatically think that the answer is to have the on-premise world replicated in the cloud and vice versa, and this is definitely a nice option (we'll come back to it in a bit). But it's not the only answer.
You could, for instance, decide to use your public cloud provider's multi-location capabilities if they have them. Implementing data replication and failover may well be significantly easier between (say) two of Amazon's US regions than it is between AWS and your own private systems.
At the other extremity, you have the opportunity to duplicate your on-premise system hardware too – but the problem in that case is with the risk of having just one location. Fair enough, if you have more than one private data centre, but in most cases you won't. So at the very least you'll probably want to replicate your on-premise data into the public cloud installation, even if you decide not to do the same in reverse.
What resilience is required?
Before you start putting the protection mechanisms in place you need to decide precisely what you're protecting, and to what extent. Because the one thing that's certain is that while you'd like to make everything super-resilient with seamless failover, that will not be possible for some systems (particularly legacy applications) and it will probably be too costly in other cases. Real-time data replication takes bandwidth, and implementing automated failover on applications can be complex. So you need to go through each of your applications and data stores and decide what level of protection you need to implement.
You're likely to want near-real-time replication of your key fileserver storage areas, and the level at which you can do it depends on the kit you have and on your choice of public cloud provider. Some cloud vendors provide mechanisms that let you replicate data easily between the private and public installations; in other cases you may have to implement a higher level mechanism to deal with shoving the data around.
At an application level, walk through each and understand what you can do. Let's take a database as an example: database apps have multiple levels of replication, from clustering (where everything is instantly replicated and application failover is seamless in the event of server or storage loss) down to log shipping (where data is replicated in an active-passive setup and the front-end application has to be manually switched to point at the other storage area).
Be judicious with the level of availability you need for each application. It's probably a big deal if your core IP telephony system goes down, and so you'll seek strong protection and automated failover. But for less critical apps it may be acceptable to wait even a few hours for files to be restored from a backup and the system resurrected.
Which brings us to backups. I'm a great believer, in multi-location setups, in having each site backing itself up to another (so in the usual two-location installation that means site A backs up to a server at site B, and vice versa). Disk-to-disk backups are increasingly the order of the day, and although the have a level of imperativeness with regard to scheduling, they don't need to be real time.
Links in the chain
Core to the task of maintaining availability, then, is ensuring you have sufficient bandwidth to ship the required data between the public and private cloud worlds; it'll usually be a VPN connection over the Internet, and you'll want a proper corporate Internet connection with symmetric bandwidth and service guarantees.
Better still, go for resilient connections with a provider that's able to give you automated failover in the event of a link problem, because without the connection between the private and public cloud setups you're stuck.
Next is any special kit you need in order to replicate the data (for instance if you're using AWS storage you'd be daft not to consider Amazon's Storage Gateway, which runs as a virtual machine on your private setup).
Then you have the apps themselves: where the app has its own failover capability you simply need to configure it appropriately, but where this isn't the case you should look to write detailed documentation and the necessary scripts to enable the failover and failback to be done effectively when the need arises.
Finally is the presentation layer stuff. Say you want to make your email system's Webmail interface accessible from either of your locations without the user having to enter a different address. This means you need to look at whether to use global load balancing (for a nicely seamless failover) or a less reactive process such as a DNS change in the event of a server switch. Again, if you go for options such as the latter you should make sure to automate it as much as possible and document it to death.
So what about integrity?
Data integrity is a curious concept, because systems that are running properly can actually help an attacker to cause an integrity problem.
Sounds slightly odd, but let's take an example where you have an enterprise database application whose data is replicated from the primary repository to the secondary. A malware package finds its way in and is able to do something like a DROP TABLE CUSTOMERS on the database.
What happens? Easy – the replicator throws this command over to the secondary and blows away the table in the next second or two. The means of making the systems redundant actually assists an integrity attack. (I had a real life example of this a few years ago: by fluke we were able to recover from the secondary because the replicator had broken a few hours before someone did a DROP DATABASE call).
So the first steps for integrity are the obvious ones, then: control access permissions strongly with each user ID having minimal permission; have a rigorous anti-malware installation with a centralised control console that manages both the public and private cloud installations under a single console; and keep the management and user-facing elements of your systems separate so that an attack on a server OS can't give access to the hypervisor or storage management layer.
And we've already discussed the concept of backups, which are an essential mechanism for recovering from disastrous data losses, whether through system failure or administrator error.
But on top of what the backups give you, explore the other facilities you have available for securing your data. If you have a reasonably modern SAN in your private cloud setup, the chances are that it provides the ability to take snapshots of storage volumes – point-in-time copies of the data as it was at a given time.
You'd never use data snapshots as your only integrity protection mechanism (an offsite backup is an essential get-out-of-jail card) but what they do give you is a layer on top of your backups. Storage snapshots cost you next to no disk space and can be removed with no service interruption, so they're an immensely useful way to supplement your backups with extra restore points.
And if you have an integrity problem, rolling back to a previous snapshot is a straightforward concept - oh, and the big public cloud providers have storage snapshot capabilities too, so it's something you can implement on both sides of the hybrid world.
There's not a great deal that's difficult in implementing integrity and availability protection in the hybrid cloud. The features available in today's public cloud installations are advanced and provide straightforward mechanisms for replicating your data to and fro, and in fact the main tricky aspects are those that would exist even if you had (say) two private data centres in different locations, because it's about how you make applications seamlessly fail over between different geographic locations (and hence internet-facing IP address ranges). ®