More like this

Data Center

Arrow

Storage

There's more than one way to back up your data

Here's how to tell them apart

storage arrays superimposed on cloudy sky

In the world of data protection you don't get fired for losing money, you get fired for losing data.

Companies tend to make many copies of data, some of which hang around, zombie-like, for years. Data protection is without question critically important and we need to understand how it has evolved if we are to decide which methods to choose for which scenario.

The four main data protection methods in use today are:

  • Traditional backups
  • Replication
  • Continuous data protection
  • Snapshots

Each category can blur a little into the next as companies implement these concepts in different ways. This makes for lovely debates about which terminology should be used where.

Backup companies fight it out over who implements what and who is better at doing so. Some pick one or two data protection methods as their thing and then spend an eternity slagging off the other methods.

Others talk about how flexible their product is – and meanwhile interfaces grow ever more complex, ease of use is left by the wayside and implementing data protection becomes ever more tedious.

Fear of failure

Before we dive in to the different categories, we should understand that data protection is an umbrella term encompassing protection from different events.

The most basic event is hardware failure (a disk failure, a server failure, a switch failure or what have you). Here we would find solutions such as RAID (lashing multiple disks together to protect against disk failure) and RAIN (lashing multiple servers together to protect against the failure of a single server).

I generally don't consider RAID, RAIN or MPIO (multiple switches) to be part of data protection. There is an argument to be made for their inclusion but I prefer to call it "designing your infrastructure using more brain cells than shoe size”.

To be blunt, if you lose a critical amount of data even as a small or medium-sized business (SMB) with only one server because a single disk (or even the server) died, you shouldn't be designing infrastructure.

This leaves the elements of worry that more commonly fall under the term data protection.

There is protection from disaster (failure of an entire data centre). There is protection from Oopsie McFumblefingers ("oh no, I didn't mean to delete that") and there is protection from corruption.

Define your terms

Here is where we should whip out the hated recovery point objective (RPO) and the recovery time objective (RTO).

RPO is the industry term for "how much data can you afford to lose"? The answer to this is different for every application and depends entirely upon who you ask.

An SMB, for example, may be just fine if all it has to do is revert to the previous night's backup of the financials database. Hard copies of all the customer transactions are printed and filed and you can always make some unfortunate minion re-enter the data manually.

Of course, if your building goes up in flames you can't re-enter that day's data. Depending on your jurisdiction, the taxman may have a few things to say about that.

Time of year can matter a great deal too. Production, sales, marketing and PR may not care if you lose a day's worth of customer orders in the January lull. A handful of customers are affected, sales people call them up, soothe some ruffled feathers – hey, it gives them a reason to be employed for that month, no?

But in peak season those same people would be summoning up genetically engineered viruses of extra suffering if you let a day's worth of customer orders go walkabout. The same company that has a handful of orders during the off season could be coping with tens or hundreds of thousands a day at its peak.

RTO is how quickly you need your backups restored. Again, this depends on what the data is, who you ask and what time of year it is.

If your RTO is right frakking now (RFN) and you are dealing with data protection by streaming terabytes over an ADSL connection no longer classified as broadband in the US, you should give some serious thought to how you will implement restores.

Sucking several terabytes across a local network is a time sink. Sucking it down across a small-business ADSL connection is hilarious in a weeping uncontrollably kind of way. Cloud to cloud begins to make a lot of sense now.

The old ways

Traditional backup methods are pretty straightforward. A robot wakes up at a given time and performs a backup. The primitive – but hard to screw up – way the robot does this is to vacuum up everything it is supposed to back up and ship this complete copy to wherever you store your files.

This might seem wasteful but it has its purpose. Let's say that your financials application and database are a whopping two or three gigs, and that your financials application can be set to "pause for backups" every night. You pause the database, copy all the files off into a folder whose name is based on the date and you un-pause the database.

This can be accomplished with a batch file and Windows scheduled tasks. No backup software required.

If you are sending that data to a backup source that has deduplication, you don't even have to worry about having eleventy squillion copies of everything, and restoring in the event of a failure is simple.

The slightly more advanced version takes a look at all your files, figures out if anything has changed and then copies only the things that have. Typically restoring the backup requires that you have access to the backup software and an index that is intact and uncorrupted. (For the record, the index is always borked.)

Traditional backups are point-in-time copies of data. They are great for protecting against Oopsie McFumblefingers, mediocre at protecting against corruption and on their own worthless at protecting against disasters.

Traditional backups are enormous. Some companies try to set up a procedure which tasks a human with rotating tapes and disks and then taking one offsite every day. This inevitably fails, as humans are prone to forgetfulness, and it generally fails just when you need it not to.

So everyone tries streaming backups over the internet somewhere. Today, this is typically sent to a cloud backup provider. Even with all the various technologies thrown at them to squish backups down to a transmittable size, getting a copy of them offsite within the daily backup window is a perpetual challenge.

For smaller companies it usually means getting a dedicated DSL connection just for backups (and all the networking fun that entails). For larger companies it means forking over ever more money to the local bandwidth monopoly and trying to explain to the bean counters why.

Double vision

Enter replication. In its most basic form, this is a technology used to protect against the failure of a server. Server A replicates its workload to server B. When server A fails to support usual operation, server B takes over, using its copy of the data.

That is all fine and good when servers A and B are side by side and you can string a wire between them to go as fast as the network cards you can afford.

If your replication software is decent this sort of local replication can be great to detect silent corruption of data and repair it on the fly, but it doesn't help you at all in dealing with Oopsie McFumblefingers or with disasters.

Replication, however, can be great for disaster recovery. If the amount of data you are writing every minute is scaled appropriately with the internet bandwidth you have available, then you can usually place server A and server B in different locations.

That makes "oh we don't support that version any more" excuses from backup vendors far less of an issue

You have to be careful about the bandwidth, but there are a lot of options out there that combine deduplication, compression and other technologies to ensure you can still squeeze an SMB's workload through an ADSL connection and have an RPO of about 15 minutes.

That is a heck of a step up from the RPOs of eight hours to one day we usually see with traditional backups.

RTOs are better with replication too. You can usually simply light up the copy on server B and be online in seconds, or at worst minutes. Recovering data from traditional backups can take anywhere from a few minutes to weeks if badly planned.

But replication doesn't deal with Oopsie McFumblefingers. If you are replicating everything synchronously then all changes are sent from one system to another, including accidental deletions.

You are completely reliant on the software used to make the replication go, but this is increasingly built directly into operating systems, hypervisors and applications. That makes "oh we don't support that version any more" excuses from backup vendors in your moment of need far less of an issue.

Total recall

Continuous data protection (CDP) is a bit like replication on steroids. There are actually more approaches to it than there are companies selling CDP products. Some companies offer CDP-like tech under different monikers and some offer several means of accomplishing CDP.

CDP streams all changes in data from server A to server B, in much the same way as replication. The main difference is that CDP never discards anything. Every change you have ever made (at least back to N number of changes or X timeframe) is kept.

This means your RPO is usually a matter of picking a point in time that you want to restore to. The RTO can vary greatly, depending on the implementation and whether or not you have the ability to simply light the workload up remotely or have only been backing up the data.

Almost everyone who has read this far has file-based CDP running on their systems in the form of Dropbox or one of its clones. Every time you make a change to a file in cloud storage, a copy is kept at the provider. You can log in to the cloud storage provider's website and restore files that go back several versions.

In other cases, CDP is doing block-level copies of changes in a way that is more akin to traditional replication, but recording every change. Many solutions choose a balance – providing replication, but "snapshotting" changes every so many seconds or minutes to offer a continuous stream of point-in-time changes without the need for write-by-write record of changes.

CDP can be the best solution. It can also be a horrible tentacled monster, roaming the world's data networking waiting for your moment of weakness before preying on your fragile sanity.

It really depends on the implementation, as CDP is not the sort of thing that is built into many operating systems, hypervisors or applications. Here you are at the mercy of your backup vendor. Choose wisely.

Take a picture

Snapshots are the current magical kitten unicorn of the storage industry. Seemingly everything from operating systems to hypervisors to storage arrays supports them, and done right you can meet tight RPOs and tight RTOs. The problem lies in the "done right".

If you are using a hyper-converged provider that specialises in data efficiency, you get a free pass here. Your virtual machines are stored on the same servers they run on and everything is deduplicated, compressed and thin-provisioned.

Snapshots take virtually no time at all to implement and they are exactly the size of data that has been written since the last time a snapshot was taken. Want to take a snapshot of your workloads every two minutes? Okey dokey, no problem.

If you have a hyper-converged setup with a stretch cluster – when one or more nodes are located on a separate physical site, connected by metro area network (MAN)-class network links – then you can set things up so that the replicated copy of your workload is kept on another site and you have these sweet point-in-time snapshots for dealing with Oopsie Mcfumblefingers. Hurray!

Of course, MAN-class connectivity is a pipe dream (literally) for most companies, and that's before we get into conversations about the cost of a second site, or the cost of racks at a colo.

And then there are various disasters that can take out entire metros (such as Hurricane Sandy), so you really should ship at least some of those snapshots to servers at some distant location.

Now the array vendors would like you to know that they do replication and snapshotting and CDP and all this too – those dang uppity hyper-converged vendors aren't the be all and end all of storage, young whippersnapper – and if you would kindly put one of their arrays in each of your sites, they will take snapshots and ship them around the world too.

Of course, some cloud providers offer the ability to light up an array vendor's software in their cloud, and many a hyper-converged vendor has something similar. So now we can send snapshots to the cloud, commensurate with our bandwidth availability. Life would seem to be good.

Unfortunately, while snapshots are great for dealing with Oopsie Mcfumblefingers and can protect against disaster, they are not so good at picking up on silent data corruption, unless your vendor has made a special effort.

Efficiency is all

Outside of specifically designed hyper-converged setups with a focus on data efficiency, the snapshots-all-the-time approach struggles to provide the level of service that CDP can offer.

CDP can also operate on different levels: at the level of the application, of individual files or of the whole virtual machine. Snapshots almost always operate at the level of the virtual machine (or the LUN in the case of arrays).

This brings us back to the efficiency question about traditional backups. How much of the data that you are backing up do you really need?

Snapshots and replication will both back up a bunch of writes that you simply don't care about. CDP software can be more finely tuned – at the cost of being significantly more complex.

Data efficiency becomes critical. Deduplication and compression are a must, and cloud gateways are increasingly important concepts. Something that can keep a copy of which blocks have existed on which site can really help reduce the amount of data that needs to be transmitted.

Once you have all of that worked out – and assuming you haven't run screaming into the hills or developed an RTO/RPO twitch that is measurable by your local seismic team – you have the fun task of determining where to send the second copy of your data.

Can you legally store it with American cloud providers? What level of encryption should you use?

And if you are choosing a data protection method that requires software from a data protection provider, do you have to install an agent on each and every server or virtual machine to get the job done?

All important questions, but ones, dear reader, that we will discuss at a later date. ®

Sponsored: Global DDoS threat landscape report