Copy Data Management: What it is and why you might need it
A giant step for data protection, says Trevor Pott
Most infrastructure verticals within the datacenter are undergoing rapid evolution today, but you could be entirely forgiven if you thought data protection stopped evolving some time ago. That's a shame because Copy Data Management (CDM) is the next evolution of this space, and it's worth taking some time to learn about this area.
That the data protection market is so quiet is odd. There's the occasional article about Disaster Recovery as a Service (DRaaS), but there just isn't the buzz around data protection compared with other infrastructure segments. In part, this is because only a handful of companies are truly pushing boundaries in the data protection space.
Compare this to the overhyped and overcrowded Storage, Networking and Virtualization (SNV) market segments. Clearly a lot of companies are pushing the boundaries in the SNV space. This makes for a lot of press, social media spats and new technologies moving downmarket to the SMBs at record pace.
But in the data protection market there are not hundreds of companies each scrambling to own the entire pie. The upshot is that is taking forever to drage data protection into the future. Rapid incremental evolution of the market, such as is happening in the SNV space isn't going to happen here.
Fortunately, iterative advancement isn't the only way to move technology along. Every now and again you get a great big leap. This is CDM.
Traditionally, data protection has been focused on finding new and interesting ways to make copies of your data. In most organisations entire disaster recovery sites full of data copies sit around and do nothing except wait for an "oh shit" moment that may never happen.
CDM begins with the understanding that a range of IT functions depend on copies of data beyond data protection; a proper CDM solution is designed to manage the creation and use of these copies in an efficient and automated way. This indeed is a different approach - one that sees data protection as one of many “consumers” of data copies.
As such, CDM is focused beyond simply taking copies of data into doing something useful with them. And by looking for a smart way to create and deliver copies as needed, it has made data protection easier to use, lowered costs in many areas of the datacenter and even given us a means to make use of traditional 24/7 workloads in the public end of a hybrid cloud scenario - without going broke.
"Run the reports at night, when nobody is using the system"
If that sounds like a lot of promises waiting to be broken, read on. CDM isn't magical. It's something many of us do already - usually badly - but made simple, and with vendor support behind it.
Copies get used
We do lots of useful things with copies of production data - and this applies even to small businesses. Let's consider three relatively straightforward uses: Report generation, infrastructure analytics and test and dev. These are all common workloads that frequently operate off copies of our production data.
Very few application developers write good algorithms for running reports, nor is it all that common to see process isolation good enough to let reports run on the same server as your production environment. Customers get tetchy if it takes five minutes to ring up a purchase, so slowing the point of sale application down to a crawl because an executive wants a sales report now is usually a bad plan.
The typical engineer's response is to organise business processes around IT capabilities. "Run the reports at night, when nobody is using the system" would be a typical response. Of course, that doesn't allow for realtime reports to be run, and it may conflict with backups. That's before we talk about storage systems running post-process deduplication or what-have-you.
So, by and large, everyone takes a copy of the night's backups and injects the copy of that data into a copy of the production server, or takes a snapshot of the production VM each night and boots it up as a report server. In both cases the result is usually accomplished by carefully crafted scripts held together by wishes and hard work.
Worse, IT typically has to justify the extra server capacity every time it goes seeking infrastructure upgrades, because business types don't seem to understand why reports are being run off a separate instance. Let's not even touch trying to convince the management types to log in to a separate server.
According to reports
Infrastructure analytics is different from reports. Whereas reports are suits nerding over sales figures, infrastructure analytics are nerds justifying purchases to suits. How much data changes each day? What kind of data? Who is using it? For what? How much compute capacity is required? What are the projections for growth? Has there been any corruption of files since the last check?
Some information can be extracted from vSphere or other infrastructure servers and processed relatively cheaply. Some of them - such as checking all files for corruption, compromise, etc. - are fairly expensive procedures. Thus another layer of scripts and some complicated work later and all of this can also run against a copy of the data you took for your backups.
Test and Dev are constantly copying things: either production environments down to test and dev or test and dev environments up to production. There are long arguments about which way that should go - and why - and I'll leave that for another time.
Some places don't use copies. They have everything building out automatically and injected using carefully manicured scripts. Entire datacenters can be made to dance, if you have a room full of PhDs to pull the strings. Most businesses, however, aren't there yet. They probably won't be for some time.
If you've sensed a pattern by now, you probably know what's coming: test and dev copies are often made off the nightly backups, because it's too costly to make them off the production storage, virtualization and compute systems. Those backup servers and disk arrays are just sitting around doing nothing, so hey, why not?
This all seems logical on its face. The problem is that it's harder to implement as it sounds, it's fragile and it's massively manpower intensive. Someone should really do something about that.
Cloudy Copy Data Management
CDM companies are, as you might imagine, experts in dealing with copies of your data. CDM is about taking the bits of infrastructure you already have in place that make copies of things and making all of it actually usable. Think along the lines of one user interface that can manage all of your production data, your backups, and your offsite copies.
Need to spin up a VM for reporting? That's simple, there's a button for that. Need to orchestrate some nightly test and dev, or analytics data or workload moves? That's what CDM does. Big data, forensics - anything you can think of that you want to do with copies of your data - CDM makes it happen.
On its own, CDM falls into the "where were you 10 years ago when I still had my sanity" category of real-world critical software. It promises to replace decades of accumulated script cruft and half functional backup applications with something streamlined and easy to use. As a bonus, CDM is also the first good argument I've seen for running traditional workloads on the public end of a hybrid cloud.
I'm not a fan of public cloud computing. Setting aside the US of NSA issues, it's just too easy to screw up. It also costs rather a lot of money if your applications aren't designed to be burstable cloudy applications.
CDM bridges the gap. It's more than a pretty interface to manage your data. Its orchestration and automation of where that data goes and what it does when it gets there.
With modern storage technologies I can set up a CDM system on a public cloud provider and on my own premises. After the initial replication has completed, only change blocks need to be sent to the public cloud provider, so keeping the remote site up to date with changes is fairly simple.
Even if I'm streaming my data changes to the public cloud provider as they happen, the CDM software can then create point in time snapshots of any workload or dataset based on whatever criteria it's fed. Workloads can be spun up off child snapshots leaving the production copy and the initial backup copy of the data untouched by the workload, but allowing the workload to operate on the data as it needs.
This is accomplished because the CDM software is managing existing storage systems that have replication, snapshotting and other capabilities built in. Enterprises can have a seemingly unlimited number of these things and herding them all can be a right mess. CDM is all about making it usable.
All in all, CDM is to traditional data protection what the invention of the washing machine was to hand washing clothes. Some companies can live without it, but I don't know why they'd want to. ®