You want to migrate how much data?
How do you move a 500lb gorilla? Gingerly
Sooner or later, you’re going to have to move some of your data. Perhaps you’re moving to a hybrid cloud model, and need to move some offsite. Maybe it’s already out there in a third-party’s infrastructure, and the contract isn’t working out as you’d planned. Or perchance you’re being smart and replicating between two infrastructure-as-a-service providers for increased resilience. Whatever the reason, you’ve got some planning to do.
Before you even get into the technology stuff, consider where your precious bits are going to end up. “As the EU’s new data regulations illustrate, data location is key if you are handling customer or private information,” warned Mark Ebden, strategic consultant at IT services company Trustmarque.
This may of course be why you’re moving it in the first place – to get it out of a cloud provider that is looking a bit shaky in the wake of Safe Harbour’s collapse.
Should that be the case, you may have other worries. Some cloud service agreements are a bit like the roach motel: your data is enticed to check in, but not to check out. Both Amazon and Azure allow free inbound transfers, but charge you to get your data back out again. If you’re moving your data out of a cloud-based service, check the pricing and factor this in as a business cost.
There are other scenarios for moving data. A company may have decided to move to a hybrid cloud environment, offloading some of its data and applications into a cloud-based provider for cost efficiency, perhaps. Getting the data to its new destination is fraught with technology and business issues.
The number one challenge for most companies considering moving vast tracts of data will be volume. The cost of wide area bandwidth may have decreased, but it hasn’t plummeted anywhere near as quickly as hard drive storage capacity. With that in mind, companies must figure out how to shunt their data around.
WAN optimization tools can help to a certain extent by compressing and deduplicating data during transfer, but there’s still the quality of the network connection to consider.
“This largely depends on the contract the user has in place with the service provider,” warned Ebden. “Some will allow ad hoc or ‘burst’ bandwidth and make changes to cope with demand, while others will demand charges which could not only be more costly but could even incur a delay of 60 or 90 days.”
If you’re dealing with a local regional cloud provider this might be an issue, depending on the volume of data you’re sending. Some of the larger firms have their own solutions. Amazon will slurp up your data using AWS Direct Connect, which it calls a private link between its servers and yours (although those worried about US surveillance would doubtless have their concerns). Microsoft offers ExpressRoute, which will get you to Azure via your colo provider, or via a WAN with an existing service provider.
In some cases, data volumes will simply be too great to warrant a network connection. The nice thing about data and bandwidth is that you can do the maths to work out how long it will take to transfer. Marcus Jewell, EMEA VP at Brocade, recalls one company he worked with that did just that. “It would’ve taken them a year to replicate the information,” he said. Instead, a man with a van may have to do. Copying your data onto tape or hard drive and shipping it physically may be a far faster way to get it to its destination.
The hyperscale cloud providers understand this, and have catered for it. Amazon recently introduced Snowball, an appliance designed specifically to export data physically at a customer’s location. Aimed at customers needing to move 10 TB or more to Amazon Web Services, customers order it, connect it to the network and start it with a bash script.
For Azure, Microsoft uses its Azure Import/Export service for physical transfers. Unlike Amazon’s Snowball service, you have to mail your own hard drives in. Google also operates a media mail-in service, but works with third parties for the Offline Media Import/Export service.
Security is key here. Encryption must be a part of the transfer process, whether the data is travelling on the wire or in the truck. One advantage of Amazon’s Snowball service is that the encryption comes built into the box. Otherwise, you’re going to have to do it yourself.
File vs block
Broadly speaking, the data to be transferred divides into file-level data and block-level. The file-based stuff is largely unstructured. It’s the payroll report on the finance director’s network drive, or that PowerPoint presentation that Quentin in marketing hasn’t finished yet. Richard Blanford, founder of IT infrastructure services firm Fordway Solutions, applies the Pareto principle to corporate data. An organization might have 100 TB of the stuff, but only 20 TB will be business critical, he warned.
“If it’s sitting inside a core business application, generally it’s easy to ascertain its business importance, and there’s not normally that much of it,” he said. “It’s the unstructured things – filesystems, emails – those are the sort of things where organizations migrate terabytes because they don’t know what the important stuff is.”
Classifying data is a painful but necessary step here, he said. Use data classification tools to do some initial scans, pulling out key data such as authors, file types, and age. Use keyword searches to narrow down documents on specific topics. Hopefully, you can prioritize a small subset of stuff that you actually need to keep, so that you can minimise your replication load.
This is a good time to audit your data for compliance purposes, pointed out Jewell.
“A lot of companies need to assess their data to ensure that they don’t have things they shouldn’t have. What a lot of companies can’t do, is audit where their data came from,” he said. Personally-identifiable information is the obvious risk here. Moving files with sensitive individual information off-premise may compound an already-developing legal risk.
The same compliance risk holds true for those core business applications he’s talking about. These are the ones with structured information, such as database rows that typically reside at the block level, stored in binary format. “You need to look at moving the whole workload. Moving the data is one thing but you need to move the application as well, because those things tend to need to be quite close,” he said.
There will be different classes of tool for handling migration tasks, designed to support specific applications, and targeted at forklifting large amounts of data to a new destination. “Ultimately, the choice of tool depends on the user’s source (Microsoft, Oracle and others usually have their own migration tools, for example), and method of migration,” said Trustmarque’s Ebden.
The hyperscale vendors often have their own migration tools too (Microsoft offers several for porting data from other databases into SQL Server on Azure, for example) but your mileage for all this may vary depending on the technical environment.
One potential spanner in the works that has some experts worried is the very thing that was meant to save the IT administrator so much work: software-defined storage. Virtualization makes it easier to move applications, but it might create problems when moving the data to go with them. In private cloud systems, administrators can forklift a virtual machine or a Docker container to a new location. But these resources access application data which is likely to be abstracted away from the physical storage hardware in a private cloud environment.
“The key is that you can no longer link your system by saying ‘this data serves this application and it’s on these three servers’,” said Jewell. Instead, administrators must now start asking themselves difficult questions.
“Are you clear where that workload resides? Are you sure that you’re moving all of the data with that application?” he mused. If you’ve virtualized a 15 year-old legacy Oracle app relying on various arcane internal data sources, you’d better make sure you’re catching them all. “You have to understand your virtual-to-physical mapping,” he said.
Don’t touch that data - it’s live!
Another potential speed bump is the always-on nature of many structured data applications. In most cases, the data and applications being moved will be processing transactions all of the time, introducing another layer of complexity.
Some approaches to handling dynamically-changing block-level data may involve standing up a mirrored copy in the cloud with a master/slave relationship, with a switch-over point after the initial data has been migrated and when you’re satisfied that the new instance is processing data correctly, said Giri Fox, services leader and cloud technologist at Rackspace. In some cases, companies may want to continue operating in a master/slave relationship for redundancy purposes, he said.
In a master-slave or multi-master replication, the bulk of the data still has to be moved somehow, during which time more transactions will be running. The worry for some customers with high-frequency transaction loads is that the destination one will never catch up with the transactions on the originating one, argues David Richards, CEO of WANdisco, which sells an ‘active-active’ replication product that keeps both parties in sync without a central co-ordinating system that could fail.
“The difference between us and everyone else’s backup is that theirs is time-based, whereas we’re based on transactions,” he said. This enables the destination side of a data replication to reconstruct incremental transactions after the bulk of the data has been moved.
He claims this enables the data’s destination to catch up after the bulk of the historical data had been transferred. “When it was plugged back in we would synchronise the changes that had happened that day,” he explained. No matter how you migrate the system, it’s important to ensure that you have a rollback path in case of problems, said Rackspace’s Fox. In preparing for all of these eventualities, you’re going to need a solid plan, and a lot of time to execute. “One of the worst things with a datacentre migration is to have a forced march because you’ve got a deadline,” he said. So plan as early as you can.
When developing that plan, ensure that each asset and task has a clear owner, whether that be inside or outside the organization, recommends Ebden. “This is particularly true when considering physical issues like power, the type of connection and the commissioning of equipment, along with virtual concerns such as the preferred data format if moving between storage types,” he warned.
This ensures that all decisions are made with intent, rather than passively accepted by default. When it comes to forklifting vast amounts of corporate data from one place to another, the old carpenter’s adage is especially relevant: measure twice, cut once. ®