Cloud mega-uploads aren't easy
Google, Microsoft, can't explain how to get big data into the cloud, despite rivals' import services
Google and Microsoft don't offer formal data ingestion services to help users get lots of data into the cloud, and neither seems set to do so anytime soon. Quite how would-be users take advantage of the hundreds of terabytes both offer in the cloud is therefore a bit of a mystery.
Data ingestion services see cloud providers offer customers the chance to send them hard disks for rapid upload into the cloud. Amazon Web Services' import/export service was among the first such services and offers the chance to ingest up to 16TB of data, provided it is no more than 14 inches high by 19 inches wide by 36 inches deep (8Us in a standard 19 inch rack) and weighs less than 50 pounds.
Rackspace offers a similar service, dubbed Cloud Files Bulk Import. Optus, the Australian arm of telecoms giant Singtel, will happily offer a similar service. Australian cloud Ninefold does likewise, branding it "Sneakernet".
Some other cloud providers offer such a service, even if it is not productised or advertised. The Register spoke to one cloudy migrant who (after requesting anonymity) told us they borrowed a desktop network attached storage (NAS) device from their new cloud provider, bought another, uploaded data to the devices and then despatched a staffer on a flight to the cloud facility. The NASes were carry-on luggage and the travelling staffer cradled them on their lap during the flight.
It was worth going to those lengths because, as AWS points out in the spiel for its import/export service, doing so “is often much faster than transferring that data via the Internet.”
To understand why, consider the fact that headline speeds advertised on broadband connections aren't always achieved in the real world. Optus, for example, told us that while its fastest broadband connection hums along at 3-5 Gbps, the standard service level agreement “guarantees a speed of 300 Mbps, above which we would conduct fibre checks to ensure additional capacity can be reserved for the customer.” At that speed each terabyte would take about eight hours to upload, and that's with an optimistic assumption of 10% overhead and general network messiness.
It's hard to imagine how that kind of speed will be of any use for cloud services which offer petabyte-scale cloud storage, such as Azure's (or whatever it is called this week) pricing tier for amounts of data “Greater than 5 PB.” Google's BigQuery also promises to support “analysis of datasets up to hundreds of terabytes.”
Both Google and Microsoft, however, offered no details when El Reg prodded them for an explanation of just how customers can get that much data into their clouds. That's despite Microsoft telling your correspondent, in a past professional life, that it was “evaluating” such a service back in 2010.
If you think this all sounds a bit theoretical, the lack of ingestion services from the Chocolate Factory is already leading to some bizarre work-arounds.
Craig Deveson, a serial cloud entrepreneur who currently serves as CEO and Co-Founder of Wordpress backup plugin vendor cloudsafe365, says Google's lack of data ingestion services became “a genuine issue” when he worked on a Gmail migration for a large Australian software company. During that project he found the best way to get substantial quantity of old email data into Google's cloud was first to send disks to Singapore for upload into Amazon's S3 cloud storage service. Once in Amazon's cloud “we had to run a program to ingest it into Google's back end.”
Similar tricks are needed to pump lots of data into software-as-a-service providers' clouds.
Salesforce.com, for example, advised us that bulk uploads are made possible by a Bulk API which happily puts SOAP and REST to work to suck up batches of 10,000 records at a time. “Even while data is still being sent to the server, the Force.com platform submits the batches for processing,” the company said.
Pressed if disks are accepted, the company responded that “All common database products provide a capability to extract to a common file format like .csv.”
Whether anyone can afford to wait for that .csv, or other larger files, to arrive is another matter. ®
Re: Teething problems, or something worse?
Data out is certainly more interesting than data in. Just what incentive do the cloud vendors have to help you remove the data after you've threatened to move elsewhere? I wouldn't be surprised if most or all of them offer import "somehow" if you kick up a fuss but not export at all.
Do you really want to extend your contract by another month/year so that you can repeat the "guy on a plane with a NAS" trip in the hope of somehow getting all your business data back? Or spend quite literally weeks redownloading it all before you can move off to another provider? Or have to post them a huge drive array and wait for them to copy all your data out at their convenience and then pay to send it back to you?
Cloud is one of those ideas that STILL doesn't know what it's supposed to be used for. There are lots of use cases, of course, but none for which cloud is the "optimal" solution.
If you are storing PETABYTES of business critical data and requiring cloud-level redundancy and availability, are you seriously telling me that you COULDN'T buy servers yourself around the world and do it cheaper with your existing talent?
Re: Teething problems, or something worse?
And immediately the "cloud" loses its meaning.
If you have on-site and off-site copies the sensible way round is always going to be the off-site as the backup.
And then the "cloud" is shown up for exactly what it is - a marketing buzzword for a reinvented wheel.
Teething problems, or something worse?
The premise of "cloud" is that it'll be wonderful once we're all using it. It's the getting started that's lacking. This does seem to be a bit of an oversight on the pushy vendors' parts. It also spells doom for the idea of bailing out later on, ie getting all that data back out.
So you decided to "go cloud". Did you even think about the getting data in, much less the getting data back out issues? And I don't mean on just the technical level; there'll be an entrepeneur for that.