Original URL: https://www.theregister.com/2013/07/25/data_restoration_tested_after_calgary_flood/

Back up all you like - but can you resuscitate your data after a flood?

Trevor Pott learns a salutary lesson in data restoration

By Trevor Pott and Iain Thomson

Posted in Storage, 25th July 2013 10:02 GMT

When it comes to backups two sayings are worth keeping in mind: "if your data doesn't exist in at least two places, it doesn't exist" and "a backup whose restore process has not been tested is no backup at all”.

There is nothing like a natural disaster affecting one of your live locations to test your procedures.

I have just had to deal with this; let's take a look at how.

Pipe dreams

To have a discussion about backups we need to start with what we are backing up and why.

The client in question has two sites, one in Edmonton and one in Calgary. Each site is serviced by a fibre pipe – theoretically capable of 100Mb in emergencies but throttled below that to meet our ISP agreement.

If we keep the cumulative usage between both sites below 'X'Mbps (measured at the 95th percentile) we can use all the bandwidth we want. No caps, no per-GB billing. It is a nice, predictable cost that we can easily manage.

More critically, in case of "oh no!", uncapping those pipes so that each can use the full 100Mb takes a single command.

Each site receives large quantities of information from customers for use at that specific location. This information is made highly available to deal with hardware failure but is not replicated offsite.

We spec our bandwidth only slightly above what we need to handle inbound data; we could never afford the cost of cloud storage, even if we are sending to one of our own datacentres.

Since we have all this infrastructure in place to meet local needs it seems silly not to host our public-facing websites and IT services on our own infrastructure. We have private clouds at each location, gobs of storage, UPSs and a fat pipe. It isn't exactly Amazon, but it should be workable.

Replicate, replicate

None of the databases for our public websites can be set up for live replication because that would require rewriting code to accommodate it. For various reasons that won’t happen any time soon.

So backups are down to cron jobs running on each MySQL server to create regular database dumps, zip, encrypt and then fire them off to our file server.

At the same time the codebase for each web server undergoes a similar backup. The file servers in question are Windows systems running distributed file system replication (DFSR), which does a marvellous job of replicating the backups.

Each site has two identical file servers in a cluster and they both have a copy of the files. The files are then fired across the WAN to the other site where it lives on a pair on that site's file server cluster as well. At this point, I'd say we are pretty well immune to hardware failure.

A backup server at the head office runs a truly archaic version of Retrospect that creates versioned backups to protect against Oopsie Mcfumblefingers, malware or other such issues that might delete the backups in the DFSR share. So anything that is placed in the backups directory on either site is automatically replicated to two systems per site and versioned.

Potential personally identifiable information is encrypted – both in the database and in the rarballs (files that  have been compressed or packaged using rar compression)  – and none of it leaves corporate control.

So far, so good; better solutions certainly exist, but with no budget I think it gets the job done.

Frying tonight

For many of our systems we not only test our backups – restoring from those backups on a regular basis is an automated process. Like any good sysadmins we maintain a test lab, re-populated each morning with last night's data.

Almost all critical systems have this sort of automated restore sequence set up so we rarely put much thought into what it might take to do a restore.

You will note that critical systems had automated restore procedures. The websites have not traditionally been viewed as critical.

The reasoning was that if we had undergone a severe outage or disaster and these systems were down we would have bigger fish to fry. We have the database and the website files, a manual restore should take a matter of minutes.

While that is all true, the hole in the plan is that you have to have something to restore to. If you don't have a copy of the virtual machines where those files and databases live, you have to build a new one from scratch.

As soon as I heard of the evacuation I knew we were going to get bitten by that lack of a virtual machine

Trevor – brilliant sort that he is – forgot to put the virtual machine templates in the backups directory. City floods, servers go down and here I sit with a bunch of rarballs and nowhere to inject them. The kicker is that we even had warning that the building had to be evacuated. As soon as I heard of the pending evacuation I knew we were going to get bitten by that lack of a virtual machine.

Before we were forced to abandon the Calgary location I had initiated a download of the virtual machine from the virtual server to the backup directory in the hopes that the power would stay on long enough to get the virtual machine replicated across the WAN.

I uncapped the fibrelinks and let them replicate at a full 100Mb, but no dice. There just wasn't enough time.

Unpredictable behaviour

It took only minutes to get a virtual machine up, the database injected and the files uploaded. Bleary-eyed from lack of sleep and out of coffee it took me at least an hour to figure out that the reason the website wouldn't load was because I forgot to flush privileges on the MySQL server after restoring the MySQL database (and with it all the user and host authentication).

So now the website loads and … it doesn't quite load. About 75 per cent of the PHP is not being parsed. After combing through config files I realise that the version of PHP that ships with the latest CentOS has decided to deprecate the use of short <? $somephp ?> tags. Everything in those applications is written in those short tags so I had to change a value in the php.ini to get it work again.

Problem solved. The websites load properly and we are up and running again. Elapsed time: a little over two hours. Ouch!

That was bad enough but we were not out of the woods yet. It turned out that email messages submitted by forms on the website were not getting through. We chased this around for a while until we came across a change in Sendmail behaviour that altered how it processed the hostname.

If the hostname wasn't quite right then it did funny things when set up to forward mail to an internal processing server before being sent on to the net. Sendmail would send to some addresses but not to others. Non-deterministic behaviour in computers gives me all of the sads.

Needless to say with that behaviour pattern we didn't catch it until a day later. That makes our total recovery time from outage to 100-per-cent functional a little more than a day.

The best-laid plans

Two hours outage to primary operation is bad but a day to get all the bugs worked out is awful. I have lost both the primary server and the primary operating system disk on the secondary server, walked a semi-retired technophobic lady through prying the panel off and rewiring the system over a cell phone, in the dark, and had the systems back up in less time than that.

I have spent years crafting this absolutely bulletproof backup system (the ability to shoot any system in the data centre and recover from it is a requirement for my designs) but was brought low by a failure to properly plan the details of recovery.

Companies with budgets shouldn't have to worry about building the infrastructure of backups any more. We have cloud backup and recovery software providers such as Asigra. (Canada has a top-tier cloud backup software provider. Yes, I am going to milk that link for a while.)

Appliance vendors such as Unitrends and even full-bore paper + bits data lifecycle companies such as Iron Mountain provide backup. Making your data exist in more than one place should be nearly effort-free at this point.

That other part, though, is testing your restoration procedures. We should probably all be paying a little more attention to that. ®