Got it taped: The business of tape-based disaster recovery
Taking a risky - or risk free? - hike up Iron Mountain
The quick and the undead
Not all of Hal’s clients can afford to implement a mirrored SAN in a different physical location with the comms links, and all that that protocol entails – SRDF runs across pretty expensive hardware. If you’re prepared to wait longer than a few milliseconds, then just have your target servers and storage sitting there ready, waiting. When the time comes, send tapes over via courier, restore them and you should be up and running within as many hours or days is quoted in the Service Level Agreement (SLA).
A lot of companies will consider this approach as it's certainly cost-effective. After all, DR is just an insurance policy; even though there are risks involved, disaster definitely doesn’t strike every company. Tape might not be the quickest option to get you up and running again, but for a policy you might never claim on, it’s excellent value. Not everyone needs the courtesy car after a crash.
Keeping an eye on things at Iron Mountain
Regarding those SLAs, while you could be back in business in hours with tape, those quick turnaround deals can be scuppered on rare occasions, as Hal explains.
“If the tape is at Iron Mountain and we need an emergency recall, it should be with us in two hours. The only time that that falls down is if the tape has just left our site and is on its way up to Bristol. You can’t just recall the van. That doesn’t happen as other client tapes obviously need to get back to the vault.”
You can’t insist the van driver puts his foot down either, as Iron Mountain has speed limiters on its fleet that won’t permit driving above the UK maximum speed limit of 70mph. The vans are tracked too, with a Green Roads sensor system. It’s the type of tracking system that’s been tested on young motorists as a way of proving they're not boy racers, thus lowering their insurance premiums in the process. Given their precious payloads, Iron Mountain van drivers need to be saints on the road. Lloyd explains how the monitoring is utilised.
“It’s live tracking and sends back information on whether the driver is speeding, sharply turning corners, changing lanes and if the vehicle is accelerating or decelerating quickly. If the safety score is too high we re-educate them. Our in-house driver trainer will go over the safety score, discuss any issues and refocus attention. Overall the system reduces wear and tear, increases MPG and decreases accidents. We don’t want accidents – it impairs our image.”
Not that everyone would know they’d collided with an Iron Mountain van. The folks at High Fibre insist on unmarked vehicles to ship its precious tapes from site to site. It's a continuous cycle of daily and weekly refreshes along with the monthly archives. Iron Mountain manages the physical handling of this routine, providing new tapes along with off-site security for the company data.
In general, Windows Server 2003 or 2008 deals with the client software side of things that Hal encounters. He describes the data centre set-up:
"On the bigger libraries such as the Quantum Scalar i500 that we use for larger installations, you might have two, six, eight or more LTO-4 fibre tape drives that are hot swappable, each with a 4Gb/s fibre channel connection at the back. They will be connected with a fibre cable into the SAN ‘fabric’ for that particular client. So the storage is in the SAN for that client (it might be dedicated or shared) and those SAN switches, typically referred to as the fabric, are the centrepoint for all the fibre cabling, and this doesn’t change once it’s in place, it normally just works.
The backup server would have HBAs (host bus adapters) with fibre cables going into the SAN switches. The client’s main servers (database and application servers) would similarly have fibre connections to the SAN switches. So they’re all interconnected with fibre which typically runs at 4Gb/s – more modern ones run at 8Gb/s, possibly higher – but we generally have everything at 4Gb/s. So that obviously optimises the flow of data from the SAN storage arrays through the SAN switches to the tape drives.
Big business: Quantum Scalar i500 tape library
For people like me, it’s the shared SAN, but we formally refer to it as the combined array and fabric. The servers connect to it, the tape drives connect to it and it’s managed by a different group: the SAN team. So they create storage areas in the arrays, then present them to individual servers and tape drives. The way that’s done is through a unique number on the fibre channel interface, which is known as the World Wide Name (WWN) – it’s like a MAC address. So every fibre channel adapter, for instance the HBAs in the server, the tape drives, the connections in the SAN switches, they all have unique WWNs.
Typically, there’s nothing more an operator has to do unless a storage area needs to be increased or reduced or, in certain SAN systems, copied to another area in the SAN. An administrator would allocate a particular area and present it and generally that’s their job done. SAN guys are not involved with the tape library.
On a normal backup the libraries are just pulling data off the SAN and putting it onto tape. For a restore they’re writing it back to SAN. These operations go on within the data centre as you’d need a very fat pipe if done remotely. However, we do have replication between data centres down dedicated links where that data is, for example, copied from a production SAN to a DR SAN.
We run our client backups at night is because there are fewer users on the system, so you can expect a bandwidth improvement, especially if going on a main public interface, although there might be a separate backup interface. By mid-morning the backup administrators in India identify the tapes that need to come out of the libraries. They’ve been monitoring them and can see if they’ve all finished. They list those tapes and send that info off to our smart hands people. These guys are in the same building as me and they’ll go around all the libraries, take out the specified tapes for secure storage and put fresh ones in.
Quantum Scalar i40 tape library software shows the drives and all slots including IE and cleaning status
Library machines don’t need routine maintenance and, in my experience, the Quantum ones are very reliable compared to others that I’ve dealt with in previous companies. The worst was an HP StorageWorks MSL6060 (a model dating back to 2003), mechanically it was horrendously unreliable, the robotics were a nightmare. The arm got stuck, the tapes got stuck, we'd be calling for an engineer every week. Working with Quantum libraries is a different world. That said, we've just had a Scalar i40 become faulty. It doesn't get beyond the start-up tests and the LCD reports that initialisation failed.
Occasionally, a drive calls for cleaning. That’s obviously a function of the library software that can be relayed out to the backup software. It could be via a GUI message, an error code in the library or a physical alert such as a little amber light. We would ask the backup admins to look into it. And they might reply, requesting we put a cleaning tape in. Some of the larger libraries you might keep it in there in a specific slot, with smaller ones, typically you don’t."