Original URL: http://www.theregister.co.uk/2013/09/18/tape_and_the_business_of_disaster_recovery/

Got it taped: The business of tape-based disaster recovery

Taking a risky - or risk free? - hike up Iron Mountain

By Bob Dormon

Posted in Storage, 18th September 2013 09:05 GMT

Feature For many SMEs, tape disappeared from their landscape as a data storage choice ten or more years ago. Domestically, it exists, if at all, as a legacy item with perhaps a car stereo chewing its way through a selection of fondly regarded C-90s. Still, this lack of public visibility by no means indicates that tape has come to the end of its spool.

Hard drive prices have steadily fallen while their capacities escalate unabated, and yet tape storage continues to play a pivotal role in business information management. For the big data boom, it proves cost-effective, energy efficient and easier to handle for remote back-ups and archiving. To find out more about the reliance IT places on this media and how it’s is deployed, I met with two IT professionals who share the same job title, but have very different roles.

Iron Mountain's facility in Bristol

Iron Mountain, Bristol: daily proving its mettle as a secure storage facility

It didn't take long to discover that data centre companies aren't too keen on exposure, so it's lucky that I have a friend of old who is a data centre manager who explained the various strategies deployed. He also took me along to Iron Mountain, the data sanctuary his company uses where we met with a data centre manager of a different sort, Gwyn Lloyd, who oversees the storage and distribution of client media and, in a separate facility, handles storage and disposal of paper documents too.

The name Iron Mountain rather suggests the musical meeting of minds from two late 1960s hard rock bands, but even though this company provides facilities for tape storage, it’s very unlikely you’ll find much in the way of recorded music on most of them.

Yet, just like record companies regard master tapes as crown jewels from years of labouring in the studio, the businesses that rely on Iron Mountain’s services have similar feelings about the precious data stored on tape that is kept in this safe and secure environment.

If you’re thinking this could get boring, then think again. This is a serious business and one that companies integrate into their disaster recovery (DR) strategies for one very simple reason: survival. Think about it: lose your data and you’ve lost everything. That’s why Iron Mountain and other tape storage facilities exist, they are an insurance policy against acts of god, theft (oh, where’s my server gone?) and the occasional disgruntled employee who might try their hand at sabotage. In the trade, the upshot of this risk management process is referred to as Business Continuity Planning (BCP).

ENISA flow chart of Business Continuity Planning

ENISA flow chart of Business Continuity Planning

Depending on your business size and scale, this can involve a separate facility that is effectively a clone of the disaster stricken site, that’s kept on ice for when the nightmares begin. All you have to do to get your business back on its feet is load it up with tapes. And as those tapes were kept off site, in a place like Iron Mountain, you’ll be back in business in no time.

So, before we begin our tour of Iron Mountain, let's take a moment for my IT buddy – I’ll name him Hal – to explain more about the role tape plays and its value in a multi-client data environment. The company he works for, let’s call it High Fibre, handles some big players in retail and does have a cloned facility, performing regular DR tests. A real test is the only way to find out if the right procedure is in place and according to Gwyn Lloyd, some Iron Mountain clients undergo DR tests as often as once a quarter.

At High Fibre the DR tests occur annually. Hal explains that this is a service the company offers to manage and perform on behalf of its clients: “It would be quite a big event with a project plan and a project manager would be assigned to it. The client’s users would be brought in to test out the DR systems and they’d have a test plan that they’d run through which would typically happen over a weekend.”

If only it was as simple as stuffing in a few tapes and letting the tech churn into life, but there’s quite a bit of prep behind this and it depends on quite a number of factors; the most significant one being how much money the client has.

You’d think deploying a fully replicated physical data centre backup system would be enough to satisfy the terminally nervous. As a solution, it’s undoubtedly costly but there’s always something more you can do to make the DR site more effective.

EMC Symmetrix Remote Data Facility (SRDF)

EMC Symmetrix Remote Data Facility (SRDF)

“In a DR situation it’s all down to the latency between the production and DR data. So, if your production site went down and you had to move to your DR site, how old is that data? Obviously if you’re shipping tapes, it could be very old; it could be weeks old. But if you’ve got a dedicated link, transferring data from one SAN to another, the latency could be as low as a few tens of milliseconds. That would be typically referred to as a synchronous link.

There’s an EMC protocol called SRDF (Symmetrix Remote Data Facility). It requires a dedicated high speed link, but the data on that target (EMC) SAN is very, very up-to-date. It’s a synchronous mechanism, so it’s handshaking as the data goes over in packets - the target end is saying: Yup, got that, next one, got that, next...”

The quick and the undead

Not all of Hal’s clients can afford to implement a mirrored SAN in a different physical location with the comms links, and all that that protocol entails – SRDF runs across pretty expensive hardware. If you’re prepared to wait longer than a few milliseconds, then just have your target servers and storage sitting there ready, waiting. When the time comes, send tapes over via courier, restore them and you should be up and running within as many hours or days is quoted in the Service Level Agreement (SLA).

A lot of companies will consider this approach as it's certainly cost-effective. After all, DR is just an insurance policy; even though there are risks involved, disaster definitely doesn’t strike every company. Tape might not be the quickest option to get you up and running again, but for a policy you might never claim on, it’s excellent value. Not everyone needs the courtesy car after a crash.

Iron Mountain security camera

Keeping an eye on things at Iron Mountain

Regarding those SLAs, while you could be back in business in hours with tape, those quick turnaround deals can be scuppered on rare occasions, as Hal explains.

“If the tape is at Iron Mountain and we need an emergency recall, it should be with us in two hours. The only time that that falls down is if the tape has just left our site and is on its way up to Bristol. You can’t just recall the van. That doesn’t happen as other client tapes obviously need to get back to the vault.”

You can’t insist the van driver puts his foot down either, as Iron Mountain has speed limiters on its fleet that won’t permit driving above the UK maximum speed limit of 70mph. The vans are tracked too, with a Green Roads sensor system. It’s the type of tracking system that’s been tested on young motorists as a way of proving they're not boy racers, thus lowering their insurance premiums in the process. Given their precious payloads, Iron Mountain van drivers need to be saints on the road. Lloyd explains how the monitoring is utilised.

“It’s live tracking and sends back information on whether the driver is speeding, sharply turning corners, changing lanes and if the vehicle is accelerating or decelerating quickly. If the safety score is too high we re-educate them. Our in-house driver trainer will go over the safety score, discuss any issues and refocus attention. Overall the system reduces wear and tear, increases MPG and decreases accidents. We don’t want accidents – it impairs our image.”

Not that everyone would know they’d collided with an Iron Mountain van. The folks at High Fibre insist on unmarked vehicles to ship its precious tapes from site to site. It's a continuous cycle of daily and weekly refreshes along with the monthly archives. Iron Mountain manages the physical handling of this routine, providing new tapes along with off-site security for the company data.

Fabric design

In general, Windows Server 2003 or 2008 deals with the client software side of things that Hal encounters. He describes the data centre set-up:

"On the bigger libraries such as the Quantum Scalar i500 that we use for larger installations, you might have two, six, eight or more LTO-4 fibre tape drives that are hot swappable, each with a 4Gb/s fibre channel connection at the back. They will be connected with a fibre cable into the SAN ‘fabric’ for that particular client. So the storage is in the SAN for that client (it might be dedicated or shared) and those SAN switches, typically referred to as the fabric, are the centrepoint for all the fibre cabling, and this doesn’t change once it’s in place, it normally just works.

The backup server would have HBAs (host bus adapters) with fibre cables going into the SAN switches. The client’s main servers (database and application servers) would similarly have fibre connections to the SAN switches. So they’re all interconnected with fibre which typically runs at 4Gb/s – more modern ones run at 8Gb/s, possibly higher – but we generally have everything at 4Gb/s. So that obviously optimises the flow of data from the SAN storage arrays through the SAN switches to the tape drives.

Quantum Scalar i500 tape library

Big business: Quantum Scalar i500 tape library

For people like me, it’s the shared SAN, but we formally refer to it as the combined array and fabric. The servers connect to it, the tape drives connect to it and it’s managed by a different group: the SAN team. So they create storage areas in the arrays, then present them to individual servers and tape drives. The way that’s done is through a unique number on the fibre channel interface, which is known as the World Wide Name (WWN) – it’s like a MAC address. So every fibre channel adapter, for instance the HBAs in the server, the tape drives, the connections in the SAN switches, they all have unique WWNs.

Typically, there’s nothing more an operator has to do unless a storage area needs to be increased or reduced or, in certain SAN systems, copied to another area in the SAN. An administrator would allocate a particular area and present it and generally that’s their job done. SAN guys are not involved with the tape library.

On a normal backup the libraries are just pulling data off the SAN and putting it onto tape. For a restore they’re writing it back to SAN. These operations go on within the data centre as you’d need a very fat pipe if done remotely. However, we do have replication between data centres down dedicated links where that data is, for example, copied from a production SAN to a DR SAN.

We run our client backups at night is because there are fewer users on the system, so you can expect a bandwidth improvement, especially if going on a main public interface, although there might be a separate backup interface. By mid-morning the backup administrators in India identify the tapes that need to come out of the libraries. They’ve been monitoring them and can see if they’ve all finished. They list those tapes and send that info off to our smart hands people. These guys are in the same building as me and they’ll go around all the libraries, take out the specified tapes for secure storage and put fresh ones in.

Quantum Scalar i40 tape library software

Quantum Scalar i40 tape library software shows the drives and all slots including IE and cleaning status

Library machines don’t need routine maintenance and, in my experience, the Quantum ones are very reliable compared to others that I’ve dealt with in previous companies. The worst was an HP StorageWorks MSL6060 (a model dating back to 2003), mechanically it was horrendously unreliable, the robotics were a nightmare. The arm got stuck, the tapes got stuck, we'd be calling for an engineer every week. Working with Quantum libraries is a different world. That said, we've just had a Scalar i40 become faulty. It doesn't get beyond the start-up tests and the LCD reports that initialisation failed.

Occasionally, a drive calls for cleaning. That’s obviously a function of the library software that can be relayed out to the backup software. It could be via a GUI message, an error code in the library or a physical alert such as a little amber light. We would ask the backup admins to look into it. And they might reply, requesting we put a cleaning tape in. Some of the larger libraries you might keep it in there in a specific slot, with smaller ones, typically you don’t."

Rack space

You won’t find any heavy duty servers or even LTO tape machines at Iron Mountain though, just like you’re unlikely to find a printing press at a public library.

When you walk inside the Iron Mountain secure storage area at the Bristol premises, you’re immediately struck by the controlled environmental conditions. The temperature is set to 20°C (±3°C) and the humidity hangs around 50 per cent (±5%). It’s normal enough and not unpleasant, you just notice it as being a different, albeit an unchanging atmosphere.

Mobile shelving by Bruynzeel

Mobile shelving similar to that found in the secure vault at Iron Mountain. Source: Bruynzeel Storage Systems

Each area is dominated by numerous rows of mobile shelving – these wide metal racks can house up to 400 tapes apiece – and, combined, up to 250,000 items can be stored here. At the end of every shelf is a smoothly geared handle that, when turned, effortlessly moves the rack along the floor. Space can be conserved this way as shelves holding tapes less frequently accessed can be bunched together, to be rolled apart again when needed.

To make it easier to maintain the environmental conditions, the upper floor is merely a grating, so everyone wears trousers there, including the men. If a fire were to break out, this flooring mesh also serves the purpose of allowing the Inergen gas to circulate rapidly to extinguish it. Overhead, stripes of bright red Inergen gas piping cover the data store and are connected to 50 x 80litre gas bottles racked up outside in the packing warehouse.

Inergen gas bottles

Inergen gas bottle array

While there are only a handful of people working in the store, they’re busy collecting tapes to put in cages for dispatch or unboxing client collections for storage on the shelves. At every stage, Motorola Mobile Computer scanners are used to log the transit of each tape with Odyssey software (now Symantec Mobile Management) keeping track of these movements. As Lloyd declares, “Nothing moves that we don’t scan.”

Each client tape has its own unique tri-optic bar code and Iron Mountain includes its own system labelling too for client identification. While certain areas of shelving remain reserved for specific companies, only those in the know would recognise their tapes from just a glance. Logging plays such a critical part in this business that not only is there a very clear procedure for tape retrieval, dispatch, collection and storage for the Iron Mountain staff to follow, there are also very strict authorisation protocols to be adhered to as well. Any requests from a client will likewise be logged and identity/authority checks performed before any media is handled.

It’s not just the client staff that get the once over either, as Lloyd explains. “All staff are vetted to the baseline personnel security standard (BPSS), enhanced CRB (Criminal Records Bureau), employment checks, other background info and every three years they are revetted.” It sounds like overkill for just driving a van around and stacking shelves with tapes, but Iron Mountain is a List X site. This is the term used to describe commercial premises for government suppliers and contractors that meet security standards to handle sensitive material with staff vetted to a specific level.

Motorola Omnii XT15 logistics handheld

Motorola Omnii XT15 logistics handheld

Even the deliveries and collections have certain protocols – they have to be by road and the vehicles are double manned. There are exceptions, says Lloyd: “Sometimes there’s a panic to retrieve records for a court case and you’ll have someone carry the media to deliver on the next available flight.”

In most cases this is sensitive material that’s all sealed up. As you can guess, that goes down well with airport security.

“US Customs can have ‘a bit of an opinion’ about things, which can slow down entry but we know the correct procedures, so there's no reason why we wouldn't get through.”

Reel world scenarios

In the UK, Iron Mountain has a fleet of 350 vehicles, including sixty-nine 7.5 tonne lorries. The Bristol premises has 18 vehicles at its disposal and although some are reserved for paper transport, the facility still gets through 1500 tape movements a day. Tapes are typically transported in foam-lined containers – Peli cases being the preferred option for transit as they are waterproof and robust. The handheld barcode reader is primed to know what to expect when these boxes are scanned as they are loaded on to the van. If something’s not right, it’ll flag it up.

You’d think this was a perfect system, but when wandering around the vault there was one small space on a bench that caught my eye: the Quarantine Area. If a mystery tape appears from nowhere, that’s where it will be kept, as the process of identification begins. Lloyd was quick to point out that such an occurrence is extremely rare and, if proof were needed, the Quarantine Area was, indeed, untenanted.

Handling unknown tapes is one thing, but what about unwanted media? If a company goes bust, what happens to its backups? This tape archive is taking up space and nobody’s paying for it. According to Lloyd, such a situation is handled by the legal department and nothing gets destroyed without the nod from the lawyers first. Of course, there are times when tapes are disposed at the client’s request. That process isn’t simply a matter of degaussing – which would happen if the tapes were removed from site prior to destruction – but amounts to shredding the tape into tiny pieces that, even if you could reconstruct, certainly wouldn’t be in the right order.

Capacity planning

The other side of the coin works in Iron Mountain’s favour: people don’t like to delete anything, which surely means the shelves in the vault are groaning with more data than ever before? Yet Lloyd has witnessed some interesting changes to how the vault operates.

“We have noticed how the bigger capacity tapes effectively shrink what we store, but there’s more data arriving now than was seen five to ten years ago.”

LTO-6 Ultrium logo

So does this mean the place is getting stuffed with LTO-6 tapes now – the 2.5TB format released at the end of last year? Apparently not, most of the tape cartridges are LTO-4, with new system builds featuring LTO-5. Glancing at the shelves, it seems that Iron Mountain is no stranger to a regular, but small scale turnover of DAT tapes too.

Hal has yet to see LTO-5 take off with his clients and he doubts many have heard of LTO-6.

“We handle almost entirely LTO-4 tapes, a few LTO-3 from a legacy system. We’ve stuck with LTO-4 because you can get 1.6TB on each tape [using 2:1 compression] and they cost about £20 each. The price has come down a lot. They used to be £50 about five years ago.”

Iron Mountain tape vault

All in hand: tape storage at Iron Mountain

The purchasing and turnover of tape works out as a very modest investment. Admittedly, the tape library machines from the likes of Quantum, IBM and HP are expensive, but in Hal’s experience, for the most part, this equipment is extremely reliable. The tapes are too, but they do have their moments.

“The tapes are so resilient that unless they are physically damaged they are good for many, many years. Very occasionally the backup admin might report: this tape is unreadable, the software doesn’t recognise it. It’s very rare and I can take a look at the tape and yet I can’t see anything physically wrong with it, but for some reason they can’t use it.”

“We’ve one client who sent us a bunch of really old tapes and we have had quite a few failures with those. You don’t spend too much time on this as it’s easier to use a fresh tape. Occasionally we encounter mechanical failures: it's been dropped, or you might have a leader displaced within the tape, or perhaps nothing obviously wrong at all. We’ve been lucky whenever this has happened, as the tape has always ‘expired’ – it’s reached the end of its retention period – so there hasn’t been any value in the data. So it’s been easy to dispose of them.”

As tapes are assigned to specific tasks they become available for use once more after the retention period expires. It’s a continuous cycle of retention and release to refresh the backup content on the tape. As Hal outlines, the regular purchase of new tapes isn’t necessarily that frequent.

“When a client environment is ‘stood up’ new tapes are usually needed and the number of tapes purchased is proportional to the data that they’re likely to be storing. You’d have an initial estimate – a larger installation could be a 100 tapes – and that could last you two years."

Feeding the machine

Tape libraries are equipped with at least one tape drive – some have several, maybe eight or more – and feature multiple slots that can house handfuls to hundreds of tapes. A critical aspect is the import/export (IE) station which is where operators retrieve backup tapes or insert expired ones for re-use. Tapes aren’t kept in the IE station; it’s just a holding bay for the movement of them while avoiding any disruption of ongoing tasks within the library, as it's not uncommon for backups to be running while tapes are being put in. Hal describes the procedure.

Quantum Scalar i40 tape library

Quantum Scalar i40 tape library: drives at the back with tape slot magazines on the sides

“A 'smart hands' operator simply presses a few buttons on the library to get the IE station drawer open, puts the tapes in closes the door followed by a few more button presses on LCD interface. Incidentally, nobody’s scanning the tapes before they go into the library. The administrators have specified their recall, so we're confident that they’re good to go back into the libraries to be re-used. There’s an online list of those tapes that we can always check if we need to.”

The machinery inside the library itself then takes over. A robotic arm equipped with a barcode scanner goes up and down and scans the tape label and that information is relayed to the backup software, which in almost all cases is CA ARCserve. The backup administrators are data centre guys, usually working remotely for our company across three continents and they have a picture of the ARCserve installation for that particular client. They know what’s going on with the backup system at any point in time.

CA ARCServe Bare Metal Recovery

CA ARCServe Bare Metal Recovery restore in progress - click for a larger image

Obviously, the backup software has a scheduler in it so the jobs start at specific times and the software can issue alerts if there are any anomalies: backup failures, that kind of thing, they all get sent through to the administrators."

“The tapes would be in one of a number of pools, and the pools would relate to the type of backup jobs they were used for. You might have tapes that are in a daily pool and the retention period on those backups might be as low as two weeks. Over that two week period, you’ve presumably done a full backup on a different pool of tapes. Now those two weeks have elapsed, that old daily becomes of no value so you can bring that tape back from Iron Mountain and into the pool where it will be logically erased and re-used."

The recovery position

Tape library machine costs aside, that works out at around £2,000 for all your backup storage over that period – under £3 a day and although the needs will increases, the expansion is slow and steady.

"It’s typically when a client has databases that grow larger that you start using more. They might spill over from one to two tapes or five to six tapes. Hence, you have more tapes in your cycle, so after a while your stock of re-usable incoming tapes gets smaller and smaller.”

Quantum LTO-6 Ultrium tapes

LTO-6 Ultrium tapes have yet to catch on but offer 6.25TB compressed capacity and claim a 30 year archival life
This pack of 20 notches up 125TB and comes to around £10 per terabyte

Chuckling, he adds, “And what normally happens is somebody notices: we haven’t got many tapes for so and so – are there any at Iron Mountain that can come back, are there any in the cupboard? Eventually, you get to the point where you need to order some tapes. It is fairly rare and might only happen a few times a year.”

The shifting of both old and new tapes between the remote secure vault, the data centre and into the tape machines themselves involves some careful management and integration between the logging systems of High Fibre and Iron Mountain, which is where the labelling comes into play.

“I instigated a numbering system so I could identify the client just by looking at the number. We buy them blank and labelled with these numbers and barcodes. There’s an additional site sticker that goes on the top of the tape which identifies us amongst Iron Mountain’s clients. However, the tape library only reads the trioptic label. Any tape within our data centre is unique, so no two clients have the same numbered tapes. The backup admininstrators in India know which backup system the tapes are on and which client it belongs to. The tape library is obviously connected to that backup server – logically and physically – with copper or fibre.”

Buncefield fire scene

Buncefield: Source: MIIB, Chiltern Air Support Unit

So the clients can sleep easy in their beds with different backups: daily, weekly, monthly and even annually, with the longer periods kept for years. Lloyd recalls the sobering example of the Buncefield oil terminal explosion from 2005 which wiped out 92 neighbouring businesses and left 9,500 employees with no premises to work in. Among them was a data centre run by Northgate Information Solutions. Yet on the strength of its business continuity planning, it was able to retain its customers. It can be a very different outcome for those without a DR safeguard, as Lloyd points out.

"At Buncefield, you couldn’t get back to the premises for some weeks; in that time your business could go to the wall. With no DR strategy in place, a large number of companies who have a fire never get back into business again."

As mentioned earlier, the whole disaster recovery routine is tested at regular intervals but to be effective, procedures need to be followed and the idiosyncrasies of the various components involved need to be accounted for in meticulous detail. Hal has first hand experience of the sorts of issues that arise with live DR testing.

“You’ve got a physical replication of your SAN at the target site which is generally not in use. The server builds, the OS patching that’s all up-to-date and ready to go. All it’s waiting for is the data on the tapes, and, of course, the switching of the communication links from one site to another.

Some of the target systems have the same local IP addresses as the production systems, so when you’re pulling data off a tape, the host names and IP addresses are all the same. Also, various databases and applications have hard coded IP addresses in them, I wish they hadn’t but they do tend to do that in the SAP world.

You have to be very careful when you switch over to the DR environment. The users involved have to be sure that they are not actually looking at the original production data and that they are looking at the DR data. Furthermore, they need to keep in mind that any updates that they might make are simply going to be lost. During a test we can’t have them working on a DR system thinking that they are working on the main production systems. So it’s all very tightly controlled.”

Borrowed time

A full suite of tests will go on for several hours and at a specified point in the project, there’s a switchover back to the production system. The production communications are re-established and off you go, the DR’s left again for the next test. Now this might seem like a lot of horsepower idling away for six months in between tests but in Hal’s experience, clients will use their DR equipment for other projects.

“We have one that uses it for dev, tests and training and so on. But if a DR occurred on their main production systems they would effectively sacrifice the training, test and development environment. The DR needs would take over but that’s all manageable as you might have extra storage to preserve any important projects being run on these systems.”

Certainly, development and training is a very practical approach to utilise such costly equipment that's being saved for a very rainy day, ready to be transformed into a fully functioning data centre by simply inserting a few tapes. Whether the end users know it or not, a vast range of enterprises still depend upon this extremely reliable and affordable media to resurrect their fortunes when disaster strikes. The continued development of LTO drives, with ever increasing capacities, suggests an enduring future, particularly as big data and the cloud reign over data centres whose business is to ensure that they have got it taped. ®