Tech is the biggest problem facing archiving

Mountains of unreadable obsolete magnetic tapes!

The Moon

Blocks and Files Technology is the biggest problem facing archiving. Archives grow bigger and bigger. The amount of data to be kept grows ever bigger and threatens to overflow an archive installation. So, let's use LTO-6 tapes instead of LTO-5 ones because they hold twice as much data in the same physical space.

That's logical but there is an unwanted side effect; LTO-5 drives can read LTO-3, LTO-4 and LTO-5 tapes. LTO-6 drives can read LTOs 4, 5 and 6 but not 3. All the LTO-3 tape contents have to be migrated up to LTO-6 to minimise future migrations. Because when LTO-7 comes along then its drives won't be able to read LTO-4 tapes and all their content will have to be migrated, etc., ad nauseam.

If their content isn't migrated then we can surely expect LTO-3 drive manufacture to cease shortly followed by LTO-3 drive support, break-and-fix skills and spare parts availability to wither away, followed in the fullness of time by LTO-5 support etc., and so it goes. Eventually it will be impossible to read an old tape format.

A significant aspect of archive tape library functionality in the future will almost inevitably need to be the automated migration of earlier tape formats to the newest ones to preserve content readability.

It would be great if the ability to read and write tapes could be divorced from the actual tape media. Note this problem doesn't exist so much with disk drives, because disk and drive are a unity. As long as the interface electronics and software exists (Fibre Channel or SAS or SATA) and as long as there is software that can interpret the data format on the drives … it's a different flavour of the same problem.

Newer versions of Word cannot read documents produced with older versions of Word. It also seems inevitable that, before long, some archive software will include old application and system software version plug-ins so that old data can be restored from an archive in human-readable format. Of course, there are only two ways to present data to the Mk 1 eyeball; as numbers, text and diagrams on a display of some sort or as printed marks on paper.

EMC Vatican Library Video still

Vatican Library; five centuries of stored paper, still readable by the Mark 1 Eyeball (screenshot)

The screen version is effectively an analogue of the paper version, and it is paper that is the enduring archive medium. Stick that in front of the Mark 1 eyeball and the jolly old inter-cranial computational unit will do its job.

Advancing storage technology, including hardware, system software and application software, gets in the way of this. It would be better if the digital archive medium contained as few steps between what the eyeball needs to see and the actual storage medium as possible, while still having the advantages of a digital medium's storage density.

That would then tend to reduce the side-effect exposure to technology advances.

Royal Dutch Petroleum Dock in E Indies

Royal Dutch Petroleum dock in the former East Indies (now Indonesia)

100 year archive

The problem is actually a very large one. Take the second-biggest company in the world in revenue terms; Shell, properly known as Royal Dutch Shell, which came into being 106 years ago. It has, in effect, a 100+ year archive consisting almost entirely (bar the last few years) of paper documents.

Let's imagine a 100-year tape archive. How would that work?

A little over every two years its tape format would advance a generation. LTO-1 was announced in 2000 with a 100GB capacity. Now, 13 years later, we have LTO-6 with 2.5TB capacity; that's 6 format generations over 13 years. Even delaying the format transitions for the archive to every five years (instead of every two years, as recent history shows us) would mean 20 tape format transitions in a century.

As the archive capacity mounted up the bulk of the tape archive's work would increasingly consist of migrating the contents of old tapes to new ones. It would present ever fewer of its resources in response to archive users' data access requests. The bulk of the cost of the archive would be spent internally, having it chase its own data migration tail, and its cost per user access would skyrocket.

We are not even thinking yet about how Word 2113 would be able to read a Word 2013 format document; that sort of problem would have to be dealt with possibly by a constant ongoing content format migration as well.

In a word, this is nonsense.

Unless we reach a stage where archival technology becomes as stable as paper and printing had been for decades, centuries even, then we cannot, unquestioning, keep all the data we digitally collect. The oldest, least-wanted data, will have to be let go, deleted. Unless there is a clear need to keep it then some kind of digital filtering mechanism will have to be used to scrap the least-wanted data and delete it.

The archive will have to be trawled by digital spider-bots; data killers, looking for useless data and destroying it to make space for wanted data.

Somebody could make a business out of taking this old data and storing it in a kind of digital deep-freeze for potential re-activation. Maybe this could be on the Moon, in a nuclear-powered mega-flash-vault, with plenty of space to expand; there'll always be another crater ... but this is science fiction.

The real moral of this tale is that virtually no data is needed for ever. Big data bigots' mad claims notwithstanding, digital archives will have to be regularly cleared. Physical space runs out; digital space runs out; formats change; applications change; and preserving access to older and older data will become crushingly expensive.

Technological change might come up with a solution to this problem, but it's a problem created by that very process of technological change. Beware what you wish for. ®

Biting the hand that feeds IT © 1998–2018