Retaining 100 years of information
El Reg has teamed up with the Storage Networking Industry Association (SNIA) for a series of deep dive articles. Each month, the SNIA will deliver a comprehensive introduction to basic storage networking concepts. This month the SNIA examines the idea of 100-year archives.
There are different reasons for storing information over a long term. Laws and regulations force organisations to keep data for specific lengths of time e.g. life insurance information has to be retained for no less than the remaining validity of the insurance (which often relates to how long people live) and people tend to live longer and longer. Other fields where data has to be stored for a long time include web services and fixed content repositories.
Fixed content repositories store information such as books, movies, music, scientific and historical data. Archiving such data is a basis to analyse it, to discover patterns and to learn from the past.
Our society has become an electronic one; the cloud, social networks et al - life today happens online. If we are not able to preserve information we might end up in a new dark age, where future generations will not be able to understand our time because there would not be enough information left behind.
How common is this?
SNIA carried out the 100 year archive requirements survey and when asked for the longest retention time they have for any data, 53 per cent of companies stated “permanent”, i.e. some data needs to be kept forever. Long term retention is obviously a relevant and wide spread requirement.
What does long-term retention mean?
Data should remain accessible, usable and of course uncorrupted for as long as necessary, beyond the lifetime of any particular storage system or technology. And this has to be achieved at an affordable cost.
The task of keeping information for many years is challenging and there are many factors that we need to be aware of and which are sometimes out of our control, such as:
• Large scale disasters
• Human error
• Media fault
• Economic fault
• Organisational fault
• Infrastructure obsolescence
• Software or format obsolescence
• Lost context/metadata
• Bit preservation
While the first six threats are relevant for all kind of data stored, the last four are specific to long-term retention.
Let’s start by having a look at the last point, bit preservation, sometimes also referred to as the problem of random bit changes. All technology can fail; in the case of storage, a bit can change e.g. from 0 to 1. This is a pretty unlikely event in day-to-day storage operations. But when you store larger and larger amounts of data for longer and longer periods of time, even improbable events become probable. However, worst case scenario you might have to keep a couple of Petabytes of data forever and in such a situation, anything that could happen will happen. The only question is when so you must be prepared. A recipe against random bit change is to create more than one copy of the data set and to double check whether anything has changed. If so, abandon that copy and create a new one.
The other three points all deal with the effect on your archive of time passing and technology and environment changing. Let’s discuss the different levels you need to have a look at:
1) The media itself, e.g. disk, tape or even something from the remote past like a floppy disk
2) The physical devices you need in order to access the media, e.g. the floppy drive and what you need to make it read your floppy
3) The logic/software behind, from the operating system on a server to the drivers to the application software
4) And finally staff, who can operate the application and understands the context, i.e. knows what kind of data they see and what it is good for.
Let’s have a look at possible ways to address these challenges:
The classic approach is to choose a medium that stands the test of time. Ancient civilisations used stone that we can still decipher thousands of years later; unfortunately access time and bandwidth are far beyond our expectations today. Nevertheless there are options such as microfiches or specific metal plates that can provide long-term data retention. Unfortunately, they still have access time and throughput deficiencies. Therefore, in practice, they could only be used if access is improbable and the main objective is to keep data for as long as possible.
IT media sooner or later loses its information; there are hardly any guarantees but as a rule of thumb hard disk drives fail after approximately five years, tape after 30. So if you need to preserve data for a longer time the only feasible approach is to regularly copy data from old to new media. This also eliminates issues related to changes in technology, which however often provides benefits in terms of enhanced capacity and enhanced throughput and if you create more than one copy, you increase the level of probability that the data will still be there.
There are basically three ways how to address this. The first option is to store on a medium in a way that does not require a specific device to read the information; this is the “hieroglyph on a stone” approach. In this case the necessary device is the human eye. Today writing images to microfiches would get close to this approach; you are very much on the safe side, but again, access and speed are not very efficient.
The second option is to archive the device with the data. This creates various challenges: do you also have to archive some spare parts? Will there be someone to repair the device when it becomes necessary? Who will be able to operate the device 100 years from now? Will you be able to connect it anywhere? If the device was an old floppy drive for example, you would also need the right generation of computer with the proper drivers. If in doubt you would probably have to archive the complete infrastructure stack including manuals. With option three we come back to the best practice on the media challenge: keep on copying. And copy regularly to up to date infrastructure (e.g. the latest generation of LTO tape). This is probably the most convenient and safest approach to avoid running into old hardware issues. Unfortunately neither copying data nor buying the latest equipment come free of charge, so you’d better count that in into the TCO calculations of your archiving solution.
The logical layer
If you don’t want to archive software stacks you will need to use abstraction or virtualisation layers that will stay the same regardless of technological progress. Such standards do exist on different levels; let’s have a look at some examples.
a. PDF is an example of a rather well-defined file format. The idea would be that even though applications might change future applications would be able to read PDF documents. Unfortunately standards also progress over time, which would not be an issue as long as there are no upgrade compatibility breaks. Compatibility is the natural enemy of innovation and the question is whether it could be avoided over very long timeframes, like 100 years.
b. Another example is XML or any other approach where you store the meta data or the description of the data together with the data itself. If you extend that idea to an object-based model the software or application necessary to work with the data could be stored with the data. But this concept is not endless. Should it go as far as the operating system?
c. An approach to archive the necessary applications could be to use server virtualisation. When using VMware for example, a server is just a file. So why not just archive this file, that represents the server, together with the software? The hypervisor is a very nice example for such an abstraction layer as described above. But is this interface stable enough to span 100 years? What if server virtualisation looks fundamentally different in the future?
There are many more examples of how standards can ease the pain of software/format/infrastructure obsolescence. As long as you stick to standards that have certain relevance today or are widely used the probability is high that you will at least find converters or emulators in the future to be able to use your data.
For any data created today to be usable in 100 years, it must be possible for our descendants to operate the applications and to understand the data. This means that user manuals have to be archived for example. Long-term retention creates situations where the original creator of data is no longer around so everything must be self-explanatory. Just imagine you had to work on archived info from the early days of IT, finding some punch cards in a cabinet without any additional information – and that is just some 60 years old!
To summarise, here are some best practices you could apply for successful long term data retention:
• Create at least two copied of data and distribute them as far apart as possible for DR purposes
• Heterogeneity helps avoid correlations. Avoid interdependencies and correlations in your concept
• Find hidden issues; carry out audits and access tests
• Use widely-understood (and used) standards
A good archive is almost always active; we can’t predict what will change, only that it will. ®
This article was written by Marcus Schneider, SNIA Europe Board member, and Director of Product Marketing at Fujitsu for Storage Solutions.
For more information on this topic, visit: www.snia.org and www.snia-europe.org. To download the tutorial and see other tutorials on this subject, please visit: http://www.snia.org/education/tutorials/2010/fall
About the SNIA
The Storage Networking Industry Association (SNIA) is a not-for-profit global organisation, made up of some 400 member companies spanning virtually the entire storage industry. SNIA's mission is to lead the storage industry worldwide in developing and promoting standards, technologies, and educational services to empower organisations in the management of information. To this end, the SNIA is uniquely committed to delivering standards, education, and services that will propel open storage networking solutions into the broader market.
About SNIA Europe
SNIA Europe educates the market on the evolution and application of storage infrastructure solutions for the data centre through education, knowledge exchange and industry thought leadership. As a Regional Affiliate of SNIA Worldwide, we represent storage product and solutions manufacturers and the channel community across EMEA.