The intelligent data storage imperative
Information Lifecycle Management
Analysis Storing data intelligently has suddenly become a major imperative for companies. How you handle replication and redundancy is becoming a critical factor. The reason that Information Lifecycle Management has become one of the focuses of the IT industry is that the amount of data we store is growing at an alarming pace. The world produced about 5 exabytes of new data per year (that's five million terabytes) and the rate of growth is about 30 per cent - at least that's what Hal Valarian's researchers at the University of Berkeley in California tell us. Now some of this is not data that many organizations store much of (video and audio), but a great deal of it is - website content and email being the main culprits.
To add to this we have the simple fact that regulation and compliance in the US are starting to demand that companies keep audit trails of changes to data (or at least important data). The audit trailing of data is likely to multiply the amount of transactional data we store by a factor of two or more. Regulation and compliance in Europe are likely to follow suit and present the same or similar demands. So the amount of data we store is going to continue to grow and just storing it in an intelligent way is going to be problematic. However, there is an issue that needs to be considered. A fair amount of data is simply out of control. Hal Valarian's team estimates that 80 per cent of stored data is replicated, or redundant, or both. This means that there are about five copies (on average) of every single chunk of data. We can readily acknowledge that there should be two copies (the real one and a back-up) and maybe there should be an average of three, because data often needs to be distributed for the sake of further usage or for the sake of performance. But two of the average five copies are probably completely redundant.
It isn't surprising really. It is common within organizations that data is left lying on a disk somewhere because no one dares to delete it, even though everyone is reasonably convinced that it probably isn't required. Unfortunately there is usually no accurate record of why the data exists, although there is usually some way of knowing the last time it was accessed and when it was created. Let's add another fact to the mix: 90 per cent of data on disk is never or seldom accessed after 90 days. Actually, a good deal of it is never accessed after a week. The 90 per cent figure applies to all data, but the data held in databases gets more usage for longer than data held in files and, particularly, data held in email systems. If we summarize the situation then: a good deal of data is redundant (around 40 per cent) and a good deal of data doesn't need to be on-line. So the intelligent archiving of data starts to become an imperative, because the cost of holding data in an archive is lower than holding it on disk. Unfortunately, it isn't so simple because the data that is accessed after 90 days is important data and keeping it available for quick access is important.
So, analyzing the usage of data in a proactive manner, so that it is possible to accurately estimate its future usage pattern is important. To complicate the situation, we have the fact that there are different options for data storage. Physically there is: solid state disk, fast disk, capacity disk, optical disk, near-line tape, far-line tape and non-digital means of storage - the options getting less expensive according to the speed of retrieval. But unless you know the speed that data needs to be made available, it is not possible to organize the sensible flow of data from instantly available to an archived state. Also, back-ups are a natural part of this migration as backed-up data too falls into this data migration cycle and needs to be stored for a specific speed of recovery. To further complicate the situation, the price of the technology is constantly changing. It is moving agreeably downwards, but the cost equation is still complex.
The market for digital tape is gradually being eroded by disk, as disk is a far more reliable medium and the cost per gigabyte is in steep decline. But this needs to be balanced against the fact that most organizations store ever more data - indeed the cost of data storage is usually the most expensive component of data center costs, despite the decline in costs. The complexity of the situation suggests that the more automated the solution the more practical it will be. Indeed, the ideal is to move towards a solution which monitors data growth and is able to predict what type of extra resource is required and when - optimizing a cost equation in the process. As in the previous column I wrote on this topic, it all depends on an analysis of the data resource and the setting of policy in line with what is known. As it happens, there are no vendors out there that have complete out-of-the-box solutions yet, although now the likes of EMC, IBM, Hitachi, StorageTek and the rest are all moving in the direction of getting smart about the problem and treating storage as a "virtual" resource. Indeed if you think in terms of information lifecycle management it is easy to understand EMC's acquisition of Legato and Documentum.
In my view, the Information Lifecycle Management problem will not be resolved by the storage vendors alone, but will ultimately involve the controlled versioning of all data and the attaching of a much richer set of meta data (using XML) to data itself - so that data of any kind knows who created it, when where how and why and also, perhaps, some indication of what its value actually is. This is really the domain of database - although database is still very far from being the natural store for all data. If you are getting the idea that you'll be hearing about Information Life Cycle management for many years to come, then you're probably right. We're only at the beginning of its life cycle.
Sponsored: Benefits from the lessons learned in HPC