Original URL: http://www.theregister.co.uk/2011/10/08/tiered_storage/
Storage is ending in tiers
Respect the data hierarchy
One basic rule of storage is don’t keep items you need to access quickly in hard-to-get-at places: that’s like keeping instant coffee in a safe. You suit the storage type to the type of item you want to store.
One size does not fit all, and valuable data should not be stored on slow-speed disk drives along with ordinary data.
The “hot” data – information that is needed often and fast – should be stored on fast disk drives, not slow ones. But these drives should not be cluttered with old data that no one will access in a blue moon.
Disk drive array suppliers and storage administrators have understood this for years, but they still have to resolve the following: how is hot data to be identified, then moved from fast to slow storage, and how often should this be done?
Disk drives come with various spin speeds and interfaces. Other things being equal, the speed of rotation is the most important factor in data access time. A read/write head has to move to the right track on a disk's surface and then wait for the disk's rotation to bring the target sector under the head.
Moving the slider to bring the read/write head to the target track takes the same time irrespective of the disk's spin speed, leaving disk form factor (3.5in or 2.5in) out of it. A 15,000rpm disk will bring the target sector under the head almost three times faster than a disk spinning at 5,200rpm.
Also, a 15,000rpm drive has a lower capacity than a 7,200rpm drive, which in turn has a lower capacity than a 5,200rpm drive. The faster a drive spins, the lower its capacity and the higher its cost.
Make way for the new
Let's imagine a company that has all of its data on a drive array with a single disk type. Some of the data is there because it has to be kept just in case, but is of low importance and is accessed infrequently – last year's expense reports and manufacturing records, for example, or the data kept for regulatory compliance reasons.
Some of the data is accessed more frequently and is probably newer. This could be:
- Emails between one week and one month old
- PowerPoint decks created in the past three months
- Marketing collateral such as whitepapers
- HR staff records
Other data is accessed much more often, for example:
- Customer and accounting databases
- New hire records in HR
- The current manufacturing run
- Sales order processing
A warehouse has aisles near to or far away from the loading bay. High turnover goods are stored nearer the loading bay so they can be dealt with faster. That is an example of storage tiering.
So we can allocate data to three overall categories based on access rate and newness. These could be called hot, cool and cold; or fast, medium and low-access. We could also say high access-rate data is high-value data.
Ideally we would put the relatively small amount of hot data on fast disks, which are expensive and low capacity. Medium access data could go on 10K drives; and low-access data, the bulk of our information, could go on high-capacity 7.2K Sata drives.
The difficult bit
That seems simple enough. What's the problem?
The problem is threefold: data is not static; identifying its state is difficult; and moving it is tricky.
Data is created, used and then kept for reference. These three stages constitute a data lifecycle.
Newly created data could be stored on fast access disks, but as its access rate slows down it takes up space that is needed for newer data and should be moved to an intermediate tier of storage. On the intermediate tier, meanwhile, data is cooling and needs to be moved to the bulk Sata tier to make space.
Do we employ storage admins to identify data that is in the wrong storage tier and move it? Of course not, it should be automated.
System software in an array or server could track the access frequency over time to files and database records and move high-access rate data up the drive tiers and low-access rate data down the tiers.
Job done? Not quite. The tracking of access rates is a significant burden and the moving of data occupies array resources too. If you move large chunks of data then you reduce the number of move operations, which is good; but you might move inappropriate data, which is bad.
A storage array delivers files or blocks. Ideally, a database should have its blocks allocated to different disk tiers according to the access rate of those blocks.
But that requires either the database to know about disk tiers or the storage array to present a single pool of storage to the database although it is spread across different kinds of disk. Clever software is needed whichever route is taken.
The same is true for large files. In principle, the smaller the unit of data moved, the more efficiently a storage array can match access frequency and storage tiers.
Some storage arrays – Dell Compellent, for example – track individual block access rates (you can't get more granular than this) and move blocks up and down the tiers dynamically. That's a lot of data moving going on inside the array, and the Compellent array operating system requires multi-core X86 processors to provide the CPU horsepower to do this.
What is the effect of using flash solid state drives in storage arrays on this automated data tiering?
None whatsoever, in principle, flash being just another tier. However, writing data to flash should be minimised in the interests of flash longevity.
Tiering has become a standard feature of all modern drive arrays, such as EMC's Fast.
It is the best way to reduce the amount of expensive fast storage in arrays, using cheaper bulk storage for infrequently accessed data, medium-cost storage for medium-value data, and gold-standard fast storage for gold-standard data. ®