The Register® — Biting the hand that feeds IT

Feeds

What your database needs is a good thermometer

Data blows hot and cold

Supercharge your infrastructure

In the very early days of databases, when they contained relatively small quantities of information, the distinction between frequencies of access was pretty immaterial.

Now that we collect and store data in massive quantities, however, the distinction is becoming increasingly important. And apposite terms have evolved to describe this facet of data: "Hot" data is accessed frequently, and "cold" is accessed less frequently.

In almost any business system this year's data will be considered hot, because it's accessed frequently and at the transactional level. People, for example, will want to know things like "What did we sell to Customer X last week?" and "Have we fulfilled Customer Y's order placed yesterday?"

Much of the data held in a warehouse is cold - the bulk of its content is likely to be the transactional data from, say, the previous four years - with a much smaller quantity of very hot data from the current year.

Broadly speaking, there are three classes of disk that serve us, which you can see in the table below. The price of one of them - Solid State Drives - is, as I wrote here, plummeting. In that earlier article, I also discussed the use of SSDs in databases and, as may already be apparent, the concept of hot and cold data is particularly applicable to data warehouses.

Disk Type Power consumption Capacity Speed Cost
Slow hard disks (5K rpm) 6 – 8 watts High Slow Cheap
Fast hard disks (15K rpm) 16 watts Medium Fast Expensive
SSD (Solid State Disks) 150mW Low Blistering Painfully expensive

But surely any DBA with any sense will already be partitioning tables, putting frequently accessed data onto fast media and less frequently accessed data onto slow and so on. In fact, we've been handling data in this way for many years, archiving older material onto tape and so on. So, why the sudden interest in data temperature?

The answer is that these two factors - the increasing diversity in the heat of the data, combined with the changing storage media - mean that the management of the data is becoming more complex. Slightly more accurately, it means that we can gain significant performance advantages if we are prepared to manage the data in more complex ways.

For example, the obvious approach is to put the cold data on slow cheap disks and, ignoring cost implications, the hot on SSDs. What may be less obvious is a counter-intuitive argument for putting some hot data on slow disks. A slow disk holding mainly cold data should, by definition, spend much of its time in an underworked state.

If you add a little hot data to the disk the chances are that when a request for it arrives, the disk won't be busy doing anything else and the hot data will be delivered rapidly and, if not piping hot, at least moderately warm. Neat.

Essentially it is a matter of bottlenecking, and the bottleneck on a slow disk for archived data may sometimes be zero. You are simply balancing probabilities to take advantage of this fact: It won't work every time but in practice it is found to improve performance rather than degrade it.

And once you have decided to put hot data on a slower disk, it makes sense to put that data onto the outer sectors of the disk because data there can often be read off the disk faster. And if you are, by now, thinking "Whoa, this may be theoretically correct but it's way too difficult to manage in practice," you're right.

By understanding that data has temperature and disks vary in performance, we can do a great deal manually but in practice the problem is that the temperature of the data changes over time and the fine-tuning is so complex that it is unlikely to be done in practice.

This is exactly the kind of situation that is far better managed under algorithmic control, which can determine the positioning of data depending on its actual usage. In fact, this way of handling hot and cold data sits extremely well with handling data as Massively Parallel Processing (MPP).

And this is already being done. Teradata, for example, is way ahead of the curve in this area: The data is monitored for temperature and automatically moved to the optimal storage location. Teradata has also been interested in SSDs for years.

Teradata's director or platform and solutions marketing for EMEA Martin Wilcox said: "Not all database engines are created equal when it comes to data access. In contrast to most other engines, Teradata has always been optimized for random I/O patterns, rather than assuming sequential access, and is therefore inherently capable of immediately exploiting all of the advantages that SSDs have to offer."

Teradata has pushed the boundaries of SSD use even further. At its conference a year ago the company demonstrated a data warehouse system running entirely on SSDs and was using it to evaluate performance and cost benefits - and the price of SSDs has done nothing but plummet since then.

Thinking about data temperature can help us to tune our databases better, and badgering your database engine supplier to think about adding it as a feature is even better. ®

5 ways to reduce advertising network latency

Whitepapers

Microsoft’s Cloud OS
System Center Virtual Machine manager and how this product allows the level of virtualization abstraction to move from individual physical computers and clusters to unifying the whole Data Centre as an abstraction layer.
5 ways to prepare your advertising infrastructure for disaster
Being prepared allows your brand to greatly improve your advertising infrastructure performance and reliability that, in the end, will boost confidence in your brand.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Email delivery: Hate phishing emails? You'll love DMARC
DMARC has been created as a standard to help properly authenticate your sends and monitor and report phishers that are trying to send from your name..
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?

More from The Register

next story
Windows 8 fans out-enthuse Apple fanbois
Redmond allows 81 Win 8 devices to use one user ID, solving side-loading shemozzle
'200 million' fanbois using iOS 7 just a week after release - study
Plus: Most US iDevice users are drinking Cupertino's latest Koolaid
No luck at all for BlackBerry as Messenger apps launch stalls
Leaked Android build 'causes issues,' is withdrawn
App Store ratings mess: What do we like? Sigh, we dunno – fanbois
How do I know what to download if I don't know what everyone else is doing?
OUCH: Google preps ad goo injection for Android mobile Gmail app
Don't worry, fandroids, wallet-plumping serum won't hurt a bit
Launchpads, catapults... what a load of - WAIT, there's £15m for grabs?
Quango sprinkles cash on games, animation and trendy meeja types
Apple iOS 7 makes some users literally SICK. As in puking, not upset
'Eye candy really is as bad as classical candy is for the teeth,' writes one
Google reveals its Hummingbird: Fly, my little algorithm - FLY!
Update brings Googleplex one step closer to sentience
prev story