Feeds

What your database needs is a good thermometer

Data blows hot and cold

Choosing a cloud hosting partner with confidence

In the very early days of databases, when they contained relatively small quantities of information, the distinction between frequencies of access was pretty immaterial.

Now that we collect and store data in massive quantities, however, the distinction is becoming increasingly important. And apposite terms have evolved to describe this facet of data: "Hot" data is accessed frequently, and "cold" is accessed less frequently.

In almost any business system this year's data will be considered hot, because it's accessed frequently and at the transactional level. People, for example, will want to know things like "What did we sell to Customer X last week?" and "Have we fulfilled Customer Y's order placed yesterday?"

Much of the data held in a warehouse is cold - the bulk of its content is likely to be the transactional data from, say, the previous four years - with a much smaller quantity of very hot data from the current year.

Broadly speaking, there are three classes of disk that serve us, which you can see in the table below. The price of one of them - Solid State Drives - is, as I wrote here, plummeting. In that earlier article, I also discussed the use of SSDs in databases and, as may already be apparent, the concept of hot and cold data is particularly applicable to data warehouses.

Disk Type Power consumption Capacity Speed Cost
Slow hard disks (5K rpm) 6 – 8 watts High Slow Cheap
Fast hard disks (15K rpm) 16 watts Medium Fast Expensive
SSD (Solid State Disks) 150mW Low Blistering Painfully expensive

But surely any DBA with any sense will already be partitioning tables, putting frequently accessed data onto fast media and less frequently accessed data onto slow and so on. In fact, we've been handling data in this way for many years, archiving older material onto tape and so on. So, why the sudden interest in data temperature?

The answer is that these two factors - the increasing diversity in the heat of the data, combined with the changing storage media - mean that the management of the data is becoming more complex. Slightly more accurately, it means that we can gain significant performance advantages if we are prepared to manage the data in more complex ways.

For example, the obvious approach is to put the cold data on slow cheap disks and, ignoring cost implications, the hot on SSDs. What may be less obvious is a counter-intuitive argument for putting some hot data on slow disks. A slow disk holding mainly cold data should, by definition, spend much of its time in an underworked state.

If you add a little hot data to the disk the chances are that when a request for it arrives, the disk won't be busy doing anything else and the hot data will be delivered rapidly and, if not piping hot, at least moderately warm. Neat.

Essentially it is a matter of bottlenecking, and the bottleneck on a slow disk for archived data may sometimes be zero. You are simply balancing probabilities to take advantage of this fact: It won't work every time but in practice it is found to improve performance rather than degrade it.

And once you have decided to put hot data on a slower disk, it makes sense to put that data onto the outer sectors of the disk because data there can often be read off the disk faster. And if you are, by now, thinking "Whoa, this may be theoretically correct but it's way too difficult to manage in practice," you're right.

By understanding that data has temperature and disks vary in performance, we can do a great deal manually but in practice the problem is that the temperature of the data changes over time and the fine-tuning is so complex that it is unlikely to be done in practice.

This is exactly the kind of situation that is far better managed under algorithmic control, which can determine the positioning of data depending on its actual usage. In fact, this way of handling hot and cold data sits extremely well with handling data as Massively Parallel Processing (MPP).

And this is already being done. Teradata, for example, is way ahead of the curve in this area: The data is monitored for temperature and automatically moved to the optimal storage location. Teradata has also been interested in SSDs for years.

Teradata's director or platform and solutions marketing for EMEA Martin Wilcox said: "Not all database engines are created equal when it comes to data access. In contrast to most other engines, Teradata has always been optimized for random I/O patterns, rather than assuming sequential access, and is therefore inherently capable of immediately exploiting all of the advantages that SSDs have to offer."

Teradata has pushed the boundaries of SSD use even further. At its conference a year ago the company demonstrated a data warehouse system running entirely on SSDs and was using it to evaluate performance and cost benefits - and the price of SSDs has done nothing but plummet since then.

Thinking about data temperature can help us to tune our databases better, and badgering your database engine supplier to think about adding it as a feature is even better. ®

Internet Security Threat Report 2014

More from The Register

next story
Be real, Apple: In-app goodie grab games AREN'T FREE – EU
Cupertino stands down after Euro legal threats
Download alert: Nearly ALL top 100 Android, iOS paid apps hacked
Attack of the Clones? Yeah, but much, much scarier – report
You stupid BRICK! PCs running Avast AV can't handle Windows fixes
Fix issued, fingers pointed, forums in flames
Microsoft: Your Linux Docker containers are now OURS to command
New tool lets admins wrangle Linux apps from Windows
Bada-Bing! Mozilla flips Firefox to YAHOO! for search
Microsoft system will be the default for browser in US until 2020
Facebook, working on Facebook at Work, works on Facebook. At Work
You don't want your cat or drunk pics at the office
Soz, web devs: Google snatches its Wallet off the table
Killing off web service in 3 months... but app-happy bonkers are fine
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
How to determine if cloud backup is right for your servers
Two key factors, technical feasibility and TCO economics, that backup and IT operations managers should consider when assessing cloud backup.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Choosing a cloud hosting partner with confidence
Download Choosing a Cloud Hosting Provider with Confidence to learn more about cloud computing - the new opportunities and new security challenges.