Original URL: https://www.theregister.com/2013/03/21/big_data_diet/

Time to put 'Big Data' on a forced diet

There ain't nothing cheap about big storage

By Dave Cartwright

Posted in On-Prem, 21st March 2013 08:02 GMT

Data is big business. These days they've even started calling it “Big Data”, just in case its potential for unbridled magnitude had escaped anyone. Of course, if you have Big Data you need somewhere to put it. Hence storage is also big business.

On the one hand this is a good thing, but that's just because several of my relatives work in sales in the storage industry - which means the commission they earn from selling another couple of petabytes of disk space can usefully be redeployed in buying me the occasional pint.

In general, however, storage is a bad thing if you don't use it sensibly.

A company I worked with a while back had a problem: their server estate was growing and their SAN was running low on space. The obvious question came up: what's the cost of adding another array to expand the space available? With disks considered a commodity these days, a nice cheap quotation was expected, but imagine the look of surprise when they were told: “It doesn't really matter what it costs to buy, there's no space in the server room to put it.”

So they asked themselves the question: can we use this stuff more wisely? And the answer was “yes” – with some relatively simple steps they could free up well over a terabyte of storage. And while that doesn't sound very much with regard to modern technology (I just picked a random IT retailer's website and found a 1TB drive for £53, for instance, and for that matter the Apple Time Capsule on my desk is a 3TB unit), it's a far bigger deal if that terabyte is enterprise-class storage, with super-high-speed switched Fibre Channel connectivity, in a high-availability configuration, with complex compression algorithms ekeing out every last byte of its available capacity.

Storage looks cheap in theory, but it's not if it's enterprise-class storage because of all the complex technology you need to wrap around it to make it usable and useful.

Do you even KNOW what you're storing?

In this case, the solution to the problem was to look at some of the items that were stored on the disks and decide that they really didn't need quite so many copies of the various backups of backups of backups that had materialised on the disks over the years.

And you know what? In the years I've been in IT, I've lost count of the number of times that clients have moaned about running out of storage space (and, more frequently, running out of hours in the day to run backups) but who have, when pressed to identify their data, been unable satisfactorily to explain just what happened to all their free space.

Data management in the average organisation is, frankly, appalling. In a way, though, that's understandable: unless you employ a team of storage Nazis to interrogate everyone regularly about their files, you really don't stand a chance of keeping tabs on your storage requirements.

And let's face it, if you're given the choice of deleting a document (and running the risk of needing it later) or hanging onto it (just in case), what are you going to do? Nobody ever got fired for keeping a document they believed might be needed one day, but I bet many have been fired for doing the opposite.

So how do you trim down? Here are three questions to ask yourself.

Data: How can I squeeze as much as possible of it onto the disks?

As I’ve already mentioned, the disk itself is one of the cheaper, more commoditised elements of the storage system; it's all the SAN fabric and controllers that you wrap around it that ramps the cost up.

If you want to compress data, the obvious way to go is to employ a compression algorithm to encode more data into less space. And since the compromise is that compressing stuff slows access times down, you then employ expensive ASIC-based compression to speed it up again (“ker-ching!”). De-duplication is also a complete no-brainer, particularly if you live in a virtual world – for instance, SAN controllers that store a single physical copy of something and present it as multiple virtual entities do a spectacular job of optimising the storage of, say, dozens of Windows VMs that are all packed with bazillions of identical system files.

At the very least, then, turn on the optimisation features of your storage hardware. You've paid for them, after all, so use them.

Storage: What sort do I need?

It's no surprise if your high-end database applications need high-speed SAN storage in order to ensure they perform adequately. What's interesting is that even today it's rare to see a software product's data sheet cite the IOPS (per-second storage operation capacity) requirement of the product. Just recently I was reading the spec of a software product and did a double-take at the fact that it actually cited an IOPS figure.

The point is that loads of your applications and users won't need super-fast disk. You can mix your storage infrastructure to match your storage requirements: so you may have SATA disks for your less onerous systems and Fibre Channel for the heavy stuff, with SAN-connected storage for heavy processing and iSCSI or even NAS-style (NFS or CIFS) presentation for lighter loads. By choosing your storage wisely, even if you don't manage to reduce the space you decide you need to buy, you can at least lop a zero off the price tag of some areas of it.

Data: Come on, spit it out: do you really need it all?

At the end of the day, though, you're always going to end up at this question and if you're being honest you're always going to answer it: “No, of course not”.

Take my personal data collection, for example. I have a box of DVDs instead of the home office server and the vast raft of external hard disks I used to keep hanging around, as I realised that in fact I probably dig one thing per year out of the archive. Now look at the average business user (especially if they're a techie): it doesn't take that many graphics-heavy PowerPoints, downloaded ISO DVD images, backups of mail files and the like to soak up a few hundred gigabytes. Multiply this by a few hundred staff, and even with your de-duping and compression hammering away for all they're worth, your storage requirements will spiral out of control.

The thing with storage capacity management is you tend only to do something about it when you're panicking. Some organisations actively monitor and manage capacity, but most don't, which means that they only do anything about it when either (a) they get close to capacity and stuff starts slowing down, or (b) the overnight backup finally tips over the end of its window and they can't get <insert name of mission-critical system> back online before the start of business the next morning.

So be ruthless with your data. Of course you need to keep much of it: if you're a business then the law requires you to do so, and of course in order to actually do business you also need much of it. But review it frequently, and do something about it proactively.

If you don't need it online, store it offline and educate people how to get to it should they need it. If you don't really need it readily accessible but can't face binning it, archive it to tape and store the tapes safely.

But I'll bet that after you've done all this, you'll still have hundreds of gigabytes of stuff that you actually, genuinely, really don't need. So identify it, grin smugly to yourself and throw it away. ®

Dave Cartwright is a senior network and telecoms specialist who has spent 20 years working in academia, defence, publishing and intellectual property. He is the founding and technical editor of Network Week and Techworld and his specialities include design, construction and management of global telecoms networks, infrastructure and software architecture, development and testing, database design, implementation and optimization. Dave and his family live in St Helier on the island paradise of Jersey.