The Register® — Biting the hand that feeds IT

Feeds

Big Data bites back: How to handle those unwieldy digits

When you can't just cram it into tables

Magic Quadrant for Enterprise Backup/Recovery

Data is easy. It comes in tables that store facts and figures about particular items – say, people. The columns define the data to be stored about each item (such as FirstName, LastName) and there is one row for each person. Most tabular database engines are relational and we use SQL for querying. So this "Big Data" thang must simply be very, very big tables with lots and lots of rows.

It’s a tempting definition, but inaccurate. Big Data describes a genuinely different class of data. The best definition that I know is more or less a negative one. Big data is any data that: doesn’t fit well into tables and that generally responds poorly to manipulation by SQL.

So we have to find other ways to store it and to analyse it. To understand why, think about what tables and SQL do well. The tables can store relatively complex data about very large numbers of very similar items.

In SQL, the SELECT clause lets you choose the columns you want to see, in other words it lets you subset the table by columns. The WHERE clause lets you choose the rows. In other words, SQL is very good at sub setting data. It can, of course, do more. It can join tables, it can summarise (GROUP BY) but essentially it is designed to go into a large set of well-organised data, extract a subset and present it to you in an answer table.

In direct contrast, Big Data is pretty varied in structure and, even if we do cram it into a table, the analysis we run against it isn’t usually sub setting.

Picture it

The easiest way to illustrate this is by example. Imagine a digitised X-ray image, perhaps a JPEG file. You want to analyse it algorithmically, looking for bright spots of a particular size, shape and intensity. Or, if this sounds too tame, think about scanning satellite images for little cruciform and deltoid shapes.

You can, of course, tabularise an image file, creating one row for each pixel and columns for X position, Y position, intensity and so on. One problem is that you end up with a narrow, very deep table which is unwieldy. In terms of analysis, SQL can very easily find the bright pixels, however it is a very poor tool for deciding which groups of rows represent all the pixels in a single spot.

“But,” you cry, “I’d use a User Defined Function for that!” Yes, so would I. In my experience all data can be squeezed into a table and analysed in a relational database system, but at some point the effort required makes you think about other, more suitable, containers and alternative analytical languages.

Given that we are defining Big Data as “Not tabular” we aren’t saying that all Big Data is similar in structure. So, in a diagram of all possible data, there is a subset where the structure is well-defined (tabular data) and there is the rest – which we are now calling Big.

This name itself comes from the volume, which is usually huge and that brings us neatly to the three “V”s which are often used to characterise Big Data:

  • Volume: Big Data often appears in huge volumes – think terabytes and petabytes
  • Velocity: It tends to come at you very fast - think Twitter feeds
  • Variety (of structure): see above

I have no (particular) problem with these three Vs, I’ve even seen some additions:

  • Value: if it isn’t valuable, why are you storing and analysing it?
  • Veracity: It has to be accurate otherwise your analysis is worthless

But I will admit to being slightly sceptical of definitions driven by the desire for absolute alliteration.

So, for me, despite the name, the most important feature of Big Data is its structure, with different classes of big data having very different structures.

With that definition, we can start to look at examples. A Twitter feed is Big Data; the census isn’t. Images, graphical traces, Call Detail Records (CDRs) from telecoms companies, web logs, social data, RFID output can all be Big Data. Lists of your employees, customers, products are not.

So how can you store and manipulate Big Data? The answer depends on the structure of your particular flavour but take a look at the large – and increasing – number of NoSQL databases systems out there, for example Cassandra, CouchDB and MongoDB.

Ultimately, it is worth remembering that Big Data and its associated database systems are not in competition with existing relational systems. The analysis of tabular data is not going away, but it was only ever part of the story.

In the 1970s and '80s we tackled tabular data because it is common and (relatively) easy to store and manipulate. I say "relatively easy" because it took us at least 30 years to develop a good understanding of tabular data and transactions.

Big Data has always been there; we just couldn’t process it very well. That’s now changing and we are finally taking on the much harder – but very rewarding and lucrative – job of tackling it. It’s a big job. ®

Mark Whitehorn holds the chair of analytics at the University of Dundee. His role involves working on data output from mass spectrometers, two-dimensional graphical traces of three-dimensional peaks that must be detected and their volumes calculated. The trick isn’t to do the sums; it’s to do them rapidly because another 8Gbyte output file is always coming.

Agentless Backup is Not a Myth

Sorted! [Was: Re: Structure]

Hard to blame this one on the relational database model when what it is, in fact, is a failure to exercise proper user interface design. Had you not stopped breaking it down partway through, you'd see your way clear of the problem without having to be told -- but, lucky for you, here's me to do the telling.

Street number is /^[0-9]+$/, or, if you prefer, /^[0-9]+[a-z-]{0,3}$/ -- and if you get the latter form, you split it at the digit-letter boundary, then stuff the leading number into your "street number" field, and the trailing gubblick minus any punctuation into something like "suite number" or "flat number" or whatever you like.

If you absolutely have to handle house names and rural stuff like "last on the left" along with numbers -- which, I dunno how it is in the UK, but it's extremely rare in the United States to encounter an address which doesn't have a number, even if that number is e.g. "#240, Carrier Route 461" -- then you can present a dropdown or other choose-one-of-many control, with options for "House number" and "House name", and use that info to present the appropriate input control or controls. That way, you get your semantic info (which is what you mean when you say "syntactic information" there), you're unambiguous about what you're receiving, and the user is neither confused nor annoyed by UI controls which he doesn't need to fill in and which don't behave the way he's expecting.

Address lines: Why do you need them to remain individual lines in the database? Just join them with "; " on the way in, and dump any empty fields ("empty" here matching /^\s*$/) before the join. That way, you don't need to denormalize the address fields or split them into a separate table; you can still do fulltext search across them; and if you ever need to get the multiple lines back again, say for printing an address label or similar, you can just split on "; " and Bob's your uncle. If that's not good enough for you, then split the address lines off into a separate table keyed by address entry ID, like you're talking about wanting people to do -- I mean, if you're trying to build a normalized schema, that's not too far to go, right?

(OK, I guess someday somebody might put a semicolon into one of the address fields, but you can reject it during form validation -- you are validating your forms, right, before you hand them to your poor unsuspecting backend? If not, you can fuck right off 'til you've learned the basics of your craft!)

Each of city, state/province/whatever, and country is a field of its own, which solves your problem of lacking semantic information -- splitting out the fields makes it entirely unambiguous what's what. Ideally, you'd present them in that order, but you probably need to know the country in order to know whether to present state or province, so you probably want to ask for country first -- in fact, since the country has the largest effect on how the address is formatted in any case, you probably want to ask for that before anything else, and present address fields appropriate to the format in question.

Once you know the country, you know enough to choose between state or province, and also to choose between presenting a "ZIP code" or a "postcode" field -- no need to show both, because that's incompetent, inelegant, and sloppy; there's also no need for a page load in between choosing country and the rest of it, not if you know what you're doing with jQuery, or if not jQuery then whatever inferior Javascript library you prefer to use. Similarly, if you work your way inward from the broadest possible category -- i.e., country first, then province, then &c., &c., at each step you get enough information to decide what to present in the next. With any luck you've only got maybe a half-dozen address formats to cope with, instead of forty, but even in the latter case you can still handle it properly this way, and without inflicting the agony of unmaintainability on yourself -- the effort involved scales linearly with each new address format, rather than exponentially, so it might be a big pain in the ass but at least it's manageable.

And why are you relying for a primary key on anything the user gives you? That's what an autoincrement field is for!

As with the apparent lack of guaranteed street numbering, I don't know how it is in the UK, but here in the States, you can skip all this bloody mess entirely by using the USPS's "Web Tools" address normalization API. That takes whatever you care to give it, and responds with either a validated and normalized equivalent of that address, or an "I dunno wtf" if the input it gets is too bogus for it to comprehend. My experience with that API has been very good, and in the rare case where it can't normalize something, I have no problem with presenting the user a "Sorry, but the Postal Service wasn't able to verify your address. Please double-check..." sort of response.

Again, dunno if the Royal Mail offers anything similar, but if they don't they bloody well should, and if they do then you should bloody well use it. (And if they don't, or they do but you can't, then you're more or less on your own, sure -- but if you listen to what I'm telling you here, you should be able to do a pretty damn solid job of it, even without being backstopped by the agency which defines whatever addressing format you find yourself having to deal with.)

There! Sorted -- and every bit of what I describe here I have done, in production, on real websites used by real people and developed on behalf of real clients. Go thou and do likewise.

2
0

Really good short article. I've never had to work with unstructured data, just tabular, and this was very insightful for me. More of the same please.

2
0

@BlueGreen -- Re: This is an ongoing argument with little consensus.

Essentially, I agree with you.

I've not hardline or extreme views on this issue one way or the other. In fact, perhaps, if pushed, I'd class myself more an irritated spectator although regularly I hit the boundaries of most storage/file systems because I've been involved in sorting/storing very diverse forms of data in the same machine/same location.

Both nomenclature and type/class are problems. Naming conventions are problematic with little or no agreement on the methods of naming, file naming, truncating and rematching truncated with originals etc. And data itself is so variable, for instance, a 950MB TIF file (the largest on this machine) coexists with files of as little as zero length, that is they only exist as name metadata.

Let me give you a practical example. I regularly hit the file-name/path-length limit of 255/260 characters in NTFS. Name a file with a long filename in a top directory then move it down a few nested directories and the file becomes inaccessible as filename/'MAX_PATH' now exceeds the allowable limit. OK you say, how about making the filename/path smaller? Yes, I can devise a method that's fine for me but how does one do it in a standard way that someone who doesn't know my method of truncating/abridging can understand? It's not only an IT/computer problem but a longstanding one for libraries, repositories, armies, NATO and such for time immemorial.

Let's take a real world example. I find an old book, say on the Internet Archive, so how do I allocate it a totally unambiguous and unique filename--one that would automatically be identical to that produced by everyone else if they'd acted independently? If everyone has to apply only the most elementary rule, then it can only be done by reproducing the book's title page. Rule: 'title page becomes the filename'. OK, let's go to an example, this one for instance: http://archive.org/details/shorttreatiseona00rums At first glance, selecting a book with a very long title might seem extreme but in practice it's not. Almost every technical/scientific/geographic/philosophical book written before 1900 had very long titles as a matter of course, the same today goes for scientific articles and such (and there's also the problem of the abstract info).

The Internet Archive knows it's not practical to apply the title page rule to the filename so it allocates a unique identifier which becomes the filename, specifically, 'shorttreatiseona00rums'. By adding an extent, pdf, .djvu, etc. we get the different available formats. When this unique identifier is used with the IA's RDBMS all problems are solved, we've a very workable library.

Trouble is, with no agreed universal system of nomenclature, the unique identifier is only known to or accepted at the IA. At the other extreme, if we apply the obvious title-page rule then we'd end up with a filename something like this:

"A short treatise on the application of steam, whereby is clearly shewn, from actual experiments, that steam may be applied to propel boats or vessels of any burthen against rapid currents with great velocity. The same principles are also introduced with effect, by a machine of a simple and cheap construction, for the purpose of raising water sufficient for the working of grist-mills, saw-mills, &c. and for watering meadows and other purposes of agriculture; Rumsey, James, 1743?-1792; Publisher: Printed by Joseph James: Chestnut-Street Philadelphia, 1788_location: ULS Lib, copy #: xyz123"

Without any extent, this name is 593 characters long which is well over double the NTFS limit of 255 and this is only the beginning of the problems as:

- NTFS is not a database file system, so even if it accepted longer filenames other user-metadata cannot be included in the file information. (Microsoft promised WinFS, a database filing system, with Vista but it never eventuated.)

- NTFS/Windows' antiquated and maddening reserved character list means that translations are necessary: the ':' replaced with '--', '/' with '_', '?' with '¿' and so on. This results in inconsistencies and matching errors with other systems.

- The problem isn't limited to NTFS, very few other file systems have filenames that exceed 255 characters. Nevertheless, Microsoft has acknowledged the problem with ReFS (Resilient File System) and it has extended the filename length to 32k in ReFS (although with Win 8 server it'll be limited to only 255 for compatibility): http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-next-generation-file-system-for-windows-refs.aspx

- Thus the solution is to use NTFS with a RDBMS, but alas there's no universally agreed database standard. (There's more but that'll do.)

So, the horns of the dilemma return, for the time being we're back to where we started.

-------

Your comments

> Depending on what you're trying to do, seems quite doable. index & Toc are separate tables, content is various tables with links. Not saying you should, just that you can.

Yeah right. However, the book paradigm's been around a long while and it's easy to format in any way the author wants as essentially there are no rules dictating where ink etc. goes on a page. This freedom is still harder to achieve in electronic models (e.g.: one can't scribble or type between the lines or in the margins in most wordprocessors or text editors).

> Probably because it would be a damn sight slower & more complex to do this at a very low level, which is why sectors are abstracted away so I never have to see or worry about them, and just use the natty "file" thingies.

Agreed it's more complex, and in the days of the 8272 FDC it would have been difficult to implement and slow. However, today very sophisticated industrial controllers are the norm in all HDs. Someone will have to give me a very good reason why it'd be impractical to implement. Sectoring and formatting is a legacy issue, thus a no-brainer for manufacturers as it's cheaper.

> AFAIK win & linux allow packing of many small files into sectors. Anyone know what this is called?

Called variously tail/slack space/block sub-allocation packing, it tries to efficiently use slack space and it's used in some Linux F/S and compression systems.

> Elsewise, if you think it's such a good idea, you're free to implement your own space efficient FS at an application level, or as a library so everyone can use it.

Yeah, in another life perhaps. Sooner or later--probably later as this stuff isn't user-candy a la iPhones. Eventually it'll happen as we leave the vestiges of 1950s computing behind.

> …hmm, or you could be happy with 16 bits a 2/3 bits unnecessary extra depth (although I'd bet someone would say 16 bits isn't enough depth for a colour).

A few bits doesn’t matter but it does with say with 36-bit colour versus 48-bit colour. 24-bit colour is, nowadays, severely limiting at the top end of imaging but genuine 48-bit is very difficult--I mean a full 16-bit dynamic range per channel and not 12-bit span shoved into a 16-bit channel. That's why camera manufacturers have RAW formats, writing out junk increases storage space needs and takes extra time.

>…And IBM's Stretch machine which had variable length bytes, was slow, I understand <http://en.wikipedia.org/wiki/IBM_7030_Stretch#Data_formats>

The 7030's before my time but my uni had a 7040 and a CDC 6600.

Just because an idea is ancient doesn’t meant that it's obsolete. Here's the first IT example that comes to mind. Kim Watt of Breeze Computing wrote a utility for the Tandy TRS-80 called 'Super Utility'. Amongst the routines was a gem called 'Format Without Erase'. Take a floppy/storage that was getting a bit flaky and so long as the read data checksumed OK, then FWE would rebuild the magnetic image by reading data into memory then formatting the disk and finally laying down the data back on a freshly formatted track.

I can't tell you how many times I've missed this utility when I switched to an IBM machine, it must have been hundreds and hundreds of times.

I simply cannot understand why this isn't a major part of the S.M.A.R.T feature of HDs. If S.M.A.R.T provided a monitoring interface with info about data's magnetic threshold then a Format Without Erase feature built into an O/S could run in the background when machine activity was low to protect HDs. A bright idea that's died.

1
0

More from The Register

SCO vs. IBM battle resumes over ownership of Unix
Zombie lawsuit back and wants to suck the brains out of Linux
 breaking news
You don't need phone lines or cable for ANYTHING, says Dish
The satellite-dish man can sort you out with phone and broadband over the air too
 breaking news
What's HP got under wraps? Looks awfully flash and tape shaped
What happens in Vegas won't stay there - we've got the details
AMD lifts the veil on Opteron, ARM chip plans for 2014
Not much action going on in 2013, though
Microsoft borks botnet takedown in Citadel snafu
Stupid Redmond kicked over our honeypots, wail white hats
IBM's $1bn layoffs latest: Now axe swings in US, Canada - reports
Union claims 121 storage bods canned after dismal sales
NetApp musters muscular cluster bluster for ONTAP busters
Storage array OS overhauled to juggle more nodes, go down on you, er, less
HP adds 'Haswell' Xeon E3s to entry ProLiant servers
Gussies up MicroServer for SMBs, adds baby switches