Feeds

There's a tide of unstructured data coming - start swimming

Or you could just work out a plan...

High performance access to file storage

Mountains of insight

Adrian Simpson, chief innovation officer at SAP UK, suggests similar automation intelligence for roles in business processes such as recruitment.

“Having a system in place that can understand a candidate’s CV without the need for human intervention is crucial. It is important to have access to this unstructured information but in a controlled environment to avoid littering databases with mountains of insight,” he says.

We start to see that structuring unstructured data for its own sake is both a waste of time and almost symbolic of some kind of big data science experiment.

This is a view echoed by Tibco Chief technology officer Matt Quinn, who believes we must question how we are going to use the insights gleaned from unstructured data. Do we need the insight in real time? Will the insight be wasted in six months if we wait that long?

“The approach I often suggest is use lightweight processing of unstructured information to add important context to structured and actionable real-time data,” he says.

“For example, correlating point-of-sale transactions with social feeds can provide great insight into how a consumer felt about the company and the product – without breaking the bank.”

Genius or idiocy?

Quinn warns we must also consider the reverse if we are indexing and searching unstructured information without understanding relevance and context.

Was the document or data created by someone who is considered to be a thought leader or an idiot? Once again it comes back to context.

The digital universe of western Europe will double every two and a half years

EMC conducted a Digital Universe study with IDC at the end of last year entitled Extracting Value from Chaos. This estimates that the digital universe of western Europe will grow from 538 exabytes to 5.0 zettabytes between 2012 and 2020 – more than 30 per cent a year.

That means it will double about every two and a half years. But Chris Roche, EMEA chief technology officer at Pivotal, cites projections that 45 per cent of western Europe’s digital universe in 2020 could still be useful if tagged and analysed correctly.

So even if we vaguely know what we should be doing to structure our unstructured data, how do we do it and what tools should we use?

More specifically, isn’t it important to have a granular discussion as to what type of database (and indeed, database management environment) we should use?

Whether it is structured, unstructured or even semi-structured data we have at hand, John Glendenning, vice-president of Apache Cassandra distributor DataStax, argues that the ability of NoSQL to tackle this need is nearly always better than a relational database management system (RDBMS) such as Oracle.

“To cope with the huge volume and variety of data that can be coming into a business, flexible or dynamic schema design is required to accommodate all the formats of big data applications, including structured, semi-structured and unstructured data,” he says.

"In Cassandra, data can be represented via column families that are dynamic in nature and accommodate all modifications online.

“For businesses that track unstructured data such as social media entries, or every interaction that a user has with an online video or movie, the amount of data tracked for one user might equate to only a handful of interactions versus another user who has hundreds.

“Now, there are ways of modeling this in an RDBMS, but they don’t come out as clean as they do in a NoSQL database, which allows you to have rows in the same table that have wildly different numbers of columns and data types.”

Intelligence quotient

Stemming the unstructured data tsunami is all about intelligence in data framework design (possibly the artificial type too). It is all about content and also all about data model flexibility.

We know that a huge amount of unstructured data is spam, so a re-engineering of the way data is treated by users' inboxes may be needed. This commonsense approach, along with de-duping and data mining, will also help.

But we need to exercise caution. Remember what Einstein said: “Everything should be made as simple as possible, but no simpler.”

We can try and strip down our unstructured data all we like, but if we go too far we will ultimately lose the context for clarity we first sought. To face the unstructured data tsunami, best learn to swim. ®

High performance access to file storage

More from The Register

next story
Android engineer: We DIDN'T copy Apple OR follow Samsung's orders
Veep testifies for Samsung during Apple patent trial
Microsoft: Windows version you probably haven't upgraded to yet is ALREADY OBSOLETE
Pre-Update versions of Windows 8.1 will no longer support patches
OpenSSL Heartbleed: Bloody nose for open-source bleeding hearts
Bloke behind the cockup says not enough people are helping crucial crypto project
Half of Twitter's 'active users' are SILENT STALKERS
Nearly 50% have NEVER tweeted a word
Windows XP still has 27 per cent market share on its deathbed
Windows 7 making some gains on XP Death Day
Internet-of-stuff startup dumps NoSQL for ... SQL?
NoSQL taste great at first but lacks proper nutrients, says startup cloud whiz
Microsoft lobs pre-release Windows Phone 8.1 at devs who dare
App makers can load it before anyone else, but if they do they're stuck with it
US taxman blows Win XP deadline, must now spend millions on custom support
Gov't IT likened to 'a Model T with a lot of things on top of it'
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.