Feeds

The devil is in the details

Grabbing data quality by the tail

High performance access to file storage

Editors blog It's well-known that data quality is often the devil you find hiding in the implementation detail of many big integration (and other) projects. It's a particular issue when companies merge. Of course, "your" salespeople always fill in contact and customer details accurately and honestly but "their" salespeople apparently got through the volumes by inventing contacts, selling to the entire population of Disneyworld - and if the spelling of a name looked a little odd they simply put up a new customer with a different spelling. Now, your merged database contains twice as many customers as it should, the stockmarket is taking an interest in the merged companies - and, hey, I wonder if there are compliance implications...

Easy enough to fix, of course, with an army of clerical people - if you have the resources (money and time, lots of both) and if your business activities can cope with a poor quality database and still maintain good service levels.

There must be a better way. Well, according to Hugo Verwijs, CEO of Human Inference, there is. Not simply the usual "quick fix" tool but a process change that deals with data quality at the appropriate places in the operational lifecycle - plus, of course, automated tools (its HIquality Suite for Total Data Quality Management) that help to ensure consistency and reduce the manual overhead.

What makes the Human Inference solution different to much ordinary validation is that it doesn't simply think of quality in terms of data formats (rejecting blank fields, numeric surnames and that sort of thing); it deals with the semantics of the data. It treats data quality in a business context. So, it can potentially recognise common misspellings, "translations" of foreign names - even well-known filmstars and cartoon characters used by people who can't be bothered filling in forms.

When a website annoys you by making you type in reams of validated information before dispensing a white paper or something, it is amazing what rubbish a conventional validation will accept. Human Inference tools can do better - leaving the site designer to think about why s/he is annoying potential customers in the first place and about the real business value of the data being collected

Human Inference deals with large companies with specific problems. It isn't really interested in commoditised off-the-shelf solutions for small companies - not yet anyway - when pulling big companies out of potential data quality disasters is such good business. And, of course there are other ways to validate data semantically – you can hard-code domain knowledge for validation (although this isn’t terribly “agile”), validate data against its essential properties (discovered during analysis), use “hard typing” to reject invalid data rigorously at entry. The key point to remember is that allowing poor quality data into your database is dangerous to your business and will be expensive to fix later. And also that “poor quality” has to be defined in a business context and usually isn’t a “yes/no” thing– if you see “Johnny Depp” in a showbiz context it is possibly OK, if he’s a potential customer for your Layer 4-7 Switch technology, then you might wonder – although you still can’t just assume it’s a spoof name.

This sort of fuzzy data quality issue is ubiquitous - and not just in databases. Mistyped email addresses are a dynamic data quality problem. With a "norfolk.org.uk" domain, my email queue collects a worrying amount of information that it shouldn't - private medical records, criminal records, health records etc, all intended for some branch of Norfolk's local government or similar organisations with a duty of confidentiality. Shouldn't commodity email software used in such places automatically learn not to send confidential matter to me after I've returned a couple of mis-addressed emails - or, even, learn to notice proactively that local government communications may be being sent outside local government (by checking the form of the address)? At the least, email software could reduce one addressing quality issue by forcing people select To an CC addresses from an officially sanctioned directory. It can't be that hard to discover that "norfolk.org.uk" isn't a government address, at least to a certain degree of confidence!

Well, of course it isn’t, if you’re a person. Getting a computer to do it without bothering you overmuch doesn’t seem to be as easy as you’d hope.®

High performance access to file storage

More from The Register

next story
Windows 8.1, which you probably haven't upgraded to yet, ALREADY OBSOLETE
Pre-Update versions of new Windows version will no longer support patches
Android engineer: We DIDN'T copy Apple OR follow Samsung's orders
Veep testifies for Samsung during Apple patent trial
OpenSSL Heartbleed: Bloody nose for open-source bleeding hearts
Bloke behind the cockup says not enough people are helping crucial crypto project
Microsoft lobs pre-release Windows Phone 8.1 at devs who dare
App makers can load it before anyone else, but if they do they're stuck with it
Half of Twitter's 'active users' are SILENT STALKERS
Nearly 50% have NEVER tweeted a word
Windows XP still has 27 per cent market share on its deathbed
Windows 7 making some gains on XP Death Day
Internet-of-stuff startup dumps NoSQL for ... SQL?
NoSQL taste great at first but lacks proper nutrients, says startup cloud whiz
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
US taxman blows Win XP deadline, must now spend millions on custom support
Gov't IT likened to 'a Model T with a lot of things on top of it'
prev story

Whitepapers

Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
HP ArcSight ESM solution helps Finansbank
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.