Feeds

The devil is in the details

Grabbing data quality by the tail

The essential guide to IT transformation

Editors blog It's well-known that data quality is often the devil you find hiding in the implementation detail of many big integration (and other) projects. It's a particular issue when companies merge. Of course, "your" salespeople always fill in contact and customer details accurately and honestly but "their" salespeople apparently got through the volumes by inventing contacts, selling to the entire population of Disneyworld - and if the spelling of a name looked a little odd they simply put up a new customer with a different spelling. Now, your merged database contains twice as many customers as it should, the stockmarket is taking an interest in the merged companies - and, hey, I wonder if there are compliance implications...

Easy enough to fix, of course, with an army of clerical people - if you have the resources (money and time, lots of both) and if your business activities can cope with a poor quality database and still maintain good service levels.

There must be a better way. Well, according to Hugo Verwijs, CEO of Human Inference, there is. Not simply the usual "quick fix" tool but a process change that deals with data quality at the appropriate places in the operational lifecycle - plus, of course, automated tools (its HIquality Suite for Total Data Quality Management) that help to ensure consistency and reduce the manual overhead.

What makes the Human Inference solution different to much ordinary validation is that it doesn't simply think of quality in terms of data formats (rejecting blank fields, numeric surnames and that sort of thing); it deals with the semantics of the data. It treats data quality in a business context. So, it can potentially recognise common misspellings, "translations" of foreign names - even well-known filmstars and cartoon characters used by people who can't be bothered filling in forms.

When a website annoys you by making you type in reams of validated information before dispensing a white paper or something, it is amazing what rubbish a conventional validation will accept. Human Inference tools can do better - leaving the site designer to think about why s/he is annoying potential customers in the first place and about the real business value of the data being collected

Human Inference deals with large companies with specific problems. It isn't really interested in commoditised off-the-shelf solutions for small companies - not yet anyway - when pulling big companies out of potential data quality disasters is such good business. And, of course there are other ways to validate data semantically – you can hard-code domain knowledge for validation (although this isn’t terribly “agile”), validate data against its essential properties (discovered during analysis), use “hard typing” to reject invalid data rigorously at entry. The key point to remember is that allowing poor quality data into your database is dangerous to your business and will be expensive to fix later. And also that “poor quality” has to be defined in a business context and usually isn’t a “yes/no” thing– if you see “Johnny Depp” in a showbiz context it is possibly OK, if he’s a potential customer for your Layer 4-7 Switch technology, then you might wonder – although you still can’t just assume it’s a spoof name.

This sort of fuzzy data quality issue is ubiquitous - and not just in databases. Mistyped email addresses are a dynamic data quality problem. With a "norfolk.org.uk" domain, my email queue collects a worrying amount of information that it shouldn't - private medical records, criminal records, health records etc, all intended for some branch of Norfolk's local government or similar organisations with a duty of confidentiality. Shouldn't commodity email software used in such places automatically learn not to send confidential matter to me after I've returned a couple of mis-addressed emails - or, even, learn to notice proactively that local government communications may be being sent outside local government (by checking the form of the address)? At the least, email software could reduce one addressing quality issue by forcing people select To an CC addresses from an officially sanctioned directory. It can't be that hard to discover that "norfolk.org.uk" isn't a government address, at least to a certain degree of confidence!

Well, of course it isn’t, if you’re a person. Getting a computer to do it without bothering you overmuch doesn’t seem to be as easy as you’d hope.®

Secure remote control for conventional and virtual desktops

More from The Register

next story
Microsoft boots 1,500 dodgy apps from the Windows Store
DEVELOPERS! DEVELOPERS! DEVELOPERS! Naughty, misleading developers!
Apple promises to lift Curse of the Drained iPhone 5 Battery
Have you tried turning it off and...? Never mind, here's a replacement
Mozilla's 'Tiles' ads debut in new Firefox nightlies
You can try turning them off and on again
Linux turns 23 and Linus Torvalds celebrates as only he can
No, not with swearing, but by controlling the release cycle
Scratched PC-dispatch patch patched, hatched in batch rematch
Windows security update fixed after triggering blue screens (and screams) of death
prev story

Whitepapers

Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Advanced data protection for your virtualized environments
Find a natural fit for optimizing protection for the often resource-constrained data protection process found in virtual environments.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.