Feeds

The devil is in the details

Grabbing data quality by the tail

Intelligent flash storage arrays

Editors blog It's well-known that data quality is often the devil you find hiding in the implementation detail of many big integration (and other) projects. It's a particular issue when companies merge. Of course, "your" salespeople always fill in contact and customer details accurately and honestly but "their" salespeople apparently got through the volumes by inventing contacts, selling to the entire population of Disneyworld - and if the spelling of a name looked a little odd they simply put up a new customer with a different spelling. Now, your merged database contains twice as many customers as it should, the stockmarket is taking an interest in the merged companies - and, hey, I wonder if there are compliance implications...

Easy enough to fix, of course, with an army of clerical people - if you have the resources (money and time, lots of both) and if your business activities can cope with a poor quality database and still maintain good service levels.

There must be a better way. Well, according to Hugo Verwijs, CEO of Human Inference, there is. Not simply the usual "quick fix" tool but a process change that deals with data quality at the appropriate places in the operational lifecycle - plus, of course, automated tools (its HIquality Suite for Total Data Quality Management) that help to ensure consistency and reduce the manual overhead.

What makes the Human Inference solution different to much ordinary validation is that it doesn't simply think of quality in terms of data formats (rejecting blank fields, numeric surnames and that sort of thing); it deals with the semantics of the data. It treats data quality in a business context. So, it can potentially recognise common misspellings, "translations" of foreign names - even well-known filmstars and cartoon characters used by people who can't be bothered filling in forms.

When a website annoys you by making you type in reams of validated information before dispensing a white paper or something, it is amazing what rubbish a conventional validation will accept. Human Inference tools can do better - leaving the site designer to think about why s/he is annoying potential customers in the first place and about the real business value of the data being collected

Human Inference deals with large companies with specific problems. It isn't really interested in commoditised off-the-shelf solutions for small companies - not yet anyway - when pulling big companies out of potential data quality disasters is such good business. And, of course there are other ways to validate data semantically – you can hard-code domain knowledge for validation (although this isn’t terribly “agile”), validate data against its essential properties (discovered during analysis), use “hard typing” to reject invalid data rigorously at entry. The key point to remember is that allowing poor quality data into your database is dangerous to your business and will be expensive to fix later. And also that “poor quality” has to be defined in a business context and usually isn’t a “yes/no” thing– if you see “Johnny Depp” in a showbiz context it is possibly OK, if he’s a potential customer for your Layer 4-7 Switch technology, then you might wonder – although you still can’t just assume it’s a spoof name.

This sort of fuzzy data quality issue is ubiquitous - and not just in databases. Mistyped email addresses are a dynamic data quality problem. With a "norfolk.org.uk" domain, my email queue collects a worrying amount of information that it shouldn't - private medical records, criminal records, health records etc, all intended for some branch of Norfolk's local government or similar organisations with a duty of confidentiality. Shouldn't commodity email software used in such places automatically learn not to send confidential matter to me after I've returned a couple of mis-addressed emails - or, even, learn to notice proactively that local government communications may be being sent outside local government (by checking the form of the address)? At the least, email software could reduce one addressing quality issue by forcing people select To an CC addresses from an officially sanctioned directory. It can't be that hard to discover that "norfolk.org.uk" isn't a government address, at least to a certain degree of confidence!

Well, of course it isn’t, if you’re a person. Getting a computer to do it without bothering you overmuch doesn’t seem to be as easy as you’d hope.®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Entity Framework goes 'code first' as Microsoft pulls visual design tool
Visual Studio database diagramming's out the window
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.