Feeds

The devil is in the details

Grabbing data quality by the tail

New hybrid storage solutions

Editors blog It's well-known that data quality is often the devil you find hiding in the implementation detail of many big integration (and other) projects. It's a particular issue when companies merge. Of course, "your" salespeople always fill in contact and customer details accurately and honestly but "their" salespeople apparently got through the volumes by inventing contacts, selling to the entire population of Disneyworld - and if the spelling of a name looked a little odd they simply put up a new customer with a different spelling. Now, your merged database contains twice as many customers as it should, the stockmarket is taking an interest in the merged companies - and, hey, I wonder if there are compliance implications...

Easy enough to fix, of course, with an army of clerical people - if you have the resources (money and time, lots of both) and if your business activities can cope with a poor quality database and still maintain good service levels.

There must be a better way. Well, according to Hugo Verwijs, CEO of Human Inference, there is. Not simply the usual "quick fix" tool but a process change that deals with data quality at the appropriate places in the operational lifecycle - plus, of course, automated tools (its HIquality Suite for Total Data Quality Management) that help to ensure consistency and reduce the manual overhead.

What makes the Human Inference solution different to much ordinary validation is that it doesn't simply think of quality in terms of data formats (rejecting blank fields, numeric surnames and that sort of thing); it deals with the semantics of the data. It treats data quality in a business context. So, it can potentially recognise common misspellings, "translations" of foreign names - even well-known filmstars and cartoon characters used by people who can't be bothered filling in forms.

When a website annoys you by making you type in reams of validated information before dispensing a white paper or something, it is amazing what rubbish a conventional validation will accept. Human Inference tools can do better - leaving the site designer to think about why s/he is annoying potential customers in the first place and about the real business value of the data being collected

Human Inference deals with large companies with specific problems. It isn't really interested in commoditised off-the-shelf solutions for small companies - not yet anyway - when pulling big companies out of potential data quality disasters is such good business. And, of course there are other ways to validate data semantically – you can hard-code domain knowledge for validation (although this isn’t terribly “agile”), validate data against its essential properties (discovered during analysis), use “hard typing” to reject invalid data rigorously at entry. The key point to remember is that allowing poor quality data into your database is dangerous to your business and will be expensive to fix later. And also that “poor quality” has to be defined in a business context and usually isn’t a “yes/no” thing– if you see “Johnny Depp” in a showbiz context it is possibly OK, if he’s a potential customer for your Layer 4-7 Switch technology, then you might wonder – although you still can’t just assume it’s a spoof name.

This sort of fuzzy data quality issue is ubiquitous - and not just in databases. Mistyped email addresses are a dynamic data quality problem. With a "norfolk.org.uk" domain, my email queue collects a worrying amount of information that it shouldn't - private medical records, criminal records, health records etc, all intended for some branch of Norfolk's local government or similar organisations with a duty of confidentiality. Shouldn't commodity email software used in such places automatically learn not to send confidential matter to me after I've returned a couple of mis-addressed emails - or, even, learn to notice proactively that local government communications may be being sent outside local government (by checking the form of the address)? At the least, email software could reduce one addressing quality issue by forcing people select To an CC addresses from an officially sanctioned directory. It can't be that hard to discover that "norfolk.org.uk" isn't a government address, at least to a certain degree of confidence!

Well, of course it isn’t, if you’re a person. Getting a computer to do it without bothering you overmuch doesn’t seem to be as easy as you’d hope.®

Reducing the cost and complexity of web vulnerability management

More from The Register

next story
New 'Cosmos' browser surfs the net by TXT alone
No data plan? No WiFi? No worries ... except sluggish download speed
'Windows 9' LEAK: Microsoft's playing catchup with Linux
Multiple desktops and live tiles in restored Start button star in new vids
iOS 8 release: WebGL now runs everywhere. Hurrah for 3D graphics!
HTML 5's pretty neat ... when your browser supports it
Mathematica hits the Web
Wolfram embraces the cloud, promies private cloud cut of its number-cruncher
Google extends app refund window to two hours
You now have 120 minutes to finish that game instead of 15
Intel: Hey, enterprises, drop everything and DO HADOOP
Big Data analytics projected to run on more servers than any other app
Mozilla shutters Labs, tells nobody it's been dead for five months
Staffer's blog reveals all as projects languish on GitHub
SUSE Linux owner Attachmate gobbled by Micro Focus for $2.3bn
Merger will lead to mainframe and COBOL powerhouse
iOS 8 Healthkit gets a bug SO Apple KILLS it. That's real healthcare!
Not fit for purpose on day of launch, says Cupertino
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.