Feeds

DataLens demystifies complex matching

A solution that's not statistically challenged

Remote control for virtualized desktops

Comment I have long espoused the cause of semantic approaches in a variety of areas: I think that natural language processing works better in search engines, and I think that LAS (which has just been acquired by IBM) offers just about the best name matching on the market, thanks to its semantic and linguistic basis as opposed to the statistical approach that is common among data quality vendors.

As it happens, I have recently uncovered another supplier in the data quality space that espouses the cause of semantics as opposed to statistics, which is Silver Creek Systems. However, in this case, the company is focused on product matching rather than name matching.

Product matching is an order of magnitude more complex than conventional name and/or address matching. For example, suppose you sell electrical resistors. Now, resistors have a variety of attributes: resistance, power, tolerance, general description, manufacturer, UNSPCS and FSC industry codes, and so on. Now, you get these product details in from your suppliers: how do you build a consistent catalogue when you bear in mind that these details may come in any order, that you have numeric values mixed in with letters, and that you may have electrical symbols (Ω instead of ohms, for example)?

Traditional approaches to this sort of matching use a statistical methodology: that is, you look for patterns within the data. The problem is that with product and similarly complex data derived from diverse sources, the data is all intermingled and it is difficult to extract relevant patterns. While there are some relatively simple product-based environments in which traditional methods can work well, in more complex situations involving such things as electrical components, electronic consumer goods (for example, digital cameras), office supplies, computer ancillaries and cables, even land title documents, success rates are seldom above 50 per cent.

The problem with 50 per cent is that it isn't adequate: there is so much manual intervention required to do the other half of the matching that it is more cost effective to do the whole thing by hand. While some manual intervention can be tolerated, it can't be at this level and you need a solution such as Silver Creek Systems’, where that degree of manual work is at a more reasonable level though, having said that, I do not know of any other vendor apart from Silver Creek Systems that can offer this sort of capability.

Silver Creek Systems’ product is called DataLens and it includes facilities to support semantically-based content profiling (that is, classifying records into content groups), standardisation (enforcing standards and normalising content), attribute identification, classification (aggregating data into taxonomies and schemas), and internationalisation (so you can have Spanish, Russian and other versions of the catalogue). The software can also operate in either real-time or batch mode, as appropriate.

I am not going to mince my words about DataLens: if you have a complex matching problem that goes beyond conventional name and address matching (not necessarily for products) then you must talk to Silver Creek Systems. To misquote a well-known beer advertisement: DataLens can get to parts of your matching problem that other data quality solutions cannot reach.

Copyright © 2006, IT-Analysis.com

Beginner's guide to SSL certificates

More from The Register

next story
Be real, Apple: In-app goodie grab games AREN'T FREE – EU
Cupertino stands down after Euro legal threats
Download alert: Nearly ALL top 100 Android, iOS paid apps hacked
Attack of the Clones? Yeah, but much, much scarier – report
You stupid BRICK! PCs running Avast AV can't handle Windows fixes
Fix issued, fingers pointed, forums in flames
Microsoft: Your Linux Docker containers are now OURS to command
New tool lets admins wrangle Linux apps from Windows
Bada-Bing! Mozilla flips Firefox to YAHOO! for search
Microsoft system will be the default for browser in US until 2020
Facebook, working on Facebook at Work, works on Facebook. At Work
You don't want your cat or drunk pics at the office
Soz, web devs: Google snatches its Wallet off the table
Killing off web service in 3 months... but app-happy bonkers are fine
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
10 threats to successful enterprise endpoint backup
10 threats to a successful backup including issues with BYOD, slow backups and ineffective security.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
The hidden costs of self-signed SSL certificates
Exploring the true TCO for self-signed SSL certificates, including a side-by-side comparison of a self-signed architecture versus working with a third-party SSL vendor.