Search me, guv...
The search engine - the last resort of the well-informed
Comment I'm always a little nervous about the idea of a search engine as the solution to the tide of "unstructured data" we're all drowning in. For a start, most of it isn't really unstructured - show me an unstructured email invoice and I'll show you something that is useless because you aren't sure who it came from and what it applies to. This means that by treating information which comes with structure and semantic metadata as “unstructured” we are increasing our business risk and making our business processing less efficient.
Secondly, we usually get more hits than we can cope with, so we reduce the number of hits by refining the search in a fairly arbitrary way (e.g. by sticking quotes around a key phrase) or, even worse, by just reading the first page or so, assuming this contains the important stuff. Surely, we all realise that information sources can be designed to optimise retrieval, as compared to other, possibly more authoritative, sources. Again, this increases business risk.
Several approaches provide more useful searches that, for example, put company information resources ahead of the general internet - see, for example, Coppereye Greenwich's indexed approach here. This makes use of the structural information - metadata - that you already have elsewhere about the information you're trying to find in flat files such as transaction logs and audit trails. It can have a Google-like front end but is more specialised than Google.
An alternative, more general, approach that has always interested me is the Google Appliance, as exploited by Information Builders (see here). WebFOCUS Intelligent Search applies tags to company data and passes it, after processing against company security and access policies, to the Google Appliance, thus providing a richer Google search - because you know that you're searching information of relevance and have some idea of what search criteria will make sense.
Information Builders has taken this idea a step further again, by incorporating its Active Reports technology, to deliver user "self service" reports at the portal, in something called WebFocus MagnifyThis means users can take a complete set of Google hits and apply spreadsheet-style reporting in near real-time - the hits can be categorised and sorted by category. So, you can carry out data mining and analysis against the data found by your search engine.
WebFOCUS Magnify exploits iWay Software's integration technology, to enrich the content of an Information Bus, thus reducing the overheads involved in trawling, especially, databases for information. WebFOCUS Magnify can use metadata tags in this enriched content to produce a "navigation tree" that will help users find the content they need - even if it doesn't turn up on the first few pages of the search engine report.
So far so good. This is an enterprise tool that's going to interest Information Builders' loyal customer-base. It's powerful, but it's not exactly cheap - Information Builders tends to deal with enterprise and government customers, with huge information stores and mission-critical applications.
However, Dave Armstrong from Information Builders presages much wider applications for Magnify in the future. Fundamentally, it is intended to be search-engine neutral - it currently supports both Google and Lucerne. But, Armstrong says, it would have trouble with engines like Autonomy, which make use of metadata categorisation in their own way. So, you could imagine an ISP, say, using Magnify to provide a low- or zero-cost value-add service to its general customer base.
Nevertheless, if you have serious business-critical questions to answer, this sort of advanced search is only technology-enabling. Provenance, as my “information professional” wife points out, is all - where does the information come from and can it be trusted.
A basic Google search isn't much help here, but at least WebFOCUS Intelligent Search, say, points you at categorised company information. Next, a lack of systematic bias helps - obviously online searches are limited to online information (see your librarian - "company information professional"; in many fields, the authoritative information isn't online or isn’t available to search engines) but various "Google hacks", e.g., bias the information in the first pages of a Google search. Magnify, say, could help here.
Then, you want to make use of everything you know about the semantics and structure of the information you’re searching, to make the search rather more efficient than, say, a Google keyword search returning hundreds of thousands of hits. Finally, as a last resort, you can spend the afternoon playing with Google or whatever, hoping that what you find (which is usually useful enough) doesn't miss out on something that you'd have really, really wanted to use - if you'd known it was there.
Information Builders' (and other vendors’) technology could be part of the solution to this "information" - as opposed to "data" - retrieval problem. But so too, of course, is employing an information professional, who actually understands the difference between Data and Information and knows about information quality and provenance issues, to “mentor” searchers.
Always remember, in this world of automated systems, that’s there is a lot more to the professional “librarian” than the stereotype suggests. For a start, compare the job spec here with that of the average IT professional, who is usually self-certified and needs no formally recognised education, although the BCS is manfully trying to address this challenge. ®