Feeds

Search me, guv...

The search engine - the last resort of the well-informed

Choosing a cloud hosting partner with confidence

Comment I'm always a little nervous about the idea of a search engine as the solution to the tide of "unstructured data" we're all drowning in. For a start, most of it isn't really unstructured - show me an unstructured email invoice and I'll show you something that is useless because you aren't sure who it came from and what it applies to. This means that by treating information which comes with structure and semantic metadata as “unstructured” we are increasing our business risk and making our business processing less efficient.

Secondly, we usually get more hits than we can cope with, so we reduce the number of hits by refining the search in a fairly arbitrary way (e.g. by sticking quotes around a key phrase) or, even worse, by just reading the first page or so, assuming this contains the important stuff. Surely, we all realise that information sources can be designed to optimise retrieval, as compared to other, possibly more authoritative, sources. Again, this increases business risk.

Several approaches provide more useful searches that, for example, put company information resources ahead of the general internet - see, for example, Coppereye Greenwich's indexed approach here. This makes use of the structural information - metadata - that you already have elsewhere about the information you're trying to find in flat files such as transaction logs and audit trails. It can have a Google-like front end but is more specialised than Google.

An alternative, more general, approach that has always interested me is the Google Appliance, as exploited by Information Builders (see here). WebFOCUS Intelligent Search applies tags to company data and passes it, after processing against company security and access policies, to the Google Appliance, thus providing a richer Google search - because you know that you're searching information of relevance and have some idea of what search criteria will make sense.

Information Builders has taken this idea a step further again, by incorporating its Active Reports technology, to deliver user "self service" reports at the portal, in something called WebFocus MagnifyThis means users can take a complete set of Google hits and apply spreadsheet-style reporting in near real-time - the hits can be categorised and sorted by category. So, you can carry out data mining and analysis against the data found by your search engine.

Navigation tree

WebFOCUS Magnify exploits iWay Software's integration technology, to enrich the content of an Information Bus, thus reducing the overheads involved in trawling, especially, databases for information. WebFOCUS Magnify can use metadata tags in this enriched content to produce a "navigation tree" that will help users find the content they need - even if it doesn't turn up on the first few pages of the search engine report.

So far so good. This is an enterprise tool that's going to interest Information Builders' loyal customer-base. It's powerful, but it's not exactly cheap - Information Builders tends to deal with enterprise and government customers, with huge information stores and mission-critical applications.

However, Dave Armstrong from Information Builders presages much wider applications for Magnify in the future. Fundamentally, it is intended to be search-engine neutral - it currently supports both Google and Lucerne. But, Armstrong says, it would have trouble with engines like Autonomy, which make use of metadata categorisation in their own way. So, you could imagine an ISP, say, using Magnify to provide a low- or zero-cost value-add service to its general customer base.

Nevertheless, if you have serious business-critical questions to answer, this sort of advanced search is only technology-enabling. Provenance, as my “information professional” wife points out, is all - where does the information come from and can it be trusted.

A basic Google search isn't much help here, but at least WebFOCUS Intelligent Search, say, points you at categorised company information. Next, a lack of systematic bias helps - obviously online searches are limited to online information (see your librarian - "company information professional"; in many fields, the authoritative information isn't online or isn’t available to search engines) but various "Google hacks", e.g., bias the information in the first pages of a Google search. Magnify, say, could help here.

Then, you want to make use of everything you know about the semantics and structure of the information you’re searching, to make the search rather more efficient than, say, a Google keyword search returning hundreds of thousands of hits. Finally, as a last resort, you can spend the afternoon playing with Google or whatever, hoping that what you find (which is usually useful enough) doesn't miss out on something that you'd have really, really wanted to use - if you'd known it was there.

Information Builders' (and other vendors’) technology could be part of the solution to this "information" - as opposed to "data" - retrieval problem. But so too, of course, is employing an information professional, who actually understands the difference between Data and Information and knows about information quality and provenance issues, to “mentor” searchers.

Always remember, in this world of automated systems, that’s there is a lot more to the professional “librarian” than the stereotype suggests. For a start, compare the job spec here with that of the average IT professional, who is usually self-certified and needs no formally recognised education, although the BCS is manfully trying to address this challenge. ®

Intelligent flash storage arrays

More from The Register

next story
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
Sway: Microsoft's new Office app doesn't have an Undo function
Content aggregation, meet the workplace ... oh
Sign off my IT project or I’ll PHONE your MUM
Honestly, it’s a piece of piss
Return of the Jedi – Apache reclaims web server crown
.london, .hamburg and .公司 - that's .com in Chinese - storm the web server charts
NetWare sales revive in China thanks to that man Snowden
If it ain't Microsoft, it's in fashion behind the Great Firewall
Chrome 38's new HTML tag support makes fatties FIT and SKINNIER
First browser to protect networks' bandwith using official spec
Admins! Never mind POODLE, there're NEW OpenSSL bugs to splat
Four new patches for open-source crypto libraries
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.