The Semantic Web comes to ApacheCon
Themes du jour
Google gives us a window into all the world's information, together of course with all the world's spam and other crap. It's granular information: words and phrases are searched; the page is merely where you go to read the selected information. It can even make a choice of media for us: when I searched for "London Underground map" in preparation for the journey to Amsterdam (to check my options between Paddington and Liverpool Street), the top three hits were indeed the tube map itself, as an image, at various sites. But Google does all this with the web as it is, not with the web as we would like it to be. What would Google look like on the semweb?
That question only becomes meaningful if the semweb scales beyond the "geek toy" and attains a certain critical mass. But at that point, the spammers inevitably move in. How is the semweb going to deal with spam? Well, what happened when a metadata format on the web became popular, with HTML's <META> elements for KEYWORDS and DESCRIPTION? How is RDF metadata supposed to escape the same fate?
RDF is all about reducing the unit of information from the page to the statement. Strip out extraneous guff and machines can work with it far more efficiently. Great. Strip out context, and you've got a bundle of context-free statements. Scale it, and you've got a bundle of statements of which 99 per cent tell you where to make money fast or buy prescription drugs. Google needs that context. If the semweb is to scale above geek toy, then any machine that accepts RDF other than from known/trusted sources is going to need that context. So much for simplifying things.
No, what could really use simplification is the semweb itself, thats barriers to entry and usage are absurdly high. It's not entirely the semweb's fault that most people who come to it (your humble scribe included) already know XML and see it through the abomination of rdf+xml. But why is nonone making serious efforts to do anything else? Where, for instance, are the tools for working with RDF/N3?
Worst of all, in practical terms, is the use of URIs as words. The underlying premise that URIs can be globally unique by virtue of namespacing has merit, though it inevitably makes RDF hard for humans (Java's uniqueness through namespacing is beautifully right).
The use of HTTP URLs is just plain bonkers. Even the W3C Annotea folks, at the cutting edge of the semweb, got themselves terminally confused and invented a system that was fundamentally broken, when they confused RDF usage (as words) with HTTP usage (as a protocol). As soon as Annotea dereferences a URL to reference a page (let alone an ill-specified XPointer within a page), it completely loses the RDF properties of uniqueness and invariance. And if the experts at W3C got so hopelessly confused for the entire duration of the project, what hope for the rest of us?
I'm somewhat at a loss how to conclude a rant about the semweb. If you're reading this, I daresay you've already seen the pro arguments, so it would be superfluous to repeat them here. It's today's reincarnation of 1980s Expert Systems, and there's no doubt that added connectivity can enable some very exciting applications, like FOAF and DOAP (Description Of A Project).
The range of tools is growing: for example, Apache's new "triplesoup" project is building on mod_sparql, which is itself a recent work. But I think the biggest potential is in the road already taken by RSS, in simplified and rather bastardised spinoffs. ®