Southampton Uni shows way to a truly open web
Making Berners-Lee's vision a reality
Untangling the semantic web Southampton is pushing to be the go-to place for expertise on linked data in the UK, and researchers at its main university launched a site earlier this month containing no less than 21 "non-confidential" datasets that underline that semantic web desire.
The University of Southampton (UoS) is one of the first academic institutions in Blighty to follow in the footsteps of its neighbour – map-making agency the Ordnance Survey, which released some open datasets in April 2010. Indeed, some of the city's boffins are dead keen to put a linked data strategy for government, academic and public sector organisations on the map, if you'll pardon the pun. However, finding wider enthusiasm expressed in Southampton for Sir Tim Berners-Lee's relatively newborn and somewhat niche software modeling system remains a big challenge.
After all, there's a of lot baggage in the way of all that juicy data. Genuine enthusiasts for linked data, which was a term coined by Berners-Lee back in 2006, need to first brush aside web standards' arguments and political grandstanding among MPs apparently desperate to push a "transparency" agenda. On top of that, they also have to work around the big social content farms and closed data silos, like Facebook and chums, whose vision of an "open web" has massively undermined what is now considered an almost redundant term.
Mark Zuckerberg has famously declared that he "is trying to make the world a more open place" with his social network site Facebook. Recently, Berners-Lee publicly questioned Zuck's supposed claim.
"The sites [Facebook, LinkedIn, Friendster and others] assemble these bits of data into brilliant databases and reuse the information to provide value-added service – but only within their sites. Once you enter your data into one of these services, you cannot easily use them on another site. Each site is a silo, walled off from the others," he said in the December 2010 issue of Scientific American.
"Yes, your site's pages are on the web, but your data are not. You can access a web page about a list of people you have created in one site, but you cannot send that list, or items from it, to another site."
The likes of the UoS linked data team are fighting that silo effect by freeing up useful datasets to help the university better cope with its non-confidential student bureaucracy. Some would argue that their efforts should be applauded, even if questions remain about how such a model to make data even more accessible online might be eventually used to link big, unwieldy government datasets together in a truly meaningful way. Others might complain that trust and privacy could take an almighty blow online if such supposedly non-sensitive data, even if considered entirely vanilla and "out there in HTML form anyway", was so easily opened up on a grand scale to all comers.
"What we need is an information shaman," explains Christopher Gutteridge, a member of the technical staff at the UoS. Gutteridge has worked closely with big data researchers for a long time and launched the institution's linked data site with a small team earlier this month.
"Bring the data back for everyone and then it's useful, and you don't just bring it back in silos," he says.
But he acknowledges that the linked data model isn't for everyone.
Bag it, tag it and let's see what else is there
Can you cross-reference tiger
owners with all the moustache-
wearing sous chefs within a 4-mile radius?
"People want controversy," says Gutteridge. "It's more useful if you use a universally unique key [such as a URI, about which more later] of public data, more useful wherever possible to use the same keys as other people to join up the data. Well 'doh' ... But the common response is 'I thought all computers already did that'," he says.
"On TV when you see that person who says 'can you cross-reference it with all the people who have blue SUVs and like Golf' and the FBI researcher just goes 'yes, sure', *taps furiously into pretend keyboard on desk*. Well yeah, we're trying to build that, cross-referenced with massive privacy issues. But open data researchers are the ones who are most concerned about privacy because we're the ones who know what's going to happen next."
However, privacy isn't the only issue potentially hampering university researchers who are keen to push the linked data agenda. Web standards are a thorny issue, too.
"Resource Description Framework [RDF, which uses the SPARQL query language, which is the data model of choice in Southampton as well as for the government's own data.gov.uk effort] is simple to use; it has got this mystique amongst some people as being very complicated and difficult, but it has a very simple data model," says John Goodwin, who is the Ordnance Survey's senior research scientist, who just so happens to have a PhD in time travel – so maybe he can see into the future ...
Linked data's timelord, perhaps?
"There are a lot of people in the web developer community who do prefer standard ways of doing things that they are used to such as APIs, XML, JSON. 'This is another new thing, why do I have to learn this?' tends to be the argument among hold-outs.
"With linked data it's a big bucket of datasets. You can keep enhancing it and adding stuff, which makes it much more flexible than other software modelling systems.”
But it remains a relatively niche skill in developer circles. The UK is at the forefront on linked data projects, with smaller efforts underway in the US, Germany and Ireland.
"There are probably less than 1,000 people in the world who can just sit down and write RDF. Build a full linked data website using RDF, backend, etc ... The whole point is we now need to demystify the whole thing," says Gutteridge.
"The skills of taking a data system and understanding how to map it into RDF so that it can be useful is bloody hard. It requires someone who can see the data, understand the structure, understand how it will be used and then map between two spaces in their head."
He simply wants to get on with the work rather than see the UoS and other universities caught up in a data-churning loop.
So how does the metadata encoder for the semantic web work?
"RDF comes in triples – a thing [subject] is related to another thing [predicate] via some kind of property [object]," explains Goodwin.
"Each of these things is identified by URI [uniform resource identifier]. Think URL, roughly speaking. But a URI can represent absolutely anything. All web addresses are URIs but not the other way around."
In other words, a set of rules need to be adhered to, to get data published on the web in a meaningful yet heavily distributed way. According to both Gutteridge and Goodwin, more linked data is available to them today compared to just two years ago. But for such a project to prove a success in the long term, many more web developers need to join the show.
Waiting to exhale
In effect, Berners-Lee advocates want to link up data in a eloquent and constructive way on the web using something called DBPedia as the central repository for information garnered online. And yes, chillingly for some Reg readers, that does involve using Wikipedia as a major data source. It doesn't just take a 'suck it and see' approach, however, instead it grabs "structured information" using sophisticated queries against Jimbo Wales's database and, importantly, links other datasets to Wikipedia from around the web.
In other words it's a bit like telling a web surfer that the populist, if not wholly-reliable online encyclopedia shouldn't be the only source of information. Perhaps proponents of DBPedia would be happy if the database was eventually likened to a detail-obsessed librarian who's middle name is pedant. That certainly appears to be the goal at least.
But unlocking information online remains a huge challenge, despite having a government in the UK that endorses the linked data desires of Berners-Lee, Southampton and others.
"Many research and evaluation projects in the few years of the Semantic Web technologies produced ontologies, and significant data stores, but the data, if available at all, is buried in a zip archive somewhere, rather than being accessible on the web as linked data," explained Berners-Lee back in 2006.
'PDF is an embarrassment to our species'
Currently, if public information is made available online, problems remain with the kind of data formats that are all too readily used by local government departments, academic institutions and other parts of the public sector.
"PDF is an embarrassment to our species," Gutteridge says of Adobe Software's once proprietary but now open standard for document exchange.
"PDF is a brilliant way to simulate A4 or portrait views. It was natural to create a new piece of technology to simulate the old ... But our screens are all A4 landscape yet there is this stupid insistence that the portrait way is still developed. It's a legacy thing and we haven't got around to getting rid of it yet. I've been cringing at it for the past 10 years."
The reality of course is that it's here to stay for now, even if the government is trying to shunt local authorities over to publishing data in CSV and other more open data-friendly formats.
"We can publish papers in a way that anyone can read for free without restriction, it should be open and eventually linked ... It's going to be a long uphill struggle. People are wasting massive amounts of effort by building spreadsheets in each university with the same sort of data and building custom tools," says Gutteridge.
"But you can do so much more in an open model, keeping in mind some things are still commercially sensitive and you still exercise common sense. So I don’t publish my home address or banking details in semantic form, for example ... The only real risk are the people who are used to a closed world and haven't worked out they're saying too much about themselves on Facebook."
Interestingly, all researchers at the UoS are "obliged" to make their data open. "They don't have the right to make it appear only enclosed ... We've shifted the tide, it's not perfect yet," Gutteridge explains.
He admits that the notion of a semantic web is "a challenge because you need to trust your sources".
But Gutteridge prefers to be knee-deep in code.
"Linked data is still semantic web – it's just ditching all the hard stuff. We're not abandoning it, but we're not making it the goal. Ultimately, we provide the tools. Let the politicians do the arguments."
He also concedes: "We will learn down the line that we've cocked up certain ways of doing things with linked data. It's a learning process. Things restructure all the bloody time. A renumbered building, for example, could break the linked data system. It's down to temporal, real-time data. The system's not perfect, but you've got to relax, these are the 404s of the semantic web. For it to work, it has got to work while being a bit broken." ®