Unsourced, unreliable, and in your face forever: Wikidata, the future of online nonsense

Like the 'pedia but without the footnotes

emptying_a_tank_of_sewage_648
Sewage: Half of Wikidata's assertions are unsourced

Special report Lobbying companies, PR professionals and SEO optimisers are flocking to influence Wikidata, a child project of Wikipedia that’s backed by serious money. And that’s just one of the reasons to be concerned about a project that could become the world’s default source of information.

The information in Google's Knowledge Graph – the info box you see on the right-hand side of search result pages – has given the internet a good few laughs.

Based on crowdsourced projects like Wikipedia and Freebase, it sometimes displays wrong (or nude) images and Wikipedia vandalism, and it's been susceptible to SEO manipulations of its underlying Freebase layer. With the demise of Freebase, Wikipedia spin-off Wikidata is set to step into the breach. But will that make things better?

Wikidata was kickstarted three years ago with money from Microsoft founder Paul Allen, Google*, and the Gordon and Betty Moore Foundation. According to Wikimedia statistics, after three years of work – much of it done by Wikipedia-scraping bots – half of the statements in Wikidata lack any source reference whatsoever. Another 30 per cent only indicate they come from Wikipedia. They don't even identify a specific article version (Wikipedia articles can have hundreds of them), but simply state, for example, “Latvian Wikipedia”.

Perhaps not every statement in Wikidata is in desperate need of references. The lack of a source confirming that the mother of Jesus Christ was called Mary is probably forgivable. But more statements in Wikidata are sourced to Wikipedia than to all other sources combined.

As search engines move from being directories of links to publishing content themselves, there are obvious reasons for them to be interested in a project like Wikidata. By displaying free content in response to queries, they can stop users from clicking through to other sites, making them stay until they click on a paying ad.

Google's answer, via its Infobox, asserts that Jerusalem is the capital of Israel.

Wikidata provides another advantage: unlike Wikipedia and Freebase, it has a very permissive licence allowing third parties to use its content without attribution. Today, when Bing shows you content from Freebase or Wikipedia, it says so. With Wikidata, it won’t have to.

Wikimedian Max Klein made an insightful comment on this in an interview last year, acknowledging that Google's and Microsoft's funding of Wikidata “could seem like they are just paying to remove a blemish on their perceived omniscience” (before banishing such doubts).

Zombie data

Wikipedia contains hoaxes. Some of them have lasted as long as ten years. Wikidata’s bots don’t notice when a Wikipedia article they’ve harvested information from is deleted as a hoax.

Among the fifteen longest-lived hoaxes currently listed at Wikipedia:List of hoaxes, six (nos. 1, 2, 6, 7, 11, 13) still have active Wikidata entries at the time of writing.

For five months in 2014, Wikidata said that Franklin D. Roosevelt was also known as "Adolf Hitler". What, then, are the chances that more subtle falsehoods and manipulations will be detected before they spread to other sites?

Yet this is the project that Wikimedians like Max Klein imagine could become the "one authority control system to rule them all".

Citogenesis on Steroids

Cartoonist and author Randall Munroe coined the word “Citogenesis” to describe the circular flow of information.

For example, Wikidata is now used as a source by the Virtual International Authority File (VIAF), while VIAF in turn is used as a source by Wikidata. In the opinion of one Wikimedia veteran and librarian I spoke to at the recent Wikiconference USA 2015, the inherent circularity in this arrangement is destined to lead to muddles which will become impossible to disentangle later on.

Whither Wikidata?

Many Wikimedians feel problems such as those described here are not all that serious. They feel safe in the knowledge that they can fix anything instantly if it's wrong, which provides a subjective sense of control. It's a wiki! And they take comfort in the certainty that someone surely will come along one day to correct any other error that might be present today. This is a fallacy.

The typical end user is blissfully unaware of Wikidata. To them, the fact that any error in Wikidata or Wikipedia may be fixed at some point in the future is immaterial. Undetected falsehoods have consequences for them today.

Wikidata needs more emphasis on controlling incoming quality. Statements in Wikidata should be referenced to sources published outside the Wikimedia universe, following the same principles that underpin Wikipedia’s Verifiability policy.

Control the Infobox, Control the People

Just over half of all statements in Wikidata are unreferenced, according to the latest published figures.

“Making the Web more machine-readable comes with a price,” the Oxford Internet Institute’s Mark Graham wrote recently in Slate. It’s the result of tuning data for the semantic web, for machines to read.

In a paper co-written with Heather Ford of the School of Media and Communication at the University of Leeds, he has examined the problems that can result when Wikidata and/or the Knowledge Graph provide the public with a single, unattributed answer. Nuance is lost along with the data’s provenance. The process of generating the information is more opaque to the end user than ever before.

As Wikidata informs search engines and other sites, its content may reach an audience of billions. This is the sort of power many will desire, and they’ll surely flock to Wikidata.

It's a propagandist's dream. Anonymous accounts. Assured identity protection. Plausible deniability. No legal liability. Automated import and dissemination without human oversight. Authoritative presentation without the reader being any the wiser as to who placed the information and which sources it is based on. Massive impact: search engines have the power to sway elections.

Is a global information system with such vulnerabilities all that wonderful? The right to enjoy a pluralist media landscape, populated by players who are accountable to the public, was hard won in centuries past. Some countries still do not enjoy that luxury today. It should not be given up carelessly, in the name of progress, for the greater glory of technocrats. ®

Andreas Kolbe serves on the editorial board of Wikipedia's community newsletter, The Signpost, where a longer version of this article can be found.

*Bootnote

Wikidata project leader Denny Vrandecic is a Google employee and became a Wikimedia Foundation board member this year. Russian search engine Yandex has also invested in Wikidata.

Sponsored: The Joy and Pain of Buying IT - Have Your Say


Biting the hand that feeds IT © 1998–2017