The Ropebridges: Authority Control in Wikidata

You may recall that our Wikipedia reciprocal linking robot “VIAFbot” finished adding Authority Control to more than a quarter of a million (English language) Wikipedia articles, but what was the utility? Five months on, that question has been answered. Luckily, and unsurprisingly, other netizens proved additional Wikipedia -> VIAF linking utility. Unanticipated reuse is the magic of collaborative and open datasets, and four such examples highlight the benefits of Library data in Wikipedia.

First was John Mark Ockerbloom’s Forward To Libraries which proposes “find in a Library” boxes in Wikipedia pages. The idea is compelling: facilitate automatic searches in your preferred library site on the topics of Wikipedia articles — one option utilizes VIAF IDs.

Similar look-up facilities were created by Owen Stephens and Thomas Meehan conducting pointed inquiry at the British Library site and other UK Academic resources. Stephens’ contemporaneous finds authors sharing their birth year with the Wikipedia page in question. Meanwhile Meehan’s bookmarklet will funnel you into relevant pages linked by VIAF at UCL’s Explore, and COPAC.

VIAF connections can also pave the way for new scholarly research. A team from Vienna University of Technology, released a paper that visualized Art History networks of Wikipedia, through VIAF IDs, and then ULAN. Here you can see the proportion Art History Subjects in Wikipedia, displayed on two dimensions derived from the ULAN connection: time and nationality.

Credit http://vsem.ec.tuwien.ac.at/wikiarthistory/

All of this is to say that VIAF data in English Wikipedia can as a very good ropebridge that allows for reuse, or recombination. The idea of a ropebridge is apt because the connection is somewhat shaky, at the moment it’s free text, semi-structured data that can be changed by anybody, but that doesn’t mean that the chasm isn’t being crossed.

Can you spot the weakness in all this collaboration though? We focused our first effort on English Language Wikipedia. The Germans, to their credit, have just as many VIAF IDs in their Wikipedia. The Italians copied the English Language data. However these separate efforts are not scalable to all 285 Wikipedias, nor does it allow all 285 Wikipedias to collaborate on the language-neutral VIAF Unique Identifiers.

Fortunately there is a solution, and that solution is Wikidata. Wikidata is first new Wikimedia Project since 2006, and will do three things. It will organize inter-language links into a central database (inter-language linking before was arduous and asymmetric). It will provide a central store of Semantic Data from the Wikipedia articles. And in the future it will be able to query that semantic data. Want to know more about Wikidata? Then look up Wikidata on Wikidata (obviously?!).

Now for a surprise – I’ve just finished migrating English Wikipedia’s VIAF data to Wikidata, and German, French, Italian, and Japanese datasets are in progress. (Code on Github). It takes about two weeks to inspect, clean, and copy the data over from each Wikipedia. I’ll post a full statistical breakdown once all the languages have finished. For now I’ll just say that the Wikidata VIAFbot is also migrating LCCN, GND, BNF, and SUDOC Identifiers as well as integrating for the first time ISNI IDs. At the time of this writing it records 750,000 edits and counting.

What does VIAF in Wikidata look like you ask? All pages about encyclopedic concepts are known as “Items” in Wikidata parlance, so lets inspect the item for Germaine Greer.

We first see all the Semantic Data Wikidata has about this topic. Each modicum of data is known as a “Claim” in Wikidata, is a triple, and is structured as [this page] [property] [value]. You can see that [Germaine Greer] [GND (read: “is a ” according to the German National Library)] [Person], and that [Germaine Greer] [is of sex] [female]. You can also see here that she’s got a lot of identifiers associated with her thanks to VIAFbot, which has sourced where it found the original VIAF ID. Now lets draw our attention to the bottom of the page to understand the impact.

This Wikidata page is associated with articles in 48 other languages. Each of those articles can capitalize on the semantic data stored above. That’s the beauty of Wikidata. Which now means that all of the data reuse cases that previously only worked for the English language Wikipedia, will now work for all of them. Austrian researchers can inspect Art History biases of not just English Wikipedia, but of dansk, Ελληνικά, हिन्दी, interlingua, Runa Simi, 中文, etc. etc. That’s one of the starting reasons why it’s important to have Authority Control in Wikidata. There are of course more directions than one to travel across a ropebridge. Leading data-mules of bibliographic information across from VIAF into Wikidata is next.

Max Klein

One Comment on “The Ropebridges: Authority Control in Wikidata”

Andy Mabbett says:

May 17, 2013 at 2:39 pm

Could you migrate ORCID data, too? There isn’t much, in Wikipedia, yet, but its fast growing. We hope to add ORCID IDs for the authors of papers in Wikipedia references, too, even if there isn’t an article about them.

Comments are closed.