VIAFbot Debriefing

Shortly after reaching the 1/4 million edits milestone VIAFbot finished linking Wikipedia biography articles to Examining the bot’s logs reveals telling statistics about the landscape of Authorities on Wikipedia. We can now know how much linked authority data is on Wikipedia, it’s composition, and the similarities between languages.

First, let’s understand the flow of the bot’s job. With VIAFbot I sought to reciprocate the links from to Wikipedia, which were algorithmically matched by name, important dates, and selected works. Therefore it started by visiting all the Wikipedia links  that existed on Note that  owing to the delay between when the links were created and now, some of the pages had been deleted or merged (Fig. 1 orange region). For the rest of the set-up it utilized German Wikipedia which has focused a lot on their authorities data. VIAFbot also loaded all available equivalent German Wikipedia articles to our English matches, the “interwiki link” in Wikipedia parlance.

Next VIAFbot searched for the equivalent structured-data Authority control, and Normdaten templates to see what preexisting authorities data those pages held. German Wikipedia shone with 92,253 Normdaten templates (Fig 1. purple region), 74,864 had the VIAF paramater filled (Fig. 1 pink region), compared to English Wikipedia’s mere score of 9,034 templates with 770 VIAF IDs.

Figure 1.

The program then compared the VIAF IDs supplied by English Wikipedia, German Wikipedia, and, although not always were all three sources present. While two or more sources didn’t conflict, VIAFbot wrote the VIAF ID to the English Wikipedia page. If a conflict was found, then the bot noted it for human inspection on Wikipedia along with which sources conflicted. One statistic that was telling was how often the different sources disagreed with one another. These disagreement rates were surprisingly similar, but German Wikipedia seemed to disagree marginally less with at 11.3% compared to English’s 15.9% (Fig 2.)

Figure 2.

In the noncontroversial non-disagreement cases, of which there were 254,678, there were still some errors found of a different variety. Even though there was no disagreement among the sources, and probably in the instances in which there was only the source, the wrong VIAF number was written. Some very dedicated Wikipedians took to reporting these errors, and will incorporate those corrections. That is the power of crowdsourcing refining algorithmic accuracy.

The question still remains of how much these links being used? Google analytics on the site, can help answer that. German Wiki was the largest referrer to Wikipedia as late as September 2012. VIAFbot started editing in October, and the effect was immediately tangible – soon gaining poll position and then doubling total referrals (Fig. 3).  It must be said though that this level of viewership may not be sustained as the “curiosity clicks” of Wikipedians being notified of changes through their watchlists starts to fade.

Figure 3. Referral traffic to

Still, don’t doubt the usefulness of the project. For instance we received this email from John Myers of Union College in  Schenectady NY,

 “I had an Arabic name to enter into a record as part of a note, and I wasn’t confident about the diacritics.  So, I look in the authority file to temporarily download it, copy the form of the name, and then move on.  Couldn’t find the name in OCLC.  Look in Wikipedia under his common name – bingo.  Even better, Wikipedia has a link to VIAF, double bingo!  With the authorized form from VIAF, I could readily find the record in OCLC (I was tempted to copy the name form directly from VIAF, but didn’t want to push my luck.)  The miracles of an interconnected bibliographic dataverse!”

VIAFbot had written the link for ‘Aziz ‘Aku ak-Misri only a few days prior.

The principal benefit of VIAFbot is the interconnected structure. Recognizing this, other Wikipedias (Italian and Swedish) have been in contact and asked for the same on their Wiki’s. Yet to truly be interconnected the next step forwards is to integrate VIAF IDs not into any one Wikipedia, but into the forthcoming Wikidata, a central database for all Wikipedias  across languages. Fortuitously, the pywikidata bot framework is stabilizing, and I’m in need of a new project now.

Without confusion,

Max Klein (@notconfusing)








7 Comments on “VIAFbot Debriefing”

  1. I believe that VIAFBot is no longer active on en:wp — Max can correct me if I’m wrong (and I frequently am!) but VIAFbot was a bootstrapping exercise to seed en:wp with authority control and VIAF links. I think the idea is that now the community can step in and take over that work.

    1. Hi Bryan,
      They way I think this should be solved is by using the Wikidata data (Although oddly it says that the correct one did come from English Wikipedia.) And the Authority control template would detect changes in Wikidata, that the bot would corroborate. There is still discussion and intertia to doing that, but it’s what I think would be best

  2. JH,
    These are some excellent points. We will have to obviate the benefits (such as automatic content generation) to prove more worth of Authority records. VIAF improves approx. once every six months, and I plan to run an update maintenance bot to help those funny cases as you mentioned.

    VIAF takes note of all corrections and installs them into the periodic updates.

  3. Thanks for the debrief!

    A couple of things:

    I encountered VIAF over the summer, when I was making some {{Creator}} entries over on WP Commons for some aquatintists and engravers that I\’d uploaded some pics for. Looking to see whether they had VIAF entries was sort of neat, but for me the real payoff was seeing if the VIAF entry then linked to an LCCN entry, because that would automatically populate a WorldCat link from the template.

    I think you missed a trick with VIAFbot, not getting it to automatically add LCCN references where possible too. The link to VIAF is all very well, but apart from somebody who really knows their bibliographic systems, it\’s not got that much value for the average end-user. On the other hand, the WorldCat link adds something they can immediately use, and get value from.

    This is important, if you compare it to eg {{coord}} information. There\’s been a huge push to add this on en-wiki, because it is so instantly useful, there\’s a real pay-off to the chore of having to work with the template and get the data into the right format. Similarly with authority control, I think. If you want people to take these to heart, and make the effort to find them, and add them, and look after them, that\’s much more likely to happen if they obviously do something that\’s user-friendly and useful. Currently an LCCN code does that, I think; but at the moment a VIAF code I suspect… doesn\’t.

    The second thing that would be interesting would be a bit more on some of the bizarre errors that came back — for example, an article picking up a VIAF code for somebody with a completely different date of birth, while VIAF itself had a link to the correct en-wiki article. Or (my favourite) WP\’s article on the Biblical character Aaron picking up a link to the VIAF for an obscure 19th century author.

    Is anything more known about these? Can you confirm whether they were errors in the VIAF when you made your data extraction, that have since been fixed? Or can they be traced back to a subtle bug in the bot; and if so, were there other VIAF mis-links introduced, that now need to be fixed. (Do you plan to re-extract from VIAF, and identify links to en-wiki that they have changed, which will need to be changed on WP?)

    But overall, a big thanks for introducing the Authority Control template to a quarter of a million WP articles, and here is hoping that a bio ought to have an AC template gets taken up further, and they become as well-curated and cherished as the co-ord templates have become.

Comments are closed.