There is a main distinction in links to different online databases – that of opacity. In the case that the URI gives you an indication as to its contents, that URI can be called transparent (e.g. https://en.wikipedia.org/wiki/Lewis_Carroll); but in the case where the URI consists of a long string of numbers it is considered opaque (e.g. https://www.wikidata.org/wiki/Q38082). There are advantages and disadvantages to both schemes.
Wikidata consists entirely of opaque identifiers, so to help us the items have a label, and possibly many aliases. For Item Q38082, the English label is “Lewis Carroll,” and is also known as “Charles Lutwidge Dodgson,” and “Charles Dodgson.” These are a set of identifiers that can be made and displayed in any of the 280 languages that Wikidata supports. Try setting your language to German and notice the differences.
Labelling and aliasing is important for understanding what we don’t understand. Even though there are 4 million articles in English Wikipedia, there are 10 million more Wikidata items. There can, and do exist Wikidata items that have English labels, without having English Wikipedia pages, for instance that of Wolf Vilenski. Label completeness is the idea that every Wikidata item will be labelled in every language regardless of which languages have linked Wikipedia articles. With moving towards label completeness we can understand which articles exist in other languages but not in our own. Knowing what you don’t know is the first hint to really expanding your knowledge into non-obvious areas, and fighting your own systemic bias.
I automated solving part of this problem. As I connected over 400,000 Wikidata items with their VIAF IDs, and that VIAF has alternate names in different languages, new information was potentially available. The procedure went as follows. For each item with an associated VIAF ID I found the potential alternative names in the VIAF record. These are the MARC 4XX fields, for those who care to know. Then I used python library called Guess Language to determine the probable language of string of text. If the alternate name was in a language that didn’t already have a Wikidata label, then immediately the VIAF alternate name could be added as a label.
If however there was a preexisting label in the language of the VIAF alternate name, I checked Levenshtein distance of the label and aliases versus the VIAF alternate name. If it was less than 65% similar – a threshold I determined by training a Bayesian filter – then I would write that the VIAF alternate name as an alias. Here are the results.
- “New AKA” means that an alias was added with VIAF data.
- “New Label” means that an alias was added with VIAF data.
- “Had AKA” means that the VIAF alternate name was already in Wikidata as an alias.
- “Had Label” means that the VIAF alternate name was already in Wikidata as an label.
That is that 5,904 new aliases and 8,197 new labels were added to Wikidata using the VIAF links that were imported. This goes to show some of the value of having such an interconnected world.
One of the problems with this method is that it could be much stronger. Say for instance if a VIAF alternate name exists and can be added to both the English and German aliases, because the name is spelled the same way in both English and German, it would not be added to both. It would be added to the language which the python library Guess Language determines for the name. Probably some business logic could be added to say that English and German could share aliases, although I erred on being conservative for possibly unforeseen consequences to those kind of assumptions. I’d be glad to hear of any suggestions to solving this particular problem, if anyone can make it less confusing, or twitter.com/notconfusing in general.