Now that VIAFbot has finished importing VIAF IDs into Wikidata, I wanted to demonstrate what kind of work could be done with those connections.¬† In March, using Wikidata, I investigated the sex ratios of different Wikipedias. I compared the Wikidata Items that used the semantic property ‘sex’ to which Wikipedia language versions contained those items. (If you’re getting confused by the jargon, I give explanations in this YouTube tutorial.) It turned out to be a flattering affair for the Tagalog and Chinese Wikipedias whose sex ratios were the most even, albeit at 29.4% and 20.6% female respectively. Another interesting finding was that some of the items that had the ‘sex’ property, also had the ‘VIAF’ property – a link into the VIAF database. The national libraries contributing to VIAF also record the sex of certain entities, which means that new comparisons are possible.
The way this bot worked was to first make a query as to all the Wikidata items that had the VIAF property. There are 388,829 such Wikidata items as of 18 June 2013. Each of them contains a VIAF ID, which can then be made into the URI of the record at VIAF.org. At the VIAF.org record we can get VIAF’s opinion on the sex. That opinion is a result of a behind-the-scenes merge of all the national library files. Unfortunately with such a merge not all the data and its provenance is preserved. But because we live in a linked data era, it’s possible to follow links out of VIAF into the online databases of the contributing libraries. Here I used the Library of Congress because I like their data model for dealing with complex sex cases. Library of Congress will record multiple sexes with applicable dates if they exist, which is a step in the right direction compared to the problematic binary (well trinary) Wikidata model. Then I had 3 data sources which we can compare.¬† The rule I used was that specific information from Library of Congress trumped the merged information on VIAF. Then we take that single ruling library opinion and hold it to Wikidata. If the opinions matched I added a source to the Wikidata claim, if no Wikidata opinion exists I added a sourced claim. In the cases where Wikidata and VIAF disagreed with one another I made a list. Lets take a look at how often each of those cases occurred.
As I mentioned our entire eligible data set consists of 388,829 Wikidata items that have the ‘VIAF’ property. Of those there is a subset which have Wikidata sex data, and VIAF sex data. That subset totals 131,650. Reassuringly only .2% of this subset disagree. For each of those 311 items in that .2% a bit of hunting should be able to right it. For instance the Deutsche National Bibliothek thinks Nadine Warmuth is male, but Wikidata thinks female. On the other hand, Wikidata thinks that Nguyen Thi Binh is female, but VIAF suggests otherwise.
There are also instances where we don’t have two sources to conflict with each other. There were 125,781 times when Wikidata had sex data that VIAF did not. Maybe this is a case where libraries could glean a datum or two. Certainly Wikidata was pleased to be informed in the 44,526 scenarios where VIAF or LoC had sex information but VIAF not.
Lastly, and just for perspective, each time I handled any sex information I kept track of the content as well. 257,431 of the Wikidata items had sex data and split to 14.7% female, 85.3% male, and 0.002% “Intersex”. (The strict classification system that has this poorly named “other” category is a problem I’ve talked about before.) The sex data that came from VIAF showed a very similar story at 14.6% / 85.4% / .006%¬† female / male / “nuanced” even at a lower sample size of 176,187. It goes to show that sex data as a whole in both Wikidata is skewed. At least now there is more of it, and it has a citation.