Last week I wrote about the ‘rope bridge’ between Wikidata and VIAF, and the new research it would afford. Today I bring you a sample of that research. I am investigating the sex associated with different Wikipedia Biography Articles for two reasons. Firstly, the Properties “Sex” and “VIAF” are two of the top 10 most used Wikidata Properties, with Sex at 587,312 items tagged, and VIAF with 301,763 (and rising, VIAFbot hasn’t finished scraping all languages yet). VIAF independently records sex per VIAF item, which gives us two comparable datasets. Secondly, after the so-called “Categorygate” piece in the New York Times I dug into Wikidata’s Sex Property and wanted to shed some light on the model currently in use.
Currently the Wikidata Property for Sex states:
Sex for humans, should be one of male, female , intersex, or the special “unknown” value
Finding this to be a rather rigid view of the world I started discussing it on the Discussion Page as per protocol. Of note, on the other hand is how VIAF records “gender” not “sex.” The current VIAF data model similarly limits values to male, female, or unknown but a change to a more nuanced model is planned for June. Its worth reminding that VIAF is populated with data from the many authority files it aggregates. One underlying authority file, which has a more nuanced view on this recording, is the Library of Congress Control Number (LCCN). The LCCN will record many “sexes” for a specific person with accompanying dates of validity. This at least shows that there are better ways of recording sex – if its necessary to record it at all – which prompts me to invite your input on the Wikidata Discusison Page about better ways to record sex. With that said, lets dig into some graphs. (Click to see larger versions.)
Sex Ratios by Language
The method used to perform this visualization is to view all the Wikidata items with Property:Sex and then look at the inter-language link section of the item to see which languages have articles relating to this item. Dividing along the lines of language, we can find sex ratios per language. Below shows each language with more than 1,000 articles tagged with sex data, sorted by the percentage of Female values.
Wikidata Sex Ratios By Language, Minimum 1000 Items
If you’re not well versed in Wikidata’s use of language codes, you can look them up. And if you’ve never browsed the winning and losing htwiki and tlwiki, the Haitian and Tagalog Wikipedias, then you can peruse the list containing minimum 10,000 Items with Sex Data.
Wikidata Sex Ratios By Language, Minimum 10000 Items
Two notable things arise here. Firstly, Chinese Wikipedia is seemingly the most progressive. Secondly The Intersex category fails to score a single pixel of recognition. In fact the Wikipedia with the highest ratio of Intersex values – as determined by Wikidata – is Korean Wikipedia, but at just 0.0078%.
Is this data reliable? A lot of it was imported from the German and other major Wikipedias. That can be a problem, because for any given Wikipedia Language there exists articles that have no linked equivalents in other languages. There may very well be Wikipedias with more or less skewed sex ratios, but they haven’t migrated their sex data to Wikidata, or they have no equivalent article in a language which has migrated its sex data. Lets see which languages have the most articles associated with sex data, of those above 1,000.
Total Number of Wikidata Items Tagged with Sex
Unsurprisingly we get a very Western view of the world. But wait, there are other data sources to corroborate against; that was one of the points of VIAFbot importing VIAF IDs into Wikidata. Let’s imagine an enhanced version of Wikidata, that uses VIAF sex data in addition to what’s currently tagged, using that VIAF ID bridge. I ran simulation of such enhanced version of Wikidata, but before we look at it, lets understand VIAF’s own biases.
VIAF IDs have gender info derived from National Library files. There’s hope this may give us a different picture because VIAFs may be ever slightly less severe in its skew, although looking at its list of contributors reveals also a Western bias. Of ~24 Million VIAF records (not all about people)¬† 1,299,396 have gender “male,” and 418,394 have gender “female.” This comes out to a percentage of 24.35% female.¬† (Unfortunately VIAF doesn’t note directly where LCCN has a more nuanced view, but it can be determined by crawling the RDF link to LCCN’s Marc XML which I explain later.) Now to compare the Wikidata and VIAF-enhanced-Wikidata sex ratios we overlay the two graphs. Here wherever you see light green that means that Wikidata’s data alone gave a higher female ratio, and where you see red, VIAF-enhanced-Wikidata data gives a higher female ratio.
Comparison of Wikidata Sex Ratios with and Without VIAF by Language, Minimum 1000 Items with sex
Reassuringly VIAF and Wikidata only disagreed on 0.0024% of 91,406 matches. There were seven cases where LCCN did have with multiple sexes and qualifying dates. Furthermore there are 52,407 cases where VIAF has Sex data but Wikidata does not. This might be a good juncture to import that data, if the Wikidata community wants.
There are articles in Wikidata which are not currently tagged with sex information, but whose sex information can be programmatically determined. There is some indication that tagging more articles would tend to produce more even sex ratios in Wikidata. If that were true, it would mean that “male” articles are more likely to be associated with sex data, though we cannot be positive about that claim. Finally recall that Wikidata’s data model for sex could also use some attention, and you the community are the instruments for that.
I wrote some simple scripts to crawl Wikidata and compare it VIAF and LCCN, its on ¬†GitHub, and I also modified code from the Wikidata community for parsing dumps which I plan to contribute back.
Did you find anything confusing? Leave your comments below or find me online. On twitter I’m @notconfusing.