Sex Ratios in Wikidata, Wikipedias, and VIAF

May 13th, 2013 by Max

Last week I wrote about the ‘rope bridge’ between Wikidata and VIAF, and the new research it would afford. Today I bring you a sample of that research. I am investigating the sex associated with different Wikipedia Biography Articles for two reasons. Firstly, the Properties “Sex” and “VIAF” are two of the top 10 most used Wikidata Properties, with Sex at 587,312 items tagged, and VIAF with 301,763 (and rising, VIAFbot hasn’t finished scraping all languages yet). VIAF independently records sex per VIAF item, which gives us two comparable datasets. Secondly, after the so-called “Categorygate” piece in the New York Times I dug into Wikidata’s Sex Property and wanted to shed some light on the model currently in use.

Currently the Wikidata Property for Sex states:

Sex for humans, should be one of male, female , intersex, or the special “unknown” value

Finding this to be a rather rigid view of the world I started discussing it on the Discussion Page as per protocol. Of note, on the other hand is how VIAF records “gender” not “sex.” The current VIAF data model similarly limits values to male, female, or unknown but a change to a more nuanced model is planned for June. Its worth reminding that VIAF is populated with data from the many authority files it aggregates. One underlying authority file, which has a more nuanced view on this recording, is the Library of Congress Control Number (LCCN). The LCCN will record many “sexes” for a specific person with accompanying dates of validity. This at least shows that there are better ways of recording sex – if its necessary to record it at all – which prompts me to invite your input on the Wikidata Discusison Page about better ways to record sex. With that said, lets dig into some graphs. (Click to see larger versions.)

Sex Ratios by Language

The method used to perform this visualization is to view all the Wikidata items with Property:Sex and then look at the inter-language link section of the item to see which languages have articles relating to this item. Dividing along the lines of language, we can find sex ratios per language. Below shows each language with more than 1,000 articles tagged with sex data, sorted by the percentage of Female values.

Wikidata Sex Ratios By Language

Wikidata Sex Ratios By Language, Minimum 1000 Items

If you’re not well versed in Wikidata’s use of language codes, you can look them up. And if you’ve never browsed the winning and losing htwiki and tlwiki, the Haitian and Tagalog Wikipedias, then you can peruse the list containing minimum 10,000 Items with Sex Data.

WikidataSexRatiosByLangAlone_Min10000

Wikidata Sex Ratios By Language, Minimum 10000 Items

Two notable things arise here. Firstly, Chinese Wikipedia is seemingly the most progressive. Secondly The Intersex category fails to score a single pixel of recognition. In fact the Wikipedia with the highest ratio of Intersex values – as determined by Wikidata – is Korean Wikipedia, but at just 0.0078%.

Data Caveats

Is this data reliable? A lot of it was imported from the German and other major Wikipedias. That can be a problem, because for any given Wikipedia Language there exists articles that have no linked equivalents in other languages. There may very well be Wikipedias with more or less skewed sex ratios, but they haven’t migrated their sex data to Wikidata, or they have no equivalent article in a language which has migrated its sex data. Lets see which languages have the most articles associated with sex data, of those above 1,000.

WikidataSexTotalByLangAlone_Min1000

Total Number of Wikidata Items Tagged with Sex

Unsurprisingly we get a very Western view of the world. But wait, there are other data sources to corroborate against; that was one of the points of VIAFbot importing VIAF IDs into Wikidata. Let’s imagine an enhanced version of Wikidata, that uses VIAF sex data in addition to what’s currently tagged, using that VIAF ID bridge. I ran simulation of such enhanced version of Wikidata, but before we look at it, lets understand VIAF’s own biases.

Introducing VIAF

VIAF IDs have gender info derived from National Library files. There’s hope this may give us a different picture because VIAFs may be ever slightly less severe in its skew, although looking at its list of contributors reveals also a Western bias. Of ~24 Million VIAF records (not all about people)¬† 1,299,396 have gender “male,” and 418,394 have gender “female.” This comes out to a percentage of 24.35% female.¬† (Unfortunately VIAF doesn’t note directly where LCCN has a more nuanced view, but it can be determined by crawling the RDF link to LCCN’s Marc XML which I explain later.) Now to compare the Wikidata and VIAF-enhanced-Wikidata sex ratios we overlay the two graphs. Here wherever you see light green that means that Wikidata’s data alone gave a higher female ratio, and where you see red, VIAF-enhanced-Wikidata data gives a higher female ratio.

Comparison of Wikidata Sex Ratios with and Without VIAF by Language, Minimum 1000 Items with Sex

Comparison of Wikidata Sex Ratios with and Without VIAF by Language, Minimum 1000 Items with sex

Reassuringly VIAF and Wikidata only disagreed on 0.0024% of 91,406 matches. There were seven cases where LCCN did have with multiple sexes and qualifying dates. Furthermore there are 52,407 cases where VIAF has Sex data but Wikidata does not. This might be a good juncture to import that data, if the Wikidata community wants.

Conclusions

There are articles in Wikidata which are not currently tagged with sex information, but whose sex information can be programmatically determined. There is some indication that tagging more articles would tend to produce more even sex ratios in Wikidata. If that were true, it would mean that “male” articles are more likely to be associated with sex data, though we cannot be positive about that claim. Finally recall that Wikidata’s data model for sex could also use some attention, and you the community are the instruments for that.

Software used

I wrote some simple scripts to crawl Wikidata and compare it VIAF and LCCN, its on  GitHub, and I also modified code from the Wikidata community for parsing dumps which I plan to contribute back.

Did you find anything confusing? Leave your comments below or find me online. On twitter I’m @notconfusing.

Related posts:

9 Responses to “Sex Ratios in Wikidata, Wikipedias, and VIAF”

  1. Joseph Reagle Says:

    Hey Max, for gender guessing this might be of interest: http://reagle.org/joseph/pelican/technology/guessing-the-gender-of-bibliographic-subjects.html

  2. Max Says:

    Very curious. This would mean that we can run a further simulation to find the ‘true(r)’ sex ratios of Wikidata by algorithmically guessing the sexes of those articles that we know to be of people but not with Property:Sex. We can guess that they are people by if they have Property:Main Type (GND) = Person. Or they have properties of people like Property:Birth Date.

  3. Andy Mabbett Says:

    Great stuff! I stared a discussion about recording the sex or gender of Wikipedia’s biographical subjects in its infoboxes, but it got bogged down in bikeshedding^W discussion of visual presentation. I must re-start it!

  4. Andy Mabbett Says:

    s/stared/started. Sorry.

  5. Andreas Kolbe Says:

    Thanks, Max. Is there a way to do a separate analysis for living people?

  6. Max Says:

    I suppose at least one way to approach that question is to look at Wikidata items that have Property:Date of Birth, but not Property:Date of Death when that Wikidata property arrives (waiting on the Date data-type). Can you think of other ways to determine living/dead people?

  7. Andy Mabbett Says:

    Max: on en.Wikipedia, Category:Living people is used; also {{WikiProject Biography|living=yes}} on the talk pages.

  8. Nora Says:

    I have been looking for the correct place to ask my question and this rather timely article may be the place.
    While lurking about the Technical page of The Village Pump, I happened across a discussion that speaks to gender equality; the possibility that multiple words in a Template may = Female and the seeming fact that a template was changed suppressing the weight box when Gender = any word that could = female. There is a lot that might be said but the bottom line would seem to be very simple – if the quality is not gender specific, then the quality should be available to be filled in for all genders.
    I am new to Wikipedia but experienced in database admin and the questions of equity for all. Please let me know if I should direct my questions elsewhere.

    Thank you for any assistance you may be able to give me.

  9. Max Says:

    @Andy,
    Thanks, I’ll use that trick.

Leave a Reply