Wikipedia Analytics Engine

Wikipedia has its own data-structure in templates with parameters — if you are not familiar with Wikipedia templates, an example is “infoboxes,” which show up as fixed-format tables in the top right-hand corner of articles. Templates, and the metadata they contain, have been exploited for research in the past, but I’ve wanted to create a toolchain that would connect Wikipedia data and library data. I also wanted to be able to include a few more features than the standard Wikipedia statistics engines. For instance (a) working over all pages in a MediaWiki dump to analyze the differences between pages that do and don’t include certain templates (b) take into account what I term subparameters of of templates, and (c) do it all in a multithreaded way. Here is an early look at some analysis which may shed light on the notion of systemic biases in Wikipedia.

Birthdates

Of all the biases Wikipedia is accused “recentism” has seemed to me one of the more subtle. To investigate I wanted to compare the shape of the curve of global population to that of birthdates of biography articles on Wikipedia. For data, I looked in templates, specifically English Wikipedia’s {{Persondata}} for parameter DATE OF BIRTH, and German Wikipedia’s {{Personendaten}} for the parameter GEBURTSDATUM. For the comparison of Global Population I used UN data. In both cases you can see that the Wikipedia curves are below global population until about 1800, and outpace population in growth thereafter. These more exponential curves corroborate Wikipedia leaning covering more recent events more heavily. Curiously both Wikipedia lines peak at about 1988 and then all but disappear. If you want a biography article on Wikipedia apparently it helps to be 25 years old.

Occurences of Birth Dates in English and German Wikipedia Compared to Global Population

Simple Metrics

This is quite a simple analysis. One of the chief benefits of working with OCLC is that there is a lot of bibliographic data to play with, so lets marry the two sources: Wikipedia template data and OCLC data. For this section I queried all the Wikipedia pages from December 2012 for all the citation templates, and extracted all the ISBNs and OCLC numbers.

One way to characterize the cited books is audience level, derived from WorldCat holdings data. Audience level is expressed as a “a decimal between 0.01 (juvenile books) and 1.00 (scholarly research works).” Taking simple mean averages of audience level across all citations gives 0.47 on English Wikipedia. In German it’s 0.44. If we plot the histograms of each, we get moderately normal curves, that actually even tend to skew left.

Audience Level English Audience Level German

Is Wikipedia stuffed with incomprehensibly dense knowledge? Maybe, but it’s citations aren’t necessarily.

Subject Analysis

Another bias claim lodged against Wikipedia is that content is heavily concentrated towards certain subjects. Is the same true for its citations? Every Wikipedia article could have any number of ISBNs or OCLC numbers, (see figure below). In FRBR terms, these identifiers relate to manifestations so using WorldCat they were clustered into works, at the expression level. And every work is about any number of subjects. Here I used the FAST subject headings, which are a faceted version of Library of Congress Subject Headings.

Subject Anaylsis Procedure for Wikipedia
Subject Analysis Procedure for Wikipedia

Then I totaled the number of citations on Wikipedia within each subject, creating a list of subjects with their respective citation frequency. Utilizing that list here is a word-cloud visualization of Wikipedia’s 100 most cited subjects, inferred through the subjects assigned to the works cited.

A world cloud of the FAST Subject Headings of the most cited Books in Wikipedia
A world cloud of the FAST Subject Headings of the most cited books in English Wikipedia

There is a large preponderance of subjects that confirm subcultures that Wikipedia is noted for its bias. Politics, Military History, Religion, Math and Physics,  Comics and Video Games, and Mycology. At least of they are going to be overrepresetented in general, they should be well cited.

Below is the same algorithm applied to a different Wikipedia – can you guess the language?  Quite funny to see courts, administrative agencies, and executive departments with such prominence.

dewiki-fast-word-cloud

That should give just a glimpse as to the range of avenues of inquiries available from being able to deeply search and connect Wikipedia template parameters with library data. Any special requests for specific queries?

Wikily yours,

Max

twitter: notconfusing