Wikipedia has its own data-structure in templates with parameters — if you are not familiar with Wikipedia templates, an example is “infoboxes,” which show up as fixed-format tables in the top right-hand corner of articles. Templates, and the metadata they contain, have been exploited for research in the past, but I’ve wanted to create a toolchain that would connect Wikipedia data and library data. I also wanted to be able to include a few more features than the standard Wikipedia statistics engines. For instance (a) working over all pages in a MediaWiki dump to analyze the differences between pages that do and don’t include certain templates (b) take into account what I term subparameters of of templates, and (c) do it all in a multithreaded way. Here is an early look at some analysis which may shed light on the notion of systemic biases in Wikipedia.
Birthdates
Of all the biases Wikipedia is accused “recentism” has seemed to me one of the more subtle. To investigate I wanted to compare the shape of the curve of global population to that of birthdates of biography articles on Wikipedia. For data, I looked in templates, specifically English Wikipedia’s {{Persondata}} for parameter DATE OF BIRTH, and German Wikipedia’s {{Personendaten}} for the parameter GEBURTSDATUM. For the comparison of Global Population I used UN data. In both cases you can see that the Wikipedia curves are below global population until about 1800, and outpace population in growth thereafter. These more exponential curves corroborate Wikipedia leaning covering more recent events more heavily. Curiously both Wikipedia lines peak at about 1988 and then all but disappear. If you want a biography article on Wikipedia apparently it helps to be 25 years old.

Simple Metrics
This is quite a simple analysis. One of the chief benefits of working with OCLC is that there is a lot of bibliographic data to play with, so lets marry the two sources: Wikipedia template data and OCLC data. For this section I queried all the Wikipedia pages from December 2012 for all the citation templates, and extracted all the ISBNs and OCLC numbers.
One way to characterize the cited books is audience level, derived from WorldCat holdings data. Audience level is expressed as a “a decimal between 0.01 (juvenile books) and 1.00 (scholarly research works).” Taking simple mean averages of audience level across all citations gives 0.47 on English Wikipedia. In German it’s 0.44. If we plot the histograms of each, we get moderately normal curves, that actually even tend to skew left.
Is Wikipedia stuffed with incomprehensibly dense knowledge? Maybe, but it’s citations aren’t necessarily.
Subject Analysis
Another bias claim lodged against Wikipedia is that content is heavily concentrated towards certain subjects. Is the same true for its citations? Every Wikipedia article could have any number of ISBNs or OCLC numbers, (see figure below). In FRBR terms, these identifiers relate to manifestations so using WorldCat they were clustered into works, at the expression level. And every work is about any number of subjects. Here I used the FAST subject headings, which are a faceted version of Library of Congress Subject Headings.

Then I totaled the number of citations on Wikipedia within each subject, creating a list of subjects with their respective citation frequency. Utilizing that list here is a word-cloud visualization of Wikipedia’s 100 most cited subjects, inferred through the subjects assigned to the works cited.

There is a large preponderance of subjects that confirm subcultures that Wikipedia is noted for its bias. Politics, Military History, Religion, Math and Physics, Comics and Video Games, and Mycology. At least of they are going to be overrepresetented in general, they should be well cited.
Below is the same algorithm applied to a different Wikipedia – can you guess the language? Quite funny to see courts, administrative agencies, and executive departments with such prominence.
That should give just a glimpse as to the range of avenues of inquiries available from being able to deeply search and connect Wikipedia template parameters with library data. Any special requests for specific queries?
Wikily yours,
Max