Archive for the 'Wikipedia' Category

Graph-ted on: Dewey and Wikipedia

Monday, February 24th, 2014 by Max

Introduction

Since entering the library world, I’ve been fascinated by the substream of uber-geeks which are the Dewey Decimal Classification (DDC) nerds. It was the enthusiastic and technical-leaning banter of Editor-in-Chief Michael Panzer, that enticed me. When I learned that the DDC is designed to be able to describe all possible concepts, and then also that is not a strict hierarchy, but really a web of complex relations, I was intrigued by the foresight of the approach. I see the DDC as a best effort 100 years before anyone would utter “crowdsourcing,” – which is how a modern analog the Wikipedia Category systems works. Using Graph Theory we develop a new way to look at, and compute with DDC and then connect it to Wikipedia Categories.

(In this post we assume some knowledge of Graph Theory, so if you need a quick primer, it’s at the end: Graph Theory tl;dr.)

Representing the Dewey Decimal Classification in Graph Theory

In order to represent DDC using Graph Theory we have to translate its components into graph theory. That means,

  • Classifications become the “nodes”.
  • Relationships between Classifications become the “edges.”
  • Relation colourings:
    • Notational hierarchy – Red
    • See References – Green
    • Class elsewhere – Blue

We are actually defining a directed graph because the relations we have are not in general reciprocal. That is, if a classification A is below B in hierarchy,  or  A has a “see also” reference to B, B does not necessarily have that reference to A. (Although there are some reciprocal “class elsewhere” relations for example 391 and 646.3  in the 4 hop graph below). Note also that colourings – the mathematical parlance – indicates that relationships are different without assigning the relationship a numerical value.

Visualizations

This has all been quite abstract so far so let’s look at something concrete. We’ll pick classification 394.1, whose caption is the enjoyable “Eating, drinking; using drugs” .  The distance to other nodes is how many edges we travel from our source. We will look at all the connections that are n-hops from 394.1. Below is the evolution of visualizations from distances 2, 3, 4, and 5. By 5 hops you can see that the graph begins to be too complicated to visually represent in a single image.

 

394_1-2hops

394_1-3hops

394_1-4hops

394_1-5hops

Weighting the Graph

One of the powerful aspects of rendering the classification system as a graph is that we can determine the shortest distance between two nodes. However, as mentioned, our colourings have no intrinsic value, so in order to to find the shortest paths, we have to assign weights to colours. This is variable, but for the rest of the examples I am using the settings: Notational Hierarchy = 0.5, See Reference = 1, Class_elsewhere = 2.  That means that our shortest path between 394.1 and 341.33 (“Diplomatic Law”) is 394.1 -> 394-> 395  -> 327.2 -> 341.33, which has a total value of 5.5.

Transforming Wikipedia Categories

Consider a Wikipedia Category. It can contain many Wikipedia  articles, each of which has zero or more citations of books. Some of those book citations include the ISBN, which is precisely what the OCLC  Classify API is hungry for. Classify will transform an ISBN or OCLC number into a DDC, based on how  it is most popularly cataloged in WorldCat. That means we can arrive at a set of classifications associated with the category. From this set it would be instructive if we could somehow average them into one representative classification. In Graph Theory terms this is known as finding the center, and there are several ways to approach this problem. The method we employ here is the betweenness centrality. That is, we calculate the weighted shortest paths between all the classifications associated with the category, and find the nodes that were most touched in all the shortest paths. This is rather abstruse without some examples. We run the algorithm on some categories and below are the top 5 between nodes of:

  • Category:Cooking techniques:
    1. 641 – Food and Drink
    2. 641.5 – Cooking
    3. 641.3 – Food
    4. 613.2 – Dietetics
    5. 641.815 – Bread and bread-like foods
  • Category:Feminism
    1. 305.42 – Social role and status of Women
    2. 305.43 – Women by occupation
    3. 305.9 – People by occupation and miscellaneous social statuses
    4. 305 – Social Groups
    5. 305.4 – Women

This shows there is some basis for proving that this technique works. The centre of the classifications we find are around the same topic of that organizes the classification. We usually finds a strand or run in the system that hover around the main ideas. One caveat though is that for both of these Categories only about 25% of Wikipedia articles actually contain an ISBN citation. Yet of articles that do include at least one ISBN citation, they average 2.7 citations per article. With that in mind we continue to look at some set of articles that are not organized by topic, and are also more ISBN-rich. What better category than Featured Articles?

  • Category:Featured articles
    1. 930 – History of the ancient world c. 499
    2. 940 – History of Europe
    3. 361 – Social problems and welfare in general
    4. 800 – Literature and Rhetoric
    5. 362 – Social welfare problems & services
    6. 580 – Plants (Botany)

Featured articles, of which there were ~4,000 gave lead to 25,000 ISBNs, which related to 17,000 Classifications. There are only approximately 50,000 standard Classifications. This at first would seem impressive coverage, but distribution of the coverage in the graph was not calculated, so its still difficult to say how diverse the featured article set is. Computing on this set was very intensive, in fact just computing the centre of these 17,000 Classifications took 9 hours parallelized over 16 cores. These topics seem to be more abstract, and less interrelated. Without being able to infer some sort of connect, next we run the algorithm not on a category, but on the list of articles that your blogger has ever edited (307 articles).

  • Contributions of User:Maximilianklein
    1. 930 – History of the ancient world c. 499
    2. 800 – Literature and Rhetoric
    3. 302.23 -  Media (Means of communication)
    4. 910.45 – Ocean travel and seafaring adventures
    5. 361 – Social problems and welfare in general

Uncovered here is a latent passion for Seafaring adventures! This is probably owing to the fact that travel articles may have cited ISBN-having travel books which were disproportionately focused around sea-travel. More importantly what arises here is that several of the central Classifications that describe the edit history are similar to the featured articles results. This would seem to suggest that there are some very central nodes that most connections are channelled through if they lie at disparate regions of the graph. The implication is that the problem of classifying an arbitrary group of object remains difficult! Of course these results are only for one possible set of values for our weights dictionary, where these top level channels would be more “expensive” to hop-through. So future research should try and calibrate these values to produce the most varied set of central nodes.

This work is available as an Ipython notebook and on github, and comments welcome
notconfusing

Graph Theory? tl;dr

We are not talking about the colloquial meaning of “graph”, but rather the mathematical concept.

A Graph is a collection of:

  • “Nodes” or “Vertices” – the objects
    • Here letters A through F
  • “Edges” – relations (directed or undirected)
    • Here the arrows between letters

Not talking about this kind of graph. (Image By Inductiveload [Public domain], via Wikimedia Commons)

An example of a “Graph Theory” Graph (By David W. at de.wikipedia (Original text : David W.) [Public domain], from Wikimedia Commons)

 

Harvesting Book Metadata From Wikipedia to Wikidata

Wednesday, November 27th, 2013 by Max

Infoboxes for a long time were Wikipedias’ way of storing data, and Wikidata is set to replace that techonlogy, with added bonuses like inter-language sharing. To get to that promise one first step is for Infoboxes to be harvested into Wikidata. I have started by harvesting Infobox Book in the 9 biggest Wikipedia languages that share the template: English, Italian, French, Spanish, Russian, Polish, Portugese, Swedish, and Japanese.

The point of harvesting Infobox Book specifically is that the Wikidata citation guidelines for books specify that the Library FRBR concept should be used, so I wanted to build out infrastructure to that end. FRBR is about describing Bilbliographic record at many different levels and here’s an example of what this kind of citation would look like in Wikidata:

With that in mind lets have a look at the data. O ur entry point is the set of Wikipedia pages that use Infobox Book -transclusion in Wikpedia parlance – in the 9 aforementioned languages. This measure is only an approximation and does not completely reflect how many Wikipedia topics are about books in a language for three reasons. The first is that the conception of a what is a book is not strictly enforced on Wikipedia.  An article could be about a physical item or an amorphous work idea,  or even sometimes the inclusion of an infobox book template is only a nod to a book like French article on this racing pigeon.  The second is that not all articles about a book necessarily contain a transclusion to Infobox Book. And thirdly some specialised Infobox Books have developed and are used instead, like Infobox Doctor Who Book.

In this next chart we look at the total Infobox Book transclusions, the total articles of a language, and the ratio between the two. Despite large variation in absolute numbers, the percentage of Books Articles in a Wikipedia is somewhere beteween .1-1% of all articles. Italians affirm themselves as the most bilbliophilic. We’ll also see later on about how their practice of labelling genre differs from the others.

 

Infobox Book Transclusion Counts By Language
Language Infobox Book Transclusions Total Articles (000′s) Percentage of Total Articles
en 30582 4432 0.690
es 3534 1057 0.334
sv 3023 1598 0.189
pl 2782 1005 0.277
pt 1975 803 0.246
ru 1865 1061 0.176
it 10788 1082 0.997
ja 1446 886 0.163
fr 7935 1441 0.551

 

In each Infobox I crawled for the most used properties across all languages and whose values were either string identifiers or links to other Wikipedia pages. When a value is a link to another Wikipedia page, for instance a link to the page of the author, that is useful because when harvested Wikidata can store the author property as a link to another Wikidata item. This is desirable as in Wikidata we seek to build a Wiki of relations.

Here is a graph of the properties that found, which were added to Wikidatak, and which were already in the database.

Properties Harvest

So as you can see there are now over 30,000 relations between books and their authors and illustrators in Wikidata, as well as the original language and genres of the books. In addition knowing which book is which from a disambiguation perspective is made easier by the inclusion of over 50,000 identifiers.

One difficulty that was encountered was that even though ISBNs are recorded in Infobox Book, the type of ISBN – 10 or 13 – was not discriminated. Wikidata does however discriminate, and so as I was sorting these ISBNs I thought it would be sage to also verify them. OCLC runs an API called  xID for this very purpose. While using xID it also struck me that the OCLC control number could be returned for a given ISBN. As Wikidata is rapidly evolving into a hub of identifiers, I included those in pushing to Wikidata. During this harvest then I also inserted an additional 10,117 OCNs (not pictured above).

As I mentioned It’s not just boring, nameless identifiers that we want to eventually integrate into all the Wikipedia pages by Wikidata. I inspected genre data as well to see how much cross-cultural benefit we’d receive by doing these sorts of harvests.  Below are the Top 10 genres found in Infobox Book by each language. The text shown are the English Labels of the Wikidata Items of links found in each local Infobox. I’ve also outlined those genres which are unique. So you can see that Swedes care a bit more about the choir books and the Japanese have a bent towards police drama.

 

Infobox Book Top 10s

What first jumped out at me is how inconsistently the idea of genre is used. In some ways its used to describe the content’s emotion and focus, like “science fiction” or “horror”. Other times its used to describe form like “novel”. In fact only the Italians really are very consistent as their top ten, albeit discusses form in “novel”, “essay”, “short story”, “poetry”, “anthology”, “autobiography”, “novella”, “dialogue”, and “poem”.

Another problem between languages is that the genres mismatch often because they are pointing to only slightly different articles. That is we see appearances from the Wikidata items for “fantasy”, “fantasy literature”, “Fantastique”, and “high fantasy”. (By the way you can draw your own conclusions about the demographics of Wikipedia editors when this much fantasy lit pervades the results.)

A conclusion that can be drawn from all this is that there is still some work to be done on negotiating cultural differences on Wikidata. Wikidata has made a lot of connections between Wikipedia articles in different languages, but not all of those merges are clean. The French conflate a pigeon and a book about a pigeon, and its linked to languages that discuss only the pigeon. Meanwhile how how the Italians interpret “genre” is a different, not necessarily incompatible, notion to others. There are some discussions still to be had probably before Infoboxes completely switch over to using Wikidata data, but we are at least one step closer to that goal.

OCLC Control Numbers in the wild

Friday, October 11th, 2013 by Merrilee

A few weeks ago, Jim posted about OCLC Control Numbers and their public domain status. In that posting, Jim wrote, of the OCN, “It’s an important element in linked library data that helps in the creation and maintenance of work sets and provides a mechanism to disambiguate authors and titles.” He went on to detail the numerous ways that the OCN has been “widely used within the broad system of information that flows among libraries, national information agencies, commercial information providers and organizations that supply consumers with book and journal-oriented services.”

While that’s all well and good (and true) I wanted to provide some specific details on how the OCN is being used outside of what most of us consider normal channels of book information flows, and that is how the OCN is being used in (English Language) Wikipedia and in the ambitious Wikidata project. These are based on some counts that Max and I did a while ago, but I think they are current enough to make my point, which is that the OCN is recognized as having value outside of the library and publishing domain.

Wikipedia relies pretty heavily on a number of templates — one such template is the Authority Control Template, which Max has written about before. Another template, which I know you’ve all seen before, is the Infobox Book template.

Infobox Book For Alice Munro's Too Much Happiness

Infobox Book For Alice Munro’s Too Much Happiness

This template, like most Wikipedia templates, contains what we would immediately recognize as metadata. This example, of Alice Munro’s Too Much Happiness, includes the author, date and country of publication, ISBN, and our friend the OCN. The OCN has been used in this template for a long time, and helps, as does the ISBN, to disambiguate works from one another. In Wikipedia, as in library catalogs, disambiguation is important, which is why the Wiki community values trusted identifiers like the OCN. And the OCN can really come in handy when there is no ISBN, as is the case with any book published before 1970.

Although not every Wikipedia article that is about a book has this template, but many do, so it can be a good way to see how many books have Wikipedia articles about them. A few months ago, I did a count on a dump of Wikipedia (using some jazzy scripts that Max wrote) and found that there were 29,673 instances of Infobox Book. In those templates, there were 23,304 ISBNs and 15,226 OCNs. Let’s hear it for identifiers!

Max did a count of identifiers in the newer Wikidata, and found that of around 14 million Wikidata items, 28,741 were books. 5403 Wikidata items have an ISBN-13 associated with them, and 12,262 have OCNs. Why is the number of ISBNs so low? Because Wikidata has a slot for ISBN-13 only; they are assuming that contributors will pad any ISBN-10s, but the numbers speak for themselves. Identifiers are of even greater importance in Wikidata than in Wikipedia, since Wikidata is all about metadata.

So there’s a look at how the humble OCN is being used, even outside the library and publishing context.

As a sidenote, there are several different flavors of Infobox Book, and one of them, I recently learned, is Infobox Dr Who Book. Go figure.

“Allow me to reintroduce myself,” say those in Wikidata and VIAF

Wednesday, September 25th, 2013 by Max

There is a main distinction in links to different online databases – that of opacity. In the case that the URI gives you an indication as to its contents, that URI can be called transparent (e.g. https://en.wikipedia.org/wiki/Lewis_Carroll); but in the case where the URI consists of a long string of numbers it is considered opaque (e.g. https://www.wikidata.org/wiki/Q38082). There are advantages and disadvantages to both schemes.

Wikidata consists entirely of opaque identifiers, so to help us the items have a label, and possibly many aliases. For Item Q38082, the English label is  “Lewis Carroll,” and is also known as “Charles Lutwidge Dodgson,” and “Charles Dodgson.” These are a set of identifiers that can be made and displayed in any of the 280 languages that Wikidata supports. Try setting your language to German and notice the differences.

LanguageSelect

Labelling and aliasing is important for understanding what we don’t understand. Even though there are 4 million articles in English Wikipedia, there are 10 million more Wikidata items. There can, and do exist Wikidata items that have English labels, without having English Wikipedia pages, for instance that of Wolf Vilenski. Label completeness is the idea that every Wikidata item will be labelled in every language regardless of which languages have linked Wikipedia articles. With moving towards label completeness we can understand which articles exist in other languages but not in our own. Knowing what you don’t know is the first hint to really expanding your knowledge into non-obvious areas, and fighting your own systemic bias.

I automated solving part of this problem. As I connected over 400,000 Wikidata items with their VIAF IDs, and that VIAF has alternate names in different languages, new information was potentially available. The procedure went as follows. For each item with an associated VIAF ID I found the potential alternative names in the VIAF record. These are the MARC 4XX fields, for those who care to know. Then I used python library called Guess Language to determine the probable language of string of text. If the alternate name was in a language that didn’t already have a Wikidata label, then immediately the VIAF alternate name could be added as a label.

If however there was a preexisting label in the language of the VIAF alternate name, I checked Levenshtein distance of the label and aliases versus the VIAF alternate name. If it was less than 65% similar – a threshold I determined by training a Bayesian filter – then I would write that the VIAF alternate name as an alias. Here are the results.

Results of populating Aliases in Wikidata using VIAF data.

Results of populating Aliases in Wikidata using VIAF data.

  • “New AKA” means that an alias was added with VIAF data.
  • “New Label” means that an alias was added with VIAF data.
  • “Had AKA” means that the VIAF alternate name was already in Wikidata as an alias.
  • “Had Label” means that the VIAF alternate name was already in Wikidata as an label.

That is that 5,904 new aliases and 8,197 new labels were added to Wikidata using the VIAF links that were imported. This goes to show some of the value of having such an interconnected world.

One of the problems with this method is that it could be much stronger. Say for instance if a VIAF alternate name exists and can be added to both the English and German aliases, because the name is spelled the same way in both English and German, it would not be added to both. It would be added to the language which the python library Guess Language determines for the name. Probably some business logic could be added to say that English and German could share aliases, although I erred on being conservative for possibly unforeseen consequences to those kind of assumptions. I’d be glad to hear of any suggestions to solving this particular problem, if anyone can make it less confusing, or twitter.com/notconfusing in general.

Yours,
Max

Sex Ratios in Wikidata, Wikipedias, and VIAF part 2

Wednesday, June 19th, 2013 by Max

Now that VIAFbot has finished importing VIAF IDs into Wikidata, I wanted to demonstrate what kind of work could be done with those connections.  In March, using Wikidata, I investigated the sex ratios of different Wikipedias. I compared the Wikidata Items that used the semantic property ‘sex’ to which Wikipedia language versions contained those items. (If you’re getting confused by the jargon, I give explanations in this YouTube tutorial.) It turned out to be a flattering affair for the Tagalog and Chinese Wikipedias whose sex ratios were the most even, albeit at 29.4% and 20.6% female respectively. Another interesting finding was that some of the items that had the ‘sex’ property, also had the ‘VIAF’ property – a link into the VIAF database. The national libraries contributing to VIAF also record the sex of certain entities, which means that new comparisons are possible.

SexStatsFlow

The way this bot worked was to first make a query as to all the Wikidata items that had the VIAF property. There are 388,829 such Wikidata items as of 18 June 2013. Each of them contains a VIAF ID, which can then be made into the URI of the record at VIAF.org. At the VIAF.org record we can get VIAF’s opinion on the sex. That opinion is a result of a behind-the-scenes merge of all the national library files. Unfortunately with such a merge not all the data and its provenance is preserved. But because we live in a linked data era, it’s possible to follow links out of VIAF into the online databases of the contributing libraries. Here I used the Library of Congress because I like their data model for dealing with complex sex cases. Library of Congress will record multiple sexes with applicable dates if they exist, which is a step in the right direction compared to the problematic binary (well trinary) Wikidata model. Then I had 3 data sources which we can compare.  The rule I used was that specific information from Library of Congress trumped the merged information on VIAF. Then we take that single ruling library opinion and hold it to Wikidata. If the opinions matched I added a source to the Wikidata claim, if no Wikidata opinion exists I added a sourced claim. In the cases where Wikidata and VIAF disagreed with one another I made a list. Lets take a look at how often each of those cases occurred.

 

SexStats

As I mentioned our entire eligible data set consists of 388,829 Wikidata items that have the ‘VIAF’ property. Of those there is a subset which have Wikidata sex data, and VIAF sex data. That subset totals 131,650. Reassuringly only .2% of this subset disagree. For each of those 311 items in that .2% a bit of hunting should be able to right it. For instance the Deutsche National Bibliothek thinks Nadine Warmuth is male, but Wikidata thinks female. On the other hand, Wikidata thinks that Nguyen Thi Binh is female, but VIAF suggests otherwise.

There are also instances where we don’t have two sources to conflict with each other. There were 125,781 times when Wikidata had sex data that VIAF did not. Maybe this is a case where libraries could glean a datum or two. Certainly Wikidata was pleased to be informed in the 44,526 scenarios where VIAF or LoC had sex information but VIAF not.

Lastly, and just for perspective, each time I handled any sex information I kept track of the content as well. 257,431 of the Wikidata items had sex data and split to 14.7% female, 85.3% male, and 0.002% “Intersex”. (The strict classification system that has this poorly named “other” category is a problem I’ve talked about before.) The sex data that came from VIAF showed a very similar story at 14.6% / 85.4% / .006%  female / male / “nuanced” even at a lower sample size of 176,187. It goes to show that sex data as a whole in both Wikidata is skewed. At least now there is more of it, and it has a citation.

@notconfusing

Sex Ratios in Wikidata, Wikipedias, and VIAF

Monday, May 13th, 2013 by Max

Last week I wrote about the ‘rope bridge’ between Wikidata and VIAF, and the new research it would afford. Today I bring you a sample of that research. I am investigating the sex associated with different Wikipedia Biography Articles for two reasons. Firstly, the Properties “Sex” and “VIAF” are two of the top 10 most used Wikidata Properties, with Sex at 587,312 items tagged, and VIAF with 301,763 (and rising, VIAFbot hasn’t finished scraping all languages yet). VIAF independently records sex per VIAF item, which gives us two comparable datasets. Secondly, after the so-called “Categorygate” piece in the New York Times I dug into Wikidata’s Sex Property and wanted to shed some light on the model currently in use.

Currently the Wikidata Property for Sex states:

Sex for humans, should be one of male, female , intersex, or the special “unknown” value

Finding this to be a rather rigid view of the world I started discussing it on the Discussion Page as per protocol. Of note, on the other hand is how VIAF records “gender” not “sex.” The current VIAF data model similarly limits values to male, female, or unknown but a change to a more nuanced model is planned for June. Its worth reminding that VIAF is populated with data from the many authority files it aggregates. One underlying authority file, which has a more nuanced view on this recording, is the Library of Congress Control Number (LCCN). The LCCN will record many “sexes” for a specific person with accompanying dates of validity. This at least shows that there are better ways of recording sex – if its necessary to record it at all – which prompts me to invite your input on the Wikidata Discusison Page about better ways to record sex. With that said, lets dig into some graphs. (Click to see larger versions.)

Sex Ratios by Language

The method used to perform this visualization is to view all the Wikidata items with Property:Sex and then look at the inter-language link section of the item to see which languages have articles relating to this item. Dividing along the lines of language, we can find sex ratios per language. Below shows each language with more than 1,000 articles tagged with sex data, sorted by the percentage of Female values.

Wikidata Sex Ratios By Language

Wikidata Sex Ratios By Language, Minimum 1000 Items

If you’re not well versed in Wikidata’s use of language codes, you can look them up. And if you’ve never browsed the winning and losing htwiki and tlwiki, the Haitian and Tagalog Wikipedias, then you can peruse the list containing minimum 10,000 Items with Sex Data.

WikidataSexRatiosByLangAlone_Min10000

Wikidata Sex Ratios By Language, Minimum 10000 Items

Two notable things arise here. Firstly, Chinese Wikipedia is seemingly the most progressive. Secondly The Intersex category fails to score a single pixel of recognition. In fact the Wikipedia with the highest ratio of Intersex values – as determined by Wikidata – is Korean Wikipedia, but at just 0.0078%.

Data Caveats

Is this data reliable? A lot of it was imported from the German and other major Wikipedias. That can be a problem, because for any given Wikipedia Language there exists articles that have no linked equivalents in other languages. There may very well be Wikipedias with more or less skewed sex ratios, but they haven’t migrated their sex data to Wikidata, or they have no equivalent article in a language which has migrated its sex data. Lets see which languages have the most articles associated with sex data, of those above 1,000.

WikidataSexTotalByLangAlone_Min1000

Total Number of Wikidata Items Tagged with Sex

Unsurprisingly we get a very Western view of the world. But wait, there are other data sources to corroborate against; that was one of the points of VIAFbot importing VIAF IDs into Wikidata. Let’s imagine an enhanced version of Wikidata, that uses VIAF sex data in addition to what’s currently tagged, using that VIAF ID bridge. I ran simulation of such enhanced version of Wikidata, but before we look at it, lets understand VIAF’s own biases.

Introducing VIAF

VIAF IDs have gender info derived from National Library files. There’s hope this may give us a different picture because VIAFs may be ever slightly less severe in its skew, although looking at its list of contributors reveals also a Western bias. Of ~24 Million VIAF records (not all about people)  1,299,396 have gender “male,” and 418,394 have gender “female.” This comes out to a percentage of 24.35% female.  (Unfortunately VIAF doesn’t note directly where LCCN has a more nuanced view, but it can be determined by crawling the RDF link to LCCN’s Marc XML which I explain later.) Now to compare the Wikidata and VIAF-enhanced-Wikidata sex ratios we overlay the two graphs. Here wherever you see light green that means that Wikidata’s data alone gave a higher female ratio, and where you see red, VIAF-enhanced-Wikidata data gives a higher female ratio.

Comparison of Wikidata Sex Ratios with and Without VIAF by Language, Minimum 1000 Items with Sex

Comparison of Wikidata Sex Ratios with and Without VIAF by Language, Minimum 1000 Items with sex

Reassuringly VIAF and Wikidata only disagreed on 0.0024% of 91,406 matches. There were seven cases where LCCN did have with multiple sexes and qualifying dates. Furthermore there are 52,407 cases where VIAF has Sex data but Wikidata does not. This might be a good juncture to import that data, if the Wikidata community wants.

Conclusions

There are articles in Wikidata which are not currently tagged with sex information, but whose sex information can be programmatically determined. There is some indication that tagging more articles would tend to produce more even sex ratios in Wikidata. If that were true, it would mean that “male” articles are more likely to be associated with sex data, though we cannot be positive about that claim. Finally recall that Wikidata’s data model for sex could also use some attention, and you the community are the instruments for that.

Software used

I wrote some simple scripts to crawl Wikidata and compare it VIAF and LCCN, its on  GitHub, and I also modified code from the Wikidata community for parsing dumps which I plan to contribute back.

Did you find anything confusing? Leave your comments below or find me online. On twitter I’m @notconfusing.

The Ropebridges: Authority Control in Wikidata

Thursday, May 9th, 2013 by Max

You may recall that our Wikipedia reciprocal linking robot “VIAFbot” finished adding Authority Control to more than a quarter of a million (English language) Wikipedia articles, but what was the utility? Five months on, that question has been answered. Luckily, and unsurprisingly, other netizens proved additional Wikipedia -> VIAF linking utility. Unanticipated reuse is the magic of collaborative and open datasets, and four such examples highlight the benefits of Library data in Wikipedia.

First was John Mark Ockerbloom’s Forward To Libraries which proposes “find in a Library” boxes in Wikipedia pages. The idea is compelling: facilitate automatic searches in your preferred library site on the topics of Wikipedia articles — one option utilizes VIAF IDs.

Similar look-up facilities were created by Owen Stephens and Thomas Meehan conducting pointed inquiry at the British Library site and other UK Academic resources. Stephens’ contemporaneous finds authors sharing their birth year with the Wikipedia page in question. Meanwhile Meehan’s bookmarklet will funnel you into relevant pages linked by VIAF at UCL’s Explore, and COPAC.

VIAF connections can also pave the way for new scholarly research. A team from Vienna University of Technology, released a paper that visualized Art History networks of Wikipedia, through VIAF IDs, and then ULAN. Here you can see the proportion Art History Subjects in Wikipedia, displayed on two dimensions derived from the ULAN connection: time and nationality.

All of this is to say that VIAF data in English Wikipedia can as a very good ropebridge that allows for reuse, or recombination. The idea of a ropebridge is apt because the connection is somewhat shaky, at the moment it’s free text, semi-structured data that can be changed by anybody, but that doesn’t mean that the chasm isn’t being crossed.

Can you spot the weakness in all this collaboration though? We focused our first effort on English Language Wikipedia. The Germans, to their credit, have just as many VIAF IDs in their Wikipedia. The Italians copied the English Language data. However these separate efforts are not scalable to all 285 Wikipedias, nor does it allow all 285 Wikipedias to collaborate on the language-neutral VIAF Unique Identifiers.

Fortunately there is a solution, and that solution is Wikidata. Wikidata is first new Wikimedia Project since 2006, and will do three things. It will organize inter-language links into a central database (inter-language linking before was arduous and asymmetric). It will provide a central store of Semantic Data from the Wikipedia articles. And in the future it will be able to query that semantic data. Want to know more about Wikidata? Then look up Wikidata on Wikidata (obviously?!).

    Now for a surprise – I’ve just finished migrating English Wikipedia’s VIAF data to Wikidata, and German, French, Italian, and Japanese datasets are in progress. (Code on Github). It takes about two weeks to inspect, clean, and copy the data over from each Wikipedia. I’ll post a full statistical breakdown once all the languages have finished. For now I’ll just say that the Wikidata VIAFbot is also migrating LCCN, GND, BNF, and SUDOC Identifiers as well as integrating for the first time ISNI IDs. At the time of this writing it records 750,000 edits and counting.

    What does VIAF in Wikidata look like you ask? All pages about encyclopedic concepts are known as “Items” in Wikidata parlance, so lets inspect the item for Germaine Greer.

    wikidata_claims

    We first see all the Semantic Data Wikidata has about this topic. Each modicum of data is known as a “Claim” in Wikidata, is a triple,  and is structured as [this page] [property] [value]. You can see that [Germaine Greer] [GND (read: "is a " according to the German National Library)] [Person], and that [Germaine Greer] [is of sex] [female]. You can also see here that she’s got a lot of identifiers associated with her thanks to VIAFbot, which has sourced where it found the original VIAF ID. Now lets draw our attention to the bottom of the page to understand the impact.

    wikidata_iwlinks

    This Wikidata page is associated with articles in 48 other languages. Each of those articles can capitalize on the semantic data stored above. That’s the beauty of Wikidata. Which now means that all of the data reuse cases that previously only worked for the English language Wikipedia, will now work for all of them. Austrian researchers can inspect Art History biases of not just English Wikipedia, but of dansk, Ελληνικά, हिन्दी, interlingua, Runa Simi, 中文, etc. etc.  That’s one of the starting reasons why it’s important to have Authority Control in Wikidata. There are of course more directions than one to travel across a ropebridge. Leading data-mules of bibliographic information across from VIAF into Wikidata is next.

    Wikipedia Analytics Engine

    Monday, January 14th, 2013 by Max

    Wikipedia has its own data-structure in templates with parameters — if you are not familiar with Wikipedia templates, an example is “infoboxes,” which show up as fixed-format tables in the top right-hand corner of articles. Templates, and the metadata they contain, have been exploited for research in the past, but I’ve wanted to create a toolchain that would connect Wikipedia data and library data. I also wanted to be able to include a few more features than the standard Wikipedia statistics engines. For instance (a) working over all pages in a MediaWiki dump to analyze the differences between pages that do and don’t include certain templates (b) take into account what I term subparameters of of templates, and (c) do it all in a multithreaded way. Here is an early look at some analysis which may shed light on the notion of systemic biases in Wikipedia.

    Birthdates

    Of all the biases Wikipedia is accused “recentism” has seemed to me one of the more subtle. To investigate I wanted to compare the shape of the curve of global population to that of birthdates of biography articles on Wikipedia. For data, I looked in templates, specifically English Wikipedia’s {{Persondata}} for parameter DATE OF BIRTH, and German Wikipedia’s {{Personendaten}} for the parameter GEBURTSDATUM. For the comparison of Global Population I used UN data. In both cases you can see that the Wikipedia curves are below global population until about 1800, and outpace population in growth thereafter. These more exponential curves corroborate Wikipedia leaning covering more recent events more heavily. Curiously both Wikipedia lines peak at about 1988 and then all but disappear. If you want a biography article on Wikipedia apparently it helps to be 25 years old.

    Occurences of Birth Dates in English and German Wikipedia Compared to Global Population

    Simple Metrics

    This is quite a simple analysis. One of the chief benefits of working with OCLC is that there is a lot of bibliographic data to play with, so lets marry the two sources: Wikipedia template data and OCLC data. For this section I queried all the Wikipedia pages from December 2012 for all the citation templates, and extracted all the ISBNs and OCLC numbers.

    One way to characterize the cited books is audience level, derived from WorldCat holdings data. Audience level is expressed as a “a decimal between 0.01 (juvenile books) and 1.00 (scholarly research works).” Taking simple mean averages of audience level across all citations gives 0.47 on English Wikipedia. In German it’s 0.44. If we plot the histograms of each, we get moderately normal curves, that actually even tend to skew left.

    Audience Level English Audience Level German

    Is Wikipedia stuffed with incomprehensibly dense knowledge? Maybe, but it’s citations aren’t necessarily.

    Subject Analysis

    Another bias claim lodged against Wikipedia is that content is heavily concentrated towards certain subjects. Is the same true for its citations? Every Wikipedia article could have any number of ISBNs or OCLC numbers, (see figure below). In FRBR terms, these identifiers relate to manifestations so using WorldCat they were clustered into works, at the expression level. And every work is about any number of subjects. Here I used the FAST subject headings, which are a faceted version of Library of Congress Subject Headings.

    Subject Anaylsis Procedure for Wikipedia

    Subject Analysis Procedure for Wikipedia

    Then I totaled the number of citations on Wikipedia within each subject, creating a list of subjects with their respective citation frequency. Utilizing that list here is a word-cloud visualization of Wikipedia’s 100 most cited subjects, inferred through the subjects assigned to the works cited.

    A world cloud of the FAST Subject Headings of the most cited Books in Wikipedia

    A world cloud of the FAST Subject Headings of the most cited books in English Wikipedia

    There is a large preponderance of subjects that confirm subcultures that Wikipedia is noted for its bias. Politics, Military History, Religion, Math and Physics,  Comics and Video Games, and Mycology. At least of they are going to be overrepresetented in general, they should be well cited.

    Below is the same algorithm applied to a different Wikipedia – can you guess the language?  Quite funny to see courts, administrative agencies, and executive departments with such prominence.

    dewiki-fast-word-cloud

    That should give just a glimpse as to the range of avenues of inquiries available from being able to deeply search and connect Wikipedia template parameters with library data. Any special requests for specific queries?

    Wikily yours,

    Max

    twitter: notconfusing

    OCLC Research 2012: Wikipedia and Libraries

    Tuesday, December 18th, 2012 by Merrilee

    At the end of 2012, we are doing a mini series of blog postings to reflect on some of the year’s high points. This posting is the first in the series. Watch for updates!

    2012 has been a great year for me, because I’ve had the privilege of seeing a project I’ve been passionate about for some time come to life — exploring the connection between Wikipedia and Libraries. Around this time last year I began making connections with the Wikipedia GLAM community, and exploring the idea of OCLC Research hosting a Wikipedian in Residence. We were fortunate enough to receive organizational support for this idea, and with help from folks in the Wikipedia community, craft a position description, and bring Max Klein into our team in OCLC Research. Having Max working with us has been terrific and not just because of his Wikipedia skills.

    Since we’ve had Max on board, we attended Wikimania, have held not one but two Wikipedia Loves Libraries events, held two successful webinars attended by more than 500 librarians, done countless videos (okay, I counted them up and there are at least 8). And then there was the Open Access Wikipedia Challenge on P2PU. Oh, and VIAFbot, which brought authority control templates and VIAF links to thousands of articles on the English language Wikipedia.

    Earlier this month, I presented a breakout session at CNI (along with Sara Snyder, from the Archives of American Art) on the connection between Wikipedia and Libraries. The session was well attended but more importantly, there was a lot of interest and excitement about the connection between Wikipedia and libraries. I’m very pleased that Max’s term has been extended, so he can help us explore some of those possibilities. So as we close out a successful and productive year, I look forward to another year of highlights in this area.

    Want to know more? View all the HangingTogether blog posts on this topic!

    VIAFbot Debriefing

    Wednesday, November 28th, 2012 by Max

    Shortly after reaching the 1/4 million edits milestone VIAFbot finished linking Wikipedia biography articles to VIAF.org. Examining the bot’s logs reveals telling statistics about the landscape of Authorities on Wikipedia. We can now know how much linked authority data is on Wikipedia, it’s composition, and the similarities between languages.

    First, let’s understand the flow of the bot’s job. With VIAFbot I sought to reciprocate the links from VIAF.org to Wikipedia, which were algorithmically matched by name, important dates, and selected works. Therefore it started by visiting all the Wikipedia links  that existed on VIAF.org. Note that  owing to the delay between when the links were created and now, some of the pages had been deleted or merged (Fig. 1 orange region). For the rest of the set-up it utilized German Wikipedia which has focused a lot on their authorities data. VIAFbot also loaded all available equivalent German Wikipedia articles to our English matches, the “interwiki link” in Wikipedia parlance.

    Next VIAFbot searched for the equivalent structured-data Authority control, and Normdaten templates to see what preexisting authorities data those pages held. German Wikipedia shone with 92,253 Normdaten templates (Fig 1. purple region), 74,864 had the VIAF paramater filled (Fig. 1 pink region), compared to English Wikipedia’s mere score of 9,034 templates with 770 VIAF IDs.

    Figure 1.

    The program then compared the VIAF IDs supplied by English Wikipedia, German Wikipedia, and VIAF.org, although not always were all three sources present. While two or more sources didn’t conflict, VIAFbot wrote the VIAF ID to the English Wikipedia page. If a conflict was found, then the bot noted it for human inspection on Wikipedia along with which sources conflicted. One statistic that was telling was how often the different sources disagreed with one another. These disagreement rates were surprisingly similar, but German Wikipedia seemed to disagree marginally less with VIAF.org at 11.3% compared to English’s 15.9% (Fig 2.)

    Figure 2.

    In the noncontroversial non-disagreement cases, of which there were 254,678, there were still some errors found of a different variety. Even though there was no disagreement among the sources, and probably in the instances in which there was only the VIAF.org source, the wrong VIAF number was written. Some very dedicated Wikipedians took to reporting these errors, and VIAF.org will incorporate those corrections. That is the power of crowdsourcing refining algorithmic accuracy.

    The question still remains of how much these links being used? Google analytics on the VIAF.org site, can help answer that. German Wiki was the largest referrer to Wikipedia as late as September 2012. VIAFbot started editing in October, and the effect was immediately tangible – soon gaining poll position and then doubling total referrals (Fig. 3).  It must be said though that this level of viewership may not be sustained as the “curiosity clicks” of Wikipedians being notified of changes through their watchlists starts to fade.

    Figure 3. Referral traffic to VIAF.org.

    Still, don’t doubt the usefulness of the project. For instance we received this email from John Myers of Union College in  Schenectady NY,

     ”I had an Arabic name to enter into a record as part of a note, and I wasn’t confident about the diacritics.  So, I look in the authority file to temporarily download it, copy the form of the name, and then move on.  Couldn’t find the name in OCLC.  Look in Wikipedia under his common name – bingo.  Even better, Wikipedia has a link to VIAF, double bingo!  With the authorized form from VIAF, I could readily find the record in OCLC (I was tempted to copy the name form directly from VIAF, but didn’t want to push my luck.)  The miracles of an interconnected bibliographic dataverse!”

    VIAFbot had written the link for ‘Aziz ‘Aku ak-Misri only a few days prior.

    The principal benefit of VIAFbot is the interconnected structure. Recognizing this, other Wikipedias (Italian and Swedish) have been in contact and asked for the same on their Wiki’s. Yet to truly be interconnected the next step forwards is to integrate VIAF IDs not into any one Wikipedia, but into the forthcoming Wikidata, a central database for all Wikipedias  across languages. Fortuitously, the pywikidata bot framework is stabilizing, and I’m in need of a new project now.

    Without confusion,

    Max Klein (@notconfusing)