Harvesting Book Metadata From Wikipedia to Wikidata

Infoboxes for a long time were Wikipedias’ way of storing data, and Wikidata is set to replace that techonlogy, with added bonuses like inter-language sharing. To get to that promise one first step is for Infoboxes to be harvested into Wikidata. I have started by harvesting Infobox Book in the 9 biggest Wikipedia languages that share the template: English, Italian, French, Spanish, Russian, Polish, Portugese, Swedish, and Japanese.

The point of harvesting Infobox Book specifically is that the Wikidata citation guidelines for books specify that the Library FRBR concept should be used, so I wanted to build out infrastructure to that end. FRBR is about describing Bilbliographic record at many different levels and here’s an example of what this kind of citation would look like in Wikidata:

With that in mind lets have a look at the data. O ur entry point is the set of Wikipedia pages that use Infobox Book –transclusion in Wikpedia parlance – in the 9 aforementioned languages. This measure is only an approximation and does not completely reflect how many Wikipedia topics are about books in a language for three reasons. The first is that the conception of a what is a book is not strictly enforced on Wikipedia.  An article could be about a physical item or an amorphous work idea,  or even sometimes the inclusion of an infobox book template is only a nod to a book like French article on this racing pigeon.  The second is that not all articles about a book necessarily contain a transclusion to Infobox Book. And thirdly some specialised Infobox Books have developed and are used instead, like Infobox Doctor Who Book.

In this next chart we look at the total Infobox Book transclusions, the total articles of a language, and the ratio between the two. Despite large variation in absolute numbers, the percentage of Books Articles in a Wikipedia is somewhere beteween .1-1% of all articles. Italians affirm themselves as the most bilbliophilic. We’ll also see later on about how their practice of labelling genre differs from the others.


Infobox Book Transclusion Counts By Language
Language Infobox Book Transclusions Total Articles (000’s) Percentage of Total Articles
en 30582 4432 0.690
es 3534 1057 0.334
sv 3023 1598 0.189
pl 2782 1005 0.277
pt 1975 803 0.246
ru 1865 1061 0.176
it 10788 1082 0.997
ja 1446 886 0.163
fr 7935 1441 0.551


In each Infobox I crawled for the most used properties across all languages and whose values were either string identifiers or links to other Wikipedia pages. When a value is a link to another Wikipedia page, for instance a link to the page of the author, that is useful because when harvested Wikidata can store the author property as a link to another Wikidata item. This is desirable as in Wikidata we seek to build a Wiki of relations.

Here is a graph of the properties that found, which were added to Wikidatak, and which were already in the database.

Properties Harvest

So as you can see there are now over 30,000 relations between books and their authors and illustrators in Wikidata, as well as the original language and genres of the books. In addition knowing which book is which from a disambiguation perspective is made easier by the inclusion of over 50,000 identifiers.

One difficulty that was encountered was that even though ISBNs are recorded in Infobox Book, the type of ISBN – 10 or 13 – was not discriminated. Wikidata does however discriminate, and so as I was sorting these ISBNs I thought it would be sage to also verify them. OCLC runs an API called  xID for this very purpose. While using xID it also struck me that the OCLC control number could be returned for a given ISBN. As Wikidata is rapidly evolving into a hub of identifiers, I included those in pushing to Wikidata. During this harvest then I also inserted an additional 10,117 OCNs (not pictured above).

As I mentioned It’s not just boring, nameless identifiers that we want to eventually integrate into all the Wikipedia pages by Wikidata. I inspected genre data as well to see how much cross-cultural benefit we’d receive by doing these sorts of harvests.  Below are the Top 10 genres found in Infobox Book by each language. The text shown are the English Labels of the Wikidata Items of links found in each local Infobox. I’ve also outlined those genres which are unique. So you can see that Swedes care a bit more about the choir books and the Japanese have a bent towards police drama.


Infobox Book Top 10s

What first jumped out at me is how inconsistently the idea of genre is used. In some ways its used to describe the content’s emotion and focus, like “science fiction” or “horror”. Other times its used to describe form like “novel”. In fact only the Italians really are very consistent as their top ten, albeit discusses form in “novel”, “essay”, “short story”, “poetry”, “anthology”, “autobiography”, “novella”, “dialogue”, and “poem”.

Another problem between languages is that the genres mismatch often because they are pointing to only slightly different articles. That is we see appearances from the Wikidata items for “fantasy”, “fantasy literature”, “Fantastique”, and “high fantasy”. (By the way you can draw your own conclusions about the demographics of Wikipedia editors when this much fantasy lit pervades the results.)

A conclusion that can be drawn from all this is that there is still some work to be done on negotiating cultural differences on Wikidata. Wikidata has made a lot of connections between Wikipedia articles in different languages, but not all of those merges are clean. The French conflate a pigeon and a book about a pigeon, and its linked to languages that discuss only the pigeon. Meanwhile how how the Italians interpret “genre” is a different, not necessarily incompatible, notion to others. There are some discussions still to be had probably before Infoboxes completely switch over to using Wikidata data, but we are at least one step closer to that goal.

4 Comments on “Harvesting Book Metadata From Wikipedia to Wikidata”

  1. Stephen, the way the Genre data is collected, is not really by term, but by links. So if there is a link in a Novel page to the “whodunitl” page, then I count that link. You will notice that the wikipedia page “whodunitl” has a counterpart on the Italian Wikipedia “Giallo deduttivo”. Each collection of Wikipedia pages that are linked to each other are known as Wikidata items (in this case https://www.wikidata.org/wiki/Q1494434#sitelinks-wikipedia). I am counting which Wikidata items are linked to in the genre fields. And each Wikidata item has an English label, which I am using, and that’s why it appears to be in English, although I could have just as easily used any other langauage, including a language which I didn’t crawl at all.

  2. This raises so many questions. Just focusing on genre–
    A quick sampling suggests that genre terms are assigned in blocks. Ellery Queen novels are assigned “Mystery novel/Whodunit”. Agatha Christie novels are assigned “Crime novel.” This suggests a certain lumpiness and idiosyncrasy in the data rather than an even application of the terms. How could this be explored?
    Are these all intact terms? Or has “Mystery novel” been split into “mystery” and “novel”? If all intact, was there no redundancy in the counts for top ten genre terms (e.g., “Mystery” made the top ten but “Mystery novel” didn’t)? Or have term counts been combined for the report?
    The Italian InfoboxBook for Queen’s “The Chinese Orange Mystery ” (“Il delitto alla rovescia”) lists Genere as “Romanzo” and Sottogenere as “giallo”. Google Translate gives “Detective,” “Whodunit,” and “Mystery novel” all as translations of “Giallo.” How were these variations in expression managed, given that all the top ten terms lists were assembled in English?
    As Karen said, fascinating. Thanks

  3. Fascinating, Max! I was also struck by the relatively high proportion of book Infoboxes on World War II and theology in the Polish Wikipedia and that the Russians care so much about the International Union for Conservation of Nature’s Red List of endangered species, even a bit more than thrillers.

Comments are closed.