Harvesting Book Metadata From Wikipedia to Wikidata

Wednesday, November 27th, 2013 by Max

Infoboxes for a long time were Wikipedias’ way of storing data, and Wikidata is set to replace that techonlogy, with added bonuses like inter-language sharing. To get to that promise one first step is for Infoboxes to be harvested into Wikidata. I have started by harvesting Infobox Book in the 9 biggest Wikipedia languages that share the template: English, Italian, French, Spanish, Russian, Polish, Portugese, Swedish, and Japanese.

The point of harvesting Infobox Book specifically is that the Wikidata citation guidelines for books specify that the Library FRBR concept should be used, so I wanted to build out infrastructure to that end. FRBR is about describing Bilbliographic record at many different levels and here’s an example of what this kind of citation would look like in Wikidata:

With that in mind lets have a look at the data. O ur entry point is the set of Wikipedia pages that use Infobox Book -transclusion in Wikpedia parlance – in the 9 aforementioned languages. This measure is only an approximation and does not completely reflect how many Wikipedia topics are about books in a language for three reasons. The first is that the conception of a what is a book is not strictly enforced on Wikipedia.  An article could be about a physical item or an amorphous work idea,  or even sometimes the inclusion of an infobox book template is only a nod to a book like French article on this racing pigeon.  The second is that not all articles about a book necessarily contain a transclusion to Infobox Book. And thirdly some specialised Infobox Books have developed and are used instead, like Infobox Doctor Who Book.

In this next chart we look at the total Infobox Book transclusions, the total articles of a language, and the ratio between the two. Despite large variation in absolute numbers, the percentage of Books Articles in a Wikipedia is somewhere beteween .1-1% of all articles. Italians affirm themselves as the most bilbliophilic. We’ll also see later on about how their practice of labelling genre differs from the others.


Infobox Book Transclusion Counts By Language
Language Infobox Book Transclusions Total Articles (000′s) Percentage of Total Articles
en 30582 4432 0.690
es 3534 1057 0.334
sv 3023 1598 0.189
pl 2782 1005 0.277
pt 1975 803 0.246
ru 1865 1061 0.176
it 10788 1082 0.997
ja 1446 886 0.163
fr 7935 1441 0.551


In each Infobox I crawled for the most used properties across all languages and whose values were either string identifiers or links to other Wikipedia pages. When a value is a link to another Wikipedia page, for instance a link to the page of the author, that is useful because when harvested Wikidata can store the author property as a link to another Wikidata item. This is desirable as in Wikidata we seek to build a Wiki of relations.

Here is a graph of the properties that found, which were added to Wikidatak, and which were already in the database.

Properties Harvest

So as you can see there are now over 30,000 relations between books and their authors and illustrators in Wikidata, as well as the original language and genres of the books. In addition knowing which book is which from a disambiguation perspective is made easier by the inclusion of over 50,000 identifiers.

One difficulty that was encountered was that even though ISBNs are recorded in Infobox Book, the type of ISBN – 10 or 13 – was not discriminated. Wikidata does however discriminate, and so as I was sorting these ISBNs I thought it would be sage to also verify them. OCLC runs an API called  xID for this very purpose. While using xID it also struck me that the OCLC control number could be returned for a given ISBN. As Wikidata is rapidly evolving into a hub of identifiers, I included those in pushing to Wikidata. During this harvest then I also inserted an additional 10,117 OCNs (not pictured above).

As I mentioned It’s not just boring, nameless identifiers that we want to eventually integrate into all the Wikipedia pages by Wikidata. I inspected genre data as well to see how much cross-cultural benefit we’d receive by doing these sorts of harvests.  Below are the Top 10 genres found in Infobox Book by each language. The text shown are the English Labels of the Wikidata Items of links found in each local Infobox. I’ve also outlined those genres which are unique. So you can see that Swedes care a bit more about the choir books and the Japanese have a bent towards police drama.


Infobox Book Top 10s

What first jumped out at me is how inconsistently the idea of genre is used. In some ways its used to describe the content’s emotion and focus, like “science fiction” or “horror”. Other times its used to describe form like “novel”. In fact only the Italians really are very consistent as their top ten, albeit discusses form in “novel”, “essay”, “short story”, “poetry”, “anthology”, “autobiography”, “novella”, “dialogue”, and “poem”.

Another problem between languages is that the genres mismatch often because they are pointing to only slightly different articles. That is we see appearances from the Wikidata items for “fantasy”, “fantasy literature”, “Fantastique”, and “high fantasy”. (By the way you can draw your own conclusions about the demographics of Wikipedia editors when this much fantasy lit pervades the results.)

A conclusion that can be drawn from all this is that there is still some work to be done on negotiating cultural differences on Wikidata. Wikidata has made a lot of connections between Wikipedia articles in different languages, but not all of those merges are clean. The French conflate a pigeon and a book about a pigeon, and its linked to languages that discuss only the pigeon. Meanwhile how how the Italians interpret “genre” is a different, not necessarily incompatible, notion to others. There are some discussions still to be had probably before Infoboxes completely switch over to using Wikidata data, but we are at least one step closer to that goal.

User studies and risks for research libraries?

Monday, July 8th, 2013 by Jennifer

This is the first in a series of posts that synthesize conclusions of published user studies about desires and needs for research support. I’ve collected quite a stack of them. For the past three years I’ve been reading up on what academics themselves say about all this. Along the way, I’ve also gathered studies that include administrators and librarians. When the latest Ithaka US and UK faculty surveys came out this spring[1], I integrated their findings into my growing pile of evidence.

The cumulative effect is rather foreboding. Academic libraries appear to be somewhat out of touch with the needs of researchers. This shouldn’t be a surprise. The typical library often does not provide the support that researchers need to do their research. As a result, researchers report not being as well-served as they should be, and in their eyes academic libraries are losing relevance.

Today’s synthesis introduces user studies about risks for research libraries, especially the risk of doing nothing. In separate future posts I’ll focus on what researchers themselves say. If I get ambitious, I may delve into to user studies with university administrators and – last but not least – librarians.

Why user studies?

Many leaders of research libraries are concerned that their institutions have become less relevant to faculty members and academics, whether due to advances in technology, success in licensing journals, or over-investment in teaching services for undergraduates at the expense of research. In the current context of disintermediation of libraries – combined with constraints on funding – administrators at research-intensive universities perceive that libraries are presently at risk. Internationally, significant attention has been given to demonstrating the value and ‘business’ of libraries to universities and funding agencies. Managing research information – whether research data, articles, or administrative information about researchers and their work – has recently become a strategy for libraries to weave themselves into the fabric of the research lifecycle, in order to demonstrate their value and mitigate risk of losing relevance and funding.

To re-establish the research libraries’ alignment with research needs, the community has called for investment in developing new services that support research workflows and university administrations. Considerable thought has been given to the nature and function of such new services. National and institutional initiatives have enabled a handful of research libraries to spend significant resources planning and developing up-to-date research services.

For years, librarians have called for studies that articulate what researchers desire by way of support for their research.[2] These blog posts are my meta-analysis of the results of some 30+ years of studies – including recent reports from RIN, Ithaka, OCLC Research, and the DCC – in order to gather together evidence of system-wide needs for research services, both within and outside libraries. Of course, methodologies used to generate the numerous studies vary, such as interviews, surveys, and focus groups. Also, the objectives of the various projects are different, so exact or parallel comparisons are difficult and the conclusions are not necessarily overlapping or consistent.

Nevertheless, clear trends and distinct patterns emerge from the body of work as a whole. Recent research on scholarly behavior converges on conclusions about all manner of information-related services in universities and across academic disciplines. Qualitative and quantitative studies of scholars and academic administrators provide a mountain of evidence about the nature of services and infrastructure required to span the entire lifecycle of the fruits of research. While we have witnessed simultaneous evolution of discipline-based and institution-based services, diverse international reports have identified gaps in digital infrastructure and provision of services to manage research information, both by libraries and by university administrations.

Risks: disintermediation, funding, value

Martin Feijen, in a literature review from the Dutch SURFfoundation titled What researchers want, gleans the crux of the matter: “There is one very concise statement about risk: ‘The biggest risk is to do nothing.’”[3]

OCLC Research 2011: “Seeking Synchronicity,” insights into virtual reference

Thursday, December 22nd, 2011 by Merrilee

At the end of 2011, we are doing a mini series of blog postings to reflect on some of the year’s highpoints. This posting is the second in the series.

“Seeking Synchronicity” was published as an OCLC members report in 2011, but is based on many research projects on virtual reference, both research conducted by OCLC Research Scientist Lynn Connaway and Rutgers Professor Marie Radford, and others. Marie and Lynn have helpfully boiled down findings to a very readable set of recommendations and guidelines about virtual reference and optimizing your chances for satisfaction and success.

What has stuck with me after reading the report, is the importance of building relationships. Practicing good customer service goes well beyond virtual reference.

You can view a webinar (and find more information about the project) here.

National systems of research assessment and implications for libraries

Tuesday, December 22nd, 2009 by John

Research assessment is a very big deal in some countries. Countries whose university systems are largely publicly-funded routinely check up on the research quality of individual universities to ensure that they are squeezing the best possible performance out of their systems. They do this because they see a link between high-quality research and economic development. The economic potential of research is growing in importance as national ‘knowledge economies’ recognise the need for international research excellence, and see universities as a key driver.

We have just published a report which reviews the research assessment regimes of five countries, and the role of libraries in the processes of assessment that exist. This report was produced by Key Perspectives Ltd, a UK consultancy, and it surveys the research assessment situation in the Netherlands, Ireland, the UK, Denmark and Australia. We chose countries that we knew were doing interesting things in assessment – or in preparation for its introduction. The high political stakes involved were evident even as the report was being written. In the UK, the pilot exercise for the system that will replace the Research Assessment Exercise (RAE) ditched one of its proposed new thrusts (bibliometrics) and found another (economic impact) for the country’s universities to stress about. In Australia, a recent change of government led to temporary abandonment of a system that tied assessment outcomes to government funding, and arguably lost the country some ground in the international scramble for both reputation and economic advantage.

The Review provides a fascinating account of different cultural understandings of the purposes of assessment, and a glimpse of the trend of concentrating research excellence in a small number of top universities that is now taking shape in many countries, as the competition for research income, top faculty and students becomes one that occurs within a single international marketplace. We found countries that tied research assessment to large amounts of government funding, and others that did not (yet); countries that operated systems based on bibliometrics and others that mistrusted them; countries that devised league tables of journals and awarded points to researchers on those they published in – and others that assembled national panels of experts to determine the rankings.

Libraries are involved in these assessment exercises in a range of ways, from the clerical (data entry) to the highly strategic, and from the specialist (bibliometric expertise) to a role as providers of general infrastructure (institutional repositories). Whatever differences there may be in the assessment systems adopted by different countries, they all share a focus upon the research outputs produced by their researchers and faculty. These outputs are managed by libraries – both indirectly (via publications) and, increasingly directly (via arrangements with the authors themselves at pre-publication stages). Does this suggest that libraries play a central role in research assessment within their institutions? Or that they should? At the very least, shouldn’t libraries seek a shared view on this question?

Climate change for libraries

Monday, November 30th, 2009 by John

At the RLG Partnership Annual Meeting in 2007, Timothy Burke told the assembled research librarians ‘you have to figure out how to be hydraulic engineers of information flow rather than the guardians of the fortress’. It’s an image that has stuck with me. Everywhere now in our professional literature we see the challenges of our work represented by the imagery of flow and fluidity. We try to scope and identify workflows that are changing or need to change. The platform of the web dips and peaks faster and differently than we can predict, and as it does so content suddenly flows in different directions, taking new channels. Stability in this environment is rare, and a relief when we find it, even though it may lie in places that librarians take some time to trust – like Google and Wikipedia.

I often show a slide produced by Rick Luce, Vice-Provost and Director of Libraries at Emory University, when describing the territory of our Research Information Management (RIM) programme. This appeals to me because it indicates that library attention needs to be focused on the workflow layer, rather than the repository layer that sits below it.

Understanding the particular environments of researchers, and the flows that matter to them, is perhaps not a new challenge for research libraries, but it is a newly urgent one. In the pre-digital world the flows were not digital flows, with the capture challenges and opportunities that now exist. The library dealt mainly in the solid world of published literature. It collected from the physical outputs that emerged at the end of flow processes, and could structure its operations around that bounded reality (within its ‘fortress’ print stores, to use Tim Burke’s analogy). Now, we see potential for library services everywhere, because we have systems that capture flows, and allow them to combine, split and replicate wherever it is useful for them to do so, and legal barriers do not obstruct. But to do so optimally, we need to understand researchers’ worlds at a level of detail that is still not familiar to libraries. Read the rest of this entry »

Mendeley scrobbles your papers

Thursday, September 24th, 2009 by John

Mendeley is a social web application for academic authors that has been receiving quite a lot of attention recently. Victor Keegan wrote about it in The Guardian last week, likening it to the streaming music service

How does it work? At the basic level, students can “drag and drop” research papers into the site at which automatically extracts data, keywords, cited references, etc, thereby creating a searchable database and saving countless hours of work. That in itself is great, but now the bit kicks in, enabling users to collaborate with researchers around the world, whose existence they might not know about until Mendeley’s algorithms find, say, that they are the most-read person in Japan in their niche specialism. You can recommend other people’s papers and see how many people are reading yours, which you can’t do in Nature and Science. Mendeley says that instead of waiting for papers to be published after a lengthy procedure of acquiring citations, they could move to a regime of real-time citations, thereby greatly reducing the time taken for research to be applied in the real world and actually boost economic growth. There are lots of research archives. For the physical (but not biological) sciences there is ArXiv, with more than half a million e-papers free online – but nothing on the potential scale of Mendeley. Around 60,000 people have already signed up and a staggering 4m scientific papers have been uploaded, doubling every 10 weeks. At this rate it will soon overtake the biggest academic databases, which have around 20m papers.

Journals and the tainting of science

Friday, August 21st, 2009 by John

The main feature article in last week’s Times Higher, A threat to scientific communication: do academic journals pose a threat to the advancement of science?, by Zoë Corbyn, examines the scholarly journals system and asks some penetrating questions about dysfunctionality in the academy, at least in the UK. We are all aware of some troubling issues caused by the link between journal publication and academic reputation, both individual and institutional. This article is one of the boldest yet to appear in the press on the subject, and it suggests that the detriment to the advancement of knowledge due to the stranglehold of the impact factor, compounded by the artificial behaviours induced by a regime of research assessment tied to funding, is now at a level that warrants serious attention. One of the most perversely reassuring things about the article is that it quotes several senior academics, editors and policy makers, whose concerns include many that librarians have been shaking their heads about for years now. Rather than rehearse the article, which can be found on the Higher’s website, I provide below my extrapolations of some of the most disturbing symptoms identified both by the correspondents in the article, and by those who are still sending in responses to the article on the website:

  • Scientists over-hype, over-interpret, destructively split out and prematurely publish their findings.
  • Ridiculously long authorship claims are almost fraudulent. This motivated me to search for an indication of the extent of this absurdity. Finding that a Thomson Scientific study indicated that a paper published in 2006 had 2,512 authors raises the question of whether such a distortion of research to benefit the credentials of scientists is not likely to bring their own work into disrepute?
  • Editorial incentives, even in top journals, are distorted by the impact factor in favour of certain types of article written by researchers in wealthy western universities. The effects could be considered racist.
  • New textbooks are not being written by UK-based humanists and social scientists because they are being horse-whipped into producing journal articles in high impact journals. This means that teaching is suffering because the available textbooks are becoming out-dated, and outmoded ideas and attitudes are being perpetuated.
  • Remedies suggested centre upon the academy taking back the means of control into its own hands, which should provide some encouragement to initiatives such as open access repositories, though their role needs considerable development if they are to provide a corrective. Among the measures suggested are:

  • Universities should develop their own metrics.
  • Learned societies should abandon commercial publishing operations.
  • Researchers working in areas of strong public concern should engage in ‘mass disobedience’ and publish their findings on the web immediately.
  • Peer review should be be less imperious, more workmanlike and more democratic.
  • Open access papers should be deposited in a national repository for the UK.
  • Wealthy universities, via their reputationally secure researchers, should lead the rest in preferring open access journals for their publications.
  • Research libraries should take on the burden of presenting choice of publication venues to academic authors.
  • Coming from scholars themselves, these views are important for us to note for our Research Information Management work, where some projects are getting underway with surveying researchers in focus groups and via interviews. It seems clear that the academic community has a number of concerns and possible solutions that librarians have not yet thought of, or dared to think of.

    Special collections and university rankings

    Thursday, August 6th, 2009 by John

    The University of Leeds has made two prestigious acquisitions recently which have been deemed worthy of announcing from the university’s own news page. In early June, the university acquired the archive of Marks & Spencer, one of the UK’s most prestigious stores, which began its life in Leeds some 125 years ago (and has created an online exhibition drawn from its archive). Now headquartered in London, the return of the company’s archive is a nice example of regional cultural repatriation, and will undoubtedly provide a basis for a great deal of interesting research as suggested by the University’s Vice Chancellor, Michael Arthur:

    We already have one of the best academic libraries in the country, and the arrival of this tremendous archive will further strengthen it. The collection spans economic, social, artistic and cultural history and will be of interest to staff and students from all parts of the University as well as the public.

    And just a few days ago came news of the acquisition of a collection relating to Frederick Rolfe, Baron Corvo, a controversial early 20th century English novelist. This collection adds to Leeds’ substantial holdings in Victorian and early 20th century literature, and illustrates well the importance of cultivating vital relationships in a collecting strategy that gives gravity to a strong research library.

    I was interested in these library stories that had made the ‘front page’ of the university’s website, since Leeds is anxious to improve its reputation internationally. The university’s ambitions are expressed very starkly in one of the the standard footnotes for editors: ‘The University’s vision is to secure a place among the world’s top 50 by 2015’. News stories based on research developments, awards to staff or students, and prestigious acquisitions like these, are of course now common on university websites, and a standardised list of notes to editors is frequently used. But even in the reputationally aggressive UK, it is unusual to see a university stake its claim quite as boldly as this. This is probably because the league tables themselves are still not widely respected nor held as authoritative – though Leeds may be banking on that position having changed by 2015.

    It does as yet have some distance to travel though, since the Times Higher table currently lists Leeds in 104th position, having dropped 24 places since the previous year. The Shanghai Jiao Tong Index has it in 131st, down one place. But the new edition of the oddly named Ranking Web of World Universities, which judges institutions on the strength of the web presence of their research rather than on prizes won or citations, has boosted Leeds from position 180, in January, to 167 in July. Perhaps stories about research, including research collections, are beginning to have the desired effect.

    Impact Measures and Library Selection

    Thursday, May 14th, 2009 by Constance

    I have just been reading a recent article by Kathy Enger* published in Library & Information Science Research that examines the potential value of citation analysis as a selection tool in academic library acquisitions. Enger proposes that citation analysis of the journal literature might be used to identify potentially high-impact books for inclusion in a college or university library collection. The reasoning here is quite interesting: based on the observation that humanities and social science scholars rely more heavily on monographs than journals as a vehicle of scholarly communication, a sampling method is used to identify high impact journals in the social sciences and then cull from these the top cited authors. If these authors have also published books not already represented in the local collection, the titles are acquired on the premise that the content is likely to represent ‘high value’ scholarship. Library circulation figures are later examined to determine if these titles are used (borrowed) more frequently than titles selected through traditional means.

    Efficiency and scholarly information practices

    Tuesday, March 31st, 2009 by Constance

    There is a good article* in the most recent issue of JASIS&T by a group of Canadian scholars who challenge James Evans’ controversial claim that the increase in online availability of research publications has resulted in more focused and narrowly concentrated scholarly citation patterns. Evans’ study (2008) was the subject of a previous post on the ‘narrowing prospective.’

    Vincent Larivière, Yves Gingras and Eric Archambault present research findings that suggest that the dispersion of citations has actually increased over the past century.  According to their research, the range of literature cited in contemporary scholarship grows over time as a function of the total supply or availability of published research. The percentage of papers cited at least one time increases steadily as the body of literature grows and matures. They characterize the implications of these findings in fairly categorical terms:

    All these measures converge to demonstrate that citations are not becoming more concentrated but increasingly dispersed, and one can therefore argue that the scientific system is increasingly efficient at using published knowledge.  Moreover, what our data shows is not a tendency toward an increasingly exclusive and elitist scientific system, but rather one that is increasingly democratic.

    Larivière, F., Gingras, Y., & Archambault, E. (2009): 861.

    Larivière, F., Gingras, Y., & Archambault, E. (2009): 861.

I was struck by the authors' references to the 'scientific system' of scholarly communication, since it connotes not only a methodical approach but also a set of norms and expectations about the progressive advancement of human knowledge.