My colleagues Jean Godby, Karen Smith-Yoshimura, and Bruce Washburn, along with a host of partners, have just released Creating Library Linked Data with Wikibase: Lessons Learned from Project Passage, a fascinating account of their experiences working with a customized instance of Wikibase to create resource descriptions in the form of linked data. In the spirit of their report, I’d like to offer a modest yet illustrative use case showing how access to the relationships and properties of the linked data in another Wikibase environment – Wikidata – smoothed the way for OCLC Research’s recent study of the Canadian presence in the published record.
Maple Leaves: Discovering Canada Through the Published Record is the latest in a series of OCLC Research studies that explore national contributions to the world’s accumulated body of published materials. A national contribution is defined as materials published in, about, and/or by the people of that country. The last category presents a special challenge: how to assemble a list of entities – people and organizations – associated with a particular country from which authors, musicians, film makers, and other creators of published works can be identified?
For the first three reports in our series – covering Scotland, New Zealand, and Ireland – I turned to DBpedia, a resource that converts the information in Wikipedia into structured data. I used data dumps from DBpedia, and processed them in two ways. First, I parsed a file of structured data, looking at certain attributes or properties such as birthPlace to find entities associated with the country in question. Because the use of attributes pertaining to country of birth or nationality are not consistently populated across all entities, I supplemented this approach by searching the short abstracts associated with each entry for strings such as “is a Scottish” “or is an Irish” in order to identify entities that are described as being of a particular nationality in their unstructured descriptions. Merging the results of these scans together resulted in a list of entities associated with a given country. We were then able to match these entities to publications via our WorldCat Identities database, which maps published works to their creators.
Although this methodology was reasonably effective in producing the desired list, there were also drawbacks, not least of which that it required lots of brute force processing. In parsing the structured data, I had to look for an assortment of place names relevant to a particular country: e.g., Ireland, Munster, County Kerry, Dublin, and so on. This necessitated the compilation of a list of pre-determined search terms with attendant problems of disambiguation – did you know there was a village named California in Scotland? In addition, the process operated on the data dumps made available periodically by DBpedia, which could be many months out of synch with Wikipedia at the time of use.
Earlier this year, as I prepared to launch the fourth and latest study, focusing on Canada, I had a conversation with my colleague Jeff Young about this work. In talking about options for accessing and querying data from Wikipedia, Jeff encouraged me to explore Wikidata as an alternative method, and generously gave me a tutorial on the basics.
Like the Project Passage environment, Wikidata is an implementation of Wikibase with MediaWiki. Wikibase is an open source package for storing and managing structured data. Paired with MediaWiki, another open source package that supports collaborative creation and editing of wiki pages (if that sounds familiar, it should: if you’ve used Wikipedia, you’ve used MediaWiki), you then have a platform for collaboratively creating and editing a database of structured data.
The most visible implementation of Wikibase is Wikidata. Among other things, Wikidata is a database containing structured data about all of the entities found across all of the different language-versions of Wikipedia. But Wikipedia data is only a subset of Wikidata: an entity can appear in Wikidata that has no corresponding entry in any of the Wikipedias. All told, Wikidata contains structured information on nearly 59 million “items” – people, concepts, places, and so on. But the really valuable feature of Wikidata is that this data is translated into an RDF triplestore, which can then be queried, via the Wikidata Query Service, using the powerful SPARQL language. In other words, Wikidata permitted me to search all of the Wikipedias – and more – as linked data.
Using Wikidata, my extraction of Canadian entities from the Wikipedia-based universe was simplified to a single query, and required no computational exertions on my part. The query I constructed looked for all identities registered in Wikidata representing individuals born in, or citizens of, Canada, as well as corporate bodies that were formed or are headquartered in Canada. My query essentially linked a series of entity attributes – place of birth, country of citizenship (for people), location of formation, headquarters or location (for organizations) – with the geographical requirement that their values are locations in the administrative territorial entity of Canada. The result was a list of about 80,000 distinct entities, which I was able to easily export in the form of Wikidata URIs (e.g., http://www.wikidata.org/entity/Q273034 for the Canadian author Lucy Maud Montgomery). It was these URIs which my colleague Ralph LeVan used to match Canadian entities to works in WorldCat.
There were several big advantages to using Wikidata over my previous methodology. First and foremost was speed: once I had settled on my query parameters, I had my list in a couple of minutes. Second, Wikidata draws in data from all of the different language versions of Wikipedia, as well as other sources. This was especially useful in identifying francophone Canadians, as Wikidata includes entries from the French language version of Wikipedia. This helped make our list of Canadian authors as complete as possible—much more so than would have been possible through the English language version of Wikipedia alone. And third, my contribution to the process of extracting the list was simply typing in a couple of lines of SPARQL – no need to write elaborate programs to parse multiple flat files – and no need to pre-compile a list of place names or worry about disambiguation. The appropriate linkages between a person or an organization, citizenship/location, and the country of Canada were established for me within the relationships documented in Wikidata.
I want to emphasize that my experience is not intended to diminish DBpedia vis-à-vis Wikidata – I am not an expert user of either resource, and no doubt I did not use the DBpedia resources to their full advantage. The point I would like to highlight is the significant advantages I enjoyed by shifting from my old methods of utilizing structured data to one that leverages a linked data approach. The Project Passage report addresses the practical realities of creating and editing linked data about entities; what I hope I have illustrated in this post is a small example of how that linked data can release value.
Acknowledgement: I’d like to thank my colleague Karen Smith-Yoshimura for helpful comments that improved this post.
Brian Lavoie is a Research Scientist in OCLC Research. He has worked on projects in many areas, such as digital preservation, cooperative print management, and data-mining of bibliographic resources. He was a co-founder of the working group that developed the PREMIS Data Dictionary for preservation metadata, and served as co-chair of a US National Science Foundation blue-ribbon task force on economically sustainable digital preservation. Brian’s academic background is in economics; he has a Ph.D. in agricultural economics. Brian’s current research interests include stewardship of the evolving scholarly record, analysis of collective collections, and the system-wide organization of library resources.