Libraries think they are notable: Using WorldCat holdings data to fill gaps in Wikipedia

Photo by Sansern Prakonsin on Unsplash

[This post was co-authored by Chris Cyr, Associate Research Scientist, OCLC Research]

It is well established that there are gaps in coverage in Wikipedia. In response, a number of efforts to amend those gaps have been established; for example a recent Edit-a-Thon, Black Artists Matter, focused on intersectional gaps around gender, race, feminism, and the arts. In 2018, Rosie Stephenson-Goodknight spoke to OCLC member libraries and staff in her Distinguished Seminar Series talk on Wikipedia’s gender gap “Wikipedia’s gender gap, and what would Hari Seldon do about it?” Her talk inspired OCLC Research staff to ask, what could OCLC do to address gaps in Wikipedia? In this post, we demonstrate how Wikipedians can make use of library holdings data from OCLC to fill these gaps.

Addressing gaps in Wikipedia: how can library data help?

We started with a notion that library authority data and WorldCat holdings might help. However, a problem with library name authority data (which OCLC has a lot of, in WorldCat authority files, the Virtual International Authority File (VIAF) and WorldCat Identities] ) is that it doesn’t natively reveal characteristics that gap practitioners’ focus on (such as race, gender, and nationality). So, on its own, library authority data is unlikely to reveal gaps. 

Lists to fill Wikipedia gaps

To help us situate any possible ways to contribute to Wikipedia gap projects, we conducted informal interviews with a number of Wikipedia volunteers that are passionate about working gap areas (we’ll refer to them as “gap practitioners”) in order to get a better sense of their practices and workflows. We found that in many cases, these gap practitioners make extensive use of lists. Many of the themes we hear regarding lists are echoed in a 21 November 2019 blog post by GLAM strategist Alex Stinson Lists in the Wikimedia movement? Why? What? In this post, Stinson characterizes lists as helping to drive task management, and providing a “compelling way to engage” potential contributors, including librarians and other content experts (such as curators).

For many years, lists have been generated manually, by knowledgeable volunteers taking note of what is absent from the encyclopedia. In our conversations we found that many in gap projects, such as the Women in Red project are using Wikidata to accelerate list building. (Women in Red “focuses on creating content regarding women’s biographies, women’s works, and women’s issues” in order to address the noted gender gap on Wikipedia.) An example of a list that is created and maintained manually by volunteers on the Women in Red project  is this modest list of “ women crafters”, which is subdivided by country. On the other hand, we are seeing projects utilizing Wikidata in list making and maintenance, such as this more extensive “women educators” list.

These conversations, reflecting examples of current practice,  allowed us to expand our original notion of what OCLC might contribute, and helped us move from our original idea of generating lists from authority data and library holdings to giving gap projects a pathway to augment existing lists with library holdings data to help prioritize pages to add. This could help those who work from lists a way to identify those individuals who have many published works by and about them.

WorldCat Identities

WorldCat Identities is an experimental project that provides library holdings data for personal, corporate, or subject-based identities. While it is not the only source of such data, it is especially useful for contributing to gap projects for three reasons.

  • It provides both holdings numbers and data on the works associated with the person.
  • It identifies the most widely held works and provides links to them, making it easier to access them.
  • It is available on the open web for non-commercial use.

We will use the WorldCat Identities page for author Minerva Mirabel as an example.

The number of library holdings for works by or about an individual offers a rough and imperfect measure of their notability. When someone is widely represented in a curated collection, this could be an indication that many people consider that person to be notable. It also offers a rough indicator about the amount of information available. Creating a Wikipedia page for someone associated with a large number of library holdings is generally expected to be easier than creating one for someone where information is scarce.

It is important to acknowledge that library holdings data in general can be biased. Libraries have room to improve the diversity of their collections. In addition, WorldCat, which WorldCat Identities is derived from, while being the most comprehensive assemblage of library holdings data in the world, it does not account for all libraries. Additionally, the fact that someone has not had a book published by or about them does not in any way mean that a person lacks notability. This means that there are biases in who is most represented in WorldCat Identities. Having said that, the library holdings data WorldCat Identities provides is still the best available source of such data.

Matching Wikidata with WorldCat Identities  

We created an easy tool for Wikipedia gap practitioners to match individuals on lists with their respective WorldCat Identities page. This tool runs in OpenRefine, a common piece of software in the Wikipedia community. It requires a spreadsheet with the Wikidata QIDs of the individuals on the list. When the code is run (and it requires no programming knowledge to do so), it finds the VIAF or LCCN in each person’s Wikidata page (if available), and uses these to create a column with links to their WorldCat Identities pages. It then pulls in the number of library holdings of works by or about each person and populates this information as a new column.

As a first test of this project, we hand matched 100 names from the Women in Red – Educators list with their WorldCat Identities pages and compared this with the number that were matched with the OpenRefine project. Manual matches resulted in 55 WorldCat Identities links, while the OpenRefine tool found 29 matches. It was possible to increase the number of matches to 41by matching only on the name but doing so created 9 false positives. We decided to err on the side of fewer matches and fewer errors by only matching cases where VIAFs or LCCNs already were available in the person’s Wikidata page.

As a second test, we used this tool to match 3967 women from the Women in Red – Activists list to WorldCat Identities. 852 were matched with WorldCat Identities pages. Three of the women on the list were represented in more than 10,000 library holdings by or about them, and another 47 had more than 1,000. This indicates that there are many people who, based on library holdings, are notable and have a lot of information about them available, but are not represented in Wikipedia. With the help of Rosie Stephenson-Goodknight and Robert Fernandez, we created a “# of Library Holdings” column for the list, which was received favorably by those who discussed it.  

Moving forward

We have made the OpenRefine project freely available and provide instructions for matching Wikidata items with WorldCat Identities pages and library holdings data. We hope that gap practitioners can use library holdings data to augment their own projects and workflows.  Adding these data will allow editors another data point that will help them easily and efficiently identify individuals who warrant Wikipedia pages. Increased links between Wikidata and WorldCat Identities also have a long-term potential to allow librarians to address inequities in their collections and discovery systems. Since authority files do not include, for example, data on ethnicity, Wikidata can help fill those gaps. This enables librarians to assess the diversity of their libraries’ collections and highlight the work by or about individuals in under-represented groups.