This blog post was written by Jenny Toves, Senior Technical Manager, OCLC Data Science and Analytics.
Any system aggregating data from thousands of sources needs sophisticated processes that mitigate duplication* and ensure the correct data remain. WorldCat is one such system, receiving thousands of bibliographic records from libraries worldwide each day. Whether manual or automated, some form of deduplication has occurred on bibliographic records since the early 1980s. While some manual data review is completed daily by OCLC staff and library workers at institutions participating in the Member Merge Program, most WorldCat records rely on automated deduplication programs. Automated processes introduced in the 1990s, known as Duplicate Detection and Resolution (DDR) have matured. Currently an average of 11,000 records are removed manually and 1 million removed through automation per month. Additionally, we merge millions of newly ingested records to existing WorldCat records every month—meaning we also work to mitigate duplicate records before their inception.
Cataloging rules and instructions have evolved many times over the decades. This means that the rules going into deduplication must evolve continuously to keep up with the latest and greatest. Over the lifetime of our merging processes, OCLC staff have carefully reviewed outcomes to improve processes, especially with inappropriate or missed merges, and have updated the rules-based system accordingly. While this works well in many cases, duplicate records still find their way into WorldCat, affecting catalogers, researchers, and library staff workflows.
Fortunately, technology continues to advance, and we can incorporate new technologies into the automated processes. In recent years, Machine Learning (ML) has made its way into the mainstream, having been around for several decades. An excellent working person’s definition of ML is “…algorithms [that] build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so”. (This Wikipedia article gives a solid basis for a general understanding of ML and how it fits into other areas like Artificial Intelligence or AI). The critical difference between ML and our current methods lies in the last part of this definition: without being explicitly programmed to do so. ML looks at training data—data labeled with the correct answers—and figures out why the data is labeled as it is. It then applies what it has “learned” on a new data set, and the ML provides a percentage that it thinks was accurately labeled.
In early 2022, the OCLC Data Science team was presented with a challenge to use ML to identify duplicate records within WorldCat. If ML could identify additional duplicates over DDR, then those duplicates could be removed via our standard resolution processes, ensuring that the appropriate record was retained. Different ML algorithms were investigated, but the more significant hurdle was gathering a training set of data to run through the chosen algorithm. The Data Science team approached the Data Quality team to find data sets. Data Quality was able to provide information for the initial sets of reviews. Still, we thought of an opportunity for our members to participate in this process, much as many do with manual deduplication. This was the beginning of the exercise known as data labeling, where we would ask member libraries (i.e., the cataloging experts) to review pairs of records the ML model thought were duplicates, labeling them as such.
During the middle part of 2022, the ML model was refined with continuing analysis and input from the Data Quality team. A user interface for the data labeling exercise was also built and tested. The interface allowed users to retrieve a pair of bibliographic records that are potential duplicates. Users could generate the pair by selecting values for the language of cataloging, material type, and record age. Once these options were chosen, two records would appear on the screen. Fields were highlighted across the records depending on whether they differed: yellow meant there were differences between the two fields; green meant the fields were precisely the same; and unshaded meant the field is only present in one of the records. Users were then asked if these two records describe the same thing and could answer yes, no, or unsure. Users could also check boxes next to each field to indicate what was used in making their decision. Overall, the tool held twenty-thousand pairs with the goal of having each pair reviewed three times by different reviewers.
By November, participants in the Member Merge Program were introduced to the tool, which was rolled out to all OCLC members in early December. The tool was open through mid-April 2023. Just over 34,000 pairs of potential duplicate records were evaluated by this time. While short of three reviews for each pair, plenty of data was gathered to train the ML model. We found that over 95% of the pairs that received multiple reviews had no disagreements between reviewers. This demonstrated that the model was on par with humans in identifying duplicates. This data was used to refine the model, and the Data Quality team reviewed new outcomes for accuracy.
We will soon implement the machine learning model as part of our ongoing efforts to mitigate and resolve duplicate records in WorldCat. Beginning late August 2023, an initial run of one (1) million records—500,000 pairs—will be processed through the machine learning algorithm. This will result in 500,000 duplicate records being merged in WorldCat, which will improve and streamline cataloging, discovery, and interlibrary loan experiences for both library staff and end users.
Thank you to every individual who participated in the project! Your collaboration helps advance the profession and the mission of libraries worldwide.
*The concept of duplicates completely depends on the user experiencing the duplicates. The labeling project described later in the post is asking catalogers if two records are analogous. A formal definition of duplication likely needs its own post.
**Record age corresponds to when the item was published. DDR uses a different set of rules for items published before 1830 since many of these fall under cataloging rules for rare materials.
This blog post was co-authored by Nathan Putnam, former Director of Data Quality and Governance at OCLC. Special thanks to Richard Urban, Senior Program Officer for OCLC Research Library Partnership, for reviewing this post.
Merrilee Proffitt is Senior Manager for the OCLC RLP. She provides community development skills and expert support to institutions within the OCLC Research Library Partnership.