We first shared our efforts for leveraging machine learning to improve de-duplication in WorldCat in this 2023 blog post on “Machine Learning and WorldCat.”

De-duplication has always been essential to maintaining the quality of WorldCat by enhancing cataloging efficiency and streamlining quality. But with bibliographic data pouring in faster than ever, we need to address the challenge of keeping records accurate, connected, and accessible at speed. AI-powered de-duplication offers an innovative way to scale this work quickly and efficiently, but its success depends on human expertise. At OCLC, we’ve invested resources into a hybrid approach, leveraging AI to process vast amounts of data while ensuring catalogers and OCLC experts remain at the center of decision-making.

From paper slips to machine learning

Long before I joined OCLC, I worked in bibliographic data quality when de-duplication was entirely manual. As part of a “Quality Improvement Program,” libraries would mail us paper slips detailing suspected duplicates, each with a cataloger’s rationale. We’d sort thousands of these color-coded slips into stationary cabinets: green for books, blue for non-books, pink for serials. We even repurposed stationery drawers to store the overflowing duplicate slips—pens and notepads were impossible to find.

A cluttered office storage room filled with tall cabinets and shelves overflowing with stacks of colorful paper folders in pink, green, and yellow. The papers are piled on top of cabinets, spilling out of shelves, and scattered on the floor, creating a chaotic and disorganized environment. Bright fluorescent lighting illuminates the space, emphasizing the abundance of materials. — *^{This image was generated using AI to recreate my memory of the cluttered corridors where we kept the duplicate slips. AI makes it look much neater than it really was.}*

In hindsight, it was a forward-looking community effort. But it was slow, methodical work that reflected the painstaking nature of our efforts at that time. Each slip was a decision, a piece of human judgment shaping how records in our system were merged or maintained. And for all its effort, this process was inherently limited by scale. We were always chasing duplicates rather than getting ahead of them.

Now, working on AI-powered de-duplication at OCLC, I’m struck by how far we’ve come. What once took years now takes weeks, with more accuracy, across more languages, scripts, and material types than ever before. The heart of the work remains the same: human expertise matters. AI is not a magic solution. It learns from our cataloging standards, our professional judgment, and our corrections.

By taking a hybrid approach to de-duplication, we can use machine learning to do the heavy lifting while ensuring that human oversight guides and refines the process.

Balancing innovation and stewardship in WorldCat

For decades, catalogers, metadata managers, and OCLC teams have worked to maintain the integrity of WorldCat, ensuring that it remains a high-quality, reliable resource for libraries and researchers. De-duplication has always been central to this effort, eliminating redundant records to improve efficiency, discovery, and interoperability.

Now, AI is allowing us to approach de-duplication in new ways, dramatically expanding our ability to identify and merge duplicate records at scale. The key challenge, however, is not simply how to apply AI but how to do so responsibly, transparently, and in alignment with professional cataloging standards.

This approach to scaling de-duplication is an extension of our longstanding role as stewards of shared bibliographic data. AI presents an opportunity to amplify human expertise, not to replace it.

The fundamental shift in de-duplication

Historically, de-duplication has relied on deterministic algorithms and manual effort on the part of catalogers and OCLC staff. While effective, these methods have limits.

OCLC’s AI-powered de-duplication methods enable us to:

Expand beyond English and Romance languages—Our machine learning algorithm can accurately and more efficiently process non-Latin scripts and records across all languages, improving rapid de-duplication at scale across global collections.
Address a vast array of record types—AI enables us to identify duplicates across a broad spectrum of bibliographic records and affords new insights into certain material types that are challenging to address.
Preserve rare and special collections—We do not currently touch rare materials with AI de-duplication processes, ensuring we preserve unique records in archives and special collections.

These advancements mean more accurate metadata across a broader range of materials and languages, helping us to scale metadata quality efforts in WorldCat responsibly.

What “responsible AI” means in practice

The term “AI” is broad and often met with skepticism. Rightly so—many AI applications raise concerns about bias, accuracy, and reliability.

Our approach has been guided by a few key ideas:

AI should extend human expertise, not replace it. We have integrated human review and data labeling to ensure that AI models are trained with cataloging best practices in mind.
Efficiency should not come at the expense of accuracy. AI-powered de-duplication is designed to optimize computing resources, ensuring that automation does not compromise the quality of records.
Sustainability matters. Our approach is designed to be computationally efficient, reducing unnecessary resource use while maintaining high-quality results. By optimizing AI’s footprint, we ensure that de-duplication remains cost-effective and scalable for the long term.

This approach to de-duplication is not about reducing the role of people—it’s about refocusing their expertise where it matters most. Catalogers can focus on high-value work that connects them to their communities instead of spending hours resolving duplicate records.

Moreover, catalogers and our experienced OCLC staff are active participants in this process. Through data labeling and feedback, professionals are helping to refine and improve AI’s ability to recognize duplicates.

AI as a collaborative effort and the road ahead

I don’t miss the piles of paper slips or quarterly cabinet purges, but I respect what they stood for. AI isn’t replacing that care—it’s scaling it. While tools evolve, our principles don’t. OCLC has long used technology to help libraries manage their catalogs and collections, and now we’re applying that same mindset to AI: deliberate, effective, and grounded in our shared commitment to metadata quality. This approach to innovation empowers libraries to meet changing needs and deliver value to their users.

Join OCLC’s data labeling initiative today and help refine AI’s role in de-duplication.
AI-powered de-duplication is an ongoing, shared effort that will continue to evolve with community input and professional oversight. Your contributions will directly impact the quality and efficiency of WorldCat, benefiting the entire library community.

Here’s how to participate:
◆ Data labeling interface (WorldShare login credentials required)
◆ Participation instructions
◆ FAQS

Bemal Rajapatirana

Bemal Rajapatirana is the Director of WorldCat Data Management. She leads initiatives on WorldCat data quality, evolution, and new data ecosystems serving libraries worldwide.