Examining the role of AI in institutional repository workflows

This is the third post in a short blog series on what we learned from the OCLC RLP Managing AI in Metadata Workflows Working Group.

A vibrant visualization featuring interconnected glowing nodes and lines overlaid on a background of text strings, including URLs and references to "University of Wisconsin-Milwaukee" in multiple languages and scripts. The design symbolizes data networks, global collaboration, and metadata integration.

Institutional repositories (IRs) play a critical role in preserving and showcasing the research outputs of universities and other organizations. However, managing metadata for these repositories presents unique challenges, largely due to user creation of metadata during the deposit process. Researchers and students generally care less about metadata quality than libraries and may rush through deposit to satisfy what they may see as an academic or funder mandate. Legacy metadata in repositories can also accumulate inconsistencies, errors, and gaps over time, further complicating workflows.

Today’s blog post—the third in a four-part series— summarizes findings from the “Institutional Repositories” workstream of the OCLC RLP Managing AI in Metadata Workflows Working Group, which was comprised of:

Michael Bolam, University of Pittsburgh
Helen Williams, London School of Economics

The group met five times between April and July 2025, with four meetings focusing on the opportunities for AI to make IR metadata workflows more efficient and productive. They also thoughtfully considered potential implementation barriers and implications for preserving professional skills and job satisfaction. As with most new technologies, the use of AI in repository workflow contexts presents both opportunities and trade-offs. The key is balancing open-mindedness about AI’s potential with a realistic assessment of current capabilities and institutional readiness.

Core workflow opportunities for AI

The working group identified two critical IR workflow areas where AI could be a useful tool:

Deposit processes (including metadata creation)
Legacy metadata clean-up and management

From these contexts, we identified several opportunities where AI-powered tools could boost efficiency and productivity.

Improving the self-deposit experience

Working group members discussed many ways that AI might improve the IR deposit process, most of which focused on addressing incomplete or erroneous metadata at submission. In both self-deposit and mediated deposit workflows, users often fail to supply complete and accurate metadata because students and researchers find metadata creation burdensome and time-consuming. Examples where AI tools might ameliorate this problem include:

Subject classification suggestions based on full-text analysis
Abstract generation for materials lacking summaries
Basic metadata extraction from uploaded files (or campus system information) to pre-populate repository records

By streamlining file processing and metadata extraction and ensuring completeness, the use of AI tools can potentially reduce the burden on researchers during self-deposit and on repository staff managing mediated deposits. AI tools could be used to scan full-text files and extract certain metadata elements to pre-populate or enrich repository records.

Enhancing legacy metadata

Working group members suggested that there is significant potential for metadata improvement on the back end, and identified the following opportunities:

Managing complex entity relationships. Repositories have persistent problems in disambiguating author names and affiliations, especially for research outputs that involve multiple authors who may be affiliated with other institutions. AI tools could help cluster names and affiliations, suggest connections to persistent identifiers like ORCID, ISNI, and WorldCat Person Entities, and provide more linkages between authors, institutions, and research outputs.
Enriching and correcting legacy data. Existing repository records can harbor a number of accumulated problems, including missing data, inconsistent metadata practices and standards, and anomalies introduced by system migrations and other technological changes. AI tools could support automated scanning of repository records to identify gaps and inconsistencies, and either correct them automatically or flag them for staff review. It could also suggest enrichments, such as abstracts, for materials that lack them.
Publication status tracking. Research outputs in institutional repositories are often preprints or other materials that are involved in formal publication processes elsewhere. AI tools could conduct automated checks for changes in the publication status of “in press” materials and make appropriate changes to repository records as needed.

This is certainly not an exhaustive list of opportunities, but it points to core workflow challenges and operational pain points that AI tools could potentially address—especially incomplete or inaccurate metadata provided at the time of deposit by researchers who find metadata creation onerous, and who are, in any event, likely not skilled metadata practitioners. Poor metadata negatively impacts activities like compliance reporting, research impact assessments, and the discoverability of institutional research outputs. AI tools could help repository staff redirect their time from manual metadata management to more strategic, high-value activities.

Making the most strategic use of AI in IR workflows

Our discussions revealed some principles for maximizing the value of AI in repository workflows.

Focus on the front end of the IR workflow: Our discussions revealed that the most successful AI implementations in IR workflows would focus on early intervention during the deposit process. Supplying missing metadata or correcting inaccuracies at the time of deposit is more efficient than later remediation.

This suggests that prioritizing AI integration on “front-end” deposit support rather than “back-end” cleanup may be the optimal approach. Another suggestion to help smooth integration of AI support into repository workflows was to prioritize processes that might require less human review.

Make a critical assessment of the value added by AI: A key theme that emerged from our discussions was that AI integration requires careful consideration of its benefits. Does AI actually save time, or does it simply shift work to other parts of the workflow? Many identified problems may be solved with non-AI approaches—such as scripting, enhanced system features, or workflow redesign—rather than requiring AI technologies. What is the necessary quality threshold AI tools must achieve to equal or surpass existing metadata practices, such as traditional authority control or linked data methods?

Addressing these questions will help repository staff determine whether AI integration offers genuine value and whether expected benefits justify the costs of implementation and use.

Open questions about AI

The working group surfaced several considerations specific to repositories. One touched on a lack of clarity about the library’s authority to modify metadata created by users during the deposit process, which raises questions about depositor control over AI-generated content. Furthermore, AI solutions must address the need to restrict access to certain deposits, such as doctoral dissertations that include potentially patentable information; these materials, for example, may not be suitable for use as training data for AI models.

Several other issues emerged, including the need for more training and skills development, concerns about the loss of professional skills, the importance of ensuring quality metadata, and the necessity of maintaining human expertise in the loop.

The subgroup identified several key takeaways for the institutional repository community moving forward:

Real-world examples are essential. The community needs concrete case studies that document actual workflows and demonstrate measurable time savings and improved outcomes, not just theoretical possibilities.
Technical guidance must be accessible. Best practices for AI implementation need to be communicated in language that nontechnical metadata teams can understand and act upon.
Ethical frameworks are needed. The community would benefit significantly from practical guidelines for the responsible use of AI, specifically tailored to repository metadata creation.
Strategic focus matters. Rather than pursuing AI for efficiency alone, repository leaders should align implementations with clear strategic objectives, such as improving discovery and enabling research intelligence.

Conclusion

The most important takeaway from our discussions was the critical role of the human element of AI integration. This includes maintaining “human in the loop” oversight for AI-generated content as well as addressing the professional development and job satisfaction needs of repository staff. As repositories develop AI strategies, they must strike a balance between the promise of AI-powered automation and the essential role of human mediation in metadata creation and management.

NB: As you might expect, AI technologies were used extensively throughout this project. We used a variety of tools—including Copilot, ChatGPT, and Claude—to summarize notes, recordings, and transcripts. These were useful for synthesizing insights for each of the three subgroups for quickly identifying the types of overarching themes described in this blog post.

Brian Lavoie

Brian Lavoie is a Research Scientist in OCLC Research. He has worked on projects in many areas, such as digital preservation, cooperative print management, and data-mining of bibliographic resources. He was a co-founder of the working group that developed the PREMIS Data Dictionary for preservation metadata, and served as co-chair of a US National Science Foundation blue-ribbon task force on economically sustainable digital preservation. Brian’s academic background is in economics; he has a Ph.D. in agricultural economics. Brian’s current research interests include stewardship of the evolving scholarly record, analysis of collective collections, and the system-wide organization of library resources.