Next-generation metadata and the semantic continuum

Many thanks to Titia van der Werf, OCLC, for writing the majority of this blog post.

On 13 April 2021, the Closing Plenary webinar took place to synthesize the OCLC Research Discussion Series on Next Generation Metadata. Senior OCLC colleagues Andrew Pace (Technical Research), Rachel Frick (Research Library Partnership), and John Chapman (Metadata Services) were invited to respond and react to the findings.

The series

The “Transitioning to the Next Generation of Metadata” report served as background reading and inspiration for the eight virtual round table discussions that were held in six different European languages, throughout the month of March. The main discussion question: “How do we make the transition to the next generation of metadata happen at the right scale and in a sustainable manner, building an interconnected ecosystem, not a garden of silos?” struck a chord with most participants and made for lively conversations.

In total, 86 participants from 17 countries – mostly Europe, but also a few from the Middle East and North Africa – contributed to the round table discussions. They represented 71 different institutions, of which 48 came from university libraries. Hosting the round tables in local languages was conducive to having deep and meaningful interactions, and that was much appreciated. Attendees found the discussions inspiring, they liked comparing experiences, the rich reflections, and the free format of the conversations. Many found the sessions of 90 minutes too short, and asked for follow up sessions. Some thought it would be good to have more decision makers at the table, and other stakeholders. A summary of each discussion has been shared via posts on HangingTogether.org (the blog of OCLC Research), and all are available in the language of the session as well:

First English round table: Towards a critical mass of interoperable library data

Italian round table: Interoperability, Sustainability and More

Second English round table: Silos and other challenges

French round table: The challenge lies in managing multiple, co-existing ‘right scales’

German round table: Formats, contexts and deficits

Spanish round table: Managing researcher identities is top of mind

Third English round table: Investing in the utility of authorities and identifiers

Dutch round table: Think bigger than NACO and WorldCat

Predominant conversation threads

All the sessions started with a mapping exercise. We asked attendees to place next generation metadata projects they were aware of on a 2×2 matrix characterizing three different application areas: 1) bibliographic data and the supply chain, 2) cultural heritage data and 3) research information management (RIM) data and scholarly communications. The collage of the eight maps that came out of the round tables shows that cultural heritage data projects and bibliographic data projects were predominant, reflecting the focus and expertise of the attendees. There were few RIM projects on the maps. All the maps showed interesting clusters of post-it notes relating to persistent identifiers (PIDs) – such as ISNI, ORCID, VIAF, DOI – which demonstrated the importance attributed to them by participants.

Image: Collage of the next generation metadata maps from the 8 round table discussions — Collage of the next generation metadata maps from the 8 round table discussions

Collaborating to produce and publish authority data

From the conversations, we learned that libraries are strongly investing in the transformation and publication of their authority files (both the name authorities and the subject headings) to leverage them as next generation metadata. The open government data policies are driving this focus. We also heard that these policies are inciting collaboration at the national level to maximize the benefits of centralization, normalization, and efficiency of data production and publication. The collaboration between the two largest producers of library data in France, BnF and Abes, as mentioned in the French session, is an exemplar of this.

Achieving a critical mass of authority data

In an effort to achieve critical mass, libraries are intentionally feeding external systems with their authority data – e.g., the University’s Research Portal, the ORCID-database, Wikidata, etc. They are also embedding PIDs in the value chain – this is particularly true for libraries and bibliographic agencies that act as registration agencies for identifiers, such as ISNI for example, and who operate in the context of their national bibliography and legal deposit tasks. Some of them, like the British Library, pro-actively encourage the adoption of ISNIs upstream in publishers’ metadata records and downstream though reconciliation with VIAF and LC/NACO files – so that the ISNIs become part of the libraries’ cataloging workflow.

Where to let go of control

At several sessions, we heard concern about the large numbers of cultural heritage projects that create separate and dedicated ontologies and vocabularies, which then remain isolated and are of limited value to others. There were many observations about duplication of efforts and the reluctance to refer to data that has already been defined by others. The key question is: Where to let go of control and where to focus and control? The control issue is also one of organization and governance. There is a growing sense of the need to negotiate with the different bibliographic data stakeholders and parties to agree on who does what in this newly emerging ecosystem.

How to participate in the connecting game?

We heard that harvesting, cleaning, normalizing, reconciliating, and transforming data at scale – what aggregators do – is still important during the transition period, but libraries want the enriched data or at least the identifiers to flow back to them, so they can participate in the connecting game. They also believe that decentralizing the workflow would allow to better leverage local expertise. There was much enthusiasm about the many opportunities of linked data, like the ability to connect different languages, to link to more granular information than authority files provide, and to automate subject indexing and name entity disambiguation with the help of AI technologies.

Managing multiple scales into a semantic continuum

Finally, there is no such thing as one “right scale” for doing linked data. There are many different reasons that justify the choices institutions make for the scale of their workflows: local expertise, efficiency, convenience, national policies, consortial economies of scale, differences between humanities, social science, and hard science data, etc.

Andrew Pace described the challenge of managing these multiple scales as “bridging the effort between the short tail and the long tail”, in other words, between scaled effort and localized domain and collection expertise. He explained that to achieve a ‘semantic continuum’:

“We balance large, shared, homogenous collections and data, while accounting for a myriad of de-centralized and heterogenous collections. We improve machine-learning and scaled reconciliation with the necessary tools for the dedicated knowledge work that happens in libraries. We can start in the big spaces involving person names and bibliographic works, while acknowledging and preparing for the more difficult work ahead like concepts and aboutness. And we can prepare for the pending paradigm shift that comes with blending bibliographic and authority work together and the challenges of balancing object description with an increase in contextual description. And across this continuum, we know that a large centralized infrastructure is needed and that custom applications will enhance the effort.”

Ongoing professional development and training needs

With the paradigm shift, we need to prepare for a new kind of knowledge work in metadata management, discovery, and access in libraries. During the round table discussions, one thread running through all the conversations was transitioning from the old to the new, or rather, the question of “How to build the new while still having to maintain the old and the established?” We know that the systems and services required are not ready and we know that there will be ongoing professional development and training needs.

Rachel Frick answered to the skills need and distinguished between the need for 1) practitioners skilled to implement, 2) managers understanding the opportunities, and 3) leaders recognizing the priority. She pointed to OCLC’s programs that support library metadata upskilling needs and active learning, namely:

WebJunction Course Catalog, which offers library specific courses and webinar recordings, for free, to all library workers and volunteers;
OCLC Research Library Partnership Metadata Managers Focus Group, which offers an opportunity to engage with peers who are responsible for creating and managing metadata; and
OCLC Community Center, which offers a community space for exchanging on cataloging and metadata issues and practices.

Q&A on the OCLC Shared Entity Management Infrastructure

During the round table discussions and the closing plenary webinar, participants shared their expectations, interest and questions about the Shared Entity Management Infrastructure (SEMI).

John Chapman took the opportunity to provide some additional insights on aspects of interest to the participants. The goal of SEMI is to address infrastructure needs identified by libraries during past efforts such as Project Passage and the CONTENTdm Linked data pilot, and in conversations with the OCLC Research Library Partnership.

To make library linked data workflows more effective, and to deliver on both sides of the “semantic continuum” that Andrew Pace described, OCLC has been building a new infrastructure. This effort is funded in part by a grant from the Andrew W. Mellon Foundation. A first version will be operational by the end of 2021, with plans to explore integration with other OCLC services and applications next.

To respond to questions concerning the business model, John explained that, in their grant award, the Mellon Foundation specified that OCLC provide free access to data, while also providing valuable services that earn the revenue required to keep the infrastructure sustainable. To that end, OCLC will be publishing entity data as linked open data, l also be providing subscription access to user interfaces and APIs to work with the data. As with VIAF, there will be public facing information on each identifier, so libraries can have a common reference for the entity URIs.

Continuing the conversation

It has been delightful to organize the OCLC Research Discussion Series on next generation metadata with such inspiring participation. We received invitations to organize follow up conversations on this topic regionally. You can also revisit all the content from the series, by going to the event page.

We also plan to repeat the OCLC Research Series in the EMEA region next year, on another topic. So, stay tuned and thank you all for your contributions!

Annette Dortmund

Dr. Annette Dortmund led OCLC’s European product management and research concerned with next-generation metadata solutions for libraries and other cultural heritage institutions, with a particular focus on persistent identifiers in scholarly communication and library linked data. She also coordinated and supported European research and engagement programs for the OCLC Research Library Partnership.