Shift to Linked Data for production

That was the topic discussed several times recently by OCLC Research Library Partners metadata managers, initiated by Philip Schreur of Stanford, who is also involved in the Linked Data for Libraries (LD4L) project. Linked data may well be the next common infrastructure both for communicating library data and embedding it into the fabric of the semantic web. There have been a number of different models developed: Digital Public Library of America’s Metadata Application Profile, schema.org, BIBFRAME, etc. Much of a research library’s routine production is tied directly to its local system and makes use of MARC for internal and external data communication. Linked data offers an opportunity to go beyond the library domain and authority files to draw on information about entities from diverse sources.

Publishing metadata for digital collections as linked data directly, bypassing MARC record conversion, may offer more flexibility and accuracy. (An example of losing information when converting from one format to another is documented in Jean Godby’s 2012 report, A Crosswalk from ONIX 3.0 for Books to MARC 21.) Stanford is pulling together information about faculty members and publications in a way that they could never do without utilizing linked data.

Some of the issues raised in the focus group discussions included:

Critical components in linked data that could be started now: Including persistent identifiers in the MARC bibliographic and authority records created now will help in transitioning to a future linked data environment. The entities are more clearly identified in authority records than in bibliographic records where it’s not always clear which elements represent a work versus an expression of a work. OCLC is already adding FAST identifiers in the $0 subfield (the authority control number or standard number) in the subject fields of WorldCat records. The British Library expects to launch a pilot this summer to match the LC/NACO authority file against the ISNI database and add ISNI identifiers to the authority record’s 024 field. Adding $4 role codes in personal name added entries will help establish relationships among name entities in the future. Creating identifiers for entities that do not yet have them will build a larger pool of data to help disambiguate them later. The community could also consider a wider range of authorities beyond the LC/NACO authority file for re-using existing identifiers (e.g., VIAF, ISNI and identifiers in other national authority files) and “get us into the habit”.

Provenance: How to resolve or reconcile conflicts between statements? We will likely see different types of inconsistencies than we see now with, for example, different birthdates. OCLC has been looking at the work of Google researchers on a “knowledge graph” (the basis of knowledge cards. As Google harvests the Web, it comes across incorrect or conflicting statements. Researchers have documented using algorithms based on frequency and the source of links to come up with a “confidence measure”. (Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion.) Aggregations such as WorldCat, VIAF and Wikidata may allow the library community to view statements from these sources with more confidence than others.

Importance of holdings data in a linked data environment: Metadata managers see the need to communicate both the availability and eligibility of the resource being described. A W3C document, Holdings via Offer, recommends mappings from bibliographic holdings data to schema.org.

Impact on workflow: In the next phase of the Linked Data for Libraries project, six libraries (Columbia, Cornell, Harvard, Stanford, Princeton and the Library of Congress) hope to figure out how to use linked data in production using BIBFRAME. They will be looking at how to link into acquisitions and circulation as well as cataloging workflows, and hope to collaborate with cataloging and local system vendors. Metadata managers noted it’s important to collaborate with the book vendors that supply them with MARC records now – even if they cannot generate linked data themselves, perhaps they could enhance MARC records so that transforming them into BIBFRAME is cleaner. Linked data may also encourage more sharing of metadata via statements rather than copy-cataloging a record that is then maintained as a local copy that is not shared with others.

Challenges

During this transition period the environment and standards are a moving target.
It’s unclear how libraries will share “statements” rather than records in a linked data environment
How to involve the many vendors which supply or process MARC records now? Working with others in the linked data environment involves people unfamiliar with the library environment, requiring metadata specialists to explain what their needs are in terms non-librarians can understand.
Differing interpretations of what is a “work” may hamper the ability to re-use data created elsewhere.

Success metrics: Moving into a production linked data environment will take time, and each institution may well have a different timetable. Discussions indicated that linked data experiments could be considered successful if:

The data is more integrated than it is now.
Data created by different workflows are interoperable.
Libraries can offer users new, valued services that current data models can’t support.
The resource descriptions are more machine-actionable than current standards.
Outside parties use library resource descriptions more.
The data is better and richer because more parties share in its creation.

—————————————

Graphic: Partial view of Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/

Karen Smith-Yoshimura

Karen Smith-Yoshimura, senior program officer, topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements. Karen retired from OCLC November 2020.