OCLC Research Mini-Symposium on Linked Data

Mini-symposium participants suggested possible topics for discussion with post-it notes

Titia van der Werf and I, with our colleague Shenghui Wang, organized a “mini-symposium on Linked Data” held in the OCLC Leiden office on 19 November, 2018. We brought together 43 staff from a range of Dutch academic libraries, cultural heritage institutions, the National Library, the National Archive, and the Institute for Sound and Vision and OCLC colleagues based in Europe to discuss how linked data meets interoperability challenges in the online information environment. Masterfully facilitated by Titia, the mini-symposium featured Art of Hosting techniques to amplify interaction among the participants. Participants convened around tables of 6 or 7, determined by their interest in a specific topic, and took notes on flip charts that served as the basis of reporting on their discussions.

Participants came primarily to learn more about linked data, network with others, and what institutions were doing with linked data. The afternoon was devoted to two sessions: 1) Knowledge Exploration—where experts described their linked data implementations and the lessons learned and answered questions from the other participants— and 2) Open Space, where participants selected specific topics proposed by other participants and discussed them. By the end of the day, 85% of the participants rated the mini-symposium “useful”or “valuable.”

Knowledge Exploration: The six linked data implementations discussed:

OCLC Linked Data Wikibase Prototype: Karen gave an overview of OCLC Research’s work with 16 U.S. libraries using the Wikibase entity-based platform that generates linked data. The prototype participants enjoyed working with the platform, as they could concentrate on entities and create linked data without needing to know the technical underpinnings like RDF and SPARQL. Wikibase embeds multilingualism, and it was easy to create statements in any script. We learned that differentiating presentation from the data itself was a challenge for catalogers unfamiliar with using Wikidata. We added a discovery layer so that catalogers could see the relationships they created and see the added value of interoperating with data retrieved from other linked data sources. In the absence of rules, documentation and best practices became more important. We also learned the need to add a “retriever” application, so prototype participants could ingest data already created elsewhere. Participants also valued the ability to include the source for each statement created, so people could evaluate their trustworthiness.

Tresoar linked data project ECHOES (RedBot.frl): Olav Kwakman explained that Project ECHOES provides the infrastructure for RedBot.frl, a site that aggregates digital collections describing the Frisian cultural heritage. The project aims to dissolve barriers to accessing diverse collections of different groups and languages through an integrated platform, offering a view of Europe as a whole. The project partners are: Erfgoed Leiden en Omstreken (project lead), Tresoar (secretary of the Deltaplan Digitalisering Cultureel Erfgoed in Fryslân), Diputació de Barcelona, Generalitat de Catalunya, and the Consorci de Serveis Universitaris de Catalunya (technology partner). The source data sets come from libraries, archives, and museums but most are not available as linked data; the project transforms them into linked data using Schema.org, the Europeana Data Model, and the CIDOC Conceptual Reference Model. He noted the difficulties of transforming legacy data into linked data; “data quality is king”. On its first attempt, only 30% of the data could be converted into linked data, the remaining 70% had technical errors because of typos, inconsistent data, or misuse of fields. A key lesson is the value of enabling the community to help correct the data. The project uses a FryskeHanne platform both to improve the data quality and to expand their linked data datasets. As the linked data is stored in a separate linked data datastore, it can be made available to the public to expand and enrich your source data. He also noted that there is no one presentation that will fit all use cases, and suggests building presentations to showcase your data based on a theme, and enable communities to build their own presentations.

International Institute for Social History website (changing to https://iisg.amsterdam 31 January 2019): Eric de Ruijter noted that the original intention of the IISH was to improve its website and make its data re-usable and interoperable. Linked data provided the means to do this and was not the goal itself. They hired a contractor, Triple, which aggregated three sources of data: MARC records, Encoded Archive Descriptions of the IISH’s archival collections, data sets, and some articles. The data was marked up as RDF and saved in a local triple store, which is currently available only to the web site. It supports searching through all collections and provides a faceted overview of all entries extracted from the aggregation of data sources. They made use of “unifying resources” such as the Getty’s Art and Architecture Thesaurus (AAT), Library of Congress headings, and the Virtual International Authority File (VIAF) for identifiers which “turned names into people.” Lessons learned: Monitor your contractors; they know the technology but not the semantics of library data. Employ an iterative process, and just start and learn, rather than trying to address all possible eventualities. The extra work helps improve your source data.

KoninklijkeBibliotheek’s data.bibliotheken.nl: René Voorburg noted that as the KB is responsible for collecting everything published in the Netherlands and publishing the corresponding metadata, linked data would expose it to a larger audience on the web. But linked data does not yet have a “proper place” in the KB’s daily workflow. Although the KB has converted its data to RDF triple stores using Schema.org (where vagueness is considered a plus), only those who are familiar with SPARQL (an RDF query language) can take advantage of it, such as scientists who are happy with the results. They have enriched their metadata for authors with identifiers pulled from VIAF and Wikidata, and publish linked data about book titles on the edition level. But the published linked data dataset is not maintained and has not had any updates since it was first published. They are struggling with how to make it easier to provide the data and privacy issues.

Data Archiving and Networked Services (DANS): Andrea Scharnhorst focused on issues around the long-term archiving of RDF datasets. Increasingly, social sciences and humanities (SSH) produce RDF datasets, and consequently they become objects for long-term archiving with DANS services (EASY, Dataverse). What to do with specific vocabularies used in SSH? Which vocabularies should be deposited alongside the RDF datasets which use part of them? Who is responsible for the maintenance of these vocabularies and for resolving an archive’s URI when the domain name changes? How much of an“audit trail” (provenance) is necessary (who did what when and why)? What criteria should be used to decide whether a web site should be archived, and which URIs should be snapshotted for ingest? Together with the Vrije Universiteit Amsterdam, host of the LOD Laundromat—a curated version of the LOD cloud that DANS archives here—DANS is working on a Digging into the Knowledge Graph project on issues of indexing and preserving Knowledge Organization Systems to provide access to data both in the present and the future.

Rijksdienst Cultureel Erfgoed (RCE) [Cultural Heritage Agency of TheNetherlands]: Joop Vanderheiden noted that the RCE is in the process of implementing a linked open dataset describing Dutch monuments and archaeology, to expose their data to a wider audience on the web. RCE will be publishing all its data about building (62,000) and archaeological monuments (1,400). In addition, all data from its
archaeological information system ARCHIS will be published as linked data. Its API and SPARQL endpoint will be available to everyone in January 2019. Even in the initial stages, it is possible to show what is possible. They have garnered more interest in their work through “LOD for Dummies” presentations. Their work has benefited from the Digital Environment Act (Digitaal Stelsel Omgevingswet or DSO) and the Dutch Digital Heritage Network (NDE).

Open Space: Participants discussed three questions:

How can we explain/understand benefits of linked data without getting lost in technical details? Among the benefits cited: Point to specific examples of successful linked data implementations; the value of providing bridges among different silos of data without destroying the integrity of the respective sources; take advantage of others’ databases so that you don’t have to replicate the work—saving labor costs; enrich your own data with the expertise provided by others; ability to provide cross-cultural, multilingual access; give your users a better, richer experience; increase the visibility of your collections and expand your user base now and in the future.

Many individual projects—how can they be related to each other? Participants referred to Tim Berners-Lee’s original set of linked data principles and stressed the need to conform to international standards. Wikidata is an example of bringing together multiple sources. Relationships among individual projects could be more easily established if implementers reused existing ontologies rather than creating their own. We need to share best practices with each other.

How to integrate LOD into your CBS/ local cataloging system? Participants recommended that source data should not be modified but linked (via link tables) to trustworthy “sources of truth.” WorldCat was viewed as a “meta-cache” for a discovery layer.Participants wondered what criteria to use to determine which data sources should be consumed. They noted that maintaining relationships among entities in linked data sources was important, and the need for all thesauri to be publicly available as linked data.

Takeaways: Participants valued the discussions. One noted that they received some answers to the questions they came with, but were returning home with even more questions—“potentially a good thing?” Some found consolation that cultural heritage institutions were more or less on the same level. The brief descriptions of a variety of specific linked data projects were appreciated. If linked data is published, you have to “keep it alive” (update it). Some noted the gap between people who work with linked data and those with the technical know-how. Proper planning and funding at the institutional level are needed. “We are not alone in this! I would like to come together more often.”

Karen Smith-Yoshimura

Karen Smith-Yoshimura, senior program officer, topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements. Karen retired from OCLC November 2020.

One Comment on “OCLC Research Mini-Symposium on Linked Data”

René Voorburg says:

December 11, 2018 at 7:46 am

A small update regarding the situation at the KB. The linked data at data.bibliotheken.nl is not just usable by those who know SPARQL, however, the added SPARQL-interface enables those who do know SPARQL to do very powerful queries.
We are indeed examining better ways to provide the data. For example, ideally we would like to implement content negotiation mechanisms that would allow the user not just to retrieve a RDF/XML vs a JSON view (serialization) but also a ‘schema.org’ vs a ‘RDA’ view of the data. However, W3C is still in the process of designing mechanisms for content negotiation on that level.
Further, our online platform for linked data makes it easier for us to supply scientists with desired data. It seems to carry less privacy related restrictions than supplying tailor made databases (and much easier to implement!).
We are in the process of adding more data to data.bibliotheken.nl, but indeed, this is not yet done as an integrated step of running metadata management processes, more as an afterthought.

Comments are closed.