OCLC Research Mini-Symposium on Linked Data (Marseille edition)

My colleague Titia van der Werf and I organized a “mini-symposium on Linked Data” as part of the OCLC EMEA (Europe-Middle East-Africa) Regional Council conference held in Marseille on 26 February 2019. Fifty staff from OCLC member institutions throughout the EMEA region participated in this interactive session. The month before we had conducted a short survey to determine the session’s registrants key interest in this session. Most wanted to learn how other institutions were implementing linked data, so we arranged for five “lightning talks” summarizing different linked data implementations or perspectives, each followed by discussion and questions.

A national library’s experiences: Sébastien Peyrard of the Bibliothèque nationale de France (BnF), reminded us that linked data and open data are related but not identical. In France few datasets in the data.gouv.fr portal are linked data, but all are available under the French equivalent to the Creative Commons License Attribution (CC-BY), Etalab, permitting others to freely distribute and re-use the data as long as they give appropriate credit to the source. The BnF’s linked data source data.bnf.fr aggregates the entities represented in the library’s various silos of resources: main catalogue, archival and manuscripts catalogue, virtual exhibitions, digital library, educational resources, etc. The BnF views linked data as an export format, not necessarily a cataloguing format. What is most important is that the cataloguing format is compatible with linked data principles: entity-driven (a record per entity, links between entities done with links between records through their identifiers) and rich in controlled values that can be easily consumed by machines. This shift is underway at the BnF as it is building a new cataloguing system, but the cataloguing format will still be MARC. However, this “MARC flavor” will be entity-driven, with a bibliographic record becoming a Work, Expression, Manifestation and Item representation, allowing linking between entities. It will also rely more heavily on controlled subfields.

Peyrard stressed that the choice of a cataloguing format is specific to the institution—it could be BibFrame, other flavors of MARC, or something else. The BnF’s choice is the next generation of INTERMARC. The impact of linked data is more about having an entity-driven database, however it is done.

Research Information Management Graph: My OCLC colleague Annette Dortmund outlined the benefits of identifiers and linked data for a “global Research Information Management (RIM) graph.” She defined RIM as the “aggregation, curation, and utilization of research information” and persistent identifiers as an important infrastructure for “unambiguous referencing and linking to resources.” RIM metadata almost always includes publications, maybe research data sets, preprints, and other outputs, and attempts to connect these outputs to grants, funders, equipment and a growing number of other categories.

At the local level, this metadata can be captured in a typical traditional relational database, such as a CRIS. However, the information is rarely complete—and each institution does similar work. Once this metadata is aggregated to a national level, you need identifiers understood across all systems to identify and merge researchers, projects, funders etc. across all systems and to see the network level activity. Otherwise you end up with duplicates. But this information may still be incomplete. Research is international, with international collaborations and funding. Setting up a “global system” to capture, merge, and de-duplicate all the information is not feasible, but linked data can help create a global RIM graph.

If we rely on persistent identifiers to uniquely identify entities—–researchers, organizations, projects, funders, etc.—–and establish links between these identifiers, we could then connect them to locally held information. With identifiers such as ORCID or ISNI, it is much easier to reliably identify the one “John Smith” in question. Organization identifiers help get the affiliation information right. Publication identifiers such as DOIs or ISBNs help with that part. Research data is often citable by a DOI. Identifiers for projects, funders, grants, and many other entities either exist or are in development. This information can be found and used by anyone interested. The global RIM graph decentralizes the task of collecting all this information, and provides a central, global source of information.

Dortmund concluded that the one thing we can all do today to help create this future global RIM graph is to include resolvable persistent identifiers in your system, for as many categories as possible, in addition to local or national ones.

A cultural heritage institution’s perspective: Gildas Illien of the Muséum national d’Histoire naturelle provided the context of a natural history museum full of databases in silos. The museum had just launched a proof of concept project based on a sample of ca 500 local people names and identified the resources attached to their names spread throughout those silos. The museum sees linked data as a way to connect people talking about the same thing in different databases and to provide context for their objects to end-users without implementing a Google-like search box. All cultural heritage institutions (museums, archives, and libraries) have similar silos of data that could benefit from connecting them for end-users through linked data.

OCLC Research’s experiences with Wikibase: I talked about how last year OCLC Research explored creating linked data in the entity-based Wikibase platform in collaboration with metadata practitioners in sixteen U.S. libraries. Wikibase is the platform underlying Wikidata that contains structured data which you may see in the “information boxes” in various language Wikipedias. The attraction of using Wikibase is that metadata practitioners could focus on creating entities and their relationships without knowing any of the technical underpinnings of linked data. For example:

Entity: Photograph, which depicts this person who has this role in this time period

>> is part of this collection >> curated by this archive >> is part of this institution, which is located in this place >> which is located in this country, which had this previous name in this time period.

We started with 1.2 million entities that we imported from Wikidata which matched entities in WorldCat and VIAF, so practitioners could link to existing ones and focus on creating new ones and establishing new relationships. We added a discovery layer so that the practitioners could see the relationships they created as part of their workflow and the added value of retrieving related data from other linked data sources.

Another valued feature of the Wikibase platform is that it embeds multilingualism. By changing the language interface, the participants could create labels and descriptions in their preferred language and script, deferring to others to provide transliterations or labels in other languages.

I reiterated the value of including identifiers wherever possible in the metadata people create now and noted that good metadata translates into good linked data!

Strategic choices by a national library: Jan Willem van Wessel of the Koninklijke Bibliotheek (KB) summarized its strategic choices about linked data. Its Strategic Plan for 2015-2018 included recommendations for the KB to adopt linked data. The KB did not choose Linked Data as a goal in itself but as a simpler way to present information that is connected and easily accessible to its users— the core function of the KB as a National Library.

The KB is now creating a platform (not system!) to bring people and information together as part of a network, leveraging the work done by network partners from the heritage field (public libraries, university libraries, museums, archives.) If everyone does what they are good at, and only do that, joint work will proceed faster and have higher quality. He noted that we are also working in an environment where users are creating their own information through Wikipedia articles, blogs, and social media. Search engines now structure and present information in Knowledge Graphs; they have developed a common language, schema.org.

Forty years of machine-readable cataloging has given us a legacy that includes software and structures that are poorly supported. The KB catalog includes 14,000 different keywords—when the average vocabulary of a native Dutch speaker is 42,000. How useful is that? The KB has much metadata remediation to do! Which of the several hundreds record fields and subfields are really needed? Bibliographic metadata is dispersed and lacks structure to glue disparate parts together. The KB does not have a publicly shared set of references for a publication.

Although linked data is not a panacea to solve all these problems, it does help to integrate and link sources of information from within the KB, from the library world in general and—why not? — from the entire world.

The KB has not achieved yet what it wants. It has conducted pilots, demonstrations, Proof of Concepts, and organized HackaLODS (hackathons about cultural Linked Open Data) with great success. It has succeeded in knowledge building and experimentation, for example a Linked Data extension to its Depher platform with marked-up entities contained in 12 million newspapers (check out lab.kb.nl). But van Wessel observed that the KB has just scratched the surface. Its goal is to set up the KB as an authoritative linked data source for bibliographic data accessible to the outside world.

Suggested resources:

Coursera online Introduction to linked data (INRIA MOOC): https://www.coursera.org/learn/web-data/home/welcome
The INRIA MOOC in French (available until February 2020) : https://www.fun-mooc.fr/courses/course-v1:inria+41002+self-paced/about#
On the nascent RIM graph: https://www.project-freya.eu/en/deliverables/freya_d4-1.pdf, chapter 5 (p. 30f.) This is one output of Project FREYA (https://www.project-freya.eu/en/resources/project-output) and its “PID Graph” subproject. Chapter 5 describes the PID graph in a RIM context – or “RIM graph” – the other chapters (also worth reading) give different examples where the PID graph creates value.
The TED talk by Tim Berners-Lee about open data: https://www.ted.com/talks/tim_berners_lee_the_year_open_data_went_worldwide
Tutorials available at Cambridge Semantics: https://www.cambridgesemantics.com/blog/semantic-university/intro-semantic-web/towards-the-semantic-web/
A short introduction to data.bnf.fr in English: https://www.youtube.com/watch?v=sCbaHJu91-w

Karen Smith-Yoshimura

Karen Smith-Yoshimura, senior program officer, topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements. Karen retired from OCLC November 2020.