Preparing for the future: supporting the transition to Linked Data in Libraries

the future soon / k rupp [Flickr]
the future soon / k rupp [Flickr]
For the past few years, Linked Data has been a buzzword at many of the major library conferences around the world. Linked Data has been the subject of paper presentations, panel discussions and project reports at conferences such as ALA, DCMI and ASIS&T. The themes of these talks vary. Sometimes the discussion revolves around the importance of Linked Data for libraries and how the transition to Linked Data will finally help facilitate the long overdue retirement of the now 40+ year old MARC Standard. Other times the discussion is more nuanced, focusing on the usefulness of Linked Data to better publish and share library data online and with the wider world. While the response to the more evangelical messaging has been mixed, one issue has been made clear, both by supporters and skeptics of Linked Data: we, as an industry, need to update and change our technical and functional infrastructure to allow for the integration and proliferation of Linked Data.

As a research assistant at OCLC, I have had the opportunity to be involved in a variety of industry initiatives to help modernize library infrastructure and prepare them for the eventual coming of Linked Data. Two of these projects help underscore the fundamental changes that the library industry needs to undergo in order to support the adoption of Linked Data. The first is the redevelopment of the MODS data model as an RDF ontology and the second is the conversion of the Getty vocabularies from a traditional controlled string-based thesaurus into a Linked Data dataset that uses URIs as identifiers for people, organizations, places, concepts etc. The two projects each address different but equally important types of changes that will need to occur in order for libraries to create Linked Data and integrate it into their everyday workflows.

Both the Getty project and, to a greater extent, the MODS project serve as good examples of OCLC Research contributing to community efforts in a significant fashion – with the redevelopment of MODS, I am a formal member of the working group that has been meeting since January of 2014. In the conversion of the Getty vocabularies, I, along with OCLC Research colleague Jeff Young and OCLC Technology Evangelist Richard Wallis, have played an informal, but still substantial advisory role. We are pleased to contribute to these initiatives that we believe will carry the community forward. It should be noted that other similar projects have already been undertaken to help prepare for Linked Data. Past and current projects include the transformation of LC vocabularies into Linked Open Data, the publication of FAST and VIAF as Linked Data, the ongoing work of the W3C Schema Bib Extend Community Group to create and propose extension to Schema.org to allow for better description of bibliographic material and finally the ongoing development of the BIBFRAME model.

Data models are a fundamental aspect of describing library metadata. They define the overall structure of the data and serve as the basis for best practices in data development. Very few libraries have begun to adopt models that are suitable for expressing their data as Linked Data. In an attempt to provide librarians with well understood/recognizable RDF data models, a team from Columbia University, led by Melanie Wacker, is currently working on a project to convert the widely used MODS XML schema into an RDF model. If successful, this project will allow libraries to easily convert their existing MODS XML data into RDF. At the surface, this project might seem like a simple cross-walking exercise that will convert XML data into RDF data but if one probes a bit deeper, it soon becomes apparent that there are much more complex issues at hand. For libraries to adopt and integrate RDF, a paradigm shift is required in the way that librarians think about and catalog bibliographic materials. You might be familiar with the phrase “Things not Strings”. This idea is fundamentally important in Linked Data but is diametrically opposed to the way that librarians have been creating and accessing bibliographic data for the past century. In the world of cataloging, librarians have relied on controlled strings to add authority to bibliographic records. Linked Data, unlike record description, relies on linking together entities. Each of the entities defined in a Linked Data model has its own URI and a corresponding set of properties (such as publication date or birth date). For Linked Data to replace current record based bibliographic description, there needs to exist models that split records into separate entities. The work of the MODS taskforce is attempting to address this issue using a current library model that is widely used and understood.

Although the development and implementation of RDF data models is vital in helping change the library’s technical infrastructure, another key piece of infrastructure needs to undergo transformation in order for Linked Data to gain wide support and use in the library domain. As mentioned in the previous paragraph, librarians currently rely on controlled strings to add ‘authority’ to a bibliographic record. In the world of Linked Data, controlled strings are replaced by references to entities. To allow for the continued use of controlled vocabularies, various projects have produced new models of traditional library authority files. The Getty has joined this effort and is currently working on a set of projects that will transform all of their vocabularies (i.e. AAT, TGN, ULAN and CONA) into Linked Data datasets The results of this project will be globally unique URIs that describe every Person, Place, Organization, Concept, etc. defined in the Getty vocabularies. These URIs can then be used within the framework of new RDF models, such as MODS RDF, BIBFRAME or non-library specific models such as Schema.org.

These two projects highlight how the standards and infrastructure that libraries have relied on for decades are slowly beginning to change. More importantly they underscore why change is needed. The two projects are similar to a lock and key. You can buy a new lock (i.e. develop a new data model) but if you do not have a new key (i.e. Linked Data datasets), it simply will not work. Conversely, a shiny new key is not very useful unless you have a lock to use it with. The two items must be bought, installed and used in tandem in order for the mechanism to work properly. This transition to Linked Data symbolizes a fundamental break from how libraries have historically created and managed data and epitomizes the term ‘disruptive innovation’. In the end, I believe that the switch from records to entities will help libraries better fulfill one of their primary missions, to improve the ability to provide and share data with patrons, and help prepare them for the future.