That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Myung-Ja (MJ) Han of the University of Illinois at Urbana-Champaign. She wanted to find out why libraries do metadata reconciliation work, what kind of linked data sources libraries use and what libraries do with reconciled identifiers. Many libraries are performing metadata reconciliation work, such as searching matching terms from linked data sources and replacing strings in metadata records with URIs or storing URIs, as a necessary first step toward the linked data environment or as part of metadata enhancement work. Several conferences, notably Code4Lib, have offered preconferences or workshops to introduce tools for metadata reconciliation work. Europeana’s report on metadata enhancement work outlines the benefits and challenges of the semantic reconciliation and enrichment process based on its own experience.
Three main reasons why libraries do metadata reconciliation:
- Efficient batch enhancement
- Preparation for linked data by using identifiers (Uniform Resource Identifiers, or URIs) rather than text strings
- Enhancing user services and discovery
Metadata reconciliation is done on a variety of data, including traditional MARC library data, digital collections, institutional repositories and archival materials. Improving the quality of the data improves users’ experience in the short term, and will help with the transition to linked data later.
Most metadata reconciliation is done on personal names, subjects and geographic names. Sources used for such reconciliation include the Virtual International Authority File (VIAF), the Library of Congress’ linked data service (id.loc.gov), the International Standard Name Identifiers (ISNI), the Getty’s Union List of Artists Names (ULAN), Art and Architecture Thesaurus (AAT), and Thesaurus of Geographic Names (TGN), Faceted Application of Subject Terminology (FAST), GeoNames, DBPedia and various national authority files. Selection of the source depends on the trustworthiness of the organization responsible, subject matter and richness of the information.
Much metadata reconciliation is devoted to normalizing variants. The University of Auckland, for example, has encountered Maori terms with dozens of spelling variants that had to be normalized to a preferred form. Large aggregators like Libraries Australia also must both normalize variant forms and remove duplicates; each incoming file submitted from individual institutions requires some level of data cleanup. Much of this work requires manual checking and is time-consuming. Each institution is doing similar types of reconciliation—how can this work be shared? When someone makes a correction, how can we disseminate the correction?
A number of institutions have experimented with obtaining identifiers (persistent URIs from linked data sources) to eventually replace our current reliance on text strings. Institutions have concluded that it is more efficient to create URIs in authority records at the outset rather than reconcile them later on. University of Washington has created an experimental RDA input form that generates identifiers for various descriptors like place of publication, edition, language and carrier. The University of Michigan has developed a LCNAF Named Entity Reconciliation program using Google’s Open Refine that searches the VIAF file with the VIAF API for matches, looks for Library of Congress source records within a VIAF cluster and extracts the authorized heading. This results in a dataset with the authorized LC Name Authority File heading paired with the original heading with a link to the URI of the LCNAF linked data service. It could be modified to bring in the VIAF identifier instead; it gets fair results even though it uses string matching. A number of NACO contributors have started to include URIs of linked data sources in the 024 fields of authority records when they are confident of exact matches.
Some portion of terms cannot be matched to an existing entity identifier. How should libraries provide identifiers for entities not represented in any of the above sources? OCLC Research coins a “placeholder URI” that encodes the work identifier, type of entity and name (text string). For example:
http://experiment.worldcat.org/entity/work/data/836692365#Organization/william_morrow_and_company
If and when this placeholder URI can be replaced with a persistent URI, the placeholder URI can be “deprecated” using owl:sameAs.
University of Wisconsin-Madison has developed a prototype that indicates how linked data sources could be incorporated into user services. It retrieves the URIs for person entities from its local Alma system, searches VIAF with the URI retrieved, and then extracts factual information from the links within a VIAF cluster such as biographies or abstracts from Getty’s ULAN or DBPedia and alma mater from Wikidata. See for example the bottom part of its catalog record for Gertrude Stein on Picasso.
Such demonstrations of the value of ingesting or fetching related information from different sources to improve discovery services help make the investment in metadata reconciliation and using identifiers worthwhile.
——————
Note on the graphic: Example is provided by colleague Janifer Gatenby. All three identifiers are for “a” Russell Thomas, but only one of the identifiers listed is for the Russell Thomas pictured. A metadata specialist would need to determine the correct one.
Karen Smith-Yoshimura, senior program officer, topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements. Karen retired from OCLC November 2020.
One way to deal with compound access points, e.g., personal names appearing in name/title strings in the description of a resource, will be to find or provisionally create a URI for the whole string signifying the work entity. The person as creator and the title as a kind of label could be specified in relation to the work entity, and the person’s name could then be reconciled to obtain the person’s URI, found or provisional, as part of linked data about the work, and indirectly about the resource. Is this kind of deconstruction of the entities that get rolled up into compound statements in resource descriptions part of the vision for metadata reconciliation? Or is having this kind of expanded work space a luxury we can’t afford?