That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Roxanne Missingham of Australian National University and Dawn Hale of Johns Hopkins University. Originally entitled “workflows associated with sharing digital collections (both born-digital and digitized)”, the topic arose from institutions’ increasingly sharing the metadata for their digital collections with both national and international discovery services. Within individual organizations, librarians create and recreate metadata for digital and digitized resources in a plethora of systems—the library catalog, archive management, digital asset and preservation systems, the institutional repository, research management systems and external subscription-based repositories. Targets for sharing this metadata range from tailored topic-based digital discovery services to national and international aggregations such as Google Scholar, HathiTrust, Digital Public Library of America (DPLA), Internet Archive, Trove and WorldCat. (The graphic above shows some of the targets identified by OCLC Research Library Partners.) Such aggregations can also help inform an institution’s own collection development, as librarians can see their contributions in the context of others’ content and identify gaps that they may wish to fill locally.
Workflows for sharing metadata are often highly manual and can involve significant reworking of data through retyping, cumbersome spreadsheets and processes that impede rapid and effective access to digital content. The resources required for these current practices impede timely access as well as innovation and development of scholarly research.
Given the variety of sources for digital collections’ metadata, even within the same institution, we should not be surprised that a number of different metadata schemas are used, including Dublin Core, Encoded Archival Description (EAD), Resource Description Framework (RDF), Metadata Authority Description Schema (MADS), Metadata Object Description Schema (MODS), Metadata Encoding and Transmission Standard (METS), Text Encoding Initiative (TEI) as well as locally customized schema. Libraries often rely on crosswalks to massage the metadata from their databases into a schema acceptable to the aggregator, which necessitates losing information from the source data that cannot be mapped. Increased exposure of one’s digital collections in a national or international aggregation is important enough to invest in this effort, and the metadata will usually include a pointer to the original source containing the more detailed information.
Focus group members thought it unlikely that one “best practices” could cover the entire range of potential aggregations, as each can differ in terms of audience, scope, context, purpose and functionality. However, developing best practices for a given target is more feasible. For example, OCLC Research Library Partners Pennsylvania State and Temple Universities have collaborated with other institutions in Pennsylvania to develop Pennsylvania’s DPLA Metadata Guidelines, “Requirements, Recommendations and Best Practices for Preparing Metadata”, which also referred to guidelines prepared by other DPLA hubs, service providers and other information professionals.
Some of the key challenges in sharing and re-using metadata describing digital collections:
- Aggregators often have different guidelines and input formats. There is a conflict between aggregators’ very reasonable contention that they cannot support many variations in submitted metadata vs. contributors’ very reasonable contention that they cannot support the different particular needs of a wide range of aggregators. Similarly, aggregators are beginning to strongly encourage or require specific metadata elements and values to support functions they want to offer, whereas contributors do not have the programs or resources to supply such information for existing digital resources.
- Disseminating corrections or updates between the source and the aggregation can be problematic. Information that may have been corrected in the chain leading to incorporation in the aggregation may not be pushed back to the source, so that the same errors must be corrected repeatedly. It is often not clear what data elements have been updated, when or by whom.
- Rights to exposed digital collections may not be easily shareable. Not all metadata is “descriptive” but also includes administrative, technical or preservation information including the terms for sharing. This information may be difficult to share. Some data must be embargoed for a period of time. OCLC Research held a seminar in 2010, “Undue Diligence: Seeking Low-risk Strategies for Making Collections of Unpublished Materials More Accessible” resulting in the document, Well-intentioned practice for putting digitized collections of unpublished materials online, endorsed by the Society of American Archivists the following year.
Although most digital collections are not yet exposed as linked data, a number of the OCLC Research Library Partners expect to publish some digital collections as linked data within the next year or two. The potential of using persistent identifiers to link to the most current version of a digital object, entity or term is very promising, and would mitigate problems associated with correcting data among different databases. It also raised a number of questions, including:
- How could we bundle together “statements” associated with a specific collection to provide the needed context? Would it suffice to include attributes that a digital object is a “part of” a collection?
- How would aggregators determine which statements were appropriate to ingest or fetch for its given audience or purpose?
- How could ontologies be aligned, especially when the same objects could be described in statements using different models?
- How could consumers determine the provenance of a given statement for its trustworthiness and authority?