This is the third and final post reporting on the work of our OCLC Research Library Partnership Web Archiving Working Group. The first post summarizes our work overall and lists some of the conundrums that librarians and archivists have experienced when trying to describe web content. The second post goes into some detail about our two supplementary reports (a literature review and an analysis of harvesting tools).
This post focuses on the primary report: Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group (WAM). Here I’ll talk about our objectives in doing the work, list the criteria we used to scope the data element set, and briefly describe each of the 14 elements.
WAM’s overall objective was to develop practices for creating consistent metadata that address the unique characteristics of websites and collections. More specifically, we sought to:
- Develop community-neutral, standards-neutral practices for descriptive metadata for archived web content, taking into account the needs of end users and metadata practitioners.
- Define a lean set of data elements with usage notes to guide the preparation of data content.
- Ensure that the data elements can be used in concert with other standards that have far more granular data element sets.
- Use a scalable approach that requires neither in-depth description nor extensive changes to records over time.
- Enable practitioners to have confidence that they are contributing to the application of consistent practice in this emerging area.
- Bridge bibliographic and archival approaches to description.
Bridging different descriptive practices has become important as digital content increasingly permeates our collections, and our literature review showed that metadata practitioners want this. We recognized that attempting to do so would be challenging, and that our recommendations might seem problematic to those with deep experience in a single standards context. As you evaluate our work, consider the value of such blending of practices in the web context. Sites may be described while either “live” or archived, have characteristics associated with both bibliographic and archival practice, and be described at the item level, collection level, or both. In some settings, this work may offer opportunities for collaboration across multiple organizational units.
WAM’s primary output is a set of 14 recommended data elements selected for their applicability to archived web content. In developing this data dictionary, we considered two key questions: what types of content are most important to include in a metadata record describing an archived website or collection of sites, and what data element should be designated for each of these content types? To guide our decisions, we followed these criteria:
- The data element set can be used either on its own or in concert with library and archival standards that are far more granular.
- The element set is lean to enable facilitate scalable metadata creation.
- Data element names and definitions were adopted or adapted from existing standards whenever feasible to enhance compatibility and to encourage consistency across discovery environments.
- Usage notes explain the recommended application of each element to assist practitioners in creating metadata whose meaning is completely clear to end users.
- Common elements that are key to identification and discovery of any resource are included (such as Creator, Date, Subject and Title).
- All other elements must have clear applicability for description of archived websites (such as Description, Rights and URL).
- Elements were excluded that rarely (if ever) appear in institutional guidelines and/or extant metadata records for archived web content and that have no special meaning in this context (such as audience, publisher, and statement of responsibility).
The same element is meant to be used at all levels of description, in accord with the multilevel description principles expressed in archival standards such as DACS and EAD. Also, the WAM data elements overlap significantly with Dublin Core. Both the name and meaning of eight Dublin Core elements match WAM: Contributor, Creator, Date, Description, Language, Relation, Subject and Title. WAM’s usage guidelines go far beyond the brief statements included in Dublin Core to guide preparation of the content of each element.
Here are WAM’s 14 recommended elements and the definition of each:
COLLECTOR: The organization responsible for curation and stewardship of an archived website or collection.
CONTRIBUTOR: An organization or person secondarily responsible for the content of an archived website or collection.
CREATOR: An organization or person principally responsible for creating the intellectual content of an archived website or collection.
DATE: A single date or span of dates associated with an event in the lifecycle of an archived website or collection.
DESCRIPTION: One or more notes explaining the content, context and other aspects of an archived website or collection.
EXTENT: An indication of the size of an archived website or collection.
GENRE/FORM: A term specifying the type of content in an archived website or collection.
LANGUAGE: The language(s) of the archived content, including visual and audio resources with language components.
RELATION: Used to express part/whole relationships between a single archived website and any collection to which it belongs.
RIGHTS: Statements of legal rights and permissions granted by intellectual property law or other legal agreements.
SOURCE OF DESCRIPTION: Information about the gathering or creation of the metadata itself, such as sources of data or the date on which source data was obtained.
SUBJECT: Primary topic(s) describing the content of an archived website or collection.
TITLE: The name by which an archived website or collection is known.
URL: Internet address for an archived website or collection.
Detailed usage notes offer direction on how to apply each element. Here are a few paragraphs selected from element descriptions:
- Use Collector for the organization that selects the web content for archiving, creates metadata and performs other activities associated with “ownership” of a resource. Stated another way, this is the organization that has taken responsibility for the archived content, although the digital files are not necessarily stored and maintained by this organization (collections harvested using Archive-It are a prominent example).
- Use Date to record any known date or span of dates associated with an archived website or collection that will help users understand the content. Always make clear the meaning of a Date element by adding appropriate wording to enable user understanding. Without this, any date for a website or collection is inherently ambiguous.
- To fulfill WAM’s objective of a lean set of data elements, the Description element can contain any type of note. In this regard it matches Dublin Core, which is widely used to describe digital resource and also has a single description element. When applying these recommendations with one or more detailed standards, use the more granular elements that they provide.
- The Extent of a collection of sites often is expressed as an approximate number of sites. This approach is easy to maintain and so is recommended for sites that will be harvested repeatedly and for collections to which additional sites will be added.
- Use the URL element to record URLs, URNs or URIs that are useful to users, particularly seed and access URLs. Include text to explain its function. Repeat the element as many times as necessary.
The description of each element also includes examples to guide metadata staff. Here is a section under the Creator element that includes both usage notes and examples:
An individual person is the Creator only when he/she is clearly the creator of the intellectual content, such as an individual’s personal blog or Twitter feed.
Creator: Sherman, Aliza
Title: Aliza Sherman rants and raves
Many sites about an individual are created by someone else (who may or may not be named), such as those for politicians, authors, musicians and other public figures. The individual’s name is often the title of the site and should be repeated in the Subject element.
[no Creator element]
Title: Jacqui Lambie
Subject: Jacqui Lambie, 1971-
The report concludes with a section on future research needs in five areas: technical and preservation metadata, discovery systems, machine-actionable description, multiple levels of description, and MARC record types. A brief appendix offers an example of a single archived website encoded in MARC, a collection encoded in MARC, a single site encoded in Dublin Core, and a multilevel finding aid encoded in EAD.
We look forward to hearing from you as you consider adopting these recommendations, whether in whole or in part. As always, your feedback is important to us!
Jackie Dooley retired in from OCLC in 2018. She led OCLC Research projects to inform and improve archives and special collections practice.