Metadata for archived websites

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Dawn Hale of Johns Hopkins University. For some years now, archives and libraries have been archiving web resources of scholarly or institutional interest to ensure their continuing access and long-term survival. Some websites are ephemeral or intentionally temporary, such as those created for a specific event. Institutions would like to archive and preserve the content of their websites as part of their historical record. A large majority of web content is harvested by web crawlers, but the metadata generated by harvesting alone is considered insufficient to support discovery.

Examples of archived websites among OCLC Research Library Partnership institutions include:

Ivy-Plus collaborative collections: Collaborative Architecture, Urbanism, and Sustainability Web Archive (CAUSEWAY) and Contemporary Composers Web Archive (CCWA);
The New York Art Resources Consortium (NYARC), which captures dynamic web-based versions of auction catalogs and artist, gallery and museum websites;
Thematic collections supporting a specific research area, such as Columbia University’s Human Rights, Historic Preservation and Urban Planning, and New York City Religions;
Teaching materials, such as MIT’s OpenCourseWare (OCW), which aspires to make the content available to scholars and instructors for reuse for the foreseeable future;
Government archives, such as the Australian Government Web Archive.

Approaches to web archiving are evolving. Libraries are developing policies regarding content selection, exploring potential uses of archived content and considering the requirements for long-term preservation. Our discussion focused on the challenges for creating and managing the metadata needed to enhance machine-harvested metadata from websites.

Some of the challenges raised in the discussions:

Descriptive metadata requirements may depend on the type of website archived, e.g., transient sites, research data, social media, or organizational sites. Sometimes only the content of the sites is archived when the look-and-feel of the site is not considered significant.
Practices vary. Some characteristics of websites are not addressed by existing descriptive rules such as RDA (Resource Description and Access) and DACS (Describing Archives: A Content Standard). Metadata tends to follow bibliographic description traditions or archival practice depending on who creates the metadata.
Metadata requirements may differ depending on the scale of material being archived and its projected use. For example, digital humanists look at web content as data and analyze it for purposes such as identifying trends, while other users merely need individual pages.
Many websites are updated repeatedly, requiring re-crawling when the content has changed. Some types of change can result in capture failures.
The level of metadata granularity (collection, seed/URL, document) may vary based on anticipated user needs, scale of material being crawled, or available staffing.
Some websites are archived by more than one institution. Each may have captured the same site on different dates and with varying crawl specifications. How can they be searched and used in conjunction with one another?

Some of the issues raised such as deciding on the correct level of granularity, determining relevance to one’s existing collection and handling concerns about copyright are routinely addressed by archivists. Jackie Dooley’s The Archival Advantage: Integrating Archival Experience into Management of Born-Digital Library Materials is applicable to archiving websites as well.

Focus group members agreed we had a common need for community-level metadata best practices applicable to archived websites, perhaps a “core metadata set”. Since the focus group discussions started in January, my colleagues Jackie Dooley and Dennis Massie have convened a 26-member OCLC Research Library Partnership Web Archiving Metadata Working Group with a charge to “evaluate existing and emerging approaches to descriptive metadata for archived websites” and “recommend best practices to meet user needs and to ensure discoverability and consistency”. Stay tuned!

Karen Smith-Yoshimura

Karen Smith-Yoshimura, senior program officer, topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements. Karen retired from OCLC November 2020.

One Comment on “Metadata for archived websites”

Lotte Belice Baltussen says:

March 23, 2016 at 4:35 am

Dear Karen,

This is very interesting! I’m currently managing a project within the Network Digital Heritage (Netherlands) in which — among other things — we’re developing a metadata model for the web archive of the archive RHV Groninger Archieven. You can read more in this blog post [1], but I’ve also done some research on models and practices [2] in order to come to a choice that fits our project (MODS-lite, most likely).

All texts are in Dutch for now, but I’d be happy to share more information with the Working Group if that might be useful.

Cheers,
Lotte

[1] http://www.den.nl/blog/bericht/5255
[2] https://docs.google.com/document/d/1pte4xY4GgV25fraa_sIdGUgJM7QVksf7HtaOfT8EHqg/edit?usp=sharing

Comments are closed.