Two Reports on Web Archiving: Literature Review and Tools Analysis

In yesterday’s post I briefly summarized the three reports issued last month by the OCLC Research Library Partnership Web Archiving Metadata Working Group. This post goes into more depth about the two reports that supplement our recommendations for descriptive metadata: a literature review and an analysis of harvesting tools.

computer screen — Photo by Markus Spiske on Unsplash

Literature review

The working group’s first task was to scope its work by writing a charge, and a key concept immediately emerged: the need to recommend an approach that would meet end-user needs, enhance discovery and improve metadata consistency. To that end, we conducted a literature review to inform our development of a recommended approach to descriptive metadata. We limited our scope to readings that include a section related to descriptive metadata, but most covered a wider swath of issues. This helped us learn about who the users of web archives are, the strategies they use and the challenges they face. Descriptive Metadata for Web Archiving: Literature Review of User Needs is the outcome of this work.

The literature falls into two clear categories: the needs of end users and those of metadata practitioners. The report characterizes types of end users, their research methodologies, barriers to use, discovery interfaces and the need for support services and outreach. The review of practitioner literature addresses the need for scalable practices, the standards and shared practices currently in use, the outcomes of a variety of case studies, and other approaches to metadata. Here are our takeaways for each community:

End Users

The literature on end-user needs largely focuses on academic researchers in a wide variety of disciplines.
Users express a strong need for provenance information to add context beyond standard descriptive metadata elements, reflecting a widespread desire for transparency around the selection process and the completeness of individual captures.
Given the ease and ubiquity of access to the open web, restrictions on access—such as being limited to onsite viewing in a library—are both mystifying and frustrating to users.
Complex web content that has been archived is sometimes presented in a way that exceeds the limits of users’ technical knowledge, constituting a widespread barrier to use.
A need for user support services derives from the complexity of accessing and using web archives.
Libraries and archives should actively engage in outreach to both current and potential web archives users.

Metadata Practitioners

Scalable descriptive metadata practices are needed because staff resources are extremely limited at most institutions.
Existing library and archival standards for data structure and content are being used for web archiving descriptive metadata. Dublin Core is most widely used, in part because use of the Archive-It tool is so widespread.
Bibliographic, archival and hybrid approaches are in use. The need to find appropriate ways to blend standard library and archival practices is widely perceived.
In devising metadata at various levels of description (collection, site, document), practitioners should consider carefully the elements they will use at each level.
Metadata describing archived web content is often delivered via multiple discovery systems, which clearly suggests the need for smooth processes to re-use metadata.
Experimentation with nontraditional approaches is underway.

This rich literature is a testament to the vitality of the relatively new field of web archiving.

Review of harvesting tools

As we read the literature and reached out to various communities to obtain feedback while the project was in process, we learned that metadata practitioners long for machine-generated metadata that they can extract and re-use from harvesting tools in lieu of having to create and/or rekey it. Descriptive Metadata for Web Archiving: Review of Harvesting Tools reports on our analysis of eleven tools with an eye to their functionality for extracting descriptive metadata.

The tools we reviewed are Archive-It, Heritrix, HTTrack, Memento, Netarchive Suite, SiteStory, Social Feed Manager, Wayback Machine, Web Archive Discovery, Web Curator Tool, and Webrecorder.

We came to several conclusions:

Most tools built for web archives focus on capturing and storing technical metadata for accurate transmission and re-creation but capture minimal descriptive metadata, in part because so little exists in the captured files. Descriptive metadata therefore must be created manually, either within the tool or externally.
The title of a site (as recorded in its metadata) and the date of harvesting are routinely captured, but it may not be possible to extract them automatically. Titles are sometimes unhelpful, such as “home page” or “title.”
Not all tools define descriptive metadata in the same way.
The hope for auto-generation of descriptive metadata may be fruitless unless or until creators of textual web pages routinely embed more metadata that can be available for capture.
Development of new tools and enhancement of existing ones are actively underway.

Ultimately, we came to wonder whether harvesting is the most appropriate stage of the web archiving process during which to add descriptive metadata for most types of content. Given the limited functionality of metadata features in current web archiving tools, perhaps there is a clearer path forward for approaches that leverage external services and APIs.

Watch for a third post tomorrow describing our recommendations for descriptive metadata. We would be delighted to have your feedback on these reports.

Jackie Dooley

Jackie Dooley retired in from OCLC in 2018. She led OCLC Research projects to inform and improve archives and special collections practice.