Skip to content

Hanging Together

the OCLC Research blog

  • Home
  • About
Main Menu
Archives and Special Collections / Digital Preservation / Metadata / Web Archiving

Two Reports on Web Archiving: Literature Review and Tools Analysis

March 28, 2018September 4, 2020 - by Jackie Dooley

In yesterday’s post I briefly summarized the three reports issued last month by the OCLC Research Library Partnership Web Archiving Metadata Working Group. This post goes into more depth about the two reports that supplement our recommendations for descriptive metadata: a literature review and an analysis of harvesting tools.

computer screen
Photo by Markus Spiske on Unsplash

Literature review

The working group’s first task was to scope its work by writing a charge, and a key concept immediately emerged: the need to recommend an approach that would meet end-user needs, enhance discovery and improve metadata consistency. To that end, we conducted a literature review to inform our development of a recommended approach to descriptive metadata. We limited our scope to readings that include a section related to descriptive metadata, but most covered a wider swath of issues. This helped us learn about who the users of web archives are, the strategies they use and the challenges they face. Descriptive Metadata for Web Archiving: Literature Review of User Needs is the outcome of this work.

The literature falls into two clear categories: the needs of end users and those of metadata practitioners. The report characterizes types of end users, their research methodologies, barriers to use, discovery interfaces and the need for support services and outreach. The review of practitioner literature addresses the need for scalable practices, the standards and shared practices currently in use, the outcomes of a variety of case studies, and other approaches to metadata. Here are our takeaways for each community:

End Users

  • The literature on end-user needs largely focuses on academic researchers in a wide variety of disciplines.
  • Users express a strong need for provenance information to add context beyond standard descriptive metadata elements, reflecting a widespread desire for transparency around the selection process and the completeness of individual captures.
  • Given the ease and ubiquity of access to the open web, restrictions on access—such as being limited to onsite viewing in a library—are both mystifying and frustrating to users.
  • Complex web content that has been archived is sometimes presented in a way that exceeds the limits of users’ technical knowledge, constituting a widespread barrier to use.
  • A need for user support services derives from the complexity of accessing and using web archives.
  • Libraries and archives should actively engage in outreach to both current and potential web archives users.

 Metadata Practitioners

  • Scalable descriptive metadata practices are needed because staff resources are extremely limited at most institutions.
  • Existing library and archival standards for data structure and content are being used for web archiving descriptive metadata. Dublin Core is most widely used, in part because use of the Archive-It tool is so widespread.
  • Bibliographic, archival and hybrid approaches are in use. The need to find appropriate ways to blend standard library and archival practices is widely perceived.
  • In devising metadata at various levels of description (collection, site, document), practitioners should consider carefully the elements they will use at each level.
  • Metadata describing archived web content is often delivered via multiple discovery systems, which clearly suggests the need for smooth processes to re-use metadata.
  • Experimentation with nontraditional approaches is underway.

This rich literature is a testament to the vitality of the relatively new field of web archiving.

Review of harvesting tools

As we read the literature and reached out to various communities to obtain feedback while the project was in process, we learned that metadata practitioners long for machine-generated metadata that they can extract and re-use from harvesting tools in lieu of having to create and/or rekey it. Descriptive Metadata for Web Archiving: Review of Harvesting Tools reports on our analysis of eleven tools with an eye to their functionality for extracting descriptive metadata.

The tools we reviewed are Archive-It, Heritrix, HTTrack, Memento, Netarchive Suite, SiteStory, Social Feed Manager, Wayback Machine, Web Archive Discovery, Web Curator Tool, and Webrecorder.

We came to several conclusions:

  • Most tools built for web archives focus on capturing and storing technical metadata for accurate transmission and re-creation but capture minimal descriptive metadata, in part because so little exists in the captured files. Descriptive metadata therefore must be created manually, either within the tool or externally.
  • The title of a site (as recorded in its metadata) and the date of harvesting are routinely captured, but it may not be possible to extract them automatically. Titles are sometimes unhelpful, such as “home page” or “title.”
  • Not all tools define descriptive metadata in the same way.
  • The hope for auto-generation of descriptive metadata may be fruitless unless or until creators of textual web pages routinely embed more metadata that can be available for capture.
  • Development of new tools and enhancement of existing ones are actively underway.

Ultimately, we came to wonder whether harvesting is the most appropriate stage of the web archiving process during which to add descriptive metadata for most types of content. Given the limited functionality of metadata features in current web archiving tools, perhaps there is a clearer path forward for approaches that leverage external services and APIs.

Watch for a third post tomorrow describing our recommendations for descriptive metadata. We would be delighted to have your feedback  on these reports.

Jackie Dooley

Jackie Dooley retired in from OCLC in 2018. She led OCLC Research projects to inform and improve archives and special collections practice.

www.oclc.org/research/people/dooley.html
Share on Facebook
Facebook
Tweet about this on Twitter
Twitter
Email this to someone
email

Related Posts

OCLC Research and the National Finding Aid Network project

November 10, 2020November 10, 2020

Photo of a compass sitting on top of a map

Advancing linked data for archives and special collections: a new publication from the OCLC RLP

July 28, 2020July 27, 2020

Photo of a floppy disk

Time Estimation for Processing Born-Digital Collections

April 28, 2020September 9, 2020

OCLC Research

Hanging Together is the blog of OCLC Research. Learn more about OCLC Research on our website.

Links

  • Lorcan Dempsey's Weblog
  • Next – OCLC Blog
  • OCLC Research
  • OCLC Research Library Partnership
  • WebJunction

Categories

  • Archives and Special Collections (195)
  • Born-Digital Special Collections (14)
  • Collective Collections (118)
  • Data Science (7)
  • Digital Preservation (69)
  • Digitization (24)
  • Equity, Diversity, Inclusion (EDI) (7)
  • Evolving Scholarly Record (10)
  • Higher Education Future (8)
  • Identifiers (26)
  • Infrastructure and Standards Support (88)
  • Libraries (93)
  • Libraries Archives and Museums (125)
  • Libraries in the Enterprise (1)
  • Library Management (5)
  • Linked Data (33)
  • Measurement and Behaviors (44)
  • Metadata (75)
  • Miscellaneous (176)
  • Modeling new services (112)
  • MOOCs (7)
  • Museums (57)
  • Open Access (14)
  • Renovating Descriptive Practice (114)
  • Research Data Management (19)
  • Research Information Management (35)
  • Research Library Partnership (161)
  • Research support (22)
  • Resource Sharing (8)
  • Searching (38)
  • SHARES (6)
  • Supporting Scholarship (65)
  • Systemwide Organization (42)
  • User Behavior Studies and Synthesis (6)
  • Visual Resources (17)
  • Web Archiving (14)
  • WebJunction (6)
  • Wikimedia (43)

Share Buttons

Share on Facebook
Facebook
Tweet about this on Twitter
Twitter
Email this to someone
email

Email Notifications


 

Recent Comments

  • Gail Thornburg on さようなら (Sayōnara)
  • Ivy Anderson on さようなら (Sayōnara)
  • Günter on さようなら (Sayōnara)
  • Shuwen Cao on さようなら (Sayōnara)
  • Andrew Padilla on Presenting metadata from different sources in discovery layers

Recent Posts

  • Frequently asked questions: resource sharing practice in the time of COVID-19, Phase I
  • Towards respectful and inclusive description
  • The way forward to a more open future … together
  • さようなら (Sayōnara)
  • OCLC-LIBER Open Science Discussion on Citizen Science

Admin.

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
  • [Un]Subscribe to Posts
© 2020 OCLC