Thoughts from Partner staff about web archiving

Jackie, Dennis, and I wondered if there might be something valuable the Partnership could be working on to further web archiving efforts and that would not duplicate others’ initiatives, so we had discussions with partner staff, attended meetings, read up on what others are doing, and then presented some options to the Partnership in the form of a survey.

There were 76 responses from 60 institutions in 6 countries.  We asked respondents to indicate which of five topics they felt were important to advance and which they’d be interested in working on.

The two most important topics – and the two that most people would be willing to work on are:  [It’s so great when those two things align!]

  • Metadata Guidelines, described as “Web archives often are hidden in silos, making access difficult.  We could work on developing metadata guidelines to bridge the archival and bibliographic traditions so that records for live and/or harvested websites can appear in local catalogs, as well as in WorldCat and other aggregations.”
  • Use of web archives, described as “Not enough is known about use of harvested websites.  We might think about how to study users and potential users of web archives to find out what they want, what they want to do with them, and how they find and navigate what is available)”

We’re going to begin by launching a project related to metadata guidelines.

[If you work at a Partner institution and would like to be added to the web arching listserv, send an email to Dennis Massie (  If you’d like to work on metadata guidelines, send a message to Jackie Dooley (]

What I want to talk about in this blog post, though, is the open-ended responses to a question asking for additional thoughts.  There were 50 very thoughtful responses that I will summarize here.

Several respondents had suggestions about metadata, including coming up with a way to describe the harvest approach to researchers (e.g., so they’ll know what was selected, how deep the harvest went, and what was not captured).  There were suggestions to explore use of linked data models and standardized vocabularies.  And there was urging to investigate integration of web archiving with existing tools, such as ArchivesSpace and Archivematica.

There were additional thoughts about studying use of web archives.  It was pointed out that there are two very different types of use: one focuses on trends, big data, digital humanities and social network analysis and the other is more akin to “traditional” research – essentially looking at archived websites as records of what happened or what was published at a point in time.  Several people urged us to consider whether users would actually use library systems to discover web archives.  Since so many rely on the Internet Archives’ web archives, it was suggested that we study how they are queried and used.

Several additional topics were brought up:

Many stated the need for tools.  There was concern about there being basically only one tool for web harvesting (ArchiveIt).  Some wished for tools to provide support throughout the workflow (to support automated quality analysis for capture, to make tasks such as description and quality assurance more efficient, and to automate the tasks associated with providing access to web archives.)  The other big set of tools wanted were those to go beyond the harvest of static HTML web pages — to collect applications, embedded media, video, social media, non-public facing content on websites, streaming media…  And some expressed the need for tools that would capture structured data before it is transformed to HTML.  There was a wish for a browser plug-in that would inform users as they look at a site whether there is an archived version.

Many advocated for advocacy, citing the need to communicate with website owners about the challenges of capture and ways they can help; addressing the issue of consent and deeds of gift in harvesting others’ web sites; working with the community (owners, users, archivists) on property rights, fair use, and other policy issues; promoting persistent URLs and evolving web archiving standards, such as ResourceSync; working with the Internet Archive to see to what extent they are meeting library needs.

Some suggestions can be grouped under the topic of administrative needs: we should better understand how to make web archiving sustainable; we should share position descriptions; we should promote understanding about the types of investment that would be meaningful; we should explore trade-offs and relationships between large scale archiving efforts and targeted ones; we should improve metrics and assessment to inform financial/staffing allocations; and we should help to build the business case and strategy for a future state for web archiving.

Selection of what is to be archived is a big challenge.  We can help to set evaluation criteria, help with appraisal of sites, and consider how the content of archived web sites affects appraisal decisions for both paper and electronic records in traditional archives.  Respondents wanted help with deciding how deep the capture should go.  Are there efficient approaches to continued archiving of web sites with added portions or connected to web sites with different directory addresses?  What are the options for archiving huge sites?  Another challenge was how to proceed when data or document limits are encountered.  And we should help weigh selectivity against a broad swath approach:

  • Is it worth the trouble of being selective?
  • Is vacuuming it all up ultimately just as helpful?
  • Should each institution take special care for a small subset?
  • When is a scattershot approach acceptable?
  • How can we make scoping rules easier to manage?

Not surprisingly, collaboration was another theme.  Collaboration with web archivists across institutions, disciplines, borders is necessary to:

  • develop collecting profiles, so we know what other institutions are collecting and we won’t duplicate work being done
  • share workflows, QA procedures, metadata guidelines, and useful tools
  • understand the existing roles of national services and others
  • come up with a master harvest with different institutions archiving based on their particular collection needs
  • encourage and improve use of web archives
  • identify gaps
  • and coordinate our work with other organizations and efforts.

We should also embed web archiving services with a community of practice to:

  • get subject specialists/scholars involved in selecting materials for web archiving
  • get researchers to use web archives in their research
  • get faculty to use web archives in their teaching


So we got what we wanted (identification of two important projects that people are willing to work on) and so much more.  Some of these ideas are ripe for others in the community to take on—and some are no doubt happening elsewhere.  Some may need to percolate a bit more.  And others may be ideas for future activities of the OCLC Research Library Partnership.

We always happy for more input.  If you’re working on something interesting or have suggestions, let us know!

One Comment on “Thoughts from Partner staff about web archiving”

  1. Wonderful. Thank you for making some of the results available and thanks to your group for taking some of the next steps! Tools to support the stages of the workflow and tools to capture embedded materials would be great to have. When doing a scan of our university websites using Archive-It, it was clear that a lot was missed (e.g. databases, video, and so on) along with the large amount of materials gathered.
    Further down the line, we need to think of how to make this material easily accessible to users.

Comments are closed.