We all know that over the past 30+ years the World Wide Web has become an indispensable tool (understatement!) for disseminating information, extending the reputations of organizations and businesses, enabling Betty the Blogger to establish an international reputation, and ruining dinner table debate by providing the answer to every conceivable question. It has caused a sea change in how humans communicate and learn. Some types of content are new, but huge quantities of material once published in print are now issued only in bytes. For example, if you’re a university archivist, you know that yesterday’s endless flood of high-use content such as graduate program brochures, course listings, departmental newsletters, and campus information dried up a decade or more ago. If you’re a public policy librarian, you know that the enormously important “grey literature” once distributed as pamphlets is now only mostly on the web. Government information? It’s almost all e-only. In addition, the scope of the scholarly record is evolving to embrace new types of content, much of which is also web-only. Without periodic harvesting of the websites that host all this information, the content is gone, gone, gone. In general, we’ve been very slow to respond to this imperative. Failure to adequately preserve the web is at the heart of the Digital Dark Ages.
The Internet Archive’s astonishing Wayback Machine has been archiving the web since the mid-1990s, but its content is far from being complete or reliable, and searching is possible only by URL. In some countries, such as the U.K. and New Zealand, the national library or archives is charged with harvesting the country’s entire web domain, and they struggle to fulfill this charge. In the U.S., some archives and libraries have been harvesting websites for a number of years, but few have been able to do so at scale. Many others have yet to dip their toes in the water. Why do so many of us lack a sense of urgency about preserving all this content? Well, for one thing, web archiving is rife with challenges.
Within the past week Ricky, Dennis, and I hosted two Webex conversations with members of our OCLC Research Library Partnership to surface some of the issues that are top-of-mind for our colleagues. Our objective was to learn whether there are shared problems that make sense for us to work on together to identify community-based solutions. All told, more than sixty people came along for the ride, which immediately suggested that we had touched a nerve. In promoting the sessions, we posited ten broad issues and asked registrants to vote for their top three. The results of this informal poll gave us a good jumping-off point. Master synthesizer Ricky categorized the issues and counted the aggregate votes for each: capture (37), description (41), and use (61). (I confess to having been glad to see use come out on top.)
OK, take a guess … what was the #1 issue? Not surprisingly … metadata guidelines! As with any type of cataloging, no one wants to have to invent the wheel themselves. Guidelines do exist, but they don’t meet the needs of all institutions. #2: Increase access to archived websites. Many sites are archived but are not then made accessible, for a variety of good reasons. #3: Ensure capture of your institution’s own output. If you’re worried about this one, you should be. #4: Measure access to archived websites. Hard to do. Do you have an analytics tool that can ever do what you really want it to?
Other challenges received some votes: getting descriptions of websites into local catalogs and WorldCat, establishing best practices for quality assurance of crawls, collaborating on selection of sites, and increasing discovery through Google and other search engines (we were a tad mystified about why this last one didn’t get more votes). Some folks offered up their own issues, such as capture of file formats other than HTML, providing access in a less siloed way, improving the end-user experience, sustaining a program in the face of minimal resources, and developing convincing use cases.
When we were done, Ricky whipped out a list of her chief off-the-cuff takeaways, to whit:
- We need strong use cases to convince resource allocators that this work is mission-critical.
- Let’s collaborate on selection so we don’t duplicate each others’ work.
- Awareness of archived websites is low across our user communities: let’s fix that.
- In developing metadata guidelines, we should bridge the differing approaches of the library and archival communities.
- We need meaningful use metrics.
- We need to know how users are navigating aggregations of archived sites and what they want to do with the content.
- Non-HTML file formats are the big capture challenge.
Our Webex conversations were lively and far ranging. Because we emphasized that we needed experienced practitioners at the table, we learned that even the experts responsible for large-scale harvesting struggle in various ways. Use issues loomed large: no one tried to claim that archived websites are easy to locate, comprehend, or use. Legal issues are sometimes complex depending on the sites being crawled. Much like ill-behaved serials, websites change title, move, split, and disappear without warning. Cataloging at the site or document level isn’t feasible if, like the British Library, you crawl literally millions of sites. Tools for analytics are too simplistic for answering the important questions about use and users.
Collecting, preserving, and providing access to indispensible informational, cultural, and scholarly content has always been our shared mission. The web is where today’s content is. Let’s scale up our response before we lose more decades of human history.
What are your own web archiving challenges? Let us know by submitting a comment below, or get in touch by whatever means you prefer so we can add your voice to the conversation. We’re listening.