RLG and web archiving

Continuing my walk down memory lane….

I attended the Coalition for Networked Information Spring 2006 Task Force Meeting and gave a project briefing titled, Archiving and Preserving the Web: Future Directions and Applications. [I am hopeful that our presentations will be linked from the abstract soon.] I was joined by two colleagues from the Internet Archive, Dan Avery and Kristine Hanna.

My part in the presentation was to give background to the importance of web archiving (capturing what will surely be the basis of tomorrow’s scholarship today), and to give an overview of RLG’s members who are currently engaged in web archiving. Most of these institutions are national archives, national libraries, and other organizations such as the Library of Congress who are doing web archiving in a big way. However, a broad range of smaller member institutions, such as Indiana University, Swarthmore, and University of Toronto, are also interested in web archiving. Some institutions are interested in archiving their own web domain, others are interested in saving web sites that represent a particular subject area.

Because RLG members, both large and small, are interested in web archiving, we’ve launched a program on web archiving. I’ll get back to that in a moment, but want to say how some of our smaller institutions are stepping up to the daunting task of archving the web. Those are are not able or ready to step up to web archiving on their own (or who might want to get their feet wet slowly) might be interested in taking a look at Archive-It, a new product of the Internet Archive. Archive-It makes it possible, at a fairly modest price, to get started with web archiving without a lot of technical expertise or investment. Because Archive-It makes use of the same open source tools as many of the big dogs are using (Heretrix for web crawling and creating ARC files, Nutch WAX for searching, etc.), it’s possible to start out using Archive-It and then later switch to your own internally hosted service.

Once an institution gets up and running with web archiving, there are still a lot of issues beyond the technical: description, sharing collecting, end user issues, etc. RLG’s web archiving program is bringing together large and small institutions who are working on web archiving to start to document best practices and procedures for the rest of the community to use. If you are at an RLG member institution, and want to participate, please let us know! More information about RLG’s web archiving program can be found here.

Incidentally, all of the podcasts from the CNI Spring Task Force Meeting are now available.