This is the second of three posts about a lightning talk session at SAA. Part 1 began with descriptions of the array of media an archives might confront and an effort to test how much can be done in house.
Part 2 picks up with four archivists talking about solutions to particularly challenging formats.
Abby Adams is the Digital Archivist at Hagley Museum & Library, an independent research library in Wilmington, Delaware, documenting American enterprise from its inception to present day with a focus on the intersection of industry, technology, and society. In 2012, Hagley received a large hybrid collection, consisting primarily of textual analog materials, in addition to a number of born-digital records. The records were created by various tech corporations during the normal course of business in the late 1990s and early 2000s and document aspects of the dot-com boom and bust, an area of research where primary sources are sorely lacking. Given the potentially high research value of the collection, Adams gave the preservation of the born-digital content high priority and culled hundreds of records cartons to discover the following obsolete media formats: 349 compact discs; 134 3.5” floppy disks; 113 digital linear tapes (DLT); 49 digital data storage tapes (DDS); 19 quarter-inch mini cartridges; 15 Travan cartridges; and 8 zip disks.
Although the CDs and floppy disks presented few problems, the remaining obsolete formats offered a lesson in how complex data recovery can be. Adams’ attempts to use “freecycled” drives and jerry-rig old PCs were just not working. Even if she could connect a computer to the exact generation DLT or DDS drive to read the tapes, she would also need to know the software program used to create the backup, which could vary widely depending on the date of creation, then successfully install it, and cross her fingers the media isn’t encrypted or corrupt. Since Hagley is a small shop with limited in-house resources, it was clear outsourcing the data extraction was the best course of action. After consulting several vendors, Adams and her coworker Kevin Martin found a company that specializes in data extraction and indexing of backup tapes. After establishing a budget for the first phase of the project, Adams and Martin sent the vendor a sample consisting of five DLT and three DDS tapes. Less than a week later, the vendor provided them access to the indexed data from seven out of eight tapes. Due to the size of the collection and Hagley’s limited in-house resources, Adams was strict with appraisal, retaining only about ten percent of the data. The original media was returned to Hagley a few weeks later. Having successfully completed the first phase of the project, Hagley will continue to use the same company for the remaining backup tapes.
Elise Warshavsky, is the Digital Archivist at the Presbyterian Historical Society, which serves as the national archives of the Presbyterian Church, documenting the political and social history of the church. The archives acquired the laptop of Clifton Kirkpatrick former Stated Clerk, the highest elected official within the church. The laptop contained files he had worked on as well as his email. Five years later Elise was hired and was asked to archive the Stated Clerk’s laptop. This was the nature of the “detailed instructions” she received regarding passwords, the types of files, and that there were 28,000 emails in the Novell GroupWise account:
The records manager who had originally received the laptop had converted the account to a Remote account enabling the email to live solely on the laptop. The records archivist had also reorganized the inbox and appraised each individual email, resulting in lost folder structure and possibly other lost metadata. The emails were readable, but because of a 50-year embargo on access to them, the goal was to ensure that these files would be readable in 50 years. After not being able to find a way to convert the GroupWise Remote email to another format, she finally contacted a company that makes a commercial grade email converter called Transend. They agreed to resurrect the Remote account on their GroupWise servers and then convert it to .pst, Microsoft’s open proprietary file format. Then she was able to move forward with her migration plan: convert to a more archival email format, .MBOX, as well as run a tool to batch export PDFs from each individual email and convert them to PDF/As – a format researchers would be able to search and access in 50 years.
Elise’s advice: If you get frustrated about not having the tools or skills necessary to complete a project, reach out to find help. There’s no need to develop resources in house when dealing with a unique, most likely not repeatable incident. Get help, and move on to doing what you do know how to do – accession, appraise, and preserve.
Ted Hull, Director of the Electronic Records Division at the National Archives at College Park, told of a project to recover content from 7-track tapes.
The Electronic Records Division accessions, processes, arranges for preservation, describes, and provides access to the born-digital federal records scheduled for permanent retention in the National Archives. They hold 932 series from over 100 federal agencies; consisting of over 750 million unique files and over 320 terabytes of data. 7-track magnetic tape was an industry standard from the 1950s -1970s, when it was generally replaced with 9-track magnetic tape. While most of the Archives’ content had already been transferred off of 7-track tape, in 2013, staff identified 13 remaining tapes containing records from the Federal Home Loan Bank Board, the Bureau of Indian Affairs, and the U.S. Joint Chiefs of Staff. The Archives reached out and found that the National Center for Atmospheric Research (NCAR) in Boulder, CO still had the capability to read 7-track tapes and were able to recover data from 9 tapes; the other 4 were blank. NCAR converted the binary-coded decimal encoding to ASCII and made the files available to NARA for direct download from their FTP site; NARA processed and accessioned the records and the original tapes were returned to NARA for disposal.
Ben Goldman, the Digital Records Archivist for Pennsylvania State University Libraries, discovered 27 3-inch disks in a modern literary manuscript collection. They didn’t have the equipment needed to read the disks, and we weren’t even sure if the disks were readable or even contained data worth recovering.
Amstrad disk from the Fiona Pitt-Kethley papers, Penn State University Special Collections Library
The author confirmed that she did own an Amstrad computer (a somewhat popular computer in the UK for a brief period in the 1980s), but because Ben didn’t know exactly what hardware or software was needed to read the disks, he decided to outsource recovery of the disks. He wanted to use the opportunity to come up with a model vendor agreement and to make the project an extension of their internal born-digital workflow. To that end, he created a media inventory spreadsheet to be used to identify the disks, their labels, their contents, the images derived from them, and to accommodate checksums after their eventual transfer. Mostly, however, he wanted to see if outsourcing was a viable option for archivists confronting elusive computer media formats and to see if core archival requirements could be met by outsourcing, whether service providers could adhere to emerging best practices, and to see if the costs were viable for archives. PSU provided funding for a project at $40 per disk.
[Tweet] Jason P. Evans Groth: $40/disk is same as person making $40k spending two hours to image obsolete disks, so maybe it is the right deal? #s601 #saa14
Soon Ben had a signed vendor agreement with the Museum of Computer Culture to provide disk images that could be processed using forensic tools. They were to work from the inventory and follow naming conventions and provide checksums to ensure accurate transfer.
Many months later, however, Ben was working with two other vendors – without a signed agreement. They found that disk images that were native to the Amstrad operating system couldn’t be migrated to modern formats or processed using common forensic tools. Instead, Ben received three versions of every file in three different formats, each with its own brand of lossy-ness and, in the end, there was no adherence to naming conventions and no checksums. Despite not really meeting his expectations, Ben doesn’t think of the project as a failure. “Fugitive media is defiant,” he warns. Communication is key and the vendor agreement should establish communication requirements. Beyond that, Ben is not sure this cost model will be sustainable. Instead, he suggests that archivists need to develop in-network options. There are technologies, resources, and talented people working on these issues. It would be nice to see some better community strategies for tackling the issues and supporting each other.
Next up: Part 3 will continue with three speakers representing the service provider point of view.
Ricky Erway, Senior Program Officer at OCLC Research, worked with staff from the OCLC Research Library Partnership on projects ranging from managing born digital archives to research data curation. Ricky left OCLC in 2015.