Summarized: rough and ready text conversion methodologies

A few weeks ago I put out a request (here and on some listservs) asking how institutions are converting finding aids or other metadata from paper to electronic. This post summarizes the responses.

Several institutions are starting the process the same way: they are using a sheet feeder to scan documents, saving the resulting file as a PDF, and then using Adobe Acrobat Pro 9 to OCR the text. One respondent reported that the “OCR capability on the full Adobe version is really quite good for typewritten documents. I was pretty surprised.”

From there, many institutions are choosing to go on to mark up the resulting document in EAD, after spelling checking, “sense checking,” removing white space, etc. EAD markup is done using a variety of templates. Sometimes institutions use a combination of templates: Word or Open Office for the “wordy bits” of the finding aid (<did>, <bioghist>, <scopecontent>, etc.) and Excel templates for the container list or information contained in the <dsc> (many of these tools are described in our 2010 report, Over, Under, Around, and Through: Getting Around Barriers to EAD Implementation.)

Other institutions have had students key information from paper documents (usually short finding aids) directly into an EAD template.

Another approach, taken by the Louisiana Research Center at Tulane University, is nicely described by Eira Tansey’s poster, presented at the 2011 Society of American Archivists meeting. Here Tulane approaches their hidden description problem in stages, first making a basic MARC record available along with a PDF of the finding aid (allowing for basic discovery), and then moving fuller descriptions into Archon with the help of a vendor.

Speaking of handy tools, I was also pointed to Adrianna Del Collo’s nifty tools for preparing text to be imported into the Archivists Toolkit (see the links on this page).

I should note that none of the institutions that self reported were what I would consider to be small institutions — indeed, they are all ARLs. However, I do think that the nearly uniform use of sheet feeders and Adobe Acrobat is an encouraging development for institutions hoping to undertake text conversion.

Do you have a different technique to report? Doing something similar? Please do email or leave a comment below!

  1. If any of your readers are in Louisiana, I will be expanding my poster into a paper at the Louisiana Archives and Manuscripts Association conference on November 11. A link to that paper will be posted on my website soon after.

