Mining the archives

November 12th, 2008 by Merrilee

As part of our ongoing effort to highlight work that we’re doing in our new Archives and Special Collections program, I thought I’d talk a little about our current efforts mining data out of WorldCat.

In the Define the State of Holdings and Description for Archives Project (which I often refer to as our data mining project) we are looking at archival descriptive practice as represented in WorldCat. In the first instance, this will be an analysis of 1.7 million MARC records under archival control. Right now, we are seeing how well these records match up with the recommendations for single level minimum and single level optimum in DACS. This will be a little tricky, because DACS is a data content standard, and MARC is a data format standard. We’d like to go beyond reporting on field/indicator/subfield usage — through sampling content, we may be able to say something about the characteristics of the data as well.

We are far from done (in fact, we’ve only recently started), but we should have results to share quite soon, including information about date of creation, encoding levels, geographic location of materials, and more. I will be reporting on preliminary findings and giving an overview of the project at the PACSCL conference “Something Old for Something New”, so I’ll report back on this blog sometime in the early December timeframe.

Further down the line, I would like to see us do a similar analysis on EAD records represented in ArchiveGrid, but this will be tougher (due to the nature of EAD) and perhaps less impactful (due to the scarcity of EAD records versus MARC records).

I should also note that we are open to other investigations, so if you have research questions, let’s hear them!

Related posts:

One Response to “Mining the archives”

  1. Mark Matienzo Says:

    Based upon a few conversations I had at the MCN conference this last week, I’ve been thinking about some similar questions. A big one that arose for me is whether we should be thinking about creating some sort of research corpus for metadata, much like how linguists have text and speech corpora.

    As far as analysis of DACS, what direction are you thinking about technically? Something similar to the MARC::Errorchecks module for Perl might make sense, but not without significant adaptation.