Some of you may already know about my “MARC Usage in WorldCat” project, where I simply expose the contents of a number of MARC subfields in ordered lists of strings. The point, as I state on the site itself, is to expose “which elements and subfields have actually been used, and more importantly, how? This work seeks to use evidence of usage, as depicted in the largest aggregation of library data in the world — WorldCat — to inform decisions about where we go from here.”
One aspect of this is the quality, or lack thereof, of the actual data recorded. As an aggregator, we see it all. We see the typos, the added punctuation where none should be. We see the made up elements and subfields (yes, made up). We see data that is clearly in the completely wrong place in the record (what were they thinking?). We see it all.
So this week when I received a request for a specific report, as sometimes happens, I was happy to comply. The correspondent wanted to see the contents of the 775 $e subfield, which, according to the documentation should only have a “language code”. Catalogers know that you can’t make these up, they must come from the Library of Congress’ MARC Code List for Languages.
Sounds simple, right? If you encode a language in the 775 $e, it must come from that list. But that doesn’t prevent catalogers from embellishing (see all the variations for “eng” below and the number of times they were found; this does not include variations like “anglais”). Why not add punctuation? Or additional information, such as “bilingual”? I’ll tell you why not. Because it renders the data increasingly unusable without normalization.
And normalization comes at a cost. Easy normalization, such as removing punctuation, is straightforward. But at some point the easiest thing to do is to simply throw it away. If a string only occurs once, how important can it be?
As we move into a more fully machine-supported world for library metadata we will be facing more of these choices. Some will be harder than others. If you don’t believe me, just check out what we have to do with dates.
1 eng~w(CaOOP) a472415
1 engw(CaOOP) a459037
1 engw(CaOOP) a371268
1 engw(CaOOP) 1-181456
1 engw(CaOOP) 01-0314275
1 engw(CaOOP) 01-0073869
1 eng(CaOOP) a359090
1 eng(CaOOP) 1-320212
1 bilingual eng
Photo by Suzanne Chapman, Creative Commons license CC BY-NC-SA 2.0
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.