Some of you may already know about my “MARC Usage in WorldCat” project, where I simply expose the contents of a number of MARC subfields in ordered lists of strings. The point, as I state on the site itself, is to expose “which elements and subfields have actually been used, and more importantly, how? This work seeks to use evidence of usage, as depicted in the largest aggregation of library data in the world — WorldCat — to inform decisions about where we go from here.”
One aspect of this is the quality, or lack thereof, of the actual data recorded. As an aggregator, we see it all. We see the typos, the added punctuation where none should be. We see the made up elements and subfields (yes, made up). We see data that is clearly in the completely wrong place in the record (what were they thinking?). We see it all.
So this week when I received a request for a specific report, as sometimes happens, I was happy to comply. The correspondent wanted to see the contents of the 775 $e subfield, which, according to the documentation should only have a “language code”. Catalogers know that you can’t make these up, they must come from the Library of Congress’ MARC Code List for Languages.
Sounds simple, right? If you encode a language in the 775 $e, it must come from that list. But that doesn’t prevent catalogers from embellishing (see all the variations for “eng” below and the number of times they were found; this does not include variations like “anglais”). Why not add punctuation? Or additional information, such as “bilingual”? I’ll tell you why not. Because it renders the data increasingly unusable without normalization.
And normalization comes at a cost. Easy normalization, such as removing punctuation, is straightforward. But at some point the easiest thing to do is to simply throw it away. If a string only occurs once, how important can it be?
As we move into a more fully machine-supported world for library metadata we will be facing more of these choices. Some will be harder than others. If you don’t believe me, just check out what we have to do with dates.
52861 eng
1249 eng.
400 (eng)
20 (eng.)
12 (eng).
3 eeng
2 [eng]
1 feng
1 eng~w(CaOOP) a472415
1 engw(CaOOP) a459037
1 engw(CaOOP) a371268
1 engw(CaOOP) 1-181456
1 engw(CaOOP) 01-0314275
1 engw(CaOOP) 01-0073869
1 enge
1 eng..
1 eng,
1 eng(CaOOP) a359090
1 eng(CaOOP) 1-320212
1 eng$x0707-9311
1 bilingual eng
1 (eng),
Photo by Suzanne Chapman, Creative Commons license CC BY-NC-SA 2.0
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.
In the Consortium of European Research Libraries, where we build up the Heritage of the Printed Book db (6 million records this month), we also find that data conversions and system migrations have caused a lot of havoc… that we aim to put right, in order to ensure the best search and retrieval experience that we can hope for.
Part of the problem: our data entry programs usually don’t validate variable fields, even for subfields like 775 $e which are supposed to have controlled values. Fixed fields like the 00x do have validation, generally.
Does Connexion allow catalogers to add invalid values in controlled subfields like 775 $e?
Validation only goes so far, of course – no help with millions of records which may already have bad data, and no way to stop catalogers from accidentally entering the *wrong* valid value… but it should be relatively cheap and easy to add to cataloging programs, and would prevent a lot of new problems from being created.
Andy,
I checked with my colleagues who said:
“Connexion validation will only permit valid language codes in 775 $e. If the contents of that subfield is longer or shorter than 3 characters, validation generates an error message. If the code in that subfield is not in the list of MARC language codes, validation generates an error message.”
and
“If the question arises as to how we ended up with so many variations in WorldCat given that we do validate that subfield in Connexion, the answer is that validation for batchload is not as strict as that for Connexion and allows addition of records with variations when the errors are minor.”
I hope this clarifies the situation.