More From the “Murky Bucket”

The inspiration for my title comes from Lorcan Dempsey, who some years ago, before I joined him at OCLC, put a name to the unease I had been feeling about the state of library metadata. In a Library Journal column I had bemoaned the fact that not only was it impossible for library users to limit a search to online items available online in full, it was impossible for us to even implement such a feature.

Lorcan responded to that column, citing the ” ‘murky bucket syndrome’ that affects any large bibliographic database—we cannot entirely, unambiguously slice and dice the database because of historic data entry and cataloging practices that…were not oriented toward our new needs.” I’ll say. Also, around that time my soon-to-be colleagues at OCLC Research wrote a paper about some related work they had done: “Mining for Digital Resources: Identifying and Characterizing Digital Materials in WorldCat”.

Later I did a deeper investigation into this while still at the California Digital Library, from which came an informal report called “Trouble in Online Paradise: An Analysis of MARC 856 Usage at One Institution”. Basically, I took 1,000,000 MARC records from UC Berkeley, pulled out all of the 856 fields (about 20,000 at the time), and analyzed them. Since I have that work on my prototype server, you can still play around with it if you want.

Enough background. Now we are working to try to do something about this. As part of this work, we have been taking a closer look at the nearly 40 million 856 fields in WorldCat. We’re doing a number of things, but what I wanted to demonstrate in this post is just how much murk lurks in the bucket.

856 fields can have one or more various subfields. One of the possibilities is a $3 subfield that defines the part of the described materials to which the field (in this case, a URL) applies. That is why it is of great interest to us, as by looking in that subfield we may be able to detect when a URL points to the full resource or simply a table of contents, review, or other part. So one thing we’ve been doing is parsing out this field, counting up the occurrences of identical strings, and printing out a report. What you see below is small sample of all of the various ways “Table of contents” is depicted in this subfield:

Table of contents
Table of contents:
Table of contents only
Detailed table of contents.
Table of Contents
Table of contents online
Complete table of contents:
Table of Contents:
Table of contents of current issue
Table of contents only:
Linked table of contents
Journal table of contents
Table of contents available online:
Journal table of contents:
English table of contents
View table of contents only
Full Table of Contents
Journal issue table of contents
Table of contents at
Table of Contents for current and some back issues
Table of contents for graphics version
Free access, table of contents
Table of content
Full table of contents
Table of contents for text only version
Access table of contents online
Journal electronic table of contents
Link to table of contents (Inhalt)
access to table of contents
Table of contents page:
Table of contents for most recent issue
Table of contents :
Table of contents pages:

These are just the instances of what could basically be “Table of contents” — if you add in all of the variations like “Table of contents and abstract” or “Table of contents and publisher’s description”, you quickly get a sense of how many potential variations of strings we’re talking about.

This is what happens when there is not a controlled list of terms. This is what happens when catalogers lack guidance on what to enter to describe a particular situation. This is what makes for a murky bucket. And this is why we are in the sad state in which we are in now.

In a related note, while doing this work I discovered over 1,600 856 fields that had a pipe symbol “|” in the second indicator position, which is clearly a typographical error. Those were sent off to the WorldCat Quality Control Team to be fixed, but again is an indicator of the murkiness of our very large bucket.

Other OCLC staff involved in some part of this work include Brian Lavoie, Eric Childress, Ed O’Neill, Jay Weitz, Jasmine deGaia, and Laura Endress.