Archive for the 'Renovating Descriptive Practice' Category

Authority for Establishing Metadata Practice

Monday, April 7th, 2014 by Karen

A metadata fkiw duagram
That was the topic discussed recently by OCLC Research Library Partners metadata managers. Carlen Ruschoff (U. Maryland), Philip Schreur (Stanford) and Joan Swanekamp (Yale) had initiated the topic, observing that libraries are taking responsibility for more and more types of metadata (descriptive, preservation, technical, etc.) and its representation in various formats (MARC, MODS, RDF). Responsibility for establishing metadata practice can be spread across different divisions in the library. Practices developed in relative isolation may have some unforeseen outcomes for discovery in awkward juxtapositions.

The discussion revolved around these themes:

Various kinds of splits create varying metadata needs. Splits identified included digital library vs. traditional; MARC vs. non-MARC; projects vs. ongoing operations. Joan Swanekamp noted that many of Yale’s early digitization projects involved special collections which started with their own metadata schemes geared towards specific audiences. But the metadata doesn’t merge well with the rest of the library’s metadata, and it’s been a huge amount of work to try to coordinate these different needs. There is a common belief in controlled vocabularies even when the purposes are different.  The granularity of different digital projects makes it difficult to normalize the metadata. Coordination issues include using data element in different ways, not using some basic elements, and lack of context. Repository managers try to mandate as little as possible to minimize the barriers to contributions. As a result, there’s a lot of user-generated metadata that would be difficult to integrate with catalog data.

Metadata requirements vary due to different systems, metadata standards, communities’ needs. Some digital assets are described using MODS (Metadata Object Description Schema) or VRA. Graphic arts departments need to find images based on subject headings, which may result in what seems to be redundant data. There’s some tension between specific area and general needs. Curators for specific communities such as music and divinity have a deeper sense of what their respective communities need rather than what’s needed in a centralized database. Subject headings that rely on keyword or locally devised schemes can clash with the LC subject headings used centrally.  These differences and inconsistencies have become more visible as libraries have implemented discovery layers that retrieve metadata from across all their resources.

Some sort of “metadata coordination group” is common.  Some libraries have created metadata coordination units (under various names), or are planning to. Such oversight teams provide a clearing house to talk about depth, quality and coverage of metadata. An alternative approach is to “embed” metadata specialists in other units that create metadata such as digital library projects, serving as consultants. After UCLA worked on ten different digital projects, it developed a checklist that could be used across projects: Guidelines for Descriptive Metadata for the UCLA Digital Library Program (2012). It takes time to understand different perspectives of metadata: what is important and relevant to curators’ respective professional standards.  It’s important to start the discussions about expectations and requirements at the beginning of a project.

We can leverage identifiers to link names across metadata silos. As names are a key element regardless of which metadata schema is used, we discussed the possibility of using one or more identifier systems to link them together. Some institutions encourage their researchers to use the Elsevier expert system. Some are experimenting with or considering using identifiers such as ORCID (Open Researcher and Contributor ID), ISNI (International Standard Name Identifier) or VIAF (Virtual International Authority File). VIAF is receiving an increasing number of LC/NACO Authority File records that include other identifiers in the 024 field.

Implications of BIBFRAME Authorities

Thursday, April 3rd, 2014 by Karen


Bibframe graphicThat was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Philip Schreur of Stanford. We were fortunate that several staff from the Library of Congress involved with the Bibliographic Framework Initiative (aka BIBFRAME) participated.

Excerpts from On BIBFRAME Authority  dated 15 August 2013 served as background, specifically the sections on the “lightweight abstraction layer” (2.1- 2.3) and the “direct” approach (3). During the discussion, Kevin Ford of LC shared the link to the relatively recent BIBFRAME Authorities draft specification dated 7 March 2014, now out for public review:

The discussion revolved around these themes:

The role of identifiers for names vis-à-vis authority records. Ray Denenberg of LC noted that when the initiative first began, the framers searched unsuccessfully for an alternate name for “authorities” as it could be confused with replicating the LC/NACO or local authority files that follow a certain set of cataloging rules and are constantly updated and maintained. BIBFRAME is meant to operate in a linked data environment, giving everyone a lot of flexibility. The “BIBFRAME Authority” is defined as a class that can be used across files. It could be simply an identifier to an authoritative source, and people could link to multiple sources as needed. The identifier link could also be used to grab more information from the “real” authority record.

Concern about sharing authority work done in local “light abstraction layers.” It was posited that Program for Cooperative Cataloging libraries, and others, could share local authorities work and expose it as linked data. This is one of the objectives for the Stanford-Cornell-Harvard Linked Data for Libraries experiment. They plan to use a type of shared light abstraction model, where they may share URIs for names rather than each institution creating their own. Concerns remain about accessing, indexing and displaying shared local authorities across multiple institutions, and the risk of outages that could hamper access. Although libraries could develop a pared down approach to creating local authority data (which may not be much more than an identifier) and then have programs that pull in more information from other sources, some feared that data would only be created locally and not shared and libraries would not ingest richer data available from elsewhere.

Alternate approaches to authority work. Given the limited staff libraries have, fewer have the resources to contribute to the LC/NACO authority file as much as they have in the past. The lightweight model could serve as a place for identifiers and labels, and allow libraries to quickly create identifiers for local researchers prominent in the journal literature but not reflected in national authority files. Using identifiers instead of worrying about validating content—doing something quick locally that you can’t afford to do at a national level—is appealing. Alternatively, a library could bring in information from multiple authority sources—each serving a different community—noting the equivalents and providing an appropriate label.  BIBFRAME Authority supports both approaches. Other sources could include those favored by publishers rather than libraries, such as ORCID (Open Researcher and Contributor ID) or ISNI (International Standard Name Identifier), or by other communities such as those using EAC-CPF (Encoded Archival Context – Corporate bodies, Persons and Families). This interest overlaps the OCLC Research activity on Registering Researchers in Authority Files.

Concern about the future role of the LC/NACO Authority File.  Some are concerned that if libraries chose to rely on identifiers to register their scholars or bring in information from other sources, fewer would contribute to the LC/NACO Authority File. Will we lose the great database catalogers have built cooperatively over the past few decades? Some would still prefer to have one place for all authority data and do all their authority work there. LC staff noted that a program could be run to ingest authority data done in these local (or consortial) abstraction layers into the LC/NACO Authority File.

Issues around ingesting authority data. We already have the technology to implement Web “triggers” to launch programs that pull in information from targeted sources and write the information to our own databases. OCLC Research recently held a TAI-CHI webinar demonstrating xEAC and RAMP (Remixing Archival Metadata Project), two tools that do just that. There are other challenges such as evaluating the trustworthiness of the sources, selecting which ones are most appropriate for your own context and reconciling multiple identifiers representing the same entity. Some are looking for third-party reconciliation services that would include links to other identifiers.

Those interested in the continuing discussion of BIBFRAME may wish to subscribe to the BIBFRAME listserv.



New Scholars’ Contributions to VIAF: Syriac!

Tuesday, March 11th, 2014 by Karen
Syriac VIAF Example for Blog

Syriac scripts added to VIAF cluster for Ephraem

We have just loaded into the Virtual International Authority File (VIAF) the second set of personal names from a scholarly resource, the Syriac Reference Portal hosted by Vanderbilt University.

Syriac is a dialect of Aramaic, developed in the kingdom of Mesopotamia in the first century A.D. It flourished in the Persian and Roman Empires, and Syriac texts comprise the third largest surviving corpus of literature from the Roman Empire, after Greek and Latin. The Syriac Reference Portal is a collaborative digital reference project funded by the National Endowment for the Humanities and the Andrew W. Mellon Foundation involving partners at Vanderbilt University, Princeton University and other affiliate institutions. Syriac - Ephraem

This addition represents the first time we see Syriac scripts (there are variants) as both the “preferred form” and under “alternate name forms” in a VIAF record. The Syriac Reference Portal also contributes additional Arabic and other scripts as alternate names, but selects a Syriac script form as a preferred form for people who wrote or were written about in Syriac.

The Syriac names join the Roman and Greek personal names we loaded from the Perseus Catalog last November and blogged about here as part of our Scholars’ Contributions to VIAF activity. Together they demonstrate how scholarly contributions can enrich existing VIAF clusters—generally comprising contributions from national libraries and other library agencies— by adding script forms of names that previously lacked them, as well as adding new names. Scholars benefit from using VIAF URIs as persistent identifiers for the names in their own databases, linked data applications and scholarly discourse to disambiguate names in multinational collaborations and using VIAF as a means to disseminate scholarly research on names beyond scholars’ own communities.

Adding these scholarly files demonstrates the benefits of tapping scholarly expertise to enhance and add to name authorities represented in VIAF. We look forward to more such enhancements from other scholars’ contributions.

OCLC Exposes Bibliographic Works Data as Linked Open Data

Tuesday, February 25th, 2014 by Roy

zenToday in Cape Town, South Africa, at the OCLC Europe, Middle East and Africa Regional Council (EMEARC) Meeting, my colleagues Richard Wallis and Ted Fons made an announcement that should make all library coders and data geeks leap to their feet. I certainly did, and I work here. However, viewed from our perspective this is simply another step along a road that we set out on some time ago. More on that later, but first to the big news:

  1. We have established “work records” for bibliographic records in WorldCat, which bring together the sometimes numerous manifestations of a work into one logical entity.
  2. We are exposing these records as linked open data on the web, with permanent identifiers that can be used by other linked data aggregations.
  3. We have provided a human readable interface to these records, to enable and encourage understanding and use of this data.

Let me dive into these one by one, although the link above to Richard’s post also has some great explanations.

One of the issues we have as librarians is to somehow relate all the various printings of a work. Think of Treasure Island, for example. Can you imagine how many times that has been published? It hardly seems helpful, from an end-user perspective, to display screen upon screen of different versions of the same work. Therefore, identifying which works are related can have a tremendous beneficial impact on the end user experience. We have now done that important work.

But we also want to enable others to use these associations in powerful new ways by exposing the data as linked (and linkable) open data on the web. To do this, we are exposing a variety of serializations of this data: Turtle, N-Triple, JSON-LD, RDF/XML, and HTML. When looking at the data, please keep in mind that this is an evolutionary process. There are possible linkages not yet enabled in the data that will be later. See Richard’s blog post for more information on this. The license that applies to this is the Open Data Commons Attribution license, or ODC-BY.

Although it is expected that the true use of this data will be by software applications and other linked data aggregations, we also believe it is important for humans to be able to see the data in an easy-to-understand way. Thus we are providing the data through a Linked Data Explorer interface. You will likely be wondering how you can obtain a work ID for a specific item, which Richard explains:

How do I get a work id for my resources? – Today, there is one way. If you use the OCLC xISBN, xOCLCNum web services you will find as part of the data returned a work id (eg. owi=”owi12477503”). By striping off the ‘owi’ you can easily create the relevant work URI:

In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data. For example you will find the following in the data for OCLC number 53474380:


As you can see, although today is a major milestone in our work to make the WorldCat data aggregation more useful and usable to libraries and others around the world, there is more to come. We have more work to do to make it as usable as we want it to be and we fully expect there will be things we will need to fix or change along the way. And we want you to tell us what those things are. But today is a big day in our ongoing journey to a future of actionable data on the  web for all to use.

The Most Edited Book Records in WorldCat

Friday, February 7th, 2014 by Roy

hungergamesIn my last post I identified the most edited records in WorldCat, which, no surprise, were all serials. Someone who read the post asked about this information by format (e.g., books, maps, scores, etc.). I doubt that I will get to all of the various formats, but I decided to take a look at books.

Unlike serials, for which I noted those that had 60 or more edits, for books I had to lower the threshold to 40 to get any at all (the most edited item had 58 edits). So here are the book records which have been edited more than 39 times in WorldCat (in no particular order):

An inevitable conclusion from the above seems to be that the more libraries that hold a book the more likely a cataloger will be to touch the record for it, which would explain how Harry Potter and the Hunger Games books made it on the list.

The Most Edited Records in WorldCat

Thursday, January 30th, 2014 by Roy

Recently I’ve been doing a large pile of data processing jobs that has me working in cycles of 20 minutes or so. In other words, I do some edits, kick off a job on our compute cluster (fondly named “Gravel” — don’t ask) and about 20 minutes later I do roughly the same thing. Yeah, I know, you’re thinking “why doesn’t he automate it?”. And I would, except that this is a shared resource and rather than kicking off my monster list of jobs that could keep the cluster running from now until…well…a long while from now I think it’s better to introduce some variability in load.

All of that is a long introduction to how I came to discover the most edited records in WorldCat. To fill in those 20 minute blocks I took up some “mini investigations” that do not take as long to perform.

For one such investigation I looked into how often WorldCat Records have been edited and by whom. I will be blogging about this in an upcoming post, but a small slice of this investigation was a closer look at the records that have been edited a lot. Since we keep track of the cataloging symbol of every institution that has edited a record, these can stack up for records that require updates on a regular basis — in other words, serials.

All of the records for these serials were edited more than 60 times over their life in WorldCat, and in no particular order:

Take a bow, serials catalogers, you’ve clearly earned your pay.

The Most Used English Title Words in WorldCat

Friday, January 3rd, 2014 by Roy

This is another installment in my continuing series of eclectic, peripatetic, and yes, let’s just say it: “pathetic” data investigations. The most recent identified the top countries of publication for WorldCat records. For whatever reason, I got it into my head to determine which English words appear the most in the main title of WorldCat items.

Clearly there are at least two ways to go about this: a) a formal, well-designed, highly replicable and ultimately near perfect investigation, or b) a slapdash, fast, seat-of-your-pants investigation of questionable merit. When given such a choice, I find the latter completely irresistable.  So I took part of my day today and did exactly that.

Since I already had code on our research cluster affectionately named “Gravel” that could extract a specific subfield, I powered it up and sucked out all of the 245 $a fields from WorldCat. As part of that process, I extracted only unique strings. The sharp ones among you have likely noticed a couple flaws already: 1) I was too lazy to filter based on language, and 2) I was too careless to normalize the title strings.

Flaws have never stopped me before, so I blazed on as if nothing was amiss. Then I threw that monster file onto another computer where I didn’t have to worry about interfering with any of the actually useful work that my colleagues where doing on Gravel (you’re welcome). There I wrote a special-purpose Perl script to take each title string, split it into individual words, lowercase them, and count up the occurrences. I dabbled in creating a “stop-words” list of useless words like “a” and “an” and “and” and “the” (ad infinitum) but that quickly began looking like a rabbit hole. As I was only really interested in identifying the top 30 or so words I figured my human eyeball would be sufficient to trap those in the end. Likewise with the foreign words.

That was really about it. Well, except for all the time I spent on Facebook waiting for the operations to complete. Did I say that out loud?

Anyway, without further ado (thank god) here are the top occurring meaningful English title words in WorldCat:

2020380 new
1853252 report
1431184 study
1159042 development
1069940 analysis
1004554 history
978681 county
968097 international
929294 state
890928 guide
856935 system
789983 education
778732 school
756569 united
748894 national
736474 management
706559 social
700137 book
688993 states
688328 studies
687695 general
687665 american
679083 systems
678582 public
677286 water
671552 research
666407 life
661707 health
645966 plan
644212 world
642100 effects

OK, now move along, nothing to see here.

Countries of Publication in WorldCat

Tuesday, December 10th, 2013 by Roy

I’m a data geek. I just love processing data in various ways to see what I can find out. So recently I decided to look into the countries of publication as recorded in the 300+ million MARC records in WorldCat. Just for kicks I did some processing of the 260 $a subfield, which is  the “Place of publication, distribution, etc.” as it appears on the piece, or noted in various other ways if it doesn’t.

As you might imagine, what results from such an investigation is a complete dog’s breakfast, with a large variety of punctuation marks, typographical errors, imaginative spellings, and just plain junk. No, it is much better to parse bytes 15-17 of the 008 field, which at least are supposed to only contain values from this list maintained by the Library of Congress. Progress.

That is, until one discovers that this “Code List for Countries” is not exactly that. If you happen to be in a certain select part of the world (mostly the United States, Canada, and Australia), you can also select state or province-specific codes. So before I used this table to translate the codes for actual countries I first had to translate the table, so that the code for “California” translated instead to “United States”. Progress.

Oh, and then countries have this tiresome tendency to change over time. The Soviet Union broke up. Czechoslovakia split into two. And don’t even get me started about the hot mess that used to fall under the general term of “Micronesia”. So I had to make some executive (and no doubt indefensible) decisions about how to deal with those. By and large, if I could identify some geography (e.g., Uzbekistan) that had a former life that could also be identified (e.g., Uzbek S.S.R.), I translated them both into the current entity. But lord only knows how many items that don’t have this distinction end up being miscounted. But progress of some sort nonetheless.

Oh, and places like “West Berlin” got their own code. How quaint. But now I’m just whining.

In the end I had the table translated into my twisted view of reality and could run my program against the entirety of WorldCat, parsing out the precious three bytes from the 008 and running my undoubtedly flawed translation on the result. I just love that “Unknown” came out on top. Somehow, after this journey, it seemed fitting.

With no further ado, here are the top 25 “countries” of publication from the records in WorldCat:

74,330,023  Unknown
52,460,566  United States
34,014,675  Germany
24,374,828  United Kingdom
21,009,805  France
 9,142,988  Japan
 8,706,853  China
 7,950,373  Spain
 6,649,599  Italy
 6,312,625  Netherlands
 6,142,256  Canada
 5,641,525  Switzerland
 3,725,639  Russia
 3,516,374  Australia
 3,310,194  Poland
 2,923,655  Denmark
 2,739,910  Sweden
 2,219,850  India
 1,996,800  Slovenia
 1,936,800  Austria
 1,612,948  Belgium
 1,518,478  Israel
 1,514,824  Brazil
 1,412,034  Mexico
 1,197,454  Finland

The full list is here. Knock yourself out. I sure did.

Visualizations of MARC Usage

Monday, December 2nd, 2013 by Roy

As part of my work to reveal exactly how the MARC standard has been used over the several decades it has existed (available at “MARC Usage in WorldCat”), I’ve always wanted to produce visualizations of the data. Recently, with essential help from my colleagues JD Shipengrover and Jeremy Browning, I was able to do exactly that.

After trying various graphical depictions of the data, we finally settled on an interactive “starburst” view of the data. The initial view provides a high-level summary of how often various tags have been used within particular formats. The interactive part allows you to “drill down” a level into a more detailed view.

We are providing two views of the data: from the point of view of the formats being described (that is, the top-level is comprised of the various formats — books, journals, etc.), and from the point of view of the tags (that is, the top-level is comprised of the various MARC tags).

If you have any ideas about a visualization you would like to see, let me know.

Metadata for digital objects

Tuesday, November 26th, 2013 by Karen

That was the topic discussed recently by OCLC Research Library Partners metadata managers. It was initiated by Jonathan LeBreton of Temple, who noted the questions staff raised when describing voluminous image collections such as: Do we share the metadata even if it would swamp results? What context can be provided economically? What are others doing both in terms of data schemas and where the metadata is shared?

The discussion revolved around these themes:

Challenges in addressing the sheer volume of digital materials.  Managers are making decisions based on staffing, subject expertise, collection’s importance and funding. It was suggested that some metadata could be extracted from the technical metadata, such as dates and location. We discussed the possibility of crowd-sourcing metadata creation, although experience to date is that a few volunteers are responsible for most contributions, and the successful examples tend to be for transcription, editing OCR’d text, and categorizations. (The At a Glance: Sites that Support Social Metadata chart indicates the ones that enhance data either through improved description or subject access.) The context must matter to people for them to volunteer their efforts. (See the OCLC Report, Social Metadata for Libraries, Archives and Museums: Executive Summary.) With the anticipated increase of born-digital and other digitized materials, there’s a greater need for batch and bulk processing.

Grappling with born-digital materials.  Libraries are receiving the digital equivalents of personal papers and using the Forensic Toolkit to “process” these digital collections.  Preservation and rights management, in addition to description, are important components and no commercially available system yet addresses these needs. The Association of Research Libraries is working with the Society of American Archivists to customize its Digital Archives Specialist (DAS) Program to develop the requisite skills for managing born-digital for ARL library staff. OCLC Research has produced several reports in conjunction with its Demystifying Born Digital program of work.

Concerns about “siloization”, or proliferation of “boutique” collections, using different metadata schema. Metadata is being created in different native systems within an institution, metadata that is often not loaded into a central catalog or even accessible in the local discovery layer. User-created metadata in institutional repositories may be OAI harvested by OCLC and thus may appear in WorldCat even if not visible in the institution’s local discovery tool. Managers grapple with whether to spend resources on updating such metadata before it is exposed for harvesting.  Another challenge is deciding what to include in which discovery layer, and what should be silo’d.  The numerous repositories within an institution can result in complex metadata flows for discovery, as illustrated by UC San Diego’s Prezi diagram. Some institutions map their various metadata schema to MODS (Metadata Object Description Schema), but all non-MARC metadata is converted to MARC when loaded into WorldCat.

What are the “essential elements” to provide access across collections? We posited that librarians have been discussing “core” or “essential” metadata elements for decades, starting with Dublin Core and the Program for Cooperative Cataloging’s “BIBCO Standard Record”. Librarians have been entering metadata for the system it was designed for, but then ultimately the data moves to another system later.  Library metadata is no longer confined to a single system: it may be exposed to search engines and viewed with lots of non-library metadata.

The Library of Congress’ Bibliographic Framework Initiative  portends a future where all metadata will be “non-MARC” and we will rely more on linked data URIs in place of metadata text strings.  How can we use the promise of that future to get to where we need to be?