Archive for the 'Renovating Descriptive Practice' Category

First Scholars’ Contributions to VIAF: Greek!

Monday, November 25th, 2013 by Karen

Perseus logo in VIAF Cluster

Contributors to the Virtual International Authority File (VIAF) have generally been national libraries and other library agencies.  We have just loaded into VIAF the first set of personal names from a scholarly resource, the Perseus Catalog hosted by Tufts University, an OCLC Research Library Partner. The Perseus Catalog aims to provide access to at least one online edition of every major Latin and Greek author from antiquity to 600 CE. Adding the Greek, Arabic and other script forms of names in the Perseus Catalog enrich existing VIAF clusters that previously lacked them.

This addition represents a milestone in our Scholars’ Contributions to VIAF activity. We anticipate mutual benefits from our collaboration with scholars. Scholars benefit from using VIAF URIs as persistent identifiers for the names in their own databases, linked data applications and scholarly discourse to disambiguate names in multinational collaborations and using VIAF as a means to disseminate scholarly research on names beyond scholars’ own communities. Both scholarly societies and libraries benefit from enriching VIAF with name authority data which would not otherwise be contributed by national libraries.

As noted in an earlier blog post, Irreconcilable differences? Name authority control & humanities scholarship,  OCLC Research discovered key issues important to scholars that didn’t mesh well with library practices represented in name authority files due to differences in intended audiences, disciplinary norms and metadata needs. However, if scholars do use the Library of Congress’ Metadata Authority Description schema, or MADS, as the Perseus Catalog does, we can add their files to VIAF much more easily.

Adding these scholarly files can demonstrate the benefits of tapping scholarly expertise to enhance and add to name authorities represented in VIAF. We have already seen the number of “alternate name forms” associated with VIAF clusters that include the Perseus Catalog’s contributions increase, with scripts not yet represented. We look forward to more such enhancements from other scholars’ contributions.


Multilingual WorldCat represented by translations

Tuesday, November 12th, 2013 by Karen

Great works are translated—the cream of the world’s cultural and knowledge heritage is shared by being translated. And many of them are represented by bibliographic records in WorldCat.

A group of us working on Multilingual WorldCat projects have been focusing on datamining WorldCat for works and all translations associated with them, identifying the translator for each translation. We plan to generate “uniform title” and “expression” records (the translations) and contribute them to the Virtual International Authority File (VIAF).

We currently have roughly 15 million personal name “clusters” in VIAF, the 26 million personal name authority records contributed by 35 agencies that represented the same person. These are not just creators of works, but also people that have had works written about them and sometimes a translator.

My colleague Jenny Toves has identified about 1 million persons in WorldCat who are associated with bibliographic records in more than one language, or roughly 7% of the people represented in VIAF.  The breakdown:

  • 624K names are associated with titles in only 2 languages
  • 283K names are associated with titles in 3 to 9 languages
  • 7K names are associated with titles in 10 or more languages
VIAF breakdown by language

Persons with titles in multiple languages

My colleague JD Shipengrover created the accompanying graphic.

We expect to focus our analysis efforts on the “short head” of the names whose works have been translated the most, and rely on machine algorithms to handle the “long tail” of the names associated with titles in only two or three languages.

MARCEdit Integrates WorldCat Capabilities

Thursday, October 31st, 2013 by Roy

As recently announced by Terry Reese, his program MARCEdit now includes a great set of new capabilities for users of WorldCat. Recently made possible by the release of the WorldCat Metadata API from OCLC, here are just a few of the things you can do directly from MARCEdit:

  • Set Batch Holdings in OCLC.
  • Batch upload/edit records into WorldCat.
  • Search WorldCat directly from within MARCEdit.

This is just the kind of integration that our web services now make available for software of all kinds. By providing an application program interface (API) that enables not just search and display of records, but also updating and creating records, we are exposing the full range of WorldCat metadata capabilities to virtually any software developer.

We have long said that by enabling developers to use our services at a deeper level we would enable new kinds of services that we  could not develop ourselves. Now we are seeing exactly that. Kudos to Terry Reese for building new capabilities into an already stellar application.

OCLC Control Numbers – Lots of them; all public domain

Monday, September 23rd, 2013 by Jim

For the last few years I have been part of a group of OCLC staff charged with articulating data sharing practices that are consistent with the WorldCat Rights and Responsibilities for the OCLC Cooperative. We’ve made good progress towards openness while making expectations and practices more regular and consistent. The recommendation to use the ODC Attribution license, the release of substantial sets of bibliographic data and the understandings we reached with DPLA and Europeana are all part of that progress. we recommended that OCLC declare OCLC Control Numbers (OCN) as dedicated to the public domain. We wanted to make it clear to the community of users that they could share and use the number for any purpose and without any restrictions. Making that declaration would be consistent with our application of an open license for our own releases of data for re-use and would end the needless elimination of the number from bibliographic datasets that are at the foundation of the library and community interactions.

I’m pleased to say that this recommendation got unanimous support and my colleague Richard Wallis spoke about this declaration during his linked data session during the recent IFLA conference. The declaration now appears on the WCRR web page and from the page describing OCNs and their use.

We think this is important to do to counter act some practices based on misunderstandings that emerged from concerns about OCLC having an overly restrictive record use and re-use policy.

One of the most unfortunate grew up around the OCLC Control Number (OCN). The OCLC Control Number is a unique, sequentially assigned number associated with a record in WorldCat. The number is included in a WorldCat record when the record is created. More than one billion have been assigned. (Yes, a billion.) Some people thought that the Control Number represented a mechanism for identifying a record as having originated with OCLC and therefore subject to the cooperative’s record use policy.

This caused institutions to strip the OCN from bibliographic records. For similar reasons commercial information users would sometimes delete the OCN from the data that they used. This is unfortunate behavior that diminishes the value of the OCN as an identifier and compromises some of the innovation that could occur if the OCN were more universally used. It’s an important element in linked library data that helps in the creation and maintenance of work sets and provides a mechanism to disambiguate authors and titles.

More importantly the OCN is also widely used within the broad system of information that flows among libraries, national information agencies, commercial information providers and organizations that supply consumers with book and journal-oriented services. For instance,
• Cataloging and IT librarians download OCLC MARC bibliographic records to the library’s local system
• Resource sharing librarians using third party ILL management programs store or use the OCLC number for searching.
• Reference services librarians with WorldCat Local use it to help a patron locate an item

Publishers, vendors and others that partner with OCLC and libraries also use the OCN. For example,
• Integrated Library Service (ILS) vendors use the OCN to manage changes and updates within their application environment,
• Publishers, material suppliers and eContent providers use OCLC MARC bibliographic records in their systems and rely on the OCN as an identifier,
• Developers maintaining or expanding services use OCLC Control Numbers as an integral component of their application architecture.

All these good things can happen because of the identifying power of the OCN and its ubiquity in the library description domain. Everyone should use them and take advantage of what they can help you do. This declaration removes any residual concern that may have incorrectly informed operating practices. We hope it makes a difference.

WorldCat Linked Data Made More Easily Available to Software

Monday, June 3rd, 2013 by Roy

You may recall that a while back we announced that linked data had been added to web pages. If you scroll down when viewing a single record you can reveal a “Linked Data” section of the page that is human readable and also “scrapable” via software.

However, it is much easier for software to request a structured data version that does not contain all of the other HTML markup of the page. The best way to do this is through something called “content negotiation”. Basically it enables a requestor (that is, a software program) to send a request that also tells the web server which format is required. For example, if you want a representation of the data in the Javascript Standard Notation (JSON) format, which many software developers use, then you could issue a command such as this:

curl -L -H “Accept: application/ld+json”

And that is what you would get in return. Alternatively, you could simply request that format by using the appropriate filename extension:

Formats supported include RDF XML, JSON, and Turtle. Richard Wallis has written a more thorough description of this that can be very helpful in understanding how best to use this new service.

These changes make it much easier and faster to get the data a developer requires into their application in a highly usable way. We can’t wait to see what they do with it.

ISBNs in WorldCat

Thursday, May 23rd, 2013 by Roy

Recently a question came up on the BIBFRAME list about ISBNs, and how many of them were in MARC records. This is just the kind of question that OCLC Research is uniquely placed to answer, so I quickly wrote some simple Perl code to run as a Hadoop streaming job to find out.

It was remarkably quick and easy to find out, although I had to edit and re-run the code when I discovered a flaw in my logic. This is, sadly, all too frequently the case. But not too much later I had my result:

Occurrences # per Record Percent of WC
230444194 0 77.71%
55668178 2 18.77%
4766652 1 1.61%
3708352 4 1.25%
616623 3 0.21%
411230 6 0.14%
125715 8 0.04%
65796 5 0.02%
45304 10 0.02%
30155 12 0.01%

These are all of the occurrences of a 020 $a in WorldCat as of 1 May 2013 [Added for clarification: the prior sentence describes exactly what is being counted. That is, I am not (yet) examining ISBNs for 10-digit vs. 13-digit; therefore, many of the records with 2 ISBNs may in fact simply have both versions].  A few observations:

  • Many items in WorldCat were published before the invention of the ISBN.
  • Many items in WorldCat are not ISBN-appropriate (e.g., unpublished materials).
  • ISBNs are therefore problematic as identifiers except for a narrow slice of materials (mainly printed books since the mid-60s).

A much better identifier for many purposes is, I assert, the OCLC number.

Irreconcilable differences? Name authority control & humanities scholarship

Wednesday, March 27th, 2013 by Karen

This post is co authored by David Michelson, Vanderbilt University

Over the past year OCLC Research has been working with a group of Syriac studies scholars with the goal of tapping their expertise to enrich the Virtual International Authority File (VIAF), by adding Syriac script to existing names and adding new ones. Syriac is a dialect of Aramaic, developed in the kingdom of Mesopotamia in the first century A.D. It flourished in the Persian and Roman Empires, and Syriac texts comprise the third largest surviving corpus of literature from the fourth through seventh centuries, after Greek and Latin. We anticipated that the issues we addressed could then be applied to scholars in other disciplines. We started with the assumption that the scholars could use the Library of Congress’ Metadata Authority Description Schema, or MADS.

We have learned a lot in the process of building a bridge between scholarly interest in names as a subject of historical research and VIAF’s interest in persistent identifiers for each name in authority files. We found that we shared values for name authorities:

  • Scholars and librarians share a mutual appreciation for each others’ work on identifying names appearing in historical research.
  • Many scholarly projects in the digital humanities are already relying on VIAF for authority control and to anchor Linked Open Data. The Syriac scholars pointed us to digital humanities projects— such as the Fihrist, a union catalog of Islamic manuscripts hosted in the UK, and those listed in the Digital Classicist Wiki under “Very Clean URIs”—that have adopted VIAF URIs as the best method for authority control and to link to other data sets.
  • VIAF can provide part of the cyberinfrastructure for digital humanities, a standard way for linking and querying data, a need identified by The American Council of Learned Societies’ national Commission on Cyberinfrastructure.

We discovered two key issues important to scholars that just don’t mesh well with the library practices represented in name authority files, which VIAF aggregates, due to differences in intended audiences, disciplinary norms, and metadata needs:

  • Scholars eschew a “preferred name”. Libraries need to bring together all the variant forms of a name under one form, choosing a “predominant form” if a person writes in more one language. This approach meets the discovery needs for a specific national or linguistic community. Scholarship is international, and the “preferred name” in one locale will differ from another. Further, the context is crucial for classifying names. For scholars, a “preferred name” needs to also include by whom and for what purpose it is preferred. For example, a Syriac name in use in 600 may be classified as “classical Syriac”; but the same name in use one thousand years later may be classified as a neo-Aramaic dialect. The same Syriac author might have multiple “preferred forms” in multiple languages (Syriac, Arabic, Greek), each used by different or competing cultural communities. This applies to other languages as well. Scholars resist declaring a “preferred form” because it could exclude some historical or cultural perspective. Each form may be “authoritative” depending on the time and place it appears.
  • Scholars need to know the provenance of each form of name. When a name has multiple forms, scholars—especially historians— need to know the provenance of each name, following the citation practices commonly used in their field. Historical and textual scholarship is built on conventions of evidence and values the process of contesting intellectual claims. MADS does not provide the structure for citing these sources or providing the required contextual information. Although library practices require “literary warrant” to justify why one form of name was chosen as the authorized heading or access point, they do not document the context for any of the variant forms. There is not even a field to indicate the language of a name’s form. We can deduce the language of the preferred form only by the source of the authority file. Scholars find little value in name information without provenance data, an equivalent of footnotes.

The good news is that our collaboration has pointed the way for future interaction between VIAF, the VIAF Council, and the scholarly community:

  • Syriac studies colleagues are building their own database where they can describe each personal name with the granularity that meets their scholarly requirements. We will work together to create a crosswalk so that OCLC Research can extract the information that fits into a MADS structure, and can still enrich existing VIAF clusters with Syriac and other script forms or add new names. VIAF and will follow existing protocols for using the namespace in minting URIs for new names not yet in VIAF.
  • For those who need the additional details, people could click a link to the name in the database, much as those who want to read a biography of a VIAF name can click on a Wikipedia link, if present. Thus VIAF can still integrate scholars’ expertise and serve scholarly users without needing to overcome the fundamental differences between library and scholarly practices.
  • will work with OCLC and the VIAF Council to establish a path for other scholarly research organizations to contribute to VIAF.

The screen captures of the current VIAF cluster and a Syriac Reference Portal Demo record for Ephrem below help us imagine how VIAF could be enhanced.

VIAF Cluster

VIAF Cluster

Extract from the Syriac Reference Portal Demo

Extract from the Syriac Reference Portal Demo

David Michelson is the assistant professor of early Christianity at Vanderbilt University and director of The Syriac Reference Portal, a joint project among Vanderbilt University, Princeton University, St. Michael’s College Vermont, Texas A&M University, Beth Mardutho the Syriac Institute and other affiliate institutions, funded by the National Endowment for the Humanities and the Andrew W. Mellon Foundation.

“Cataloging Unchained”

Wednesday, February 27th, 2013 by Roy

Lorcan Dempsey (VP of Research at OCLC) has long said that we need to “make our data work harder.” And for years that is exactly what OCLC Research has been doing. So when I was asked to speak on data mining at the OCLC European, Middle East, and African Regional Council Meeting in Strasbourg, France, I knew I would have a lot to talk about. Too much, in fact.

Instead of trying to cover everything we’ve been doing in a whirlwind of slides that no one would remember, I decided to use WorldCat Identities as a “poster child” for the kinds of data mining activities we have been doing recently here at OCLC Research. Then, I described another, related project — the Virtual International Authority File. To bring it all home I mentioned how we’re considering how we might be able to marry these two resources into one “super” identities service.

Consider what it would mean to take an aggregation of library-curated authority records and enhance it with algorithmically-derived data from WorldCat as well as links to other resources about creators such as Wikipedia. This would provide a rich resource of information about creators, all sitting behind authoritative and maintained identifiers that could be used in emerging new bibliographic structures such as is being created by the Library of Congress’ Bibliographic Framework Transition Initiative. The mind reels with the possibilities.

But before I could jump into all this I needed a way to quickly explain why we are doing things like this — and how we are doing them. I decided I needed to make a video. So last week that is exactly what I did, with help from colleagues in Dublin. The result was less than three-and-a-half minutes long, and yet it amply set the stage for what was to come after. Plus, it can have a life of its own.

Take a look yourself, at “Cataloging Unchained”, and let me know what you think in the comments.

Adventures in Hadoop, #5: String Searching vs. Record Parsing

Friday, January 25th, 2013 by Roy

In a previous post I described how I could now string search WorldCat using the Research compute cluster and Hadoop. This means I can find any string of characters anywhere in a MARC record and output the OCLC numbers of matches and/or the entire record, all within a few minutes. Keep in mind we are now talking about nearly 300 million records.

String searching is not, by any means, the only way we use Hadoop. It is actually more common for us to use code (typically written in Java, Python, or Perl) to parse the records and output portions for further processing. But since I had been using such code for simply identifying records of interest, I began to wonder which method of processing was faster.

In a short experiment I quickly proved that string searching was about three times faster for simple identification and output of records than was the code I had previously been using. This is because, I believe, the code I had been using would parse the record before determining if it met my criteria. This one extra step added so much overhead to the process that it would take 15 minutes (in one test) rather than 5.

This likely means that in some cases where relatively few records would match your criteria, you would still be better off extracting the records by string searching and then running your extraction code against them off-cluster. For example, if I wanted to pull out the 245 fields of, say, about 1,000 records, I’d likely be better off extracting the records I wanted by string searching and then process that file directly without using Hadoop.

One last permutation, however. If your process is one that identifies 1,000 records in some situations and several million in another, having one process through which all operations flow is more efficient than two or more separate processes.

And so it goes. Another day and another adventure in Hadoop.

Top Corporate Names in WorldCat

Tuesday, November 20th, 2012 by Roy

As I explained earlier, I have been doing some investigations into how MARC has been used over the last several decades. Curious about the contents of the 110 $a (corporate names), I parsed it and the top 30 headings are listed below. Keep in mind a few things, however:

  • Entities can be put together in different ways. For example , there is “Great Britain” and “England and Wales” and “Scotland” all appear in the list.
  • My process (as presently constituted) is simplistic. Therefore, both “Canada.” and “CANADA.” are counted separately.
  • Slight variations in headings produce different entries. For example, “Santa Fe River Baptist Association (Fla.)” and “Santa Fe River Baptist Association.”
  • Typos produce different entries.
Eventually I will make the entire list available. If you’re really eager, email me.
1417046 United States.
587986  Great Britain.
358417	France.
206591	Canada.
176754	Geological Survey (U.S.)
101421	California.
98397	Michigan.
79615	Australia.
78175	Catholic Church.
64390	New York (State).
57037	New Zealand.
48218	Sotheby's (Firm)
46196	Hôtel Drouot.
45853	Québec (Province).
44812	New South Wales.
44022	England and Wales.
43469	Massachusetts.
41914	Pennsylvania.
41560	Christie, Manson & Woods.
41292	Église catholique.
39517	Ontario.
36636	Scotland.
36234	Illinois.
34691	United Nations.
31121	India.
31011	Agence de presse Meurisse.
29958	Cornell University.
29648	Church of England.
29073	Japan.
28675	Victoria.