Archive for the 'Architecture and standards' Category

OCLC Exposes Bibliographic Works Data as Linked Open Data

Tuesday, February 25th, 2014 by Roy

zenToday in Cape Town, South Africa, at the OCLC Europe, Middle East and Africa Regional Council (EMEARC) Meeting, my colleagues Richard Wallis and Ted Fons made an announcement that should make all library coders and data geeks leap to their feet. I certainly did, and I work here. However, viewed from our perspective this is simply another step along a road that we set out on some time ago. More on that later, but first to the big news:

  1. We have established “work records” for bibliographic records in WorldCat, which bring together the sometimes numerous manifestations of a work into one logical entity.
  2. We are exposing these records as linked open data on the web, with permanent identifiers that can be used by other linked data aggregations.
  3. We have provided a human readable interface to these records, to enable and encourage understanding and use of this data.

Let me dive into these one by one, although the link above to Richard’s post also has some great explanations.

One of the issues we have as librarians is to somehow relate all the various printings of a work. Think of Treasure Island, for example. Can you imagine how many times that has been published? It hardly seems helpful, from an end-user perspective, to display screen upon screen of different versions of the same work. Therefore, identifying which works are related can have a tremendous beneficial impact on the end user experience. We have now done that important work.

But we also want to enable others to use these associations in powerful new ways by exposing the data as linked (and linkable) open data on the web. To do this, we are exposing a variety of serializations of this data: Turtle, N-Triple, JSON-LD, RDF/XML, and HTML. When looking at the data, please keep in mind that this is an evolutionary process. There are possible linkages not yet enabled in the data that will be later. See Richard’s blog post for more information on this. The license that applies to this is the Open Data Commons Attribution license, or ODC-BY.

Although it is expected that the true use of this data will be by software applications and other linked data aggregations, we also believe it is important for humans to be able to see the data in an easy-to-understand way. Thus we are providing the data through a Linked Data Explorer interface. You will likely be wondering how you can obtain a work ID for a specific item, which Richard explains:

How do I get a work id for my resources? – Today, there is one way. If you use the OCLC xISBN, xOCLCNum web services you will find as part of the data returned a work id (eg. owi=”owi12477503”). By striping off the ‘owi’ you can easily create the relevant work URI:

In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data. For example you will find the following in the data for OCLC number 53474380:


As you can see, although today is a major milestone in our work to make the WorldCat data aggregation more useful and usable to libraries and others around the world, there is more to come. We have more work to do to make it as usable as we want it to be and we fully expect there will be things we will need to fix or change along the way. And we want you to tell us what those things are. But today is a big day in our ongoing journey to a future of actionable data on the  web for all to use.

Visualizations of MARC Usage

Monday, December 2nd, 2013 by Roy

As part of my work to reveal exactly how the MARC standard has been used over the several decades it has existed (available at “MARC Usage in WorldCat”), I’ve always wanted to produce visualizations of the data. Recently, with essential help from my colleagues JD Shipengrover and Jeremy Browning, I was able to do exactly that.

After trying various graphical depictions of the data, we finally settled on an interactive “starburst” view of the data. The initial view provides a high-level summary of how often various tags have been used within particular formats. The interactive part allows you to “drill down” a level into a more detailed view.

We are providing two views of the data: from the point of view of the formats being described (that is, the top-level is comprised of the various formats — books, journals, etc.), and from the point of view of the tags (that is, the top-level is comprised of the various MARC tags).

If you have any ideas about a visualization you would like to see, let me know.

MARCEdit Integrates WorldCat Capabilities

Thursday, October 31st, 2013 by Roy

As recently announced by Terry Reese, his program MARCEdit now includes a great set of new capabilities for users of WorldCat. Recently made possible by the release of the WorldCat Metadata API from OCLC, here are just a few of the things you can do directly from MARCEdit:

  • Set Batch Holdings in OCLC.
  • Batch upload/edit records into WorldCat.
  • Search WorldCat directly from within MARCEdit.

This is just the kind of integration that our web services now make available for software of all kinds. By providing an application program interface (API) that enables not just search and display of records, but also updating and creating records, we are exposing the full range of WorldCat metadata capabilities to virtually any software developer.

We have long said that by enabling developers to use our services at a deeper level we would enable new kinds of services that we  could not develop ourselves. Now we are seeing exactly that. Kudos to Terry Reese for building new capabilities into an already stellar application.

Thresholds for Discovery

Wednesday, October 30th, 2013 by Merrilee

I’m pleased to report that we have an article in the most recent Code4Lib Journal! The article, Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems is based on an analysis of how EAD (Encoded Archival Description) is used in the ArchiveGrid corpus. We go beyond that to look at EAD-in-use through the lens of discovery — how well is EAD currently meeting our objectives making our finding aids not only more discoverable, but more functional in discovery environments? In many ways, this article is a reaction to the many questions we receive about ArchiveGrid and why we do not provide a variety of indexes or advanced search features — the encoding and the data simply do not, at this point, support this functionality. We hope this paper can serve as the beginning of a discussion about some focussed efforts to improve the current situation.

With more than 120,000 finding aids, we believe this is the largest analysis of EAD done to date, and I’ll immodestly propose that the article is worthy of your attention!

Many thanks to Kathy Wisser and Jackie Dean for sharing their work with us — their earlier study of EAD usage provided an excellent model, and I’m grateful that they were willing to share their early results with us. Their article will be published in the next American Archivist, providing yet another look at EAD. Thanks also to my co authors: Bruce Washburn; the extraordinary developer behind ArchiveGrid and many other wonderful things; and Marc Bron, a PhD candidate in the computer science department at University of Amsterdam, who worked with us as an intern in the spring and did the analysis quite handily along with a number of other projects.

We look forward to your comments on the article, as well as your ideas for how to move forward to better thresholds for discovery.

Library Authorities Alternatives

Thursday, October 3rd, 2013 by Roy

Without delving into the dysfunctional politics that has led to the shutdown of most U.S. Government services, we thought it might be helpful to our library and archives colleagues to point out some alternatives to various library authority sources that have been shuttered at the Library of Congress.

In some cases (for example, LCSH vs. FAST) they are not exactly equivalent, but they might be useful during this period nonetheless. Please note that we cannot verify the accuracy or completeness of the Internet Archive copy of any of these sources.

This is a summary of some of the most widely available resources, but please note that bibliographic data, classification schemes and authority files traditionally provided on the site are also available in subscription services from the Library of Congress (Cataloger’s Desktop and Classification Web) [note: these LoC services continue to be made available to subscribers during the shutdown], from OCLC (for cataloging service subscribers), and from other service providers.

Also, for a  more complete list of alternative resources, see this spreadsheet (Excel .xslx file, courtesy of Eric Childress).

Name Authorities

Subject Authorities

MARC Documentation


Genre Headings

 Thesaurus for Graphic Materials

We hope these are useful, but even more we hope that very soon we will be back to business as usual. Even then, though, you may find some of these other sources helpful. If you know of other reasonable alternatives, leave a comment below.

WorldCat Linked Data Made More Easily Available to Software

Monday, June 3rd, 2013 by Roy

You may recall that a while back we announced that linked data had been added to web pages. If you scroll down when viewing a single record you can reveal a “Linked Data” section of the page that is human readable and also “scrapable” via software.

However, it is much easier for software to request a structured data version that does not contain all of the other HTML markup of the page. The best way to do this is through something called “content negotiation”. Basically it enables a requestor (that is, a software program) to send a request that also tells the web server which format is required. For example, if you want a representation of the data in the Javascript Standard Notation (JSON) format, which many software developers use, then you could issue a command such as this:

curl -L -H “Accept: application/ld+json”

And that is what you would get in return. Alternatively, you could simply request that format by using the appropriate filename extension:

Formats supported include RDF XML, JSON, and Turtle. Richard Wallis has written a more thorough description of this that can be very helpful in understanding how best to use this new service.

These changes make it much easier and faster to get the data a developer requires into their application in a highly usable way. We can’t wait to see what they do with it.

ISBNs in WorldCat

Thursday, May 23rd, 2013 by Roy

Recently a question came up on the BIBFRAME list about ISBNs, and how many of them were in MARC records. This is just the kind of question that OCLC Research is uniquely placed to answer, so I quickly wrote some simple Perl code to run as a Hadoop streaming job to find out.

It was remarkably quick and easy to find out, although I had to edit and re-run the code when I discovered a flaw in my logic. This is, sadly, all too frequently the case. But not too much later I had my result:

Occurrences # per Record Percent of WC
230444194 0 77.71%
55668178 2 18.77%
4766652 1 1.61%
3708352 4 1.25%
616623 3 0.21%
411230 6 0.14%
125715 8 0.04%
65796 5 0.02%
45304 10 0.02%
30155 12 0.01%

These are all of the occurrences of a 020 $a in WorldCat as of 1 May 2013 [Added for clarification: the prior sentence describes exactly what is being counted. That is, I am not (yet) examining ISBNs for 10-digit vs. 13-digit; therefore, many of the records with 2 ISBNs may in fact simply have both versions].  A few observations:

  • Many items in WorldCat were published before the invention of the ISBN.
  • Many items in WorldCat are not ISBN-appropriate (e.g., unpublished materials).
  • ISBNs are therefore problematic as identifiers except for a narrow slice of materials (mainly printed books since the mid-60s).

A much better identifier for many purposes is, I assert, the OCLC number.

“Cataloging Unchained”

Wednesday, February 27th, 2013 by Roy

Lorcan Dempsey (VP of Research at OCLC) has long said that we need to “make our data work harder.” And for years that is exactly what OCLC Research has been doing. So when I was asked to speak on data mining at the OCLC European, Middle East, and African Regional Council Meeting in Strasbourg, France, I knew I would have a lot to talk about. Too much, in fact.

Instead of trying to cover everything we’ve been doing in a whirlwind of slides that no one would remember, I decided to use WorldCat Identities as a “poster child” for the kinds of data mining activities we have been doing recently here at OCLC Research. Then, I described another, related project — the Virtual International Authority File. To bring it all home I mentioned how we’re considering how we might be able to marry these two resources into one “super” identities service.

Consider what it would mean to take an aggregation of library-curated authority records and enhance it with algorithmically-derived data from WorldCat as well as links to other resources about creators such as Wikipedia. This would provide a rich resource of information about creators, all sitting behind authoritative and maintained identifiers that could be used in emerging new bibliographic structures such as is being created by the Library of Congress’ Bibliographic Framework Transition Initiative. The mind reels with the possibilities.

But before I could jump into all this I needed a way to quickly explain why we are doing things like this — and how we are doing them. I decided I needed to make a video. So last week that is exactly what I did, with help from colleagues in Dublin. The result was less than three-and-a-half minutes long, and yet it amply set the stage for what was to come after. Plus, it can have a life of its own.

Take a look yourself, at “Cataloging Unchained”, and let me know what you think in the comments.

Top Corporate Names in WorldCat

Tuesday, November 20th, 2012 by Roy

As I explained earlier, I have been doing some investigations into how MARC has been used over the last several decades. Curious about the contents of the 110 $a (corporate names), I parsed it and the top 30 headings are listed below. Keep in mind a few things, however:

  • Entities can be put together in different ways. For example , there is “Great Britain” and “England and Wales” and “Scotland” all appear in the list.
  • My process (as presently constituted) is simplistic. Therefore, both “Canada.” and “CANADA.” are counted separately.
  • Slight variations in headings produce different entries. For example, “Santa Fe River Baptist Association (Fla.)” and “Santa Fe River Baptist Association.”
  • Typos produce different entries.
Eventually I will make the entire list available. If you’re really eager, email me.
1417046 United States.
587986  Great Britain.
358417	France.
206591	Canada.
176754	Geological Survey (U.S.)
101421	California.
98397	Michigan.
79615	Australia.
78175	Catholic Church.
64390	New York (State).
57037	New Zealand.
48218	Sotheby's (Firm)
46196	Hôtel Drouot.
45853	Québec (Province).
44812	New South Wales.
44022	England and Wales.
43469	Massachusetts.
41914	Pennsylvania.
41560	Christie, Manson & Woods.
41292	Église catholique.
39517	Ontario.
36636	Scotland.
36234	Illinois.
34691	United Nations.
31121	India.
31011	Agence de presse Meurisse.
29958	Cornell University.
29648	Church of England.
29073	Japan.
28675	Victoria.

Linked Data – for the enlightened non-geek reader (or dummies) (or managers)

Tuesday, June 26th, 2012 by Jim

OCLC had some big announcements about linked data this past week. My colleagues, Roy Tennant and Richard Wallis, both have good blog posts (Roy’s) (Richard’s) explaining the what and the why of making WorldCat data available in a linked data format. The announcements got nice press and supportive criticism from people like Ed Chamberlain and Adrian Pohl.

It also caused folks to wonder if I could explain linked data to them.


There are, however, some very brief, very elementary explanations out there that ought to do the job for this interested but non-nerd audience.

I recommend these brief videos which convey the rudiments about RDFa, JSON and Linked Data. The fellow who did them has a nice manner, charms with his hand-drawn flash cards and gives you enough while steering around the usual avalanche of angle brackets that characterize other explanations. Plus the videos share the same introductory stuff so you can slide forward on the subsequent videos. (to Bruce Washburn and Jeff Young)

For something slicker and a bit more substantial try A skim-read introduction to linked data by two of the technologists in the BBC Research and Development Group. Toggle between the slide view and the continuous scroll view if you’re impatient.

And if you need parables you could try this post Linked Data for Dummies or A dummy’s introduction to linked data (me being the dummy).

And if you insist on a use case here’s the oldest and best – Use of Semantic Web Technologies on the BBC Web Sites

The ‘enlightened non-geek reader’ phrase draws on a comment made to me by Chet Grycz when he was at the University of California Press. He used to talk about ENSORs saying that all university press people believed in these mythical creatures. Press people were confident that were lots of ENSORs out in the wild but in fact no press person had ever had a personal encounter with one. Okay, Chet, what’s an ENSOR? An Enlightened Non-Scholarly Reader. ;)

Update 8 August 2012 OCLC just released a video explaining linked data on our YouTube channel. It’s quite good, very informative and graphically rich. If you’re motivated to understand the basics, want to know why this is important to libraries, and how linked data will make a difference then this will reward the approximately fifteen minutes it takes to view.