Archive for the 'Infrastructure' Category

Another Step on the Road to Developer Support Nirvana

Monday, March 10th, 2014 by Roy

devnetToday we released a brand spanking new web site for library coders. It has some cool features including a new API Explorer that will make it a lot easier for software developers to understand and use our application program interfaces (APIs). But seen from a broader perspective, this is just another way station on a journey we began some years ago to enable our member libraries to have full machine access to our services.

When I joined OCLC in May 2007, I immediately began collaborating with my colleagues in charge of these efforts, as I knew many library developers and had been active in the Code4Lib community. As a part of this effort, we flew in some well-known library coders to our headquarters in Dublin, OH, to pick their brains about the kinds of things they would like to see us do, which helped us to form a strategy for ongoing engagement.

From there we hired Karen Coombs, a well-known library coder from the University of Houston, to lead our engagement efforts. Under Karen’s leadership we engaged with the community in a series of events we began calling hackathons, although we soon changed to calling them “mashathons” in response to the pejorative nature the term “hack” had in Europe. In those events we brought together library developers to spend a day or two of intense learning and open development. The output of those events began populating our Gallery of applications and code libraries.

Karen also dug into the difficult, but very necessary, work to more thoroughly and consistently document our APIs. Her yeoman work in this regard helped to provide a more consistent and easier to understand and use set of documentation from which we continue to build upon and improve.

When Karen was moved into another area of work within OCLC to better use her awesome coding ability, Shelley Hostetler was hired to carry on this important work.

In this latest web site release I think you will find it even easier to understand and navigate. One essential difference is it is much easier to get started since we have better integrated information about, and access to, key requesting and management when those are required (some services do not require a key).

Although this new site offers a great deal to developers who want to know how to use our growing array of web services, we recognize it is but another step along the road to developer nirvana. So check it out and let us know how we can continue to improve. As always, we’re listening!


OCLC Exposes Bibliographic Works Data as Linked Open Data

Tuesday, February 25th, 2014 by Roy

zenToday in Cape Town, South Africa, at the OCLC Europe, Middle East and Africa Regional Council (EMEARC) Meeting, my colleagues Richard Wallis and Ted Fons made an announcement that should make all library coders and data geeks leap to their feet. I certainly did, and I work here. However, viewed from our perspective this is simply another step along a road that we set out on some time ago. More on that later, but first to the big news:

  1. We have established “work records” for bibliographic records in WorldCat, which bring together the sometimes numerous manifestations of a work into one logical entity.
  2. We are exposing these records as linked open data on the web, with permanent identifiers that can be used by other linked data aggregations.
  3. We have provided a human readable interface to these records, to enable and encourage understanding and use of this data.

Let me dive into these one by one, although the link above to Richard’s post also has some great explanations.

One of the issues we have as librarians is to somehow relate all the various printings of a work. Think of Treasure Island, for example. Can you imagine how many times that has been published? It hardly seems helpful, from an end-user perspective, to display screen upon screen of different versions of the same work. Therefore, identifying which works are related can have a tremendous beneficial impact on the end user experience. We have now done that important work.

But we also want to enable others to use these associations in powerful new ways by exposing the data as linked (and linkable) open data on the web. To do this, we are exposing a variety of serializations of this data: Turtle, N-Triple, JSON-LD, RDF/XML, and HTML. When looking at the data, please keep in mind that this is an evolutionary process. There are possible linkages not yet enabled in the data that will be later. See Richard’s blog post for more information on this. The license that applies to this is the Open Data Commons Attribution license, or ODC-BY.

Although it is expected that the true use of this data will be by software applications and other linked data aggregations, we also believe it is important for humans to be able to see the data in an easy-to-understand way. Thus we are providing the data through a Linked Data Explorer interface. You will likely be wondering how you can obtain a work ID for a specific item, which Richard explains:

How do I get a work id for my resources? – Today, there is one way. If you use the OCLC xISBN, xOCLCNum web services you will find as part of the data returned a work id (eg. owi=”owi12477503”). By striping off the ‘owi’ you can easily create the relevant work URI:

In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data. For example you will find the following in the data for OCLC number 53474380:


As you can see, although today is a major milestone in our work to make the WorldCat data aggregation more useful and usable to libraries and others around the world, there is more to come. We have more work to do to make it as usable as we want it to be and we fully expect there will be things we will need to fix or change along the way. And we want you to tell us what those things are. But today is a big day in our ongoing journey to a future of actionable data on the  web for all to use.

Learning Commons: well-made in Japan

Wednesday, November 27th, 2013 by Jim

During a very hectic, very interesting week visiting research libraries in Japan last week I had the good fortune to tour the new (April 2013) Learning Commons at Doshisha University. It is not a library-managed facility but the library helps to staff it along with other Student Support Services staff. The facility itself is as good an implementation as I’ve seen anywhere including the new facilities at North Carolina State University’s new library. The Doshisha University Learning Commons brochure

The Commons itself is a multi-story structure constructed adjacent to the library and connected to the library at various levels. As a consequence students can move very freely from the collections and quiet of the traditional library to the group study, presentation, production and technology areas of the learning commons. There are plenty of visible but unobtrusive staff available to the students. People in red jackets offer technology support, in blue jackets peer instruction and guidance, in yellow you get media production and on each floor a desk staffed by a librarian.

There are no fixed furnishings in the entire facility. Everything can be moved. As an experiment they left one group study space with two tables without rollers. That space is the most infrequently used in the building. I was impressed with the energy of the staff and the enthusiasm of the students. The location of the facility bordering on one of the busiest streets in Kyoto purposely serves to advertise the learning environment of this private university. The big study and computing rooms are lined up along picture windows that face out onto this boulevard ensuring that Kyoto citizens know that Doshisha is a good place to learn.

Check out some photos taken during my walk-through in this Flickr set. Look for the Global Village sign that designates an area where no Japanese is to be spoken.

P.S. After the original post my colleagues at Doshisha advised me that an English language version of their Learning Commons brochure is available (.pdf).

OCLC Control Numbers – Lots of them; all public domain

Monday, September 23rd, 2013 by Jim

For the last few years I have been part of a group of OCLC staff charged with articulating data sharing practices that are consistent with the WorldCat Rights and Responsibilities for the OCLC Cooperative. We’ve made good progress towards openness while making expectations and practices more regular and consistent. The recommendation to use the ODC Attribution license, the release of substantial sets of bibliographic data and the understandings we reached with DPLA and Europeana are all part of that progress. we recommended that OCLC declare OCLC Control Numbers (OCN) as dedicated to the public domain. We wanted to make it clear to the community of users that they could share and use the number for any purpose and without any restrictions. Making that declaration would be consistent with our application of an open license for our own releases of data for re-use and would end the needless elimination of the number from bibliographic datasets that are at the foundation of the library and community interactions.

I’m pleased to say that this recommendation got unanimous support and my colleague Richard Wallis spoke about this declaration during his linked data session during the recent IFLA conference. The declaration now appears on the WCRR web page and from the page describing OCNs and their use.

We think this is important to do to counter act some practices based on misunderstandings that emerged from concerns about OCLC having an overly restrictive record use and re-use policy.

One of the most unfortunate grew up around the OCLC Control Number (OCN). The OCLC Control Number is a unique, sequentially assigned number associated with a record in WorldCat. The number is included in a WorldCat record when the record is created. More than one billion have been assigned. (Yes, a billion.) Some people thought that the Control Number represented a mechanism for identifying a record as having originated with OCLC and therefore subject to the cooperative’s record use policy.

This caused institutions to strip the OCN from bibliographic records. For similar reasons commercial information users would sometimes delete the OCN from the data that they used. This is unfortunate behavior that diminishes the value of the OCN as an identifier and compromises some of the innovation that could occur if the OCN were more universally used. It’s an important element in linked library data that helps in the creation and maintenance of work sets and provides a mechanism to disambiguate authors and titles.

More importantly the OCN is also widely used within the broad system of information that flows among libraries, national information agencies, commercial information providers and organizations that supply consumers with book and journal-oriented services. For instance,
• Cataloging and IT librarians download OCLC MARC bibliographic records to the library’s local system
• Resource sharing librarians using third party ILL management programs store or use the OCLC number for searching.
• Reference services librarians with WorldCat Local use it to help a patron locate an item

Publishers, vendors and others that partner with OCLC and libraries also use the OCN. For example,
• Integrated Library Service (ILS) vendors use the OCN to manage changes and updates within their application environment,
• Publishers, material suppliers and eContent providers use OCLC MARC bibliographic records in their systems and rely on the OCN as an identifier,
• Developers maintaining or expanding services use OCLC Control Numbers as an integral component of their application architecture.

All these good things can happen because of the identifying power of the OCN and its ubiquity in the library description domain. Everyone should use them and take advantage of what they can help you do. This declaration removes any residual concern that may have incorrectly informed operating practices. We hope it makes a difference.

Sliding scale: mapping local, group and system-wide library infrastructure

Sunday, July 28th, 2013 by Constance

Recently, we looked at how Sankey diagrams might be used to visualize the flow of library resources within and across inter-lending networks. It was a useful exercise, but it left me feeling that a critical dimension was lacking: a measure of the geographic distance between inter-lending partners. Understanding that a significant share of the inter-lending demand that is fulfilled by CIC libraries is generated outside the CIC group is significant in its own right, but it doesn’t tell us much about the relative costs of serving ‘in-group’ and ‘out-of-group’ partners. If most of the non-CIC borrowers are located within close proximity of CIC lenders, the costs of fulfilling returnable requests (which must travel to and from the borrower) will be less than if the non-CIC lenders are located farther away. I tried mapping the Sankey flows to ZIP codes, but unless one is already familiar with the codes, it is fairly difficult to visualize the distance covered.

The obvious solution was to plot the outbound and returning flows on a map. At first, I did this by dropping markers for the borrowing and lending partners on a map, using a simple Web application (BatchGeo) that uses the Google Maps API to generate map-based data visualizations. BatchGeo is a nice tool, but in this case the result wasn’t very pleasing — the density of same-sized location markers in some locations made it difficult to read and, more importantly, obscured patterns in the relative concentration and diffusion in different regions. This was particularly true in comparatively small states. It was a very noisy picture.  Even if one looks at a small fraction of the inter-lending partners, the result is an irritating blur.  Limiting the inter-lending population to the top 5% of borrowers resulted in this not especially informative picture:

Top 250 CIC borrowers by location

Highlighting the borrowers located in the ChiPitts megaregion made it only slightly more interesting:

Top 250 within and outside ChiPitts

By chance, as I was experimenting with these maps, Jim Michalko (my boss) stopped by my desk to chat about a recent article in the New York Times on the geography of economic mobility. He half-jokingly suggested that we overlay a map of North American library infrastructure on the map of economic mobility, to see if there was any correlation between the availability of library services and the likelihood that individuals can better their economic lot in life. Well, why not? I already had a geo-coded set of US libraries — all I needed to do was map those to pre-existing shape files to produce a county-level view of library infrastructure. I used a freely available ZIP code data table to map ZIP data to county-level boundaries. The mappings are not perfect, but I considered them good enough for my purpose — which was not to produce an exact map of all library locations, but simply to compare the relative density of regional library infrastructure. With this in hand, I could use a method outlined by Robert Mundigl in his Clearly and Simply blog to associate data values with colored fill gradients in choropleth maps, using Excel.

Here is the result:

County-level distribution of WorldCat libraries in the United States

County-level distribution of WorldCat libraries in the United States

It is not a complete map — I wasn’t able to map every OCLC library symbol in the United States to a valid ZIP-based county, and not every library in the US has an OCLC symbol. Still, with nearly 30 thousand libraries, it is more comprehensive  than a map produced earlier this year based on IMLS data for about 17 thousand public libraries.

The first thing to be said about this map is that it does not suggest that there is any obvious correlation between the concentration of library resource (infrastructure) and economic mobility. Several of the areas that authors of the Equality of Opportunity study highlight as places where children of low-income families have a relatively greater likelihood of rising in the income distribution have comparatively limited library infrastructure. Admittedly, the geographic unit of measure in the two maps differs — I used counties (partly because they are readily available as shape files), while the researchers used commuting zones. It’s not obvious to me that if the library data were aligned to commuting zones, the picture would look much different: our data suggests that there is comparatively little library infrastructure in the upper Northeast zone of Nebraska, whether one relies on county or commuting zone boundaries — yet, this is an area where inter-generational income gains are relatively frequent. Conversely, in metro areas like Chicago where library infrastructure is comparatively dense, there is reportedly a pretty low level of inter-generational income gain.

Of course, to judge the strength of the US library system based on the geographic distribution of libraries alone is to overlook a vital — perhaps the most vital — attribute of the library enterprise. Libraries are in the business of increasing access to information by sharing resources that are distributed across broad networks of related institutions: public libraries, academic libraries, special libraries etc. Libraries are part of what is now fashionably termed the ‘sharing economy.’ To measure the integrity or vitality of the library system, one needs to take into account the efficiency of flows across the library network.  A successful library system is one that ensures that a child (or adult) in rural Nebraska has access to the same collections and services as a child (or adult) in Minneapolis or Seattle.

It would be interesting to investigate whether geographic areas that favor economic mobility are co-extensive with areas where library flows — the balance of supply and demand for library resources — are notably efficient. Perhaps some LIS PhD student will take up the challenge. My current objective is a lot more prosaic: modeling supply and demand within and outside of a given library consortium to inform decisions about local and shared stewardship of print collections. For this, I think the county-level choropleth is actually quite useful. It helps to show how demand is distributed at ‘above-the-institution’ scale, and this is important for understanding the role of logistics in optimizing the flow of library resources.

This is what a county-level heatmap of demand for CIC returnables looks like, reflecting cumulative inter-lending request activity over a period of about seven years:

Percent of CIC Returnable Borrowing by US Counties

What does this map tell us? A few important things are immediately discernible:

  • CIC libraries (which are mostly located in the Midwest) serve institutions across an enormous geographic range in the United States.
  • Regional demand is concentrated in a relatively small number of counties.
  • The relative volume of demand is quite low; counties with the greatest number of request transactions individually account for less than 7% of total demand.

Some of this information was equally visible in the institution location map that I had started with — but the county-based version is less noisy and enables me to roll up a great deal more data (about 1.3 million transactions and five thousand borrowers) in a single picture.  It also raises some additional questions:

  • Are the ‘hotspots’ of demand an artifact of aggregating demand over several years?  I.e. did all of the demand from Southern California occur in the last twelve months or is it a recurring pattern over many years?
  • Has the geographic range of demand changed over time?  Are CIC libraries a broader range of institutional partners today than they did five years ago?
  • Do individual CIC member libraries serve a comparable range of institutions?  Is ‘long-range lending’ associated with libraries that hold materials that are relatively scarce in the overall system, or are materials traveling farther than is necessary to meet demand?

It was easy enough to plot annual demand for an individual lender, so I produced a new series of maps looking at the county-level location of institutions who borrowed returnable items from the Ohio State University Library (symbol OSU) over a few years.  This time, I used the absolute number of loans as the input, so that even low-volume borrowers would be visible.  Here are the results for:


Locations of borrowers from OSU in CY2008.

Locations of borrowers from OSU in CY2008.


Locations of borrowers from OSU in CY2010.

and 2012

Locations of borrowers from OSU in CY2012.

Locations of borrowers from OSU in CY2012.

At first glance, they might appear to be the same map…yet there are minor variations from year to year. The consistency in regions of demand is interesting, since it suggests that there is some predictability in the sources of demand — year after year, institutions in Southern California (mostly Los Angeles County) and Southern Arizona (mostly Pima County) have turned to OSU as a supplier one hundred times or more.  Why does this matter?  A pattern of sustained demand might suggest that a subscription based pricing model would benefit partners on both sides, providing predictability in budgeting, and  also provide OSU with documentation of the continuing value its holdings are producing for other institutions (and geographies).

The variations in demand are equally interesting.  For example, demand for OSU resources from institutions in the Pacific Northwest seems to have waned somewhat — could this be a result of improved intra-regional inter-lending arrangements and courier service within the Orbis-Cascade group?  Less visible in these pictures, but no less intriguing (I think) is the decreasing range of geographies served by OSU between 2008 and 2012, which amounts to a reduction of 12%.  This trend holds up over a longer period, not reflected in the maps above.  What might account for this change?  Is demand being deflected to other suppliers? Or is the increasing volume of requests generated by ‘in-network’ CIC partners displacing fulfillment for non-CIC institutions?

There’s a lot more to explore in these complex inter-lending networks — and I suspect that visualizing flows between institutions and across geographies will become increasingly important in monitoring, analyzing and improving efficiency in the library system as a whole.  OCLC makes an increasingly wide range of data (including ILL policy and institution data) available for programmatic use by developers and others, and this will hopefully lead to more experimentation with visualization and library analytics.

WorldCat Linked Data Made More Easily Available to Software

Monday, June 3rd, 2013 by Roy

You may recall that a while back we announced that linked data had been added to web pages. If you scroll down when viewing a single record you can reveal a “Linked Data” section of the page that is human readable and also “scrapable” via software.

However, it is much easier for software to request a structured data version that does not contain all of the other HTML markup of the page. The best way to do this is through something called “content negotiation”. Basically it enables a requestor (that is, a software program) to send a request that also tells the web server which format is required. For example, if you want a representation of the data in the Javascript Standard Notation (JSON) format, which many software developers use, then you could issue a command such as this:

curl -L -H “Accept: application/ld+json”

And that is what you would get in return. Alternatively, you could simply request that format by using the appropriate filename extension:

Formats supported include RDF XML, JSON, and Turtle. Richard Wallis has written a more thorough description of this that can be very helpful in understanding how best to use this new service.

These changes make it much easier and faster to get the data a developer requires into their application in a highly usable way. We can’t wait to see what they do with it.

ISBNs in WorldCat

Thursday, May 23rd, 2013 by Roy

Recently a question came up on the BIBFRAME list about ISBNs, and how many of them were in MARC records. This is just the kind of question that OCLC Research is uniquely placed to answer, so I quickly wrote some simple Perl code to run as a Hadoop streaming job to find out.

It was remarkably quick and easy to find out, although I had to edit and re-run the code when I discovered a flaw in my logic. This is, sadly, all too frequently the case. But not too much later I had my result:

Occurrences # per Record Percent of WC
230444194 0 77.71%
55668178 2 18.77%
4766652 1 1.61%
3708352 4 1.25%
616623 3 0.21%
411230 6 0.14%
125715 8 0.04%
65796 5 0.02%
45304 10 0.02%
30155 12 0.01%

These are all of the occurrences of a 020 $a in WorldCat as of 1 May 2013 [Added for clarification: the prior sentence describes exactly what is being counted. That is, I am not (yet) examining ISBNs for 10-digit vs. 13-digit; therefore, many of the records with 2 ISBNs may in fact simply have both versions].  A few observations:

  • Many items in WorldCat were published before the invention of the ISBN.
  • Many items in WorldCat are not ISBN-appropriate (e.g., unpublished materials).
  • ISBNs are therefore problematic as identifiers except for a narrow slice of materials (mainly printed books since the mid-60s).

A much better identifier for many purposes is, I assert, the OCLC number.

We Want to Send You to SemTechBiz

Monday, April 29th, 2013 by Roy

semtechSemTechBiz is a major conference for those who are using semantic web technologies like linked data, RDF,, etc. It is being held June 2-5 in San Francisco and OCLC and LITA have teamed up to send a librarian there to share the good work that libraries are doing to produce and consume linked data.

We will pay the expenses of the selected individual to attend the conference where they will also be afforded a lightning talk slot to highlight their work for conference attendees. This is the first “Library Spotlight on Innovation” that we jointly developed with, the producers of the conference. Richard Wallis, our Linked Data Evangelist, was instrumental in putting this together.

So are you doing something interesting with linked data? Or do you know of someone who is? If so, you can nominate yourself or someone else for this great opportunity. We want the broader world to know about how libraries are innovating with linked data.

Regional print management and cooperative infrastructure: maps and gaps

Monday, March 4th, 2013 by Constance


We are excited to be working with the Ohio State University (OSU) and the Committee on Institutional Cooperation (CIC) on a new project to explore the contours of a regional strategy for managing the print book resource in the CHI-PITTS mega-region. Regular readers of this blog will know that mega-regions are geographic areas that typically encompass multiple population centers, exhibit a high degree of economic integration, and are bound together by a rich network of transportation, logistics, and communications infrastructure, as well as mutual cultural interests and similarities. Mega-regions are an intriguing concept for thinking about collaborative activities that scale above small groups of institutions, or even existing library consortia. OCLC Research recently published a report that used a mega-regions framework to explore the characteristics and implications of a North American network of regionally consolidated print book collections.

Over the last few months, we have explored this issue further by working with several US regional library consortia to examine their collective print book holdings in the context of the print book resource and infrastructure available in the mega-region most closely aligned with the location of the consortial membership. We have produced profiles for the Statewide California Electronic Library Consortium (SCELC) in the context of the SO-CAL mega-region; the Association of Southeastern Research Libraries (ASERL) and the Washington Research Library Consortium (WRLC) in the context of the CHAR-LANTA mega-region; and the National Institute for Technology in Liberal Education (NITLE) membership in the context of the BOS-WASH mega-region. We plan to publish a series of case studies highlighting the findings from these consortial profiles in the near future.

Our new collaboration with OSU and the CIC is an extension of this consortial profiling work. In this project, we will examine print book holdings at multiple levels: an institution (OSU); a library consortium (CIC); and a mega-region (CHI-PITTS). The purpose of the work is to conduct a detailed analysis of the factors that an individual library might bring to bear in selecting books to contribute to a shared consortial collection, as well as to compare both the individual library collection and the consortial print book resource to the broader context of the print book resource available in the surrounding mega-region. The CHI-PITTS mega-region, which extends across the upper Midwest from Chicago to Pittsburgh, is the mega-region which aligns most closely with the locations of the CIC membership.

Some of the questions we will address include:

  • What part of the OSU print book collection represents a distinctive asset when compared to the aggregate print book holdings within the CIC membership, or the broader CHI-PITTS mega-regional print book resource? What are the characteristics of these distinctive resources with respect to subject, age, and system-wide work-level holdings?
  • What part of the OSU collection is widely held across the collections of the CIC membership, or institutions within the CHI-PITTS region? Can a “core” set of titles be identified, at the consortial or regional level, that represent duplicative investment? Are there opportunities to reduce local costs by managing these titles as a shared resource at the consortial or regional level?
  • What does the ILL demand profile for OSU tell us about consortial and regional demand for its print book collection? How much of this demand is centered around OSU’s distinctive print book titles? How can OSU cooperate with other CIC members to meet local, consortial, and regional demand for print books?

Carol Pitts Diedrichs, Director of OSU Libraries, has posted a nice summary of the thinking that led up to this joint effort.

OSU volunteered to serve as a test case for this project, with the understanding that findings from the analysis will be useful to all CIC member libraries considering shared print archiving arrangements. Of course, we hope the project will be useful to other libraries as well. There is growing interest in how (or if) the lessons learned in journal archiving projects like the Western Regional Storage Trust (WEST) or the CIC Shared Print Repository can be applied to cooperative efforts to preserve monographic collections. This project should provide some answers. We expect to post periodic updates on the project over the next several months here on Hanging Together, and will publish a synthesis of findings in a final report later this year.


“Cataloging Unchained”

Wednesday, February 27th, 2013 by Roy

Lorcan Dempsey (VP of Research at OCLC) has long said that we need to “make our data work harder.” And for years that is exactly what OCLC Research has been doing. So when I was asked to speak on data mining at the OCLC European, Middle East, and African Regional Council Meeting in Strasbourg, France, I knew I would have a lot to talk about. Too much, in fact.

Instead of trying to cover everything we’ve been doing in a whirlwind of slides that no one would remember, I decided to use WorldCat Identities as a “poster child” for the kinds of data mining activities we have been doing recently here at OCLC Research. Then, I described another, related project — the Virtual International Authority File. To bring it all home I mentioned how we’re considering how we might be able to marry these two resources into one “super” identities service.

Consider what it would mean to take an aggregation of library-curated authority records and enhance it with algorithmically-derived data from WorldCat as well as links to other resources about creators such as Wikipedia. This would provide a rich resource of information about creators, all sitting behind authoritative and maintained identifiers that could be used in emerging new bibliographic structures such as is being created by the Library of Congress’ Bibliographic Framework Transition Initiative. The mind reels with the possibilities.

But before I could jump into all this I needed a way to quickly explain why we are doing things like this — and how we are doing them. I decided I needed to make a video. So last week that is exactly what I did, with help from colleagues in Dublin. The result was less than three-and-a-half minutes long, and yet it amply set the stage for what was to come after. Plus, it can have a life of its own.

Take a look yourself, at “Cataloging Unchained”, and let me know what you think in the comments.