Archive for the 'Architecture and standards' Category

Two Huge Linked Data Announcements

Wednesday, June 20th, 2012 by Roy

This week we have announced two major initiatives that are now providing significant library linked data resources to the world. First was the announcement yesterday that all of the 23rd Edition of the Dewey Decimal Classification has been released on the web as linked data. From the announcement:

All assignable classes from DDC 23, the current full edition of the Dewey Decimal Classification, have been released as Dewey linked data. As was the case for the Abridged Edition 14 data, we define “assignable” as including every schedule number that is not a span or a centered entry, bracketed or optional, with the hierarchical relationships adjusted accordingly. In short, these are numbers that you find attached to many WorldCat records as standard Dewey numbers (in 082 fields), as additional Dewey numbers (in 083 fields), or as number components (in 085 fields).

Second was today’s announcement that we have now added descriptive markup. as well as draft set of library extensions, to all of WorldCat. From the press release:

OCLC is taking the first step toward adding linked data to WorldCat by appending descriptive mark-up to pages. now offers the largest set of linked bibliographic data on the Web. With the addition of mark-up to all book, journal and other bibliographic resources in, the entire publicly available version of WorldCat is now available for use by intelligent Web crawlers, like Google and Bing, that can make use of this metadata in search indexes and other applications.

For more information, see “Linked Data at OCLC”. Please keep in mind that these efforts are beginning steps. We will be reviewing the feedback we receive and likely making changes as opportunities to improve present themselves. For example, we are working to pull together a group of institutions that can collaborate on establishing a set of extensions to the elements. A very beginning draft is available, but it will likely go through many changes as others become more closely involved. We welcome your participation.

Follow-up addendum: We’ve had several folks ask about data dumps relative to the linked data announcement. Adding linked data to is, for the time being, an experiment that we’re putting out there in order to garner feedback and get some early usage results. We expect our model to change; because of that, we’re not publishing any bulk downloads of the data at this time.

Harvard bibliographic data released with prominent nod to OCLC

Tuesday, April 24th, 2012 by Jim

Member of the Charles River Basin Community Sailing Club Enjoy an Evening Sail. for a Dollar a Year, Youngsters Up to Age 17 Can Join the Club and Learn to Handle a Boat 08/1973

Into the flow.

Back in October we were excited to announce the final step in a project on which OCLC Research worked with the University of Cambridge – the release of their library catalog data as both MARC21 and as Linked Data. They worked with us and implemented our provisional recommendation to use an Open Data Commons Attribution license for the data release, which include data that was derived from WorldCat. While we are working to finalize and formalize that recommendation (it was a major discussion item at last week’s OCLC Global Council meeting) other institutions have been working on their own data releases.

Today the Harvard University Libraries released their library catalog of more than 12 million bibliographic records. This release furthers the mandate from their Library Board and Faculty to make as much of their metadata as possible available through open access in order to support learning and research, to disseminate knowledge and to foster innovation and aligns with the very public and established commitment that Harvard has made to open access for scholarly communication. I’m pleased to say that they worked with OCLC as they thought about the terms under which the release would be made. Although Harvard Libraries did not ultimately accept our recommendation about the ODC-BY license, the approach chosen by the Harvard Libraries takes into account some of the primary aspects of OCLC’s recommendation.

Specifically, our discussions acknowledged the Harvard mandate as well as what was most important to the OCLC cooperative – receiving attribution and making others aware of the cooperative’s norms and expectations of one another in regards to data derived from WorldCat. And again I’m pleased to say that our Harvard colleagues took the cooperative’s desires into account. The dataset is being released subject to the Creative Commons Public Domain designation (CC0) but Harvard requests that subsequent use provide attribution to Harvard, OCLC and the Library of Congress. They also request that users be aware of and act in a manner consistent with the OCLC cooperative community norms and provide a link to those norms. We think this is a well-intentioned and executed compromise.

It’s true we don’t think that public domain dedications for data derived from WorldCat are consistent with the OCLC cooperative’s norms as expressed in the WorldCat Rights and Responsibilities (WCRR) statement, particularly at Section 3.B.5. We also recognize that the WCRR statement is not a legally binding document and that interpretations of these community norms within the cooperative may differ. Releasing data is ultimately the choice of the OCLC member institution as are the terms. Would other members of the cooperative consider the release of the Harvard dataset under these terms and conditions bad acting and a risk to the long-term viability and sustainability of WorldCat? Probably not, particularly with attribution, and awareness and responsible treatment of WorldCat-derived data being requested so prominently.

Our discussions and this outcome are evidence that interpretations of community norms within the cooperative may differ. The mandates of institutional mission, the imperatives of emerging local policy, national and supra-national structures may all contribute to a differing view and legitimately demand precedence. In our discussions with Harvard we acknowledged that their direction was their choice. Their mandates took precedence. They acknowledged the cooperative’s concerns and responded as a responsible cooperative citizen by requesting attribution, and awareness of and adherence to the community norms of the OCLC cooperative. The discussion was frank and mutually supportive. After all, OCLC like its member institutions is in the early stages of large shifts in data technology and policy. There are inevitable tensions and conflicting goods that will need to be reconciled over time. The process in which we are engaged will if we continue to work together with good will, ultimately lead to a new suite of best practices that balance the common good and institutional sustainability.

Image: Member of the Charles River Basin Community Sailing Club Enjoy an Evening Sail

Five Easy Pieces

Saturday, February 11th, 2012 by Roy

I seem to have acquired an obsession. This obsession manifests itself in various ways, but one clear way is that I can’t seem to stop thinking about some of the findings from my colleague’s work that resulted in the publication Implications of MARC Tag Usage on Library Metadata Practices. Chief among them, in my view, is just how few metadata elements are actually used on a consistent basis in library cataloging.

I’m so intrigued and obsessed with this, that a chart Karen Smith-Yoshimura produced probably two years ago still graces my cube wall today (see picture, and click on it to see the chart up close). One of the things the chart illustrates, is that out of the then 200 million or so WorldCat records, only about five elements appear in more than half of the records. They are:

  • Identifiers (OCLC number, LCCN, ISBN, cataloging source, etc.)
  • Title Statement (245)
  • Publication Statement (260)
  • Physical Description (300)
  • Personal Name (100)

From there, the use of various fields falls off the proverbial cliff, with only fields like 500, 650, and 700 even making it above the one-quarter level. The vast majority of fields and subfields congregate, on the chart, along the very bottom, somewhere in the 0-5% range.

I stare at the chart, trying to translate its deeper hieroglyphic meaning. Is pure usage enough evidence to identify the fewest elements required to describe bibliographic objects? Has the profession really invested untold dollars and sweat into describing a few things very, very well and the vast majority hardly at all? What does this mean? What lessons can we take forward into a new bibliographic future?

I stare at it some more, as if pure observation can reveal a hidden truth.

OCLC Research 2011: it’s starting to look like a lot of Linked Data

Thursday, December 29th, 2011 by Jim

This is the sixth post in a mini series, where we look back at accomplishments in 2011.

While OCLC has gotten some (deserved and undeserved) bashing in the blogosphere during 2011 about the cooperative’s practices over the release of major bibliographic subset we’ve also been active in the Linked Data arena in ways that have moved the library linked data community forward.

Exhibit number one is, of course, the Virtual International Authority File (VIAF), about which much has been written. It fits the pattern that I think will emerge in the linked data arena. Rather than lots of institutional releases of data we will see the emergence of significant hubs based around authoritative aggregations on which many applications and implementations will arise. This file created through the manipulation of twenty-one authority files from eighteen organizations is prominent in the Linked Data Cloud and getting more than 2 hits/second from Google. Thom Hickey, the principal force behind the creation, extension and maintenance of VIAF has sensible commentary about its development on his blog including how VIAF relates to other name identifiers. The principals in VIAF – LC, DNB, BnF and OCLC – are working to formalize VIAF’s integration as an OCLC offering where it will be offered under an Open Data Commons Attribution license. Right now it’s out in the cloud without a license which counts as “not openly licensed” in that community.

Exhibit number two is the very recent release of the Faceted Application of Subject Terminology (FAST) file as linked data. I blogged about this not long ago. It has now shown up in the Linked Open Data Graph.

Exhibit number three is the Dewey linked data. Exhibit four OCLC’s support for and involvement with the Library Linked Data Incubator Group of the World Wide Web Consortium (W3C) where my Research colleague, Jeff Young, was a participant and contributor.

We expect more activity in the linked data arena during 2012 and hope to see some creative implementations and use cases as the year progresses. For now it’s Robert Burns and Auld Lang Syne time.

I understand that the earliest known manuscript of Auld Lang Syne autographed by Robert Burns is at the Lilly Library Indiana University but I couldn’t find a digital image…

OCLC Research 2011: “Well Intentioned Practices” adopted as a standard

Wednesday, December 28th, 2011 by Merrilee

We are closing out 2011 with a mini blog series, looking back on some highlights. This is the fifth post in the series

Although we’ve blogged about WIP, or “Well-intentioned practice for putting digitized collections of unpublished materials online” we failed to mention that it was endorsed as a standard by the Society of American Archivists in August.

Documenting the practices of “reasonable archivists” and encouraging the adoption of a risk management approach in digitizing materials from archival collections provides a path forward for archivists and decision makers, helping institutions to at least consider digitizing low risk materials. We’re pleased to have helped with establishing a community of practice for archivists who are concerned with making their materials available for research in an online environment.

FAST on the street

Wednesday, December 14th, 2011 by Jim

Tokyo Drift

leaving the garage for the street FAST

I’m pleased to say that today OCLC Research released FAST (Faceted Application of Subject Terminology) as linked data under an Open Data Commons Attribution license.

FAST has been a multi-year project of OCLC Research in collaboration with the Library of Congress. The FAST authority file is an enumerative, faceted subject heading schema derived from the Library of Congress Subject Headings (LCSH).

You can read the details in the press releases and announcements but even better would be to take a look at the web search interface to FAST. You can also see a nice example of FAST in action by looking at MapFAST which uses FAST to show library materials using the geographic focus of the content.

FAST itself has been a lot of work over many years and I was pleased that Ed O’Neill who led the project was here in our San Mateo offices when the release occurred. We were able to give him a big round of applause. Of course, this project demanded a broad range of effort from many Research staff over the years but the principal developer and the kingpin in the linked data release is Rick Bennett. We applauded him virtually.

Now I hope to sit back and hear about the interesting ways that FAST is mobilized in the linked data cloud.

Photo sourced from zweiff

The tail of the COMET (Project)

Thursday, October 27th, 2011 by Jim

1962 Mercury Comet Coupe

1962 Mercury Comet Coupe

Today the University of Cambridge released the final dataset from its COMET (Cambridge Open METadata) project. The final dataset contains more than 600,000 records derived from OCLC’s WorldCat available as both Marc21 and RDF triples under an Open Data Commons Attribution License (ODC-BY). All the previous data sets released, as well as this one, have been enriched with links to the FAST subject and VIAF name authority services provided by OCLC. This is the final step in the project and brings the total bibliographic records released to more than 3,600,000. OCLC Research was a formal partner in the project which was officially announced in February 2011.

While this JISC-supported project formally ended some time ago this final dataset release is noteworthy because of the license regime that has been applied. One of the goals set by the Cambridge University Library team was to release data derived from WorldCat in a fashion that was compliant with the rights and responsibilities of the cooperative. In that spirit they engaged OCLC in a discussion about the type of license that would be suitable and wondered whether OCLC had a recommendation. We didn’t at the start of the project but by the end we had engaged in enough other conversations and done enough investigation to recommend the Open Data Commons Attribution license with an explicit reference to the community norms embodied in the document WorldCat Rights and Responsibilities for the OCLC Cooperative (WCRR).

It was quite clear to us that many libraries would be engaging in data experiments similar to this Cambridge project and OCLC would be obliged to make a recommendation that could be viewed as a best practice by the members of the cooperative. Whatever recommendation we made needed to be consistent with the expectations of semantic web practitioners both in and out of the library community. That meant a standard license created by a neutral body operating globally that would be both widely used and generally understood. For a variety of reasons we settled on ODC-BY. It is a license that provides for attribution as set out in the WCRR document. Moreover from an intellectual property perspective it reflects the difference between the rights over a database as a whole, such as OCLC claims over WorldCat, and the rights over the contents of a database – the record data in WorldCat for example.

We were very pleased that Cambridge, particularly the project principal, Ed Chamberlain (an arcadia@cambridge Fellow) was willing to work with us to establish a low overhead implementation of the license as part of this final dataset release. OCLC Research and OhioLINK recently released datasets used in the OhioLINK collection and circulation analysis project under the same ODC-BY license. That project and the COMET project effort gave us real-world experience in the license implementation and an opportunity for the policy discussions that will result in a consistent recommendation to OCLC members wanting to honor the community norms expressed in the WCRR.

Ed and the project team, including the indefatigable Hugh Taylor, head of Collection Development and Description at the Cambridge University Library, with whom I’ve worked across many years, produced a project with very interesting results, sensible ongoing commentary and openly shared their experiences as they struggled with the specifics of the data and the vexed nature of library catalog ownership.

It’s worth reading Ed’s COMET blog, particularly the final entry summarizing what he learned and offering advice e.g. “‘Enliven’ linked RDF data”.

And for those who have not seen it yet, Hugh’s document describing problems inherent in understanding the origin of a MARC-encoded bibliographic record must be read. He made an heroic attempt to sort out the origins of Cambridge records with fascinating results. His analysis makes clear that most large library catalogs were created by collecting and combining whatever ingredients were at hand. And in this hobo stew the profile of rights under contract and license are complex and unclear. I was gratified to see that the conditions surrounding the WorldCat-derived data are quite clear relative to the range of records and vendors from whom they were sourced.

Congratulations to the COMET team. Working with them helped us to understand what kind of advice OCLC members want regarding the release of their catalog data and took us a long way towards a standard recommendation on a responsible and consistent licensing regime for cooperatively-sourced bibliographic data.

The photo is by Randy von Liski. Good stuff

More From the “Murky Bucket”

Thursday, March 3rd, 2011 by Roy

The inspiration for my title comes from Lorcan Dempsey, who some years ago, before I joined him at OCLC, put a name to the unease I had been feeling about the state of library metadata. In a Library Journal column I had bemoaned the fact that not only was it impossible for library users to limit a search to online items available online in full, it was impossible for us to even implement such a feature.

Lorcan responded to that column, citing the ” ‘murky bucket syndrome’ that affects any large bibliographic database—we cannot entirely, unambiguously slice and dice the database because of historic data entry and cataloging practices that…were not oriented toward our new needs.” I’ll say. Also, around that time my soon-to-be colleagues at OCLC Research wrote a paper about some related work they had done: “Mining for Digital Resources: Identifying and Characterizing Digital Materials in WorldCat”.

Later I did a deeper investigation into this while still at the California Digital Library, from which came an informal report called “Trouble in Online Paradise: An Analysis of MARC 856 Usage at One Institution”. Basically, I took 1,000,000 MARC records from UC Berkeley, pulled out all of the 856 fields (about 20,000 at the time), and analyzed them. Since I have that work on my prototype server, you can still play around with it if you want.

Read the rest of this entry »

Monster Mash

Tuesday, September 21st, 2010 by Roy

Tomorrow Bruce Washburn and I leave from the San Mateo office of OCLC Research to help run the WorldCat Mashathon in Boston (well, Cambridge, really, but you could toss a rock across the river and hit Boston). I really enjoy these events, since it is a couple days of helping library programmers learn about OCLC Web Services with a good chunk of time set aside to play with them. We’ll have all day Thursday and Friday to devote to learning and playing, which can be time difficult to come by when under pressure to deliver at your place of employment.

Previous Mashathons have yielded a number of new mashups, many of which have ended up in our Application Gallery. Previous attendees have also integrated a number of service improvements in their local systems using these APIs. Mashers are not limited to OCLC APIs by any means. We take pains to point out a list of library-related APIs that I maintain over on my site. Any API is fair game. Or linked data, or what have you. Whatever developers can use to improve their local services is fine with us.

So why did I title this post “Monster Mash”? Why I’ll be there…why else?

Economics of Scholarly Production: Supplemental Materials

Wednesday, August 25th, 2010 by Constance

At the Spring CNI Taskforce meeting last April, Karen Wetzel (Standards Program Manager at NISO) announced a new piece of work related to “supplemental materials” in journal articles. In the scientific literature, it is not uncommon for articles to be accompanied by a secondary set of figures, data, documentation of experimental protocols that aren’t considered part of the core content. Karen reported that thought-leaders from a variety of sectors had expressed concerns about the expense that publishers incur in managing this material, as well as the additional work that it creates for editorial staff and authors. Libraries were included in a long list of potential stakeholders, as potential curators of this supplemental material.

A central concern is that scholarly citation and reuse of this kind of supporting material is limited by the absence of identifiers, bibliographic metadata etc. Read the rest of this entry »