Archive for the 'Infrastructure' Category

ISBNs in WorldCat

Thursday, May 23rd, 2013 by Roy

Recently a question came up on the BIBFRAME list about ISBNs, and how many of them were in MARC records. This is just the kind of question that OCLC Research is uniquely placed to answer, so I quickly wrote some simple Perl code to run as a Hadoop streaming job to find out.

It was remarkably quick and easy to find out, although I had to edit and re-run the code when I discovered a flaw in my logic. This is, sadly, all too frequently the case. But not too much later I had my result:

Occurrences # per Record Percent of WC
230444194 0 77.71%
55668178 2 18.77%
4766652 1 1.61%
3708352 4 1.25%
616623 3 0.21%
411230 6 0.14%
125715 8 0.04%
65796 5 0.02%
45304 10 0.02%
30155 12 0.01%

These are all of the occurrences of a 020 $a in WorldCat as of 1 May 2013 [Added for clarification: the prior sentence describes exactly what is being counted. That is, I am not (yet) examining ISBNs for 10-digit vs. 13-digit; therefore, many of the records with 2 ISBNs may in fact simply have both versions].  A few observations:

  • Many items in WorldCat were published before the invention of the ISBN.
  • Many items in WorldCat are not ISBN-appropriate (e.g., unpublished materials).
  • ISBNs are therefore problematic as identifiers except for a narrow slice of materials (mainly printed books since the mid-60s).

A much better identifier for many purposes is, I assert, the OCLC number.

We Want to Send You to SemTechBiz

Monday, April 29th, 2013 by Roy

semtechSemTechBiz is a major conference for those who are using semantic web technologies like linked data, RDF, Schema.org, etc. It is being held June 2-5 in San Francisco and OCLC and LITA have teamed up to send a librarian there to share the good work that libraries are doing to produce and consume linked data.

We will pay the expenses of the selected individual to attend the conference where they will also be afforded a lightning talk slot to highlight their work for conference attendees. This is the first “Library Spotlight on Innovation” that we jointly developed with SemanticWeb.com, the producers of the conference. Richard Wallis, our Linked Data Evangelist, was instrumental in putting this together.

So are you doing something interesting with linked data? Or do you know of someone who is? If so, you can nominate yourself or someone else for this great opportunity. We want the broader world to know about how libraries are innovating with linked data.

Regional print management and cooperative infrastructure: maps and gaps

Monday, March 4th, 2013 by Constance

NITLEMR

We are excited to be working with the Ohio State University (OSU) and the Committee on Institutional Cooperation (CIC) on a new project to explore the contours of a regional strategy for managing the print book resource in the CHI-PITTS mega-region. Regular readers of this blog will know that mega-regions are geographic areas that typically encompass multiple population centers, exhibit a high degree of economic integration, and are bound together by a rich network of transportation, logistics, and communications infrastructure, as well as mutual cultural interests and similarities. Mega-regions are an intriguing concept for thinking about collaborative activities that scale above small groups of institutions, or even existing library consortia. OCLC Research recently published a report that used a mega-regions framework to explore the characteristics and implications of a North American network of regionally consolidated print book collections.

Over the last few months, we have explored this issue further by working with several US regional library consortia to examine their collective print book holdings in the context of the print book resource and infrastructure available in the mega-region most closely aligned with the location of the consortial membership. We have produced profiles for the Statewide California Electronic Library Consortium (SCELC) in the context of the SO-CAL mega-region; the Association of Southeastern Research Libraries (ASERL) and the Washington Research Library Consortium (WRLC) in the context of the CHAR-LANTA mega-region; and the National Institute for Technology in Liberal Education (NITLE) membership in the context of the BOS-WASH mega-region. We plan to publish a series of case studies highlighting the findings from these consortial profiles in the near future.

Our new collaboration with OSU and the CIC is an extension of this consortial profiling work. In this project, we will examine print book holdings at multiple levels: an institution (OSU); a library consortium (CIC); and a mega-region (CHI-PITTS). The purpose of the work is to conduct a detailed analysis of the factors that an individual library might bring to bear in selecting books to contribute to a shared consortial collection, as well as to compare both the individual library collection and the consortial print book resource to the broader context of the print book resource available in the surrounding mega-region. The CHI-PITTS mega-region, which extends across the upper Midwest from Chicago to Pittsburgh, is the mega-region which aligns most closely with the locations of the CIC membership.

Some of the questions we will address include:

  • What part of the OSU print book collection represents a distinctive asset when compared to the aggregate print book holdings within the CIC membership, or the broader CHI-PITTS mega-regional print book resource? What are the characteristics of these distinctive resources with respect to subject, age, and system-wide work-level holdings?
  • What part of the OSU collection is widely held across the collections of the CIC membership, or institutions within the CHI-PITTS region? Can a “core” set of titles be identified, at the consortial or regional level, that represent duplicative investment? Are there opportunities to reduce local costs by managing these titles as a shared resource at the consortial or regional level?
  • What does the ILL demand profile for OSU tell us about consortial and regional demand for its print book collection? How much of this demand is centered around OSU’s distinctive print book titles? How can OSU cooperate with other CIC members to meet local, consortial, and regional demand for print books?

Carol Pitts Diedrichs, Director of OSU Libraries, has posted a nice summary of the thinking that led up to this joint effort.

OSU volunteered to serve as a test case for this project, with the understanding that findings from the analysis will be useful to all CIC member libraries considering shared print archiving arrangements. Of course, we hope the project will be useful to other libraries as well. There is growing interest in how (or if) the lessons learned in journal archiving projects like the Western Regional Storage Trust (WEST) or the CIC Shared Print Repository can be applied to cooperative efforts to preserve monographic collections. This project should provide some answers. We expect to post periodic updates on the project over the next several months here on Hanging Together, and will publish a synthesis of findings in a final report later this year.

 

“Cataloging Unchained”

Wednesday, February 27th, 2013 by Roy

Lorcan Dempsey (VP of Research at OCLC) has long said that we need to “make our data work harder.” And for years that is exactly what OCLC Research has been doing. So when I was asked to speak on data mining at the OCLC European, Middle East, and African Regional Council Meeting in Strasbourg, France, I knew I would have a lot to talk about. Too much, in fact.

Instead of trying to cover everything we’ve been doing in a whirlwind of slides that no one would remember, I decided to use WorldCat Identities as a “poster child” for the kinds of data mining activities we have been doing recently here at OCLC Research. Then, I described another, related project — the Virtual International Authority File. To bring it all home I mentioned how we’re considering how we might be able to marry these two resources into one “super” identities service.

Consider what it would mean to take an aggregation of library-curated authority records and enhance it with algorithmically-derived data from WorldCat as well as links to other resources about creators such as Wikipedia. This would provide a rich resource of information about creators, all sitting behind authoritative and maintained identifiers that could be used in emerging new bibliographic structures such as is being created by the Library of Congress’ Bibliographic Framework Transition Initiative. The mind reels with the possibilities.

But before I could jump into all this I needed a way to quickly explain why we are doing things like this — and how we are doing them. I decided I needed to make a video. So last week that is exactly what I did, with help from colleagues in Dublin. The result was less than three-and-a-half minutes long, and yet it amply set the stage for what was to come after. Plus, it can have a life of its own.

Take a look yourself, at “Cataloging Unchained”, and let me know what you think in the comments.

Adventures in Hadoop, #5: String Searching vs. Record Parsing

Friday, January 25th, 2013 by Roy

In a previous post I described how I could now string search WorldCat using the Research compute cluster and Hadoop. This means I can find any string of characters anywhere in a MARC record and output the OCLC numbers of matches and/or the entire record, all within a few minutes. Keep in mind we are now talking about nearly 300 million records.

String searching is not, by any means, the only way we use Hadoop. It is actually more common for us to use code (typically written in Java, Python, or Perl) to parse the records and output portions for further processing. But since I had been using such code for simply identifying records of interest, I began to wonder which method of processing was faster.

In a short experiment I quickly proved that string searching was about three times faster for simple identification and output of records than was the code I had previously been using. This is because, I believe, the code I had been using would parse the record before determining if it met my criteria. This one extra step added so much overhead to the process that it would take 15 minutes (in one test) rather than 5.

This likely means that in some cases where relatively few records would match your criteria, you would still be better off extracting the records by string searching and then running your extraction code against them off-cluster. For example, if I wanted to pull out the 245 fields of, say, about 1,000 records, I’d likely be better off extracting the records I wanted by string searching and then process that file directly without using Hadoop.

One last permutation, however. If your process is one that identifies 1,000 records in some situations and several million in another, having one process through which all operations flow is more efficient than two or more separate processes.

And so it goes. Another day and another adventure in Hadoop.

Adventures in Hadoop, #3: String Searching WorldCat

Tuesday, September 25th, 2012 by Roy

OK, I admit it. Ever since I joined OCLC over five years ago I’ve harbored a dream. My dream was to one day string search WorldCat. What that means is to have the ability to find any random string of characters anywhere in any MARC field within the over 250 million records that comprise this huge union catalog. Unix geeks call it “regular expressions” or “regex” for short, and often use the Unix command “grep” to string search files. Now I admit it, this is a very geeky dream, but it’s mine and I’m not giving it up.

Luckily, I don’t have to. In fact, just the other day, thanks to a colleague, I actually did it.

Read the rest of this entry »

Adventures in Hadoop, #2: “Hello World”

Wednesday, September 12th, 2012 by Roy

In my first adventure I introduced Hadoop and the computer cluster upon which we will be running it as we perform various data mining procedures. Now I will begin to get “down and dirty” into the technologies and some of the ways we are using them.

Since I am just beginning to learn about Hadoop, HDFS (the Hadoop file system), HBase (the Hadoop database), and all kinds of other related techologies (I can’t wait to get to Pig, myself), I will likely be starting with simple “hello world!” style tasks. If you’re a programmer but don’t know anything about Hadoop, these will likely be right down your alley. If you’re already familiar with Hadoop you might be fascinated to see how badly I mangle the explanations. So here goes…

On Gravel, our Research cluster, we have a full copy of WorldCat loaded into HBase. HBase shines at random access, and the key for any record in HBase is the OCLC number, so nothing could be simpler than fetching a specific record from HBase. Well, or so you think. But like anything there are at least several ways to do it:

  • HBase shell: From the command line you can start up shell access to HBase, and access records in a particular table. WorldCat is one big table in HBase, and we have a number of other tables available as well. A simple command like “get ‘TABLENAME’,'ROWNUM’” will fetch a particular row, and since the row number is equal to the OCLC number, for us to fetch a WorldCat record it’s as simple as “get ‘Worldcat’,’234′” to fetch the record for the item with 234 as the OCLC number.
  • Stargate REST Server: A very simple REST query to localhost can fetch records as well via the Stargate REST server. For example, the request http://localhost:PORT/TABLENAME/ROW (or, using the values above, http://localhost:PORT/Worldcat/234) will retrieve that row from the specified table. However, the record will be returned with the data encoded in Base64, so it must be translated to become usable.
  • Thrift API: A cross-language API that claims to be more lightweight than REST for many operations. That is absolutely everything I know about it.
  • Java code: Hadoop expects interaction via Java, so the best, most “native” way to write code to interact with the Hadoop family of technologies is with Java. If you prefer not to use Java, you will need to use the “streaming” function to use Hadoop.

As a simple “Hello World!” style exercise, I set out to use the REST server access above and write a small CGI script to present a web form that expects an OCLC number and then retrieves the record for that number from HBase. Getting the record was the easy part, but then you have to separate the data from the wrapper XML and decode the Base64 data. I got stuck on this, and my colleague Bruce Washburn came to my rescue (and on a weekend no less). By Monday it was working like a charm.

To make the response easier to deal with, I also put in line breaks between XML tags, otherwise the bulk of the record comes back as a single line. All told it’s 60 lines including documentation and blank lines, plus a few external subroutines and the MIME::Base64 and LWP::Simple Perl modules.

Bottom line — I got my feet wet with a small but potentially useful exercise and as a side benefit I now have code to easily retrieve and decode a record from HBase that I can now potentially use as a subroutine in future programs. Not bad for a start.

Adventures in Hadoop, #1: Introduction and the Research Cluster

Thursday, August 9th, 2012 by Roy

What is it with geeks and wacky names, anyway? Despite what it sounds like, Hadoop is neither a rarely-glimpsed mammal from Tasmania, nor a children’s board game. Nope, rather it is a family of technologies that implement various aspects of Google’s MapReduce algorithm cum programming model that is optimized for processing huge data sets.

Since WorldCat is nothing if not a huge data set, it seems only natural that we would be using MapReduce technologies to data mine WorldCat, and in fact we have – for years. But what is new to us is making the transition to the Hadoop family of technologies that implements MapReduce in Java and is tuned specifically for running on clusters of computers, as we have in Research.

As I stumble along the learning trail with my colleagues (in particular, Bruce Washburn here in the San Mateo office), I hope to write about some of my experiences here — not so much to educate as to entertain through laughter. You see, I tend to learn new technologies just well enough to create havoc, and that can be quite entertaining if not instructive about what not to do. But more on that later.

For today’s post I will introduce you to “gravel,” which is replacing our former compute cluster that had been dubbed “pebbles” (don’t ask). It looks like this:

  • 1 “head” (control) node – 2 6-core 3.1 GHz processors, 64 GB RAM 24 TB hard disk
  • 40 “compute” (processing) nodes – each with 2 4-core 2.6 GHz processors, 32 GB RAM, 6 TB hard disk

For those of you following along at home, overall we have well over a terabyte of RAM and about a quarter of a petabyte of disk storage. Needless to say, even with multiple copies of WorldCat we have plenty of headroom for data mining processes.

So hardware isn’t much of a limitation right now, but it will take me more work to get up to speed on the software side. I hope to document some of those antics here over the coming weeks. It may be as pretty as watching sausage being made (which, as an Indiana farm boy with German ancestry I actually have seen  – and smelled!), but I’m hoping it will be humorous if not also informative. You be the judge.