Archive for the 'Searching' Category

Pinch Me. I’m Dreaming.

Thursday, May 30th, 2013 by Roy

Commodore PETI was a library assistant at a community college in California when the microcomputer Commodore PET was released. At the time I was starting to think about going to college to get a library degree. To be clear, I first needed to get my B.A. as I had not gone to college out of high school (I was actually a high school dropout, but that’s a long story). Since libraries were all about information, and computers excelled in processing information, it seemed obvious to me that I needed to learn how to use computers.

That began my lifelong relationship with computers, since I wrote my very first software program on that Commodore PET, stored on a cassette tape, in the early 1980s. It was a library orientation tutorial.

From there, I minored in computer science and majored in geography at Humboldt State University. As an assistant to a geography professor there I wrote statistical analysis programs in FORTRAN and took classes in COBOL and Pascal. These were batch systems, where you submitted your job and waited for it to be run at the whim of the operators. The largest computer I ever used at Humboldt in the 80s is now far eclipsed in both power and storage by the phone in my pocket.

At UC Berkeley, where I received my MLS in 1986 and worked for 15 years, I made the acquaintance of UNIX and online time sharing computing. What a breakthrough that was. You could write and run a program immediately and get instant feedback. For a lousy code jockey like me, it was heaven. I didn’t have to wait to find out I had left out a semicolon.

So fast forward to today. Earlier today I kicked off a computing job on the OCLC Research compute cluster that took 4 minutes and 19 seconds to complete. At first that sounds like a long time, until you find out what it did. The job located 129 MARC records out of nearly 300 million that had the text string “rdacontnet”. It did this without indexes or databases. 300 million MARC records are sitting out on spinning disks somewhere in Ohio and from California I searched every byte of those records for a specific text string and it completed in less than 5 minutes. Pinch me. I’m dreaming.


Photo courtesy of Don DeBold, Creative Commons License Attribution 2.0 Generic

Adventures in Hadoop, #5: String Searching vs. Record Parsing

Friday, January 25th, 2013 by Roy

In a previous post I described how I could now string search WorldCat using the Research compute cluster and Hadoop. This means I can find any string of characters anywhere in a MARC record and output the OCLC numbers of matches and/or the entire record, all within a few minutes. Keep in mind we are now talking about nearly 300 million records.

String searching is not, by any means, the only way we use Hadoop. It is actually more common for us to use code (typically written in Java, Python, or Perl) to parse the records and output portions for further processing. But since I had been using such code for simply identifying records of interest, I began to wonder which method of processing was faster.

In a short experiment I quickly proved that string searching was about three times faster for simple identification and output of records than was the code I had previously been using. This is because, I believe, the code I had been using would parse the record before determining if it met my criteria. This one extra step added so much overhead to the process that it would take 15 minutes (in one test) rather than 5.

This likely means that in some cases where relatively few records would match your criteria, you would still be better off extracting the records by string searching and then running your extraction code against them off-cluster. For example, if I wanted to pull out the 245 fields of, say, about 1,000 records, I’d likely be better off extracting the records I wanted by string searching and then process that file directly without using Hadoop.

One last permutation, however. If your process is one that identifies 1,000 records in some situations and several million in another, having one process through which all operations flow is more efficient than two or more separate processes.

And so it goes. Another day and another adventure in Hadoop.

Introducing ArchiveGrid – the sandbox where archivists build something better

Monday, September 12th, 2011 by Jim

Those of you in the archive or research library world may be familiar with ArchiveGrid®, a database and discovery service that grew out of RLG’s Archival Resources service which leveraged all the collection level descriptions in the union catalog and aggregated the encoded finding aids that institutions made available for their collections. For many years ArchiveGrid was a subscription service and it has continued as such within the OCLC environment.

Although ArchiveGrid is currently available as a subscription service at, it will eventually become a free discovery system. To facilitate this transition, OCLC Research is developing a new ArchiveGrid discovery interface that is now freely available. To try it out, go to

The great work of my Research colleagues, Bruce Washburn and research assistant Ellen Ast, has produced this experimental version of ArchiveGrid which will signficantly expand the work and impact of OCLC Research in the archives arena.

A major strand of Research investment has gone toward the broad area of Mobilizing Unique Materials where the objective has been the achievement of economies and efficiencies that permit the unique materials in libraries, archives and museums to be effectively described, properly disclosed, successfully discovered and appropriately delivered. There’s been great work done and I hope you’ll review some of it at the link above but we’ve also been hampered by the lack of a proving ground where innovative approaches to description can be tested, where discovery behaviors can be watched and measured and where we can identify the best ways to have search engines incorporate these unique institutional assets into results.

We want ArchiveGrid to fill that gap. My colleagues are structuring a program of work around ways in which this sandbox can be best exploited to the advantage of archivists and potential users of archives. We’ll look first to what the institutions in the OCLC Research Library Partnership can contribute both in the way of content but ideas and direction as well. We’ll generalize our findings and feed this back to the community.

Check out ArchiveGrid now. It includes over a million descriptions of archival collections held by thousands of libraries, museums, historical societies and archives worldwide and enables researchers to learn about the contents of these collections, contact archives to arrange a visit to examine materials or order copies—all from one simple, intuitive search. At the bottom of the landing page you’ll see the links to provide feedback and to indicate interest in including your descriptions in the aggregation. Operators are standing by.

OCLC Research 2010: Classify and WorldCat Genres

Friday, December 24th, 2010 by Merrilee

As 2010 winds down, we’d like to call attention to some of the things we’ve worked on or created this year. You can see a rundown of highlights here.

I hate those end of year “10 best” lists. For me, each list represents a number of [books, cds, movies, apps, restaurants] that I once again failed to get to in the current year and probably won’t in the next. I also hate being told what I should [read, listen to, watch, play with, eat].

But I love WorldCat Genres, which is a great way to browse and discover fiction (or movies) based on my own tastes and preferences. For example, I love autobiographical fiction, because it’s usually bittersweet and sometimes dishy. Browsing in WorldCat Genres, I can see some newer books that are in this genre that look tempting, as well as some old favorites, and related movies. I like this way of constructing my own lists, based on similarities in the WorldCat data.

And then there’s Classify. Classify is an experimental web service that reveals the classification (Dewey Decimal Classification, Library of Congress Classification, or National Library of Medicine Classification) that has been assigned across a FRBR work set. A good example is a book I’m reading now, Christopher McDougall’s Born to Run. You’ll see, at least for DCC, the classifications mostly adhere to one class number, but also tend to be assigned to two other class numbers.

Additionally, Classify reveals the FAST subject headings for the FRBR work set.

So what?

So this is a person-friendly prototype for what is actually a web service. Imagine farming a portion of your cataloging workflow off to a webservice. If there’s overwhelming agreement on classification (90% of those items that have a class number are all the same), then the class number is assigned automagically. If there’s variance, a human intervenes and makes a decision. There is also an opportunity to use the provided subject terms.

Classify helps to harness the wisdom of the crowds, the decisions of lots of catalogers, as represented in WorldCat.

Another cool tool to put under the Christmas tree.

You can find out more about the Classify project and more about what makes WorldCat Genres tick on our website.

And if you are thirsty for more, you can check out a three-page summary of our accomplishments over the last five years.

Breaking Open the ILS Silos

Friday, August 20th, 2010 by Roy

In 2007-2008, the Digital Library Federation (DLF) convened a Task Group to recommend standard interfaces for integrating the data and services of the Integrated Library System (ILS) with new applications supporting user discovery. The group produced a report with recommendations in December 2008. After that not much happened.

In February 2010, at the Code4Lib Conference, Karen Coombs (the OCLC Developer Network manager) and I brought together some of the people who had been on that task group as well as other interested parties who were at the conference to take this work to the next stage. At this ad hoc meeting we agreed that we were ready to take this work to the next stage. The next stage, we felt, was to actually create a middleware layer that we could collaboratively maintain. Read the rest of this entry »

Next-Gen Harvesting

Thursday, February 4th, 2010 by Roy

Metadata harvesting (collecting metadata from others and aggregating it in a collection) is not new. Although there are any number of ways to do this, the OAI-PMH protocol for metadata harvesting is often used and has been around for years. It defines a small set of actions that allows anyone to discover what sets of metadata are available for harvesting from a digital repository, which metadata formats are offered, and select and download those records. Thousands of repositories worldwide support it, sometimes even unknowingly, because many repository applications such as DSpace and ePrints come with OAI-PMH support out of the box.

This has led to a world in which there are metadata aggregators and even agreggators of aggregators. It has also led to potential confusion and difficulty. Records that are picked up from their “native” location and indexed and displayed elsewhere may not be depicted as the creator of that metadata intended. They also may not be refreshed in a timely fashion, thereby potentially leading to records that are out-of-date persisting in various corners of the Internet.

This is why when my colleagues on the services side of the house announced the WorldCat Digital Collection Gateway I sat up and took notice. This heralds a new world in which those being harvested can exert some control over not only how frequently their records are updated, but also how those records are depicted in the aggregation — in this case, WorldCat. Through a simple web-based interface, you can provide your OAI-PMH base URL, have the Gateway test harvest some records, view how those records would display in WorldCat, and change the mapping if you wish. Another benefit is that your records will then appear in all of the places WorldCat is syndicated.

A pilot project to test the Digital Collection Gateway was just announced, beginning March 1, and we are seeking volunteers to try it out and provide feedback. During the pilot you will be asked to:

  • Attend a two-hour webinar reviewing the use of the Gateway
  • Upload a minimum of 500 metadata records to WorldCat
  • Offer feedback and input on your experience with the Gateway to our support and product teams so we can improve the tool and workflows

If you would like to help us create a next-generation harvesting infrastructure, in which you control your metadata more than ever before, email us at

The Straight Dope on OAIster

Monday, September 21st, 2009 by Roy

As many of you are probably aware, OCLC and the University of Michigan announced last January that OCLC was taking over the OAIster aggregation of metadata harvested from OAI-compliant repositories. The University of Michigan was no longer able to support it, and was looking for assistance in sustaining this valuable community resource. As Kat Hagedorn remarked in regards to our agreement, “Hosting anything of this size quickly got out of hand for UM Libraries, and it took us a long time to realize it. Besides, greater access for more folks? Sounds win-win to me, as long as it’s continuously freely available.” [reported by Dorothea Salo]

I have heard lots of questions since we started contacting contributors with the most recent phase of the transfer plan, so the purpose of this post is to bring everyone up to date on why we are doing this, where things are, and what we hope to accomplish in the future. Read the rest of this entry »

Smithsonian Web Strategy, CultureLabel: The Impact of Network Effects

Friday, July 31st, 2009 by GĂĽnter

The Smithonian just announced the release of its Web and New Media Strategy v 1.0 [pdf], which has come together swiftly in a process of marvelous openness and inclusion. As a campus-like institution with 19 museums and galleries, 9 research centers, 18 archives, 1 library with 20 branches, and a zoo, the Smithsonian web-presence to date is as fragmented as its administrative parts (also see this presentation), and the chief goal of the web strategy is to offer the Smithsonian Commons as a unifying platform to SI units.

The initial Smithsonian Commons will be a Web site […] featuring collections of digital assets contributed voluntarily by the units and presented through a platform that provides best-of-class search and navigation; social tools such as commenting, recommending, tagging, collecting, and sharing; and intellectual property permissions that clearly give users the right to use, re-use, share, and innovate with our content without unnecessary restrictions.

Starting to skim through the report, this line in particular caught my attention:

We are like a retail chain that has desirable and unique merchandise but requires its customers to adapt to dramatically different or outdated idioms of signage, product availability, pricing, and check-out in every aisle of each store.

I think this is an apt metaphor for how the Smithsonian currently undermines its own potential, and should serve as a memorable rallying cry for the changes the web strategy advocates.

As coincidence would have it, this metaphor also handsomely dovetails with another intriguing piece of news, gleaned from the UK Museum Computer Group list (posted by Simon Cronshaw, Director of CultureLabel):

If you haven’t come across CultureLabel yet, our aim is to facilitate a united alliance of museum e-stores to forge a new mainstream consumer shopping category of ‘cultural shopping’ – in a similar way to how ethical shopping or alternative gifts have crystallised as buying categories in the public consciousness. We see this as a great new opportunity for both income generation and innovative audience development for all our culture partners.

While the Smithsonian aims to integrate its digital collection into a more cohesive webpresence, CultureLabel aims to integrate museum e-stores (for starters, those in the UK – more here) into one massive one-stop shop. What’s true for digital collections is equally true for products from the museum store: bringing together assets from a wide variety of players creates a webpresence with more gravity, which in turn will attract a wider audience. The Smithsonian Commons and CultureLabel both take advantage of a fundamental network effect: the more assets, the more users (customers / site visitors); the more users, the more participation (purchasing / tagging, commenting, etc.). The brand, a term featuring prominently both in the SI Web Strategy and on the CultureLabel website, ultimately is the biggest winner.

The Smithsonian web strategy acknowledges that the fragmented offering severely limits the impact pan-institutional assets currently have. Taking a step back, of course this logic also applies to the larger community: fragmenting our offerings into thousands of institutional websites severely limits the impact and potential of the collective museum collection.

With 60 participating museums and galleries, CultureLabel breaks down those institutional barriers, and stands as one of the most extensive data sharing exercise museums have engaged in to date. It’s a little sobering, if not surprising, that the gift shop is ahead of the collection in this instance. Can we do for museum collections what CultureLabel has done for museum commerce? Can we scale the model and the values of the Smithsonian Commons to a Commons for all museums? If it works for products, let’s make it work for digital collections.

Repositories and library cultures

Tuesday, March 10th, 2009 by John

When is a repository not a repository? When it’s an OPAC? Are OPACs in reality a species of repository, however reluctantly, given that the genus is usually used with a specific application in mind – one which is a newcomer to the library world whose value is still not convincingly proven?

In the UK, JISC is about to award a tender for a study on The links between library OPACs and repositories in Higher Education Institutions. The invitation to tender states:

Repositories and OPACs … share various features and requirements. Both depend for their efficiency upon accurate metadata. Both provide a primary service to the home institution but also provide services to external users, for example in enabling access to content for a user from another institution. Various items of content may be accessible both through the library OPAC and through the repository, sometimes in different versions (e.g. a preprint in a repository and a published journal article under licence in an OPAC).

Its terms of reference include:

  • survey the extent to which repository content is in scope for institutional library OPACs, and the extent to which it is already recorded there;
  • examine the interoperability of OPAC and repository software for the exchange of metadata and other information;
  • list the various services to institutional managers, researchers, teachers and learners offered respectively by OPACs and by repositories;
  • make recommendations for the development of possible further links between library OPACs and institutional repositories, identifying the benefits of such links to various stakeholder groups.
  • Reading this reminded me that the University of Edinburgh has recently announced the introduction of an Open Access publication mandate. The Library will continue to run its Edinburgh Research Archive (ERA) open access repository alongside a new, closed, Publications Repository (PR), which will support research assessment and profiling. As the criteria for institutional deposit proliferate, the mandate document includes a FAQ section to answer researchers’ concerns. One is:

    What about research outputs which are not journal articles? The PR and ERA can accept most research output types including books, book chapters, conference proceedings, performances, video, audio etc. In some cases – for example books not available electronically – the PR/ERA will hold only metadata, with the possibility of links to catalogues so that users can find locations….

    Read the rest of this entry »

    Easy Access to Digitized Books

    Thursday, December 11th, 2008 by Roy

    Over on the Developer’s Network blog, where I sometimes blog as well as other colleagues involved with OCLC Grid Services, Xiaoming Liu posted something that I think deserves much wider attention than the two readers that blog normally has (Hi Mom!).

    In a nutshell, he describes how easy it can be to find out if a particular book is openly available in full-text by using the xOCLCNUM Web Service, which is free to OCLC cataloging subscribers (also known as “governing members”). According to his calculation, by using FRBR principles to collect related works, there are now nearly 2.5 million titles discoverable through this service that are available from the Internet Archive and the Hathi Trust.

    So how does it work? Easy as pie. For example, this URL:,url&format=txt

    Would retrieve a result like this:


    If multiple URLs exist for same OCLC number, they are separated by a space. I’ve never been employed as a computer programmer but even I can hit this softball out of the park. Grab the OCLC numbers of library catalog search results, query the xOCLCNUM service, and for any that match, drop a link to the digital versions right on the search result screen.

    Easy as pie. Like falling off a log. Piece of cake. So why are you still hanging around here?