Archive for the 'Searching' Category

Breaking Open the ILS Silos

Friday, August 20th, 2010 by Roy

In 2007-2008, the Digital Library Federation (DLF) convened a Task Group to recommend standard interfaces for integrating the data and services of the Integrated Library System (ILS) with new applications supporting user discovery. The group produced a report with recommendations in December 2008. After that not much happened.

In February 2010, at the Code4Lib Conference, Karen Coombs (the OCLC Developer Network manager) and I brought together some of the people who had been on that task group as well as other interested parties who were at the conference to take this work to the next stage. At this ad hoc meeting we agreed that we were ready to take this work to the next stage. The next stage, we felt, was to actually create a middleware layer that we could collaboratively maintain. Read the rest of this entry »

Next-Gen Harvesting

Thursday, February 4th, 2010 by Roy

Metadata harvesting (collecting metadata from others and aggregating it in a collection) is not new. Although there are any number of ways to do this, the OAI-PMH protocol for metadata harvesting is often used and has been around for years. It defines a small set of actions that allows anyone to discover what sets of metadata are available for harvesting from a digital repository, which metadata formats are offered, and select and download those records. Thousands of repositories worldwide support it, sometimes even unknowingly, because many repository applications such as DSpace and ePrints come with OAI-PMH support out of the box.

This has led to a world in which there are metadata aggregators and even agreggators of aggregators. It has also led to potential confusion and difficulty. Records that are picked up from their “native” location and indexed and displayed elsewhere may not be depicted as the creator of that metadata intended. They also may not be refreshed in a timely fashion, thereby potentially leading to records that are out-of-date persisting in various corners of the Internet.

This is why when my colleagues on the services side of the house announced the WorldCat Digital Collection Gateway I sat up and took notice. This heralds a new world in which those being harvested can exert some control over not only how frequently their records are updated, but also how those records are depicted in the aggregation — in this case, WorldCat. Through a simple web-based interface, you can provide your OAI-PMH base URL, have the Gateway test harvest some records, view how those records would display in WorldCat, and change the mapping if you wish. Another benefit is that your records will then appear in all of the places WorldCat is syndicated.

A pilot project to test the Digital Collection Gateway was just announced, beginning March 1, and we are seeking volunteers to try it out and provide feedback. During the pilot you will be asked to:

  • Attend a two-hour webinar reviewing the use of the Gateway
  • Upload a minimum of 500 metadata records to WorldCat
  • Offer feedback and input on your experience with the Gateway to our support and product teams so we can improve the tool and workflows

If you would like to help us create a next-generation harvesting infrastructure, in which you control your metadata more than ever before, email us at oaister@oclc.org.

The Straight Dope on OAIster

Monday, September 21st, 2009 by Roy

As many of you are probably aware, OCLC and the University of Michigan announced last January that OCLC was taking over the OAIster aggregation of metadata harvested from OAI-compliant repositories. The University of Michigan was no longer able to support it, and was looking for assistance in sustaining this valuable community resource. As Kat Hagedorn remarked in regards to our agreement, “Hosting anything of this size quickly got out of hand for UM Libraries, and it took us a long time to realize it. Besides, greater access for more folks? Sounds win-win to me, as long as it’s continuously freely available.” [reported by Dorothea Salo]

I have heard lots of questions since we started contacting contributors with the most recent phase of the transfer plan, so the purpose of this post is to bring everyone up to date on why we are doing this, where things are, and what we hope to accomplish in the future. Read the rest of this entry »

Smithsonian Web Strategy, CultureLabel: The Impact of Network Effects

Friday, July 31st, 2009 by GĂźnter

The Smithonian just announced the release of its Web and New Media Strategy v 1.0 [pdf], which has come together swiftly in a process of marvelous openness and inclusion. As a campus-like institution with 19 museums and galleries, 9 research centers, 18 archives, 1 library with 20 branches, and a zoo, the Smithsonian web-presence to date is as fragmented as its administrative parts (also see this presentation), and the chief goal of the web strategy is to offer the Smithsonian Commons as a unifying platform to SI units.

The initial Smithsonian Commons will be a Web site […] featuring collections of digital assets contributed voluntarily by the units and presented through a platform that provides best-of-class search and navigation; social tools such as commenting, recommending, tagging, collecting, and sharing; and intellectual property permissions that clearly give users the right to use, re-use, share, and innovate with our content without unnecessary restrictions.

Starting to skim through the report, this line in particular caught my attention:

We are like a retail chain that has desirable and unique merchandise but requires its customers to adapt to dramatically different or outdated idioms of signage, product availability, pricing, and check-out in every aisle of each store.

I think this is an apt metaphor for how the Smithsonian currently undermines its own potential, and should serve as a memorable rallying cry for the changes the web strategy advocates.

As coincidence would have it, this metaphor also handsomely dovetails with another intriguing piece of news, gleaned from the UK Museum Computer Group list (posted by Simon Cronshaw, Director of CultureLabel):

If you haven’t come across CultureLabel yet, our aim is to facilitate a united alliance of museum e-stores to forge a new mainstream consumer shopping category of ‘cultural shopping’ - in a similar way to how ethical shopping or alternative gifts have crystallised as buying categories in the public consciousness. We see this as a great new opportunity for both income generation and innovative audience development for all our culture partners.

While the Smithsonian aims to integrate its digital collection into a more cohesive webpresence, CultureLabel aims to integrate museum e-stores (for starters, those in the UK - more here) into one massive one-stop shop. What’s true for digital collections is equally true for products from the museum store: bringing together assets from a wide variety of players creates a webpresence with more gravity, which in turn will attract a wider audience. The Smithsonian Commons and CultureLabel both take advantage of a fundamental network effect: the more assets, the more users (customers / site visitors); the more users, the more participation (purchasing / tagging, commenting, etc.). The brand, a term featuring prominently both in the SI Web Strategy and on the CultureLabel website, ultimately is the biggest winner.

The Smithsonian web strategy acknowledges that the fragmented offering severely limits the impact pan-institutional assets currently have. Taking a step back, of course this logic also applies to the larger community: fragmenting our offerings into thousands of institutional websites severely limits the impact and potential of the collective museum collection.

With 60 participating museums and galleries, CultureLabel breaks down those institutional barriers, and stands as one of the most extensive data sharing exercise museums have engaged in to date. It’s a little sobering, if not surprising, that the gift shop is ahead of the collection in this instance. Can we do for museum collections what CultureLabel has done for museum commerce? Can we scale the model and the values of the Smithsonian Commons to a Commons for all museums? If it works for products, let’s make it work for digital collections.

Repositories and library cultures

Tuesday, March 10th, 2009 by John

When is a repository not a repository? When it’s an OPAC? Are OPACs in reality a species of repository, however reluctantly, given that the genus is usually used with a specific application in mind - one which is a newcomer to the library world whose value is still not convincingly proven?

In the UK, JISC is about to award a tender for a study on The links between library OPACs and repositories in Higher Education Institutions. The invitation to tender states:

Repositories and OPACs … share various features and requirements. Both depend for their efficiency upon accurate metadata. Both provide a primary service to the home institution but also provide services to external users, for example in enabling access to content for a user from another institution. Various items of content may be accessible both through the library OPAC and through the repository, sometimes in different versions (e.g. a preprint in a repository and a published journal article under licence in an OPAC).

Its terms of reference include:

  • survey the extent to which repository content is in scope for institutional library OPACs, and the extent to which it is already recorded there;
  • examine the interoperability of OPAC and repository software for the exchange of metadata and other information;
  • list the various services to institutional managers, researchers, teachers and learners offered respectively by OPACs and by repositories;
  • make recommendations for the development of possible further links between library OPACs and institutional repositories, identifying the benefits of such links to various stakeholder groups.
  • Reading this reminded me that the University of Edinburgh has recently announced the introduction of an Open Access publication mandate. The Library will continue to run its Edinburgh Research Archive (ERA) open access repository alongside a new, closed, Publications Repository (PR), which will support research assessment and profiling. As the criteria for institutional deposit proliferate, the mandate document includes a FAQ section to answer researchers’ concerns. One is:

    What about research outputs which are not journal articles? The PR and ERA can accept most research output types including books, book chapters, conference proceedings, performances, video, audio etc. In some cases – for example books not available electronically – the PR/ERA will hold only metadata, with the possibility of links to catalogues so that users can find locations….

    Read the rest of this entry »

    Easy Access to Digitized Books

    Thursday, December 11th, 2008 by Roy

    Over on the Developer’s Network blog, where I sometimes blog as well as other colleagues involved with OCLC Grid Services, Xiaoming Liu posted something that I think deserves much wider attention than the two readers that blog normally has (Hi Mom!).

    In a nutshell, he describes how easy it can be to find out if a particular book is openly available in full-text by using the xOCLCNUM Web Service, which is free to OCLC cataloging subscribers (also known as “governing members”). According to his calculation, by using FRBR principles to collect related works, there are now nearly 2.5 million titles discoverable through this service that are available from the Internet Archive and the Hathi Trust.

    So how does it work? Easy as pie. For example, this URL:

    http://xisbn.worldcat.org/webservices/xid/oclcnum/51848364?method=getEditions&library=ebookfl=oclcnum,url&format=txt

    Would retrieve a result like this:

    465222    http://www.archive.org/details/lifeexploitsofin02cerviala
    4730463    http://www.archive.org/details/ingeniousgentlem02cerv

    If multiple URLs exist for same OCLC number, they are separated by a space. I’ve never been employed as a computer programmer but even I can hit this softball out of the park. Grab the OCLC numbers of library catalog search results, query the xOCLCNUM service, and for any that match, drop a link to the digital versions right on the search result screen.

    Easy as pie. Like falling off a log. Piece of cake. So why are you still hanging around here?

    Te Puna Mātauranga o Aotearoa Rocks My World

    Wednesday, December 3rd, 2008 by Roy

    The National Library of New Zealand (Te Puna Mātauranga o Aotearoa in Maori and an RLG Partner) has obviously been busy. Last week they joined the Flickr Commons, and they have already reported some impressive use statistics. But today (well, yesterday in Kiwi time) came an even bigger announcement.

    Digital New Zealand, “a nation-wide project to help make New Zealand digital content easier to find, share and use was launched at the National Library of New Zealand on 3 December 2008.” The incredible array of collections made available through this one interface would be news enough for many libraries. But the joy doesn’t stop there.

    The project welcomes additional content contributors, and stands ready to provide advice and assistance to help them to do so. Visitors are offered an opportunity to create a tailored search of the site and drop the resulting widget onto any web page they like or use the special search page that is created for them right on the Digital New Zealand site.

    If a visitor doesn’t wish to create a tailored web widget, they already have a library of such from which to choose. And for the true technorati, there is the developer section, which provides a simple way for software developers to get a key to be able to use the application programming interface (API) of the site. If all of this isn’t enough to knock your socks off, stay tuned.

    The “Memory Maker” is a web-based way to mix and match video clips into your own cinematic production. I kid you not. Try it out. You can add audio or music to add your own special touches. I doubt that any movie miracles will be made here, but the level of interactivity is completely off the charts. To get the full measure of this, you simply must see this movie.

    So by now you must think surely I am done singing the praises of Te Puna Mātauranga o Aotearoa, but I’m not. There’s still more. Like I said, they’ve obviously been busy. The last thing I want to highlight is their National Digital Heritage Archive. Long in the works through a partnership with ExLibris, this preservation system went live on November 4. “The National Digital Heritage Archive (NDHA),” states the web site, “is the National Library’s technical and business solution to preserve and provide long-term public access to its digital heritage collections.” The NLNZ was the flagship partner with ExLibris, and the product is based on the Open Archival Information System (OAIS) model and conforming to trusted digital repository (TDR) requirements (which came out of joint RLG-OCLC work before the two organizations joined).

    This is an incredible array of new initiatives by any measure, and a tribute to the leadership of Penny Carnaby, Chief Executive and National Librarian, and John Truesdale, Director National Digital Library, and of course many others who were instrumental in accomplishing all of this work. For my part, it’s hard to believe that it was only a bit more than a year ago when I was talking with Penny and John in a Melbourne bar after participating in a National and State Libraries Australasia strategic planning meeting. They have much to celebrate, as do we, since they have are doing much from which we can learn. I simply can’t wait to see what comes next.

    LC-Flickr: updating the catalog

    Wednesday, October 22nd, 2008 by GĂźnter

    In the context of John MacColl’s guest blog on Karen Calhoun’s Metalogue, I was reminded of the stats from the LC-Flickr project pertaining to changes LC made in their own catalog prompted by insightful Flickr comments.

    When I last updated my Flickr slides for a class at Syracuse University, I found 174 records containing the word “flickr” in an all text field search of LC’s Prints and Photographs Online Catalog. The records in that set usually contain a credit such as “Source: Flickr Commons project” for information which has been added, like in this instance.

    The same search today yields a whopping 4,256 records - which is quite close to the entire set of images LC has on Flickr (4,615 as of today). Upon closer inspection, I found that many of these records don’t contain a change to the substance of the record - however, they now do have a useful pointer to a discussion about the photograph on the Flickr site, and that’s why my search retrieved them. For an example, see this record which includes the following language: “Additional information about this photograph might be available through the Flickr Commons project at http://www.flickr.com/photos/library_of_congress/2369119062“. On Flickr, one can then follow a playful discussion about dating the photograph.

    Interestingly enough, these links to Flickr aren’t programmatic – an item which doesn’t have comments on Flickr doesn’t seem to receive the link. See for yourself – the LC equivalent of this Flickr image does not contain the pointer in the LC record, since there was no comment on the image in Flickr.

    It looks like LC continues to update its records based on Flickr user feedback, and they’re also creating links so people searching the LC catalog exclusively don’t miss out on the oftentimes rich discussion on Flickr. A search for “Source: Flickr Commons” yields 509 exact phrase hits, which is the portion which most likely represents actual updates to the catalog.

    A Map to Destinations Uncrawled

    Friday, July 18th, 2008 by Roy

    As part of what we’re trying to accomplish in the “Modeling New Service Infrastructures” part of the RLG Work Agenda, I’ve been working on a white paper on best practices for enhancing disclosure of library, museum and archive content at the network level. Using a sitemap to better expose content to web search engines such as Google is one technique that I’m investigating, but it is clear that this technique is by no means a silver bullet.

    For those of you who aren’t familiar with sitemaps, they are a way to expose information for a web crawler that is normally hidden behind a database wall. Since web crawlers cannot intuit all the necessary queries to extract all the data from a database, other techniques must be used to provide the crawler with crawlable URLs. One such technique is to create a sitemap, which is an XML file with all the URLs you wish to have crawled on your site. Google, Yahoo!, and Microsoft have collaborated through the Sitemaps.org site to establish a common protocol for sitemaps.

    I’ve used this for some time now to make sure all the images on my photos site were exposed to crawling. But recently I checked on it in Google’s Webmaster Tools site (where you can register your sitemap to make sure Google finds it), and I discovered to my dismay that only a small percentage (38%) of the URLs had been indexed in Google. On that site they state that “Most sites will not have all of their pages indexed,” but the only clues why are perhaps contained in the “Webmaster Guidelines” that describe what to do or not do to increase your chances of being indexed.

    I  asked the subscribers to the Web4Lib discussion what their experience has been. Debbie Campbell of the National Library of Australia (an RLG partner institution) reported that of the over a million items in Picture Australia, Google had indexed only about 49% of the URLs, although it was up dramatically over a couple weeks ago when the number was a paltry 32,000.

    Marshall Breeding of Vanderbilt University and the owner of Library Technology Guides reported a higher percentage of 62% of his URLs indexed. He also reported that he’s doing some testing of the priority setting of potentially uncrawled URLs (based on low page views) in his sitemap to see if that helps.

    Danielle Plumer of the Texas State Library and Archives Commission pointed out an article that discusses related issues, since of course this is a topic of interest to those trying to achieve “search engine optimization”. She also sent along part of a Q&A with Matt Cutts of Google that bears on this issue:

    Q: “My sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?”
    A: One of the classic crawling strategies that Google has used is the amount of PageRank on your pages. So just because your site has been around for a couple years (or that you submit a sitemap), that doesn’t mean that we’ll automatically crawl every page on your site. In general, getting good quality links would probably help us know to crawl your site more deeply. You might also want to look at the remaining unindexed urls; do they have a ton of parameters (we typically prefer urls with 1-2 parameters)? Is there a robots.txt? Is it possible to reach the unindexed urls easily by following static text links (no Flash, JavaScript, AJAX, cookies, frames, etc. in the way)? That’s what I would recommend looking at.

    So what can I conclude from this? Not a whole lot yet, except that creating a sitemap and registering it with Google is not by any means a silver bullet. It’s still a good idea, but it should only be one strategy of many we can take to get the unique content that libraries, museums and archives hold in front of the roving eyes of web users. It’s exactly that suite of strategies I’m trying to identify and describe, so please let me know of any methods you’re using that you think may be effective, either as a comment to this post or direct email.

    The Quest for the Single, Simple, Successful Search

    Tuesday, April 22nd, 2008 by Ricky

    Yes, for most people the Grail is Google. For many of us in the library world (even for some library users) it’s WorldCat. But there are apparently still numerous quests for one way to search everything that’s available at an institution.

    As discovery continues to rise to the network level, there are still some valid reasons to think locally:
    • The undergrad wanting to know what she can get her hands on now, but doesn’t want to have to think of all the various places on campus she might look (not to mention the systems she’d have to navigate)
    • The museum curator hoping to use other objects or resources in other departments to augment the exhibit he’s planning
    • The fundraiser needing to showcase the breadth and depth of the collections, highlighting whatever topic is of interest to the potential donor
    • A faculty member wanting to know what resources the university has that she might use for a new course
    • The special collections curator courting a donor of a collection, who wants to highlight other like materials across the institution and to show how easily they are accessed for research
    • An institution wishing to showcase and provide access to its collections on the institutional web site.

    In eight visits I’ve made to RLG partner institutions in the last six months, six of them said that this quest was a top priority for them (and perhaps when the other two finish their big building projects, the quest will find its way to the top of their priorities, too). Some call it metasearch, others call it federated searching, others plan to use OAI harvesting to create a single index of all their collections, and still others think about putting all the metadata into a single system. Most of them had tried something already and were unhappy with it and were determined to try something else. No matter which approach they had tried, in each case the perceived problem involved mapping the data. It wasn’t done correctly and next time they hoped it would be better.

    If only the mapping were better, the functionality would be improved. If only the mapping were better, the results could be presented in a more meaningful manner. If only the mapping were better, disparate data would coexist more easily together. Some thought the poor mapping was a characteristic of their chosen approach and if they went another route (or even the same route with different software) the problem could be addressed. The community is playing musical chairs with federated search software. When the music stops, one will try it again with SingleSearch, one will try it again with MetaLib, one will try it again with MetaFind, one with WebFeat, one with LibraryFind …, and the one left out will try OAI harvesting. All hoping to make the mapping work this time.

    Thinking about this on the plane after the last of these visits, it became clear to me that it’s not about the mapping; that’s a red herring. Stop trying to do it better; it can’t be done. Sure if you’re mapping two collections of books both using MARC and AACR2, you can do a pretty good job (and some of the visited institutions had). But these institutions wanted to allow people to search across books, special collections, archives and museum collections, digital image collections, faculty and departmental collections, and in one case, even course offerings and faculty bios.

    The network can’t provide the solution; much of the content is not likely to find its way into WorldCat or the open web, due to rights issues (think slide libraries) or other content ownership issues (think museum images with revenue potential) and some of it is really local (think teaching materials or licensed images).

    The simple fact is that, to offer fielded searching of disparate data, the data has to be mapped and the lowest common denominator prevails. As anyone who’s done any mapping knows, not all metadata is created equal. Just coming up with a lowest common denominator is impossible. Images of geological formations may not have a creator, paintings don’t necessarily have a subject, ancient artifacts may not have titles, dates are often unknown… Forcing all records to map to a set of required fields and then offering parametric searching on those fields guarantees that a lot of relevant content will be omitted from the result set.
    (At RLG, we found this to be the case even when we had a single standard (EAD) in a single union catalog (ArchiveGrid), because the standard had been applied differently by the various contributors and the collections being described varied in their nature and hence their description. While EAD allows for tagging of personal names, geographic locations, and controlled subject headings, the tags had been used so inconsistently that to offer indexes based on those fields would have resulted in vastly underreporting relevant results.)

    So what is the right approach, you might ask. I think we need to shift our focus from mapping and start looking at other ways to approach the problem.

    If we offer keyword searching of all the data in the records, we’ll get a big set that will likely include some irrelevant items. Lots of recall, not much precision. What would Google do? Improve the result set.

    We can do that in many ways:
    • We could tweak the relevance ranking algorithms. While we might not use the fields for searching, we can still use them for display – and for determining relevance. We could decide that if the search term was found in creator, title, or subject fields to rank that result higher.
    • We could track previous use and put the most viewed records at the top.
    • We could improve the data by using automated processes like Open Calais to identify personal names and geographic locations, to deduce subjects by text analysis, to normalize dates …
    o Then we could improve the user experience by using those elements in a meaningful display of the results, sorting and quantifying the results by various elements would allow the user to have a better sense of the nature of the result set
    o And we could offer the user ways to manipulate the results by those facets, much as WorldCat.org and IndexData’s MasterKey do.
    • We could investigate ways to pre-limit by offering ways to search just a slice of the whole (anthropological content, things that have been digitized, non-book materials…).
    • We might seek APIs to other services like name authorities or subject thesauri to improve or expand the query.
    • We might look for ways to tap into things like LibraryThing, WorldCat, or flickr to use network effects to enrich our results.

    If mapping is the roadblock on the route to Single, Simple, Successful Search, let’s choose a different route and get on with the quest.

    And once we’re good at making our wonderful resources accessible within our own institutions, won’t we be in a better position to make them accessible to the world?

    Read the rest of this entry »