Archive for April, 2008

The Quest for the Single, Simple, Successful Search

Tuesday, April 22nd, 2008 by Ricky

Yes, for most people the Grail is Google. For many of us in the library world (even for some library users) it’s WorldCat. But there are apparently still numerous quests for one way to search everything that’s available at an institution.

As discovery continues to rise to the network level, there are still some valid reasons to think locally:
• The undergrad wanting to know what she can get her hands on now, but doesn’t want to have to think of all the various places on campus she might look (not to mention the systems she’d have to navigate)
• The museum curator hoping to use other objects or resources in other departments to augment the exhibit he’s planning
• The fundraiser needing to showcase the breadth and depth of the collections, highlighting whatever topic is of interest to the potential donor
• A faculty member wanting to know what resources the university has that she might use for a new course
• The special collections curator courting a donor of a collection, who wants to highlight other like materials across the institution and to show how easily they are accessed for research
• An institution wishing to showcase and provide access to its collections on the institutional web site.

In eight visits I’ve made to RLG partner institutions in the last six months, six of them said that this quest was a top priority for them (and perhaps when the other two finish their big building projects, the quest will find its way to the top of their priorities, too). Some call it metasearch, others call it federated searching, others plan to use OAI harvesting to create a single index of all their collections, and still others think about putting all the metadata into a single system. Most of them had tried something already and were unhappy with it and were determined to try something else. No matter which approach they had tried, in each case the perceived problem involved mapping the data. It wasn’t done correctly and next time they hoped it would be better.

If only the mapping were better, the functionality would be improved. If only the mapping were better, the results could be presented in a more meaningful manner. If only the mapping were better, disparate data would coexist more easily together. Some thought the poor mapping was a characteristic of their chosen approach and if they went another route (or even the same route with different software) the problem could be addressed. The community is playing musical chairs with federated search software. When the music stops, one will try it again with SingleSearch, one will try it again with MetaLib, one will try it again with MetaFind, one with WebFeat, one with LibraryFind …, and the one left out will try OAI harvesting. All hoping to make the mapping work this time.

Thinking about this on the plane after the last of these visits, it became clear to me that it’s not about the mapping; that’s a red herring. Stop trying to do it better; it can’t be done. Sure if you’re mapping two collections of books both using MARC and AACR2, you can do a pretty good job (and some of the visited institutions had). But these institutions wanted to allow people to search across books, special collections, archives and museum collections, digital image collections, faculty and departmental collections, and in one case, even course offerings and faculty bios.

The network can’t provide the solution; much of the content is not likely to find its way into WorldCat or the open web, due to rights issues (think slide libraries) or other content ownership issues (think museum images with revenue potential) and some of it is really local (think teaching materials or licensed images).

The simple fact is that, to offer fielded searching of disparate data, the data has to be mapped and the lowest common denominator prevails. As anyone who’s done any mapping knows, not all metadata is created equal. Just coming up with a lowest common denominator is impossible. Images of geological formations may not have a creator, paintings don’t necessarily have a subject, ancient artifacts may not have titles, dates are often unknown… Forcing all records to map to a set of required fields and then offering parametric searching on those fields guarantees that a lot of relevant content will be omitted from the result set.
(At RLG, we found this to be the case even when we had a single standard (EAD) in a single union catalog (ArchiveGrid), because the standard had been applied differently by the various contributors and the collections being described varied in their nature and hence their description. While EAD allows for tagging of personal names, geographic locations, and controlled subject headings, the tags had been used so inconsistently that to offer indexes based on those fields would have resulted in vastly underreporting relevant results.)

So what is the right approach, you might ask. I think we need to shift our focus from mapping and start looking at other ways to approach the problem.

If we offer keyword searching of all the data in the records, we’ll get a big set that will likely include some irrelevant items. Lots of recall, not much precision. What would Google do? Improve the result set.

We can do that in many ways:
• We could tweak the relevance ranking algorithms. While we might not use the fields for searching, we can still use them for display – and for determining relevance. We could decide that if the search term was found in creator, title, or subject fields to rank that result higher.
• We could track previous use and put the most viewed records at the top.
• We could improve the data by using automated processes like Open Calais to identify personal names and geographic locations, to deduce subjects by text analysis, to normalize dates …
o Then we could improve the user experience by using those elements in a meaningful display of the results, sorting and quantifying the results by various elements would allow the user to have a better sense of the nature of the result set
o And we could offer the user ways to manipulate the results by those facets, much as WorldCat.org and IndexData’s MasterKey do.
• We could investigate ways to pre-limit by offering ways to search just a slice of the whole (anthropological content, things that have been digitized, non-book materials…).
• We might seek APIs to other services like name authorities or subject thesauri to improve or expand the query.
• We might look for ways to tap into things like LibraryThing, WorldCat, or flickr to use network effects to enrich our results.

If mapping is the roadblock on the route to Single, Simple, Successful Search, let’s choose a different route and get on with the quest.

And once we’re good at making our wonderful resources accessible within our own institutions, won’t we be in a better position to make them accessible to the world?

Read the rest of this entry »

Get out the vote

Tuesday, April 22nd, 2008 by Merrilee

No, this is not about the Pennsylvania primary (or our seemingly endless slog to the US Democratic Party nominations, which one has to assume will come to an end sometime this summer)…

The Canadian Archivist Blog called my attention to the Webby Awards, and the fact that the he U.S. National Archives and Records Administration’s Digital Vaults site has been nominated for a Peoples Choice Webby in the Society/Cultural Institutions category.

I have not paid close attention to the Webby Awards in the last few years. Scanning through the list of nominees for this year, I do see a few RLG Partners nominated (NARA, Smithsonian, the National Gallery of Art, MOMA, etc.). But the vast majority of nominees are commercial players. In almost every category, I see sites that add value to my life on a daily basis. These are not library or “memory institution” sites, although in some cases (as with Flickr) content from libraries, archives, museums, etc. flows to these sites. A good list to look at, as a reminder of where attention increasingly is focussed.

In our own world, we have various web awards with accompanying nomination. “Best archives on the web” over at ArchivesNext comes to mind. Which reminds me, the nomination period for the archives Movers and Shakers is coming to a close at the end of this week. So get in your vote (or nomination, in this case) by the end of the week. Who’s making a difference in the archival community? In the interest of discloser, I’ll reveal that I’ve been asked to judge, so please send in a nomination or two!

And those of you in Pennsylvania, do what you can to bring the Democratic Party nominations to a close, willya?

A role for ‘Libraries of the Future’ in the UK: digitisation via research-based learning

Tuesday, April 22nd, 2008 by John

The Guardian newspaper in the UK has today published an education supplement devoted to JISC’s ‘Libraries of the Future’ theme. This reflects excellent communications work by JISC, whose Annual Conference was held last week in Birmingham, with OCLC as its main sponsor.

Malcolm Read, JISC Executive Secretary. Flickr image by James F Clay

Several of our UK RLG Programs Partners are mentioned in the various articles. The British Library’s 19th century newspaper collection, made available free to universities with JISC funding, is mentioned. Sheila Cannell, Library Director at Edinburgh, is quoted on the subject of library design, and an article which profiles the changing world of library directors features Jean Sykes (LSE) and Anne Bell (Warwick).

One of the most interesting articles, indeed, describes a collaboration – funded with a two-year grant from the University’s Education Innovation Fund – between the Library of the University of Warwick, and its department of French Studies, to assist in the digitisation of a large collection of 18th and 19th century French plays from the Library’s Modern Records Centre. The approach to digitisation employed here seems to belong to a new category: not ’boutique’, in which a special collection or subset thereof is singled out and funded for digitisation, nor ‘industrial’ – the Google, OCA or Microsoft approach, in which digitisation starts at the beginning of a large collection and works indiscriminately through to the end, nor on-demand, where a user’s request leads to individual items being digitised. The approach being taken with the Marandet Collection was research-based learning led, from within an undergraduate programme.

Students chose a group of undigitised plays from the collection (in this case plays from the period 1799-1815), and focused upon it in order to produce essays analysing themes drawn from the literature and history of Napoleonic France. The plays were digitised by the Library, with the students being responsible for the quality of the finished digital resource, and the resulting essays (if marked highly enough) were then placed online by the Department in an ejournal created for the purpose (not in the Library’s nascent institutional repository, however, which – like most in the UK – will include only student work at PhD level). What is of particular interest in this approach is the requirement for students (final-year undergraduates) to appreciate the digital curation of a research collection, at the same time as showing evidence of scholarship in the literature and history of the period. The idea of learning how to do research is thus extended to include the process of documentary preservation via digitisation, which happily has the added benefit of adding to the digital corpus within a discipline. The course lecturer, Katherine Astbury, describes the project in the University’s Interactions journal. ‘The first stage for students will be to identify a corpus of plays from within the Marandet collection … These will be digitised to provide a permanent resource for researchers world-wide. The students will thus gain experience of research-based practice by selecting texts from the Marandet collection for preservation through digitisation and then be responsible for overseeing the preservation process to the finish: checking the digitised content, uploading them and adding to what is a very scant body of secondary material by writing on their selected plays’. Warwick is to be commended on the imagination which led to this remarkable project which advances digitisation, scholarship among junior researchers, and research in the period of Napeolonic France, in one fell swoop.

Playing with Twine

Friday, April 18th, 2008 by Merrilee

I attended the CNI Task Force meeting last week (was it just last week?). One of the project briefings I attended was on Twine, a tool that supports social bookmarking, provides file storage (and sharing), provides collaborative editing environments. Twine also encourages you to use it for other functions — instead of using a blog, use Twine! Instead of using an email list, use Twine! As Twine gets to know you, it will give recommendations for resources that you might find interesting.

I’ve been using Twine this week, mostly to park things (webpages, so far) that I might blog about. I am super lazy about tagging thing. Part of that is because I find other people’s tags not so useful, and I’m not convinced I would find my own tags so very useful either. So one of the features I find interesting is the automatic tagging of resources that Twine provides. These are broken down into people, places, organizations, other tags, and types of items. I can add tags if I want, and I can kill off tags I don’t like. Supposedly, Twine will make this data open so that others can build applications that make use of it. Right now, I’m more likely to kill tags than add them, but if someone else can make use of my work, I’m more likely to add tags.

Twine is underpinned by semantic webby stuff (“powered by semantic understanding,” is what Twine says). I’ll admit that I have never fully understood the semantic web. While I recognize that this makes me a shallow person, Twine is an application where even I can the semantic web in action in a very small way.

If you are interested in finding out more about Twine, take the tour, or read this blog post which sings the praises of Twine.

Although I can’t give this application a ringing endorsement yet, I’m interested in having more of us in the library, archives, and museum space play around with tools like this. I think this can help give understanding of and insight into “personal research spaces” that researchers and others may be using now or in the future. If you are interested in getting an invite, I appear to have many to give out (as soon as I run out, I get more). Leave a comment or email me at proffitm@oclc.org. Just be warned that this is a true beta environment. If you are already on Twine, connect with me (I’m Merrilee). And please invite me to sit in on any interesting Twine experiments you are cooking up.

Author has written over 200,000 “books”

Wednesday, April 16th, 2008 by Merrilee

I swear I am going to get back to more serious postings at some point, but I found this New York Times article on Philip M. Parker quite interesting. He’s “generated” more than 200,000 books using computer programmers and the web.

I found the WorldCat Identities record for Parker to be particularly interesting, at least for the segment of Parker’s works held by libraries (7,914 publications, a small percentage of his reported output). 2610 works published just in 2004. The tag cloud of subjects associated with his books is also interesting.

More on Flickr

Thursday, April 10th, 2008 by Merrilee

A few more Flickr related things:

The Boston Public Library has posted photos to Flickr. Like Library of Congress, the collections are open to commentary and tags, although initially they were not. From a brief scan of the collection, there are considerably fewer comments on the items than on items in LCs Flickr collections. I hope to spend some time with folks from the Boston Public Library in May, and if I find out more about this project and can share, I will. War posters, cased photos, there’s a lot to love….

Like the Library of Congress, the Powerhouse Museum has joined the Flickr Commons.

You can now post video on Flickr. At first I thought this was a little odd, but when I read more, it made sense. It’s only for Pro accounts, and you are limited to 90 second clips. This makes good sense, because sometimes a photo doesn’t quite cut it (I have some “videos” I’ve taken that are really more about getting a paranoramic sweep of something when I’m too lazy to actually cut and paste a series of photos together). It’s not a replacement for YouTube or other video sharing sites. Long photos.

Impact of digitization on scholarship and collecting

Tuesday, April 1st, 2008 by Merrilee

Last week there was an announcement that the Folger Library, the Bodleian Library at University of Oxford, and the Maryland Institute for Technology in the Humanities at the University of Maryland (all RLG Partners!) have been awarded one of five transatlantic collaboration grants in the new JISC/NEH Transatlantic Digitization Collaboration Grants. The grant will help create The Shakespeare Quartos Archive, “a freely-accessible, high-resolution digital collection of the 75 pre-1641 quarto editions of Shakespeare’s plays.” This will be a boon to scholarship, indeed.

As materials move online, in both licensed and freely available forms, what will be the impact on scholarship? On teaching and learning practice? On the collecting practices of research libraries? These are questions we are hoping to explore in the third day of our annual meeting (June 4th). This symposium, which we’re calling ” Digitization and the Humanities: Impact on Libraries and Special Collections,” will feature perspectives from scholars on how digital collections are impacting both their research and teaching practice. We’ll also have perspectives from university librarians (Paul Courant, University of Michigan and Robin Adams, Trinity College Dublin) on the potential impact on library collecting practices.

We’re fortunate that Philadelphia-area partners are terrific hosts. The symposium will be held at the Chemical Heritage Foundation, and on Tuesday evening (June 3rd), the Philadelphia Museum of Art will host a reception for attendees. It should be a great event and a thought provoking conversation, and we hope you will join us. RLG Partners may register online.

While you’re at it, check out the program for our Annual Meeting. I’ll be blogging more about what we have planned at that event in the near future.