Archive for the 'Searching' Category

Te Puna Mト》auranga o Aotearoa Rocks My World

Wednesday, December 3rd, 2008 by Roy

The National Library of New Zealand (Te Puna Mト》auranga o Aotearoa in Maori and an RLG Partner) has obviously been busy. Last week they joined the Flickr Commons, and they have already reported some impressive use statistics. But today (well, yesterday in Kiwi time) came an even bigger announcement.

Digital New Zealand, “a nation-wide project to help make New Zealand digital content easier to find, share and use was launched at the National Library of New Zealand on 3 December 2008.” The incredible array of collections made available through this one interface would be news enough for many libraries. But the joy doesn’t stop there.

The project welcomes additional content contributors, and stands ready to provide advice and assistance to help them to do so. Visitors are offered an opportunity to create a tailored search of the site and drop the resulting widget onto any web page they like or use the special search page that is created for them right on the Digital New Zealand site.

If a visitor doesn’t wish to create a tailored web widget, they already have a library of such from which to choose. And for the true technorati, there is the developer section, which provides a simple way for software developers to get a key to be able to use the application programming interface (API) of the site. If all of this isn’t enough to knock your socks off, stay tuned.

The “Memory Maker” is a web-based way to mix and match video clips into your own cinematic production. I kid you not. Try it out. You can add audio or music to add your own special touches. I doubt that any movie miracles will be made here, but the level of interactivity is completely off the charts. To get the full measure of this, you simply must see this movie.

So by now you must think surely I am done singing the praises of Te Puna Mト》auranga o Aotearoa, but I’m not. There’s still more. Like I said, they’ve obviously been busy. The last thing I want to highlight is their National Digital Heritage Archive. Long in the works through a partnership with ExLibris, this preservation system went live on November 4. “The National Digital Heritage Archive (NDHA),” states the web site, “is the National Library’s technical and business solution to preserve and provide long-term public access to its digital heritage collections.” The NLNZ was the flagship partner with ExLibris, and the product is based on the Open Archival Information System (OAIS) model and conforming to trusted digital repository (TDR) requirements (which came out of joint RLG-OCLC work before the two organizations joined).

This is an incredible array of new initiatives by any measure, and a tribute to the leadership of Penny Carnaby, Chief Executive and National Librarian, and John Truesdale, Director National Digital Library, and of course many others who were instrumental in accomplishing all of this work. For my part, it’s hard to believe that it was only a bit more than a year ago when I was talking with Penny and John in a Melbourne bar after participating in a National and State Libraries Australasia strategic planning meeting. They have much to celebrate, as do we, since they have are doing much from which we can learn. I simply can’t wait to see what comes next.

LC-Flickr: updating the catalog

Wednesday, October 22nd, 2008 by Gテシnter

In the context of John MacColl’s guest blog on Karen Calhoun’s Metalogue, I was reminded of the stats from the LC-Flickr project pertaining to changes LC made in their own catalog prompted by insightful Flickr comments.

When I last updated my Flickr slides for a class at Syracuse University, I found 174 records containing the word “flickr” in an all text field search of LC’s Prints and Photographs Online Catalog. The records in that set usually contain a credit such as “Source: Flickr Commons project” for information which has been added, like in this instance.

The same search today yields a whopping 4,256 records – which is quite close to the entire set of images LC has on Flickr (4,615 as of today). Upon closer inspection, I found that many of these records don’t contain a change to the substance of the record – however, they now do have a useful pointer to a discussion about the photograph on the Flickr site, and that’s why my search retrieved them. For an example, see this record which includes the following language: “Additional information about this photograph might be available through the Flickr Commons project at“. On Flickr, one can then follow a playful discussion about dating the photograph.

Interestingly enough, these links to Flickr aren窶冲 programmatic 窶 an item which doesn窶冲 have comments on Flickr doesn窶冲 seem to receive the link. See for yourself 窶 the LC equivalent of this Flickr image does not contain the pointer in the LC record, since there was no comment on the image in Flickr.

It looks like LC continues to update its records based on Flickr user feedback, and they’re also creating links so people searching the LC catalog exclusively don’t miss out on the oftentimes rich discussion on Flickr. A search for “Source: Flickr Commons” yields 509 exact phrase hits, which is the portion which most likely represents actual updates to the catalog.

A Map to Destinations Uncrawled

Friday, July 18th, 2008 by Roy

As part of what we’re trying to accomplish in the “Modeling New Service Infrastructures” part of the RLG Work Agenda, I’ve been working on a white paper on best practices for enhancing disclosure of library, museum and archive content at the network level. Using a sitemap to better expose content to web search engines such as Google is one technique that I’m investigating, but it is clear that this technique is by no means a silver bullet.

For those of you who aren’t familiar with sitemaps, they are a way to expose information for a web crawler that is normally hidden behind a database wall. Since web crawlers cannot intuit all the necessary queries to extract all the data from a database, other techniques must be used to provide the crawler with crawlable URLs. One such technique is to create a sitemap, which is an XML file with all the URLs you wish to have crawled on your site. Google, Yahoo!, and Microsoft have collaborated through the site to establish a common protocol for sitemaps.

I’ve used this for some time now to make sure all the images on my photos site were exposed to crawling. But recently I checked on it in Google’s Webmaster Tools site (where you can register your sitemap to make sure Google finds it), and I discovered to my dismay that only a small percentage (38%) of the URLs had been indexed in Google. On that site they state that “Most sites will not have all of their pages indexed,” but the only clues why are perhaps contained in the “Webmaster Guidelines” that describe what to do or not do to increase your chances of being indexed.

Iツ asked the subscribers to the Web4Lib discussion what their experience has been. Debbie Campbell of the National Library of Australia (an RLG partner institution) reported that of the over a million items in Picture Australia, Google had indexed only about 49% of the URLs, although it was up dramatically over a couple weeks ago when the number was a paltry 32,000.

Marshall Breeding of Vanderbilt University and the owner of Library Technology Guides reported a higher percentage of 62% of his URLs indexed. He also reported that he’s doing some testing of the priority setting of potentially uncrawled URLs (based on low page views) in his sitemap to see if that helps.

Danielle Plumer of the Texas State Library and Archives Commission pointed out an article that discusses related issues, since of course this is a topic of interest to those trying to achieve “search engine optimization”. She also sent along part of a Q&A with Matt Cutts of Google that bears on this issue:

Q: 窶廴y sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?窶
A: One of the classic crawling strategies that Google has used is the amount of PageRank on your pages. So just because your site has been around for a couple years (or that you submit a sitemap), that doesn窶冲 mean that we窶冤l automatically crawl every page on your site. In general, getting good quality links would probably help us know to crawl your site more deeply. You might also want to look at the remaining unindexed urls; do they have a ton of parameters (we typically prefer urls with 1-2 parameters)? Is there a robots.txt? Is it possible to reach the unindexed urls easily by following static text links (no Flash, JavaScript, AJAX, cookies, frames, etc. in the way)? That窶冱 what I would recommend looking at.

So what can I conclude from this? Not a whole lot yet, except that creating a sitemap and registering it with Google is not by any means a silver bullet. It’s still a good idea, but it should only be one strategy of many we can take to get the unique content that libraries, museums and archives hold in front of the roving eyes of web users. It’s exactly that suite of strategies I’m trying to identify and describe, so please let me know of any methods you’re using that you think may be effective, either as a comment to this post or direct email.

The Quest for the Single, Simple, Successful Search

Tuesday, April 22nd, 2008 by Ricky

Yes, for most people the Grail is Google. For many of us in the library world (even for some library users) it窶冱 WorldCat. But there are apparently still numerous quests for one way to search everything that窶冱 available at an institution.

As discovery continues to rise to the network level, there are still some valid reasons to think locally:
窶「 The undergrad wanting to know what she can get her hands on now, but doesn窶冲 want to have to think of all the various places on campus she might look (not to mention the systems she窶囘 have to navigate)
窶「 The museum curator hoping to use other objects or resources in other departments to augment the exhibit he窶冱 planning
窶「 The fundraiser needing to showcase the breadth and depth of the collections, highlighting whatever topic is of interest to the potential donor
窶「 A faculty member wanting to know what resources the university has that she might use for a new course
窶「 The special collections curator courting a donor of a collection, who wants to highlight other like materials across the institution and to show how easily they are accessed for research
窶「 An institution wishing to showcase and provide access to its collections on the institutional web site.

In eight visits I窶况e made to RLG partner institutions in the last six months, six of them said that this quest was a top priority for them (and perhaps when the other two finish their big building projects, the quest will find its way to the top of their priorities, too). Some call it metasearch, others call it federated searching, others plan to use OAI harvesting to create a single index of all their collections, and still others think about putting all the metadata into a single system. Most of them had tried something already and were unhappy with it and were determined to try something else. No matter which approach they had tried, in each case the perceived problem involved mapping the data. It wasn窶冲 done correctly and next time they hoped it would be better.

If only the mapping were better, the functionality would be improved. If only the mapping were better, the results could be presented in a more meaningful manner. If only the mapping were better, disparate data would coexist more easily together. Some thought the poor mapping was a characteristic of their chosen approach and if they went another route (or even the same route with different software) the problem could be addressed. The community is playing musical chairs with federated search software. When the music stops, one will try it again with SingleSearch, one will try it again with MetaLib, one will try it again with MetaFind, one with WebFeat, one with LibraryFind 窶ヲ, and the one left out will try OAI harvesting. All hoping to make the mapping work this time.

Thinking about this on the plane after the last of these visits, it became clear to me that it窶冱 not about the mapping; that’s a red herring. Stop trying to do it better; it can窶冲 be done. Sure if you窶决e mapping two collections of books both using MARC and AACR2, you can do a pretty good job (and some of the visited institutions had). But these institutions wanted to allow people to search across books, special collections, archives and museum collections, digital image collections, faculty and departmental collections, and in one case, even course offerings and faculty bios.

The network can窶冲 provide the solution; much of the content is not likely to find its way into WorldCat or the open web, due to rights issues (think slide libraries) or other content ownership issues (think museum images with revenue potential) and some of it is really local (think teaching materials or licensed images).

The simple fact is that, to offer fielded searching of disparate data, the data has to be mapped and the lowest common denominator prevails. As anyone who窶冱 done any mapping knows, not all metadata is created equal. Just coming up with a lowest common denominator is impossible. Images of geological formations may not have a creator, paintings don窶冲 necessarily have a subject, ancient artifacts may not have titles, dates are often unknown窶ヲ Forcing all records to map to a set of required fields and then offering parametric searching on those fields guarantees that a lot of relevant content will be omitted from the result set.
(At RLG, we found this to be the case even when we had a single standard (EAD) in a single union catalog (ArchiveGrid), because the standard had been applied differently by the various contributors and the collections being described varied in their nature and hence their description. While EAD allows for tagging of personal names, geographic locations, and controlled subject headings, the tags had been used so inconsistently that to offer indexes based on those fields would have resulted in vastly underreporting relevant results.)

So what is the right approach, you might ask. I think we need to shift our focus from mapping and start looking at other ways to approach the problem.

If we offer keyword searching of all the data in the records, we窶冤l get a big set that will likely include some irrelevant items. Lots of recall, not much precision. What would Google do? Improve the result set.

We can do that in many ways:
窶「 We could tweak the relevance ranking algorithms. While we might not use the fields for searching, we can still use them for display 窶 and for determining relevance. We could decide that if the search term was found in creator, title, or subject fields to rank that result higher.
窶「 We could track previous use and put the most viewed records at the top.
窶「 We could improve the data by using automated processes like Open Calais to identify personal names and geographic locations, to deduce subjects by text analysis, to normalize dates 窶ヲ
o Then we could improve the user experience by using those elements in a meaningful display of the results, sorting and quantifying the results by various elements would allow the user to have a better sense of the nature of the result set
o And we could offer the user ways to manipulate the results by those facets, much as and IndexData窶冱 MasterKey do.
窶「 We could investigate ways to pre-limit by offering ways to search just a slice of the whole (anthropological content, things that have been digitized, non-book materials窶ヲ).
窶「 We might seek APIs to other services like name authorities or subject thesauri to improve or expand the query.
窶「 We might look for ways to tap into things like LibraryThing, WorldCat, or flickr to use network effects to enrich our results.

If mapping is the roadblock on the route to Single, Simple, Successful Search, let窶冱 choose a different route and get on with the quest.

And once we窶决e good at making our wonderful resources accessible within our own institutions, won窶冲 we be in a better position to make them accessible to the world?

Read the rest of this entry »

More on Flickr

Thursday, April 10th, 2008 by Merrilee

A few more Flickr related things:

The Boston Public Library has posted photos to Flickr. Like Library of Congress, the collections are open to commentary and tags, although initially they were not. From a brief scan of the collection, there are considerably fewer comments on the items than on items in LCs Flickr collections. I hope to spend some time with folks from the Boston Public Library in May, and if I find out more about this project and can share, I will. War posters, cased photos, there’s a lot to love….

Like the Library of Congress, the Powerhouse Museum has joined the Flickr Commons.

You can now post video on Flickr. At first I thought this was a little odd, but when I read more, it made sense. It’s only for Pro accounts, and you are limited to 90 second clips. This makes good sense, because sometimes a photo doesn’t quite cut it (I have some “videos” I’ve taken that are really more about getting a paranoramic sweep of something when I’m too lazy to actually cut and paste a series of photos together). It’s not a replacement for YouTube or other video sharing sites. Long photos.

Flickr inaugurates “The Commons” with Library of Congress collections

Thursday, January 17th, 2008 by Gテシnter

Flickr and the Library of Congress just announced a prototype which will bring 1,500 or 3,000 photographs (depending on whether you believe the Flickr or the LC blog) from two of the most popular LC photo collections to the immensely popular photo sharing website owned by Yahoo. This project inaugurates 窶The Commons窶 on Flickr, which has the tagline: 窶弸our opportunity to contribute to describing the world’s public photo collections.窶

alexa.jpgThe benefits to LC (and any other cultural heritage institution choosing to participate) seems so obvious that it feels surprising we haven窶冲 seen this announcement earlier: foremost amongst the benefits to my mind, the LC collections will enjoy unprecedented exposure on a website which receives a staggering amount of traffic. (The screenshot above shows the percentage of webtraffic flowing to Flickr [red] and LC [blue] tracked over a 3 year period by Alexa. No further commentary needed). Flickr displays what looks like a rather comprehensive LC record for each photograph, which also includes links back to the collections and the image itself on the LC website. I窶囘 be rather curious about how these records got into Flickr 窶 was a batch-upload mechanism created for this project? As time passes, I hope we窶冤l also hear from LC about how the referrals from Flickr have impacted the overall traffic on their website!

And it goes without saying that LC will also harness the collective tagging power of the Flickr community to help describe its collection, a feature of the project much touted on both blog announcements. However, I was interested to learn in the project FAQ that LC is actually hedging its bets on incorporating any of the captured tags in their own system.

The announcement reminded me of a number of other creative projects which have found ways to disclose cultural heritage materials into social networking sites or large online hubs. What comes to mind spontaneously:

  • Last summer, the University of Washington published a fascinating article in D-Lib about their experience in adding links to UW special collections to Wikipedia, which includes statistics on how this strategy increased web traffic to those collections.
  • Just in November, the Brooklyn Museum launched a Facebook application called 窶ArtShare窶 which allows users to pick their favorite images from the museum窶冱 collection, and have them shown in rotation on their Facebook profile page. The app itself is social (meaning shareable) as well 窶 the Victoria & Albert and the PowerHouse Museum offer up images to prettify your Facebook page as well. Along with the recently released WorldCat Facebook app (created by my RLG Programs colleague Bruce Washburn), ArtShare is about the only thing happening on my Facebook profile (you notice how I’m casually omitting a link here).

If you can think of other innovative ways in which cultural heritage organization do or should disclose their collections on social networking sites, please drop me a line!

Browsing audio

Monday, December 10th, 2007 by Karen

Recently I realized that I窶冦 spending almost as much time in 窶徘rofessional listening窶 as I am doing 窶徘rofessional reading窶. So many interviews, Webcasts, TED talks, Google Tech Talks, and the like!

So I was intrigued indeed when I read about Searching Video Lectures, a tool from MIT that leverages decades of speech-recognition research to convert audio into text and make it searchable, as reported in MIT窶冱 Technology Review of November 26, 2007.

I tried out the Lecture Browser. It currently has only 200 publicly available lectures, but still! For an astronomy buff like me I was thrilled to zero in on professors窶 insights about Hubble images (retrieved easily by a keyword search on 窶廩ubble窶.) ツDefinitely a fun tool to play with.

Then I thought about oral history projects I窶况e known. Think of all the recordings of interviews with individuals who provide insight to our history, culture, and perspectives. The ones I know about have a MARC record about the interview with a very brief summary of the topics covered (usually with associated subject headings), the media used (e.g., 窶徭ound tape reel窶), and a note that a transcript is available. But in the Brave New World, imagine what it would be like for researchers to type a few keywords and pull up both the transcript where the key words appear and the spot in the audio where the topic is discussed?

Visual reminder

Monday, July 2nd, 2007 by Merrilee

A visual reminder of how many places we have to search.

Far from complete., LibraryThing, etc. are not even on the list.

Books are objects too

Sunday, May 20th, 2007 by Jim

Lorcan recently brought my attention to a conference where he will be speaking. It窶冱 the mid-term meeting of LIBER where they will have a think tank on the future value of the book as artefact and the future value of digital documentary heritage at the National Library of Sweden in Stockholm 24-25 May 2007. I窶冦 sure he窶冤l share some interesting thoughts when he returns (more likely, before he returns).

The topic sent me back to one of his recent posts that focused on books as technology and to thinking about a recent RLG Partner visit to UCLA about which Merrilee posted not long ago.

During that day at UCLA a special and unexpected treat was an invitation from Victoria Steele, a long-time professional friend and the head of special collections, to join an evening event she had organized for friends where she would be offering an intimate, organized tromp through the treasures. Vicki was passionate, the group was very interested and the treasures were a great pleasure to see close up.

Part of her traverse included the following items:

Euclid - Ratdolt
Euclid, Erhard Ratdolt, Joannes Campanus, and Adelard. 1482. Elementa artis geometriae: [translated by Joh. Adelhardus Bathoniensis; edited by Joh. Campanus; with dedicatory letter by Ratdolt]. Venice: Erhard Ratdolt. [big image]

Byrne, Oliver, and Bruce Rogers. 1847. The first six books of the elements of Euclid, in which coloured diagrams and symbols are used instead of letters for the greater ease of learners. London: William Pickering. [big image]

Euclid Valery

Euclid, Paul Valeフ〉y, and Bruce Rogers. 1944. Elements of geometry. New York: Random House.

Hofstra, Sjoerd. 1994. Elements of geometry by Euclid. Amsterdam: ZET [big image]

The first and earliest book relocated the content on the page to provide a wide margin in which to present the diagrams that illustrated the text. The second nearly eliminated the text – the content instead delivered by richly colored diagrams that reminded us more of Mondrian than mathematics. The third (for which I was unable to find an image) delivered its content via beautiful italic along with diagrams in colored panels overlaid by the presence of Valeフ〉y’s essay and commentary. The final volume by a Swedish book artist transformed Euclid’s original two-dimensional drawings of geometric shapes into three-dimensional models that spring from the flat pages of grayed-out, abstracted text as pop-ups.

These four dramatically different versions of Euclid窶冱 窶text窶 represented to me an increasing innovation in presentation within the broad parameters of the book technology but possible largely because the book was honored as an object.

In Lorcan窶冱 post he mentions that consideration of the book as a technology 窶徨einforces an awareness that the book itself, the codex, represents particular technological choices which in turn have influenced how we create and engage with the intellectual and cultural record, and in turn with broader experience and intellectual development.窶 He thinks that this is positive because it moves us 窶彙eyond the reductive opposition between the book and the digital turn.窶

Vicki窶冱 traverse I think reinforces a complementary point. Certain kinds of desired and desirable activities that are now easily delivered in the digital environment have been playing out within the technology of the book for a very long time. All those re-actions 窶途euse, repurpose and remix 窶 have a deep history of their own within the book as object.

P.S. All the citations above were obtained via WorldCat – the citations are in Chicago form. There’s a post for another day regarding my search for these texts…

A wonderful specimen of a user!

Friday, March 9th, 2007 by Gテシnter

We’re always on the look-out for this mystical creature called “the user,” and I am exited to report a public sighting: for a good hour during the Bibliographic Control Working Group public meeting, we had a superb specimen right in front of us. Dr. Timothy Burke, an Associate Professor in the Department of History at Swarthmore College, spoke with detail and nuance about his information retrieval (read: search) behavior as a researcher, and provided a plethora of specific scenarios for different searching strategies. My only regret is that I didn’t walk up to him afterwards and thank him in person for his candor, wit and eloquence. I would have written this up in detail, but found that Karen Coyle already beat me to the punch. Thanks, Timothy (and Karen)!