Archive for the 'Searching' Category

LC-Flickr: updating the catalog

Wednesday, October 22nd, 2008 by GĂźnter

In the context of John MacColl’s guest blog on Karen Calhoun’s Metalogue, I was reminded of the stats from the LC-Flickr project pertaining to changes LC made in their own catalog prompted by insightful Flickr comments.

When I last updated my Flickr slides for a class at Syracuse University, I found 174 records containing the word “flickr” in an all text field search of LC’s Prints and Photographs Online Catalog. The records in that set usually contain a credit such as “Source: Flickr Commons project” for information which has been added, like in this instance.

The same search today yields a whopping 4,256 records – which is quite close to the entire set of images LC has on Flickr (4,615 as of today). Upon closer inspection, I found that many of these records don’t contain a change to the substance of the record – however, they now do have a useful pointer to a discussion about the photograph on the Flickr site, and that’s why my search retrieved them. For an example, see this record which includes the following language: “Additional information about this photograph might be available through the Flickr Commons project at http://www.flickr.com/photos/library_of_congress/2369119062“. On Flickr, one can then follow a playful discussion about dating the photograph.

Interestingly enough, these links to Flickr aren’t programmatic – an item which doesn’t have comments on Flickr doesn’t seem to receive the link. See for yourself – the LC equivalent of this Flickr image does not contain the pointer in the LC record, since there was no comment on the image in Flickr.

It looks like LC continues to update its records based on Flickr user feedback, and they’re also creating links so people searching the LC catalog exclusively don’t miss out on the oftentimes rich discussion on Flickr. A search for “Source: Flickr Commons” yields 509 exact phrase hits, which is the portion which most likely represents actual updates to the catalog.

A Map to Destinations Uncrawled

Friday, July 18th, 2008 by Roy

As part of what we’re trying to accomplish in the “Modeling New Service Infrastructures” part of the RLG Work Agenda, I’ve been working on a white paper on best practices for enhancing disclosure of library, museum and archive content at the network level. Using a sitemap to better expose content to web search engines such as Google is one technique that I’m investigating, but it is clear that this technique is by no means a silver bullet.

For those of you who aren’t familiar with sitemaps, they are a way to expose information for a web crawler that is normally hidden behind a database wall. Since web crawlers cannot intuit all the necessary queries to extract all the data from a database, other techniques must be used to provide the crawler with crawlable URLs. One such technique is to create a sitemap, which is an XML file with all the URLs you wish to have crawled on your site. Google, Yahoo!, and Microsoft have collaborated through the Sitemaps.org site to establish a common protocol for sitemaps.

I’ve used this for some time now to make sure all the images on my photos site were exposed to crawling. But recently I checked on it in Google’s Webmaster Tools site (where you can register your sitemap to make sure Google finds it), and I discovered to my dismay that only a small percentage (38%) of the URLs had been indexed in Google. On that site they state that “Most sites will not have all of their pages indexed,” but the only clues why are perhaps contained in the “Webmaster Guidelines” that describe what to do or not do to increase your chances of being indexed.

I  asked the subscribers to the Web4Lib discussion what their experience has been. Debbie Campbell of the National Library of Australia (an RLG partner institution) reported that of the over a million items in Picture Australia, Google had indexed only about 49% of the URLs, although it was up dramatically over a couple weeks ago when the number was a paltry 32,000.

Marshall Breeding of Vanderbilt University and the owner of Library Technology Guides reported a higher percentage of 62% of his URLs indexed. He also reported that he’s doing some testing of the priority setting of potentially uncrawled URLs (based on low page views) in his sitemap to see if that helps.

Danielle Plumer of the Texas State Library and Archives Commission pointed out an article that discusses related issues, since of course this is a topic of interest to those trying to achieve “search engine optimization”. She also sent along part of a Q&A with Matt Cutts of Google that bears on this issue:

Q: “My sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?”
A: One of the classic crawling strategies that Google has used is the amount of PageRank on your pages. So just because your site has been around for a couple years (or that you submit a sitemap), that doesn’t mean that we’ll automatically crawl every page on your site. In general, getting good quality links would probably help us know to crawl your site more deeply. You might also want to look at the remaining unindexed urls; do they have a ton of parameters (we typically prefer urls with 1-2 parameters)? Is there a robots.txt? Is it possible to reach the unindexed urls easily by following static text links (no Flash, JavaScript, AJAX, cookies, frames, etc. in the way)? That’s what I would recommend looking at.

So what can I conclude from this? Not a whole lot yet, except that creating a sitemap and registering it with Google is not by any means a silver bullet. It’s still a good idea, but it should only be one strategy of many we can take to get the unique content that libraries, museums and archives hold in front of the roving eyes of web users. It’s exactly that suite of strategies I’m trying to identify and describe, so please let me know of any methods you’re using that you think may be effective, either as a comment to this post or direct email.

The Quest for the Single, Simple, Successful Search

Tuesday, April 22nd, 2008 by Ricky

Yes, for most people the Grail is Google. For many of us in the library world (even for some library users) it’s WorldCat. But there are apparently still numerous quests for one way to search everything that’s available at an institution.

As discovery continues to rise to the network level, there are still some valid reasons to think locally:
• The undergrad wanting to know what she can get her hands on now, but doesn’t want to have to think of all the various places on campus she might look (not to mention the systems she’d have to navigate)
• The museum curator hoping to use other objects or resources in other departments to augment the exhibit he’s planning
• The fundraiser needing to showcase the breadth and depth of the collections, highlighting whatever topic is of interest to the potential donor
• A faculty member wanting to know what resources the university has that she might use for a new course
• The special collections curator courting a donor of a collection, who wants to highlight other like materials across the institution and to show how easily they are accessed for research
• An institution wishing to showcase and provide access to its collections on the institutional web site.

In eight visits I’ve made to RLG partner institutions in the last six months, six of them said that this quest was a top priority for them (and perhaps when the other two finish their big building projects, the quest will find its way to the top of their priorities, too). Some call it metasearch, others call it federated searching, others plan to use OAI harvesting to create a single index of all their collections, and still others think about putting all the metadata into a single system. Most of them had tried something already and were unhappy with it and were determined to try something else. No matter which approach they had tried, in each case the perceived problem involved mapping the data. It wasn’t done correctly and next time they hoped it would be better.

If only the mapping were better, the functionality would be improved. If only the mapping were better, the results could be presented in a more meaningful manner. If only the mapping were better, disparate data would coexist more easily together. Some thought the poor mapping was a characteristic of their chosen approach and if they went another route (or even the same route with different software) the problem could be addressed. The community is playing musical chairs with federated search software. When the music stops, one will try it again with SingleSearch, one will try it again with MetaLib, one will try it again with MetaFind, one with WebFeat, one with LibraryFind …, and the one left out will try OAI harvesting. All hoping to make the mapping work this time.

Thinking about this on the plane after the last of these visits, it became clear to me that it’s not about the mapping; that’s a red herring. Stop trying to do it better; it can’t be done. Sure if you’re mapping two collections of books both using MARC and AACR2, you can do a pretty good job (and some of the visited institutions had). But these institutions wanted to allow people to search across books, special collections, archives and museum collections, digital image collections, faculty and departmental collections, and in one case, even course offerings and faculty bios.

The network can’t provide the solution; much of the content is not likely to find its way into WorldCat or the open web, due to rights issues (think slide libraries) or other content ownership issues (think museum images with revenue potential) and some of it is really local (think teaching materials or licensed images).

The simple fact is that, to offer fielded searching of disparate data, the data has to be mapped and the lowest common denominator prevails. As anyone who’s done any mapping knows, not all metadata is created equal. Just coming up with a lowest common denominator is impossible. Images of geological formations may not have a creator, paintings don’t necessarily have a subject, ancient artifacts may not have titles, dates are often unknown… Forcing all records to map to a set of required fields and then offering parametric searching on those fields guarantees that a lot of relevant content will be omitted from the result set.
(At RLG, we found this to be the case even when we had a single standard (EAD) in a single union catalog (ArchiveGrid), because the standard had been applied differently by the various contributors and the collections being described varied in their nature and hence their description. While EAD allows for tagging of personal names, geographic locations, and controlled subject headings, the tags had been used so inconsistently that to offer indexes based on those fields would have resulted in vastly underreporting relevant results.)

So what is the right approach, you might ask. I think we need to shift our focus from mapping and start looking at other ways to approach the problem.

If we offer keyword searching of all the data in the records, we’ll get a big set that will likely include some irrelevant items. Lots of recall, not much precision. What would Google do? Improve the result set.

We can do that in many ways:
• We could tweak the relevance ranking algorithms. While we might not use the fields for searching, we can still use them for display – and for determining relevance. We could decide that if the search term was found in creator, title, or subject fields to rank that result higher.
• We could track previous use and put the most viewed records at the top.
• We could improve the data by using automated processes like Open Calais to identify personal names and geographic locations, to deduce subjects by text analysis, to normalize dates …
o Then we could improve the user experience by using those elements in a meaningful display of the results, sorting and quantifying the results by various elements would allow the user to have a better sense of the nature of the result set
o And we could offer the user ways to manipulate the results by those facets, much as WorldCat.org and IndexData’s MasterKey do.
• We could investigate ways to pre-limit by offering ways to search just a slice of the whole (anthropological content, things that have been digitized, non-book materials…).
• We might seek APIs to other services like name authorities or subject thesauri to improve or expand the query.
• We might look for ways to tap into things like LibraryThing, WorldCat, or flickr to use network effects to enrich our results.

If mapping is the roadblock on the route to Single, Simple, Successful Search, let’s choose a different route and get on with the quest.

And once we’re good at making our wonderful resources accessible within our own institutions, won’t we be in a better position to make them accessible to the world?

Read the rest of this entry »

More on Flickr

Thursday, April 10th, 2008 by Merrilee

A few more Flickr related things:

The Boston Public Library has posted photos to Flickr. Like Library of Congress, the collections are open to commentary and tags, although initially they were not. From a brief scan of the collection, there are considerably fewer comments on the items than on items in LCs Flickr collections. I hope to spend some time with folks from the Boston Public Library in May, and if I find out more about this project and can share, I will. War posters, cased photos, there’s a lot to love….

Like the Library of Congress, the Powerhouse Museum has joined the Flickr Commons.

You can now post video on Flickr. At first I thought this was a little odd, but when I read more, it made sense. It’s only for Pro accounts, and you are limited to 90 second clips. This makes good sense, because sometimes a photo doesn’t quite cut it (I have some “videos” I’ve taken that are really more about getting a paranoramic sweep of something when I’m too lazy to actually cut and paste a series of photos together). It’s not a replacement for YouTube or other video sharing sites. Long photos.

Flickr inaugurates “The Commons” with Library of Congress collections

Thursday, January 17th, 2008 by GĂźnter

Flickr and the Library of Congress just announced a prototype which will bring 1,500 or 3,000 photographs (depending on whether you believe the Flickr or the LC blog) from two of the most popular LC photo collections to the immensely popular photo sharing website owned by Yahoo. This project inaugurates “The Commons” on Flickr, which has the tagline: “Your opportunity to contribute to describing the world’s public photo collections.”

alexa.jpgThe benefits to LC (and any other cultural heritage institution choosing to participate) seems so obvious that it feels surprising we haven’t seen this announcement earlier: foremost amongst the benefits to my mind, the LC collections will enjoy unprecedented exposure on a website which receives a staggering amount of traffic. (The screenshot above shows the percentage of webtraffic flowing to Flickr [red] and LC [blue] tracked over a 3 year period by Alexa. No further commentary needed). Flickr displays what looks like a rather comprehensive LC record for each photograph, which also includes links back to the collections and the image itself on the LC website. I’d be rather curious about how these records got into Flickr – was a batch-upload mechanism created for this project? As time passes, I hope we’ll also hear from LC about how the referrals from Flickr have impacted the overall traffic on their website!

And it goes without saying that LC will also harness the collective tagging power of the Flickr community to help describe its collection, a feature of the project much touted on both blog announcements. However, I was interested to learn in the project FAQ that LC is actually hedging its bets on incorporating any of the captured tags in their own system.

The announcement reminded me of a number of other creative projects which have found ways to disclose cultural heritage materials into social networking sites or large online hubs. What comes to mind spontaneously:

  • Last summer, the University of Washington published a fascinating article in D-Lib about their experience in adding links to UW special collections to Wikipedia, which includes statistics on how this strategy increased web traffic to those collections.
  • Just in November, the Brooklyn Museum launched a Facebook application called “ArtShare” which allows users to pick their favorite images from the museum’s collection, and have them shown in rotation on their Facebook profile page. The app itself is social (meaning shareable) as well – the Victoria & Albert and the PowerHouse Museum offer up images to prettify your Facebook page as well. Along with the recently released WorldCat Facebook app (created by my RLG Programs colleague Bruce Washburn), ArtShare is about the only thing happening on my Facebook profile (you notice how I’m casually omitting a link here).

If you can think of other innovative ways in which cultural heritage organization do or should disclose their collections on social networking sites, please drop me a line!

Browsing audio

Monday, December 10th, 2007 by Karen

Recently I realized that I’m spending almost as much time in “professional listening” as I am doing “professional reading”. So many interviews, Webcasts, TED talks, Google Tech Talks, and the like!

So I was intrigued indeed when I read about Searching Video Lectures, a tool from MIT that leverages decades of speech-recognition research to convert audio into text and make it searchable, as reported in MIT’s Technology Review of November 26, 2007.

I tried out the Lecture Browser. It currently has only 200 publicly available lectures, but still! For an astronomy buff like me I was thrilled to zero in on professors’ insights about Hubble images (retrieved easily by a keyword search on “Hubble”.)  Definitely a fun tool to play with.

Then I thought about oral history projects I’ve known. Think of all the recordings of interviews with individuals who provide insight to our history, culture, and perspectives. The ones I know about have a MARC record about the interview with a very brief summary of the topics covered (usually with associated subject headings), the media used (e.g., “sound tape reel”), and a note that a transcript is available. But in the Brave New World, imagine what it would be like for researchers to type a few keywords and pull up both the transcript where the key words appear and the spot in the audio where the topic is discussed?

Visual reminder

Monday, July 2nd, 2007 by Merrilee

A visual reminder of how many places we have to search.

sputtr.com

Far from complete. WorldCat.org, LibraryThing, etc. are not even on the list.

Books are objects too

Sunday, May 20th, 2007 by Jim

Lorcan recently brought my attention to a conference where he will be speaking. It’s the mid-term meeting of LIBER where they will have a think tank on the future value of the book as artefact and the future value of digital documentary heritage at the National Library of Sweden in Stockholm 24-25 May 2007. I’m sure he’ll share some interesting thoughts when he returns (more likely, before he returns).

The topic sent me back to one of his recent posts that focused on books as technology and to thinking about a recent RLG Partner visit to UCLA about which Merrilee posted not long ago.

During that day at UCLA a special and unexpected treat was an invitation from Victoria Steele, a long-time professional friend and the head of special collections, to join an evening event she had organized for friends where she would be offering an intimate, organized tromp through the treasures. Vicki was passionate, the group was very interested and the treasures were a great pleasure to see close up.

Part of her traverse included the following items:

Euclid - Ratdolt
Euclid, Erhard Ratdolt, Joannes Campanus, and Adelard. 1482. Elementa artis geometriae: [translated by Joh. Adelhardus Bathoniensis; edited by Joh. Campanus; with dedicatory letter by Ratdolt]. Venice: Erhard Ratdolt. [big image]

byrne-1
Byrne, Oliver, and Bruce Rogers. 1847. The first six books of the elements of Euclid, in which coloured diagrams and symbols are used instead of letters for the greater ease of learners. London: William Pickering. [big image]

Euclid Valery

Euclid, Paul Valéry, and Bruce Rogers. 1944. Elements of geometry. New York: Random House.

hofstra
Hofstra, Sjoerd. 1994. Elements of geometry by Euclid. Amsterdam: ZET [big image]

The first and earliest book relocated the content on the page to provide a wide margin in which to present the diagrams that illustrated the text. The second nearly eliminated the text – the content instead delivered by richly colored diagrams that reminded us more of Mondrian than mathematics. The third (for which I was unable to find an image) delivered its content via beautiful italic along with diagrams in colored panels overlaid by the presence of Valéry’s essay and commentary. The final volume by a Swedish book artist transformed Euclid’s original two-dimensional drawings of geometric shapes into three-dimensional models that spring from the flat pages of grayed-out, abstracted text as pop-ups.

These four dramatically different versions of Euclid’s ‘text’ represented to me an increasing innovation in presentation within the broad parameters of the book technology but possible largely because the book was honored as an object.

In Lorcan’s post he mentions that consideration of the book as a technology “reinforces an awareness that the book itself, the codex, represents particular technological choices which in turn have influenced how we create and engage with the intellectual and cultural record, and in turn with broader experience and intellectual development.” He thinks that this is positive because it moves us “beyond the reductive opposition between the book and the digital turn.”

Vicki’s traverse I think reinforces a complementary point. Certain kinds of desired and desirable activities that are now easily delivered in the digital environment have been playing out within the technology of the book for a very long time. All those re-actions –reuse, repurpose and remix – have a deep history of their own within the book as object.

P.S. All the citations above were obtained via WorldCat – the citations are in Chicago form. There’s a post for another day regarding my search for these texts…

A wonderful specimen of a user!

Friday, March 9th, 2007 by GĂźnter

We’re always on the look-out for this mystical creature called “the user,” and I am exited to report a public sighting: for a good hour during the Bibliographic Control Working Group public meeting, we had a superb specimen right in front of us. Dr. Timothy Burke, an Associate Professor in the Department of History at Swarthmore College, spoke with detail and nuance about his information retrieval (read: search) behavior as a researcher, and provided a plethora of specific scenarios for different searching strategies. My only regret is that I didn’t walk up to him afterwards and thank him in person for his candor, wit and eloquence. I would have written this up in detail, but found that Karen Coyle already beat me to the punch. Thanks, Timothy (and Karen)!

Search in 2017

Tuesday, February 27th, 2007 by Merrilee

On Friday, I attended the long-running “Friday afternoon seminar,” (also known as the “Buckland-Larson-Lynch seminar,” also known by its formal course title…) at UC Berkeley. The seminar is available through the iSchool (formerly known as the “School for Information Management and Studies,” formerly known as the “School of Library and Information Studies”…) and is available to the general public.

Cliff Lynch gave a reprise of a talk given earlier this month on Internet Search in the Year 2017. While Cliff did not make a lot of predictions, he did make some interesting observations. Among these…

Content is where the biggest change has been in the last 10 years. Search engines used to all work from the same data set, the web. Now search engines make deals with content providers. This has transformed search engines from businesses that could be started in a garage (with a few clever people, a good algorithm, and a couple of fast machines) to businesses that have more in common with a cable network. Search engines also encourage the creation of content: think blogs, images, etc.

Cliff does not think that personalization services will have much uptake. He sees a lot of resistance from users to the idea registering and giving enough data for this to work. Interestingly, he also does not see a lot of movement in interfaces. What we have is likely what we’ll continue to have.

As a sidenote to the remarks on content, Cliff talked about worry within the library community, that users are not starting search at the catalog. This has led to much hand ringing, and the “OPACs really suck” discussion. What generally missed in this conversation is what led users to search engines in the first place. For the low, low price of one more click you get the data you wanted. Immediate gratification wins out over being told that the book is available in the library, or worse, available in a storage facility and you can get it in a few days.

I think there are a few more factors at play here, not mentioned during the talk but worth mentioning here. They are already at the search engine for other reasons. While looking for weather or a place to eat, why not also do research? Search engines have massive piles of information, and why would you want to limit yourself to a smaller bucket of information. Even if it is limited to good stuff, most of “our” piles of good stuff still need to be searched individually.
To consider and discuss. Hopefully a topic at our Discovery to Delivery in New Contexts symposium, for RLG Program Partners.