A Map to Destinations Uncrawled

As part of what we’re trying to accomplish in the “Modeling New Service Infrastructures” part of the RLG Work Agenda, I’ve been working on a white paper on best practices for enhancing disclosure of library, museum and archive content at the network level. Using a sitemap to better expose content to web search engines such as Google is one technique that I’m investigating, but it is clear that this technique is by no means a silver bullet.

For those of you who aren’t familiar with sitemaps, they are a way to expose information for a web crawler that is normally hidden behind a database wall. Since web crawlers cannot intuit all the necessary queries to extract all the data from a database, other techniques must be used to provide the crawler with crawlable URLs. One such technique is to create a sitemap, which is an XML file with all the URLs you wish to have crawled on your site. Google, Yahoo!, and Microsoft have collaborated through the Sitemaps.org site to establish a common protocol for sitemaps.

I’ve used this for some time now to make sure all the images on my photos site were exposed to crawling. But recently I checked on it in Google’s Webmaster Tools site (where you can register your sitemap to make sure Google finds it), and I discovered to my dismay that only a small percentage (38%) of the URLs had been indexed in Google. On that site they state that “Most sites will not have all of their pages indexed,” but the only clues why are perhaps contained in the “Webmaster Guidelines” that describe what to do or not do to increase your chances of being indexed.

I  asked the subscribers to the Web4Lib discussion what their experience has been. Debbie Campbell of the National Library of Australia (an RLG partner institution) reported that of the over a million items in Picture Australia, Google had indexed only about 49% of the URLs, although it was up dramatically over a couple weeks ago when the number was a paltry 32,000.

Marshall Breeding of Vanderbilt University and the owner of Library Technology Guides reported a higher percentage of 62% of his URLs indexed. He also reported that he’s doing some testing of the priority setting of potentially uncrawled URLs (based on low page views) in his sitemap to see if that helps.

Danielle Plumer of the Texas State Library and Archives Commission pointed out an article that discusses related issues, since of course this is a topic of interest to those trying to achieve “search engine optimization”. She also sent along part of a Q&A with Matt Cutts of Google that bears on this issue:

Q: “My sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?”
A: One of the classic crawling strategies that Google has used is the amount of PageRank on your pages. So just because your site has been around for a couple years (or that you submit a sitemap), that doesn’t mean that we’ll automatically crawl every page on your site. In general, getting good quality links would probably help us know to crawl your site more deeply. You might also want to look at the remaining unindexed urls; do they have a ton of parameters (we typically prefer urls with 1-2 parameters)? Is there a robots.txt? Is it possible to reach the unindexed urls easily by following static text links (no Flash, JavaScript, AJAX, cookies, frames, etc. in the way)? That’s what I would recommend looking at.

So what can I conclude from this? Not a whole lot yet, except that creating a sitemap and registering it with Google is not by any means a silver bullet. It’s still a good idea, but it should only be one strategy of many we can take to get the unique content that libraries, museums and archives hold in front of the roving eyes of web users. It’s exactly that suite of strategies I’m trying to identify and describe, so please let me know of any methods you’re using that you think may be effective, either as a comment to this post or direct email.

4 Comments on “A Map to Destinations Uncrawled”

  1. Laurie,
    Thanks for all your information, I may be in touch for more details. This sounds similar to techniques the National Library of Singapore is using, which I’ll be investigating as well. It would be great if we could devise a set of tools that would help, but I’m not optimistic about this given the very different nature of our various situations. But we’ll see. Thanks again.

  2. I’m in the University of Florida’s Digital Library Center and we’re now using static pages on a separate directory to ensure our Digital Collections are properly crawled. We were trying to rely on sitemaps, but our sitemaps weren’t being read properly, leading to both poor search engine rankings and heavy memory loads on our servers. The search engine bots would crawl everywhere despite the directions in robots.txt and nofollow links, and we pre-load the next and previous page images for items, so the random over-crawling was ineffective and problematic. The static pages include the full text for the items and the full citation, and this has been the most effective in getting search engines to crawl us.

    Our static pages live on http://www.uflib.ufl.edu/ufdc2 and our primary site is http://www.uflib.ufl.edu/ufdc. The UFDC2 pages are single static pages for each item and all links on them go to the main UFDC site. We’re using robots.txt to deny search engines for UFDC and then allowing them on UFDC2 to get around the uneven response to using nofollow for links. The only major problem with this is that this will require more work than maintaining sitemaps alone as search engines get better at indexing the deep web.

    We’ve also used standard SEO methods, including creating RSS feeds for new items by collection; using blogs and writing entries on specific items and collections for Wikipedia and other relevant sites to make sure the search engines get to the collection main pages and to some of the specific items in the collections; optimizing our code as much as possible for search bots; and trying to add our collection links to as many other sites as possible. Based on “Googlizing a Digital Library” from the Code4Lib journal (http://journal.code4lib.org/articles/43), it seems like static pages are the next best step if sitemaps aren’t sufficient, but I’d love to know more about what others are using and a suite of easily configurable and sharable tools would be ideal.

Comments are closed.