As part of what we’re trying to accomplish in the “Modeling New Service Infrastructures” part of the RLG Work Agenda, I’ve been working on a white paper on best practices for enhancing disclosure of library, museum and archive content at the network level. Using a sitemap to better expose content to web search engines such as Google is one technique that I’m investigating, but it is clear that this technique is by no means a silver bullet.
For those of you who aren’t familiar with sitemaps, they are a way to expose information for a web crawler that is normally hidden behind a database wall. Since web crawlers cannot intuit all the necessary queries to extract all the data from a database, other techniques must be used to provide the crawler with crawlable URLs. One such technique is to create a sitemap, which is an XML file with all the URLs you wish to have crawled on your site. Google, Yahoo!, and Microsoft have collaborated through the Sitemaps.org site to establish a common protocol for sitemaps.
I’ve used this for some time now to make sure all the images on my photos site were exposed to crawling. But recently I checked on it in Google’s Webmaster Tools site (where you can register your sitemap to make sure Google finds it), and I discovered to my dismay that only a small percentage (38%) of the URLs had been indexed in Google. On that site they state that “Most sites will not have all of their pages indexed,” but the only clues why are perhaps contained in the “Webmaster Guidelines” that describe what to do or not do to increase your chances of being indexed.
I asked the subscribers to the Web4Lib discussion what their experience has been. Debbie Campbell of the National Library of Australia (an RLG partner institution) reported that of the over a million items in Picture Australia, Google had indexed only about 49% of the URLs, although it was up dramatically over a couple weeks ago when the number was a paltry 32,000.
Marshall Breeding of Vanderbilt University and the owner of Library Technology Guides reported a higher percentage of 62% of his URLs indexed. He also reported that he’s doing some testing of the priority setting of potentially uncrawled URLs (based on low page views) in his sitemap to see if that helps.
Danielle Plumer of the Texas State Library and Archives Commission pointed out an article that discusses related issues, since of course this is a topic of interest to those trying to achieve “search engine optimization”. She also sent along part of a Q&A with Matt Cutts of Google that bears on this issue:
Q: “My sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?”
So what can I conclude from this? Not a whole lot yet, except that creating a sitemap and registering it with Google is not by any means a silver bullet. It’s still a good idea, but it should only be one strategy of many we can take to get the unique content that libraries, museums and archives hold in front of the roving eyes of web users. It’s exactly that suite of strategies I’m trying to identify and describe, so please let me know of any methods you’re using that you think may be effective, either as a comment to this post or direct email.