Before I got distracted, I was going to update you on some of my travels and activities. I’ll start with the Digital Library Federation Forum in Austin, Texas (I’ve already told you about the panel on the Open Content Alliance).
From my perspective, some highlights from that meeting — keep in mind that I was not able to go to all of the various sessions, and I missed several that I would have loved to attend. Links in presentation titles will take you to PowerPoint presentations.
Evolution of a Digitization Program from Project to Large Scale [no PowerPoint available] (Aaron Choate, UT Austin)
Transition from unique and rare to high volume and how to do both. Outsourcing for less fragile items, focus on efficiency and workflow. University of Texas is a member of the Open Content Alliance, so they will be digitizing books to contribute to the overall effort. The are using Stokes Imaging workstations for high volume. Using CCS/DocWorks for automated OCR and structural metadata. High volume scanning leads to increased workflow for preservation and cataloging, as well as collection managers and programming/web development. There is a library team dedicated to working this stuff out, testing processes and communication, Changing image or object IDs from “metadata encumbered” (my term) to arbitrary, streamlining workflows from the more handcrafted earlier stages of digitization. Acknowledged compromise on quality. Using Sharepoint (MS Windows) to manage project data. I liked this presentation because of it’s practical nature, and because it ties in well with our upcoming Member Forum.
Contextualizing the Institutional Repository within Faculty Research (Deb Holmes-Wong, USC)
Anne Van Camp and I heard about this project when we visited USC in January. Before building their institutional repository, USC conducted an assessment. The group looked in literature and couldn’t find that anyone else had done this type of assessment before launching a repository. Interviewed USC faculty, found that they are disinterested in depositing published works, more interested in supporting materials (which can’t be published due to space reasons) and PhD research. They want a permanent URL, want to be able to strictly control who accesses. Faculty also want high quality scanning services. They want for their materials to persist over time and be migrated forward in terms of file formats. I liked this presentation because it ties in with RLG’s interest in working with users before developing/deploying.
Repurposing Digital Collections at the University of Michigan via Print on Demand
Interesting presentation on how U Mich is turning MOA and other projects into print on demand books, offered via Amazon with fulfillment done via Lightning Source. Growing business, working towards cost recovery for tracking, etc. I liked this presentation because it explores new economic models for libraries, and also addresses issues of availability — many of these books are out of print and unattainable at a reasonable cost, and this project makes them available to those who do not have ready access to a well-stocked library.
Serials, the Next Motherlode for Large Scale Digitization? (U Penn, John Mark Ockerbloom)
Looking at opportunities for digitizing out of copyright and orphaned series, and techniques for how to determine. A real need for tools to help in this area. I liked this presentation because there is a clear tie in to the work we are doing on the Open Content Alliance.
Surfacing consistent topics across aggregated resource collections (clustering and classification techniques)
All projects were looking, I think, at using data mining techniques to cluster and then classify documents based on metadata, not text, so this work is analogous to work we did “under the hood” with RedLightGreen. I found this set of presentations interesting because of the tie in with RedLightGreen.
1. Emory, MetaCombine (Martin Halbert) Clustering and classification tools part of MetaCombine, still need work. Looking at creating tools that can be used in an unsupervised mode. Looking at using web services to give access to MetaCombine tools (so you don’t have to install them at your home institution), give access to training sets, etc. Interesting part of the presentation is some work they have done to modify Heritrix, the open source web crawler that almost everyone uses. They’ve taken Bow, developed at Carnegie Mellon and adapted it to Heretrix so now Heretrix will do crawls based on relevance (only following links from relevant pages onto other relevant pages — if a page is not relevant, it stops crawling in that direction).
2. OAIser, University of Michigan (Kat Hegedorn). Used the MetaCombine tools from Emory. Conclusion was that clustering was useful over a very large data set, but classification was difficult and less useful. Also, large datasets take a long time (well, we could have told her that, a lesson learned from RedLightGreen where processing the very large dataset that is the Union Catalog took quite some time!).
3. CDL, Bill Landis. Work from CDL’s America West project. Clustering is good at a global level, classification helps to meet local/project needs. Classification and bags of words can and should be shared.
4. If you have no idea what any of the above is about, David Newman from TopicSeek gave a nice introduction to clustering and classification.
Recommending and ranking: experiments in next generation library catalogs (on Melvyl, CDL, Brian Tingle presenting)
Currently investigating how to get XTF to represent MARC data in FRBR, if circulation data or holdings data are more helpful for in ranking, if “people who checked out this book also checked out…” features would be interesting. Just finished one round of user testing, will do more in May. XTF providing better ranking of results than are coming from the ILS. I’m inviting this team to come to RLG to share findings, so I will have more to report in June. Lots of RedLightGreen synergies.
Unbundling the ILS: Deploying an E-Commerce Catalog Search Solution
Andrew Pace and Emily Lynema, North Carolina State
This project has received quite a bit of play and this was my first real look at it. Using an e-commerce tool, Endeca, to help provide relevance and faceted browsing to the catalog. Runs fast, because all data is held in RAM (no surprise). Takes 7 hours to reindex data, which is done nightly, on something like 1.2 million records. They encountered the same issues we did, in working with a tech partner — wow, you have so many fields and you want to index them all?!? Future plans to FRBRize. I was gratified to see numerous acknowledgements of lessons learned from our RedLightGreen project. If you haven’t seen it, take a look.
Finally, David Seaman announced that he will be stepping down as the Director of the DLF. This is sad news, and we will miss him, but fortunately he’ll be around through the next Forum in Boston.