Archive for April, 2006
Before I got distracted, I was going to update you on some of my travels and activities. I’ll start with the Digital Library Federation Forum in Austin, Texas (I’ve already told you about the panel on the Open Content Alliance).
From my perspective, some highlights from that meeting — keep in mind that I was not able to go to all of the various sessions, and I missed several that I would have loved to attend. Links in presentation titles will take you to PowerPoint presentations.
Evolution of a Digitization Program from Project to Large Scale [no PowerPoint available] (Aaron Choate, UT Austin)
Transition from unique and rare to high volume and how to do both. Outsourcing for less fragile items, focus on efficiency and workflow. University of Texas is a member of the Open Content Alliance, so they will be digitizing books to contribute to the overall effort. The are using Stokes Imaging workstations for high volume. Using CCS/DocWorks for automated OCR and structural metadata. High volume scanning leads to increased workflow for preservation and cataloging, as well as collection managers and programming/web development. There is a library team dedicated to working this stuff out, testing processes and communication, Changing image or object IDs from “metadata encumbered” (my term) to arbitrary, streamlining workflows from the more handcrafted earlier stages of digitization. Acknowledged compromise on quality. Using Sharepoint (MS Windows) to manage project data. I liked this presentation because of it’s practical nature, and because it ties in well with our upcoming Member Forum.
Contextualizing the Institutional Repository within Faculty Research (Deb Holmes-Wong, USC)
Anne Van Camp and I heard about this project when we visited USC in January. Before building their institutional repository, USC conducted an assessment. The group looked in literature and couldn’t find that anyone else had done this type of assessment before launching a repository. Interviewed USC faculty, found that they are disinterested in depositing published works, more interested in supporting materials (which can’t be published due to space reasons) and PhD research. They want a permanent URL, want to be able to strictly control who accesses. Faculty also want high quality scanning services. They want for their materials to persist over time and be migrated forward in terms of file formats. I liked this presentation because it ties in with RLG’s interest in working with users before developing/deploying.
Repurposing Digital Collections at the University of Michigan via Print on Demand
Interesting presentation on how U Mich is turning MOA and other projects into print on demand books, offered via Amazon with fulfillment done via Lightning Source. Growing business, working towards cost recovery for tracking, etc. I liked this presentation because it explores new economic models for libraries, and also addresses issues of availability — many of these books are out of print and unattainable at a reasonable cost, and this project makes them available to those who do not have ready access to a well-stocked library.
Serials, the Next Motherlode for Large Scale Digitization? (U Penn, John Mark Ockerbloom)
Looking at opportunities for digitizing out of copyright and orphaned series, and techniques for how to determine. A real need for tools to help in this area. I liked this presentation because there is a clear tie in to the work we are doing on the Open Content Alliance.
Surfacing consistent topics across aggregated resource collections (clustering and classification techniques)
All projects were looking, I think, at using data mining techniques to cluster and then classify documents based on metadata, not text, so this work is analogous to work we did “under the hood” with RedLightGreen. I found this set of presentations interesting because of the tie in with RedLightGreen.
1. Emory, MetaCombine (Martin Halbert) Clustering and classification tools part of MetaCombine, still need work. Looking at creating tools that can be used in an unsupervised mode. Looking at using web services to give access to MetaCombine tools (so you don’t have to install them at your home institution), give access to training sets, etc. Interesting part of the presentation is some work they have done to modify Heritrix, the open source web crawler that almost everyone uses. They’ve taken Bow, developed at Carnegie Mellon and adapted it to Heretrix so now Heretrix will do crawls based on relevance (only following links from relevant pages onto other relevant pages — if a page is not relevant, it stops crawling in that direction).
2. OAIser, University of Michigan (Kat Hegedorn). Used the MetaCombine tools from Emory. Conclusion was that clustering was useful over a very large data set, but classification was difficult and less useful. Also, large datasets take a long time (well, we could have told her that, a lesson learned from RedLightGreen where processing the very large dataset that is the Union Catalog took quite some time!).
3. CDL, Bill Landis. Work from CDL’s America West project. Clustering is good at a global level, classification helps to meet local/project needs. Classification and bags of words can and should be shared.
4. If you have no idea what any of the above is about, David Newman from TopicSeek gave a nice introduction to clustering and classification.
Recommending and ranking: experiments in next generation library catalogs (on Melvyl, CDL, Brian Tingle presenting)
Currently investigating how to get XTF to represent MARC data in FRBR, if circulation data or holdings data are more helpful for in ranking, if “people who checked out this book also checked out…” features would be interesting. Just finished one round of user testing, will do more in May. XTF providing better ranking of results than are coming from the ILS. I’m inviting this team to come to RLG to share findings, so I will have more to report in June. Lots of RedLightGreen synergies.
Unbundling the ILS: Deploying an E-Commerce Catalog Search Solution
Andrew Pace and Emily Lynema, North Carolina State
This project has received quite a bit of play and this was my first real look at it. Using an e-commerce tool, Endeca, to help provide relevance and faceted browsing to the catalog. Runs fast, because all data is held in RAM (no surprise). Takes 7 hours to reindex data, which is done nightly, on something like 1.2 million records. They encountered the same issues we did, in working with a tech partner — wow, you have so many fields and you want to index them all?!? Future plans to FRBRize. I was gratified to see numerous acknowledgements of lessons learned from our RedLightGreen project. If you haven’t seen it, take a look.
Finally, David Seaman announced that he will be stepping down as the Director of the DLF. This is sad news, and we will miss him, but fortunately he’ll be around through the next Forum in Boston.
[Originally posted April 18th, 2006 â€” the last in our recently republished series. A special thanks to Roy Tennant for locating this!]
I just received this email:
100 years ago today on April 18, 1906, a monumental earthquake hit the city of San Francisco. The effects of the earthquake and subsequent fire that swept the city are commemorated in a significant
online resource created in commemoration of the event which draws on historic materials such as photographs, albums, diaries, letters, oral histories, reports and other documents housed in some of the
most important special collections in California.
The 1906 San Francisco Earthquake and Fire Digital Collection includes approximately 14,000 images and 7,000 pages of full text searchable documents related to the history of the 1906 earthquake
and fire in San Francisco and other affected areas. The website has an introduction to the collection, an online exhibit including short films and audio, a 360 degree panorama of the city, a searchable
interactive map of the city, and search and browse capabilities to help users explore this rich content.
The 1906 San Francisco Earthquake and Fire Digital Collection includes selected holdings from the archives and special collections of The Bancroft Library, University of California, Berkeley; the California Historical Society, San Francisco; The California State Library, Sacramento; Stanford University Libraries – Special Collections & Archives, Stanford University, Stanford; The Huntington Library, San
Marino; and The Society of California Pioneers, San Francisco.
The 1906 San Francisco Earthquake and Fire Digital Collection: http://bancroft.berkeley.edu/collections/earthquakeandfire/
When I left UC Berkeley five years ago, April 2001 (has it been that long already?), this project was just winding up the first year of activities and the library was applying for a second round of funding. For the last week with the build up to the centennial, Iâ€™ve been reminiscing about my days working on digital library projects (instead of just talking and thinking about it). Iâ€™m sorry I missed out on the project, but happy itâ€™s up online. Congratulations to all those institutions, and all those who did the work.
Those amongst you who read this blog regularly know that I’ve been thinking a fair bit about digital asset management in museums lately â€“ witness my posts here and here. I’ve had another opportunity to clarify my thinking while musing about a talk for AAM 2006 in Boston. I’ll be on a panel called “Preserving Your Digital Assets: Preserving Your Investment,” and I’ve entitled my talk (at least so far) “From Asset Management to Digital Preservation.” My basic conundrum: figure out how to tell the gathered museum crowd that vendor-based Digital Asset Management Systems (DAMS) are worthwhile for many reasons, but those reasons shouldn’t be confused with fully-fledged digital preservation.
Both from the ClearStory survey (for a quick summary, read here) and from a little informal survey of 17 RLG museum members I know the notion DAMS = digital preservation prevails (out of the 17 respondents of my informal survey in February, 14 indicated just that). To debunk that myth, I turn to the RLG-NARA Audit Checklist for Certifying Digital Repositories, which is a concise articulation of the circumstances enabling long-term preservation. If you hold a DAMS against the checklist, you’ll soon notice that you’re comparing apples and oranges: a DAMS is a technology, while the attributes of a trusted digital repository are only to a degree about technology â€“ most of the checklist details institutional commitments, policies and frameworks which have to be in place in order to ensure the long-term survival of a digital file.
But what if you evaluated a DAMS in its institutional environment? Couldn’t the DAMS be the technological aspect of the trusted digital repository, if everything else (the policies etc.) fell into place? I put that question to Robin Dale (our resident checklist and certification expert), and she agreed that while we couldn’t rule out that possibility in theory, in practice an institution running a vendor-based DAMS would have a hard time even answering any number of the questions in the checklist, because they wouldnâ€™t be privy to the inner workings of their black-box solution.
It would be an interesting exercise to take the checklist and apply it to a museum with a DAMS implementation (any willing self-auditors out there?) In the meantime, I’ll stubbornly maintain my prejudice that a DAMS is a technology and therefore by definition only part of a digital preservation solution. Beyond that, even as a technology, these systems are more geared towards pumping assets around an institution in the here and now rather than maintaining them for the next generation. The way I see it, a trusted digital repository aims to preserve an asset beyond the lifetime of the current technological environment, while a DAMS provides access to assets for many uses over the lifetime of the current technological environment. And aren’t both fine, dandy and eminently worthwhile endeavors, and they even build nicely on one another (more on this maybe during another blog). Just don’t confuse them.
[Originally posted April 17th, 2006 -- No 2 in our recently republished series]
Okay, itâ€™s been a while. I havenâ€™t been avoiding you, Iâ€™ve just been super busy.
Iâ€™ll give some other highlights from DLF soon, but wanted to mention that I was on a panel with Rick Prelinger (Internet Archive-Open Content Alliance-Prelinger Archive) and Robin Chandler (California Digital Library). Rick gave an introduction to the Open Content Alliance, I went through the various workgroups that will push forward the work agenda, and Robin went through what it means to be a content contribution partner.
Even though UCâ€™s digitization is being covered by external funding, there is still a ton of other work to be done. Robin described the team of people working across the University of California system doing selection, coordination, an amazing array of tasks â€” right down to having to plan for and install bathrooms for those doing the scanning. Participating in the OCA has meant a lot of work on the part of UC, not paid for by external funding.
So why do this? UC wants to make its collections more broadly accessible for their campus constituents, but also for the community. UC looks forward to drawing from other OCA collections to fill holes in their collections. Robin also pointed out the various ways that OCA participation allows UC to consider new business models.
On the fun side, for three nights in a row I went to the Congress Avenue Bridge to see the bat flight. This is amazing, and if you are going to be in Austin between now and December, I urge you to go check it out. Iâ€™ll be attending the Rare Books and Manuscripts Section preconference to ALA in June, which will also be in Austin, and I plan to be down at the bridge watching the bats head off for a night of eating bugs. More information about Austinâ€™s bat colony can be found here, at the Bat Conservation International website.
[Originally posted 04/17/2004 - No 1 in our "re-live the moment" series.]
I’ve oracled about it before, and now it’s finally here: the joint Museum Computer Network (MCN) and AAM Media & Technology blog musematic.net is open for business! As a member of both boards, I’ve been intimately involved in setting up this blog, so excuse my giddy glee at being able to announce its launch. Similar to hangingtogether, musematic provides varied entertainment by hosting the opinions of a number of bloggers. The cast is quite diverse and impressive, so expect posts on far-ranging topics in the field of museum technology.
Already, there are a good half-a-dozen posts giving a good sense of the voices you’ll hear â€“ Holly Witchey (Cleveland Museum of Art) struck twice with posts on ethical dilemmas in museums (somehow she manages to drag Madonna into this) and a quite hilarious piece on her first adventure in secondlife.com. Nik Honeysett (J.Paul Getty Trust) has mocked-up an interview with “Bob the IT Guy,” which I predict is destined to become a classic. (Nik, I hope Bob will make future appearances on the blog!) For those who need it, Richard Urban (University of Illinois, Urbana-Champaign) gives a little intro to blogging, reading blogs and the museum blogosphere, and Paul Marty has an equally useful post on website evaluation. While these last two bloggers represent a more academic point of view, Amalyah Keshet (Israel Museum, Jerusalem) brings an international perspective and an abiding interest in copyright. Peter Samis (San Francisco of Modern Art) needs no introduction. Paul Glenshaw (self-described as a “recovering museum employee” and “Dad to avid young museum-goers”) rounds out the field, and wrote the most recent post on sustaining technology in museum galleries.
I know I’ll be reading, what can I say. Together with Rob Lancefield (Davison Art Center, Wesleyan University), I’ll stay involved behind the scenes to help hold the blog together and motivate should motivation be needed. Please stop by, subscribe, and don’t forget to leave comments to spur on this talented cast of (mostly) fledgling bloggers!
LIShost had some problems, and consequently, HangingTogether lost some posts (fortunately, we’re not missing in action altogether — thanks Blake!). I was able to grab some of the missing posts from a cache, but if any of you readers have a copy of my post from 4/18 (A walk down memory lane), I’d love to get a copy of it, so I don’t need to reconstruct it.
Prepare yourselves to relive a few recent blog postings. May they be enjoyable the second time around!
Hi. My name is Anne and Iâ€™m addicted to access to archives.
I have had this passion since the first time I started to work in an archive processing a collection of original materials. These records had never been seen by anyone except the creators and even they probably never looked at the material as a whole, as a story, as a history waiting to be told. I loved everything about working on that collection â€“ organizing it, typing up the finding aid, re-typing the finding aid, and finally completing it and putting it in a folder where it would stay until someone, somehow might learn about this great collection and would appreciate all my care and devotion to making it easy to use. To this day, Iâ€™m still not sure if anyone ever did use it.
Several years now into a long career as an archivist, I am still passionately engaged in making archives accessible â€“ though Iâ€™m not doing it one finding aid at a time. Iâ€™m doing it with thousands of finding aids and bibliographic records that archivists have painstakingly created in hopes of making their treasured collections accessible to researchers. And whatâ€™s exciting is that current technology allows us to do this in an unprecedented fashion. And whatâ€™s even more exciting is that I can now see how great the demand for this information has become.
When we launched ArchiveGrid.org last month, we had the opportunity to open the doors to an incredible wealth of information about archives that are dispersed across the country and around the world. One private funder who understood the value and the vision of our work to aggregate information about dispersed archives and build the best access system we could, also allowed us to test out the power and the demand for this information by supporting a three month free access period. We were sure that this would be a way of attracting more users to this important resource.
What we could not anticipate was the enormous response that started on day 1. With our first announcements of the new service out, we carefully tracked how many people were finding the service, visiting, re-visiting, telling their friends and colleagues about it, seeing the blogosphere pick it up and run with it. Within the first three days of the service, the traffic exceeded our expectations for the full three months. Over the course of last month, 182,000 visits have been made to ArchiveGrid, an average of 5800 visits a day. More than 806,000 pages of information have been viewed and the numbers are continuing to rise. Iâ€™d say thereâ€™s a demand out there all right.
Now we have the challenge of keeping the system open. Who funds this sort of thing? Who is passionate enough about archives to make sure that this kind of access can continue? The mantra running through most of our feedback is something like â€“ this is fabulous â€“ but what happens after May?
Iâ€™m looking for angels and also for real live funders to help support my addiction.
In a previous post, I hinted at the collaborative work I’m engaged in with a number of natural history institutions. Our working group ominously titled itself RAVNS or Resources Available in Natural Sciences, and we’ve had conference calls for a good 18 months now. Building on a European Union project called BioCASE (Biological Collections Access Service for Europe), the stated goal of the RAVNS is to create an XML Schema for describing collections in natural history institutions. As Neil Thomson of the Natural History Museum, London, writes:
It is intended primarily as a lightweight resource description standard that is specific to natural history and lies between general resource discovery standards such as Dublin Core (DC) and rich collection description standards such as the Encoded Archival Description (EAD). However, it should be possible to extract a Dublin Core record from an NCD record for use with general resource discovery systems or, going in the other direction, to use an NCD record as the seed for a much richer collection description as and when time allows.
I pulled this succinct positioning of the fledgling specification from the Taxonomic Database Working Group (TDWG) Natural Collections Description webpage, where you can also download a draft [link to .xsd file] of the schema. As the RAVNS got more and more serious about their work, TDWG expressed an interest in making the XML Schema we were working on part of their ecology of standards (excuse the pun), and they have established a working group to move things forward on their end. Since Neil Thomson chairs both groups (he actually co-chairs the RLG group together with Carol Butler from the National Museum of Natural History), we’re all happily pulling on one string. You’ll hear more about this effort after a big meeting in June, which will bring together representatives from the Global Biodiversity Information Facility (GBIF), TDWG, the RAVNS and some folks with collection description smarts who’ll be able to give some impartial input. Graciously funded by the Gordon and Betty Moore Foundation, via GBIF, I should hasten to add. Stay tuned!