Archive for the 'digitization' Category

Pat the Elephant

Friday, July 23rd, 2010 by Constance

There is a well-known fable about blind men with contrasting views on the anatomy of an elephant, each having examined a separate piece of the beast and independently concluded that it is either very like a spear, or a fan, or a snake, etc.  Even in combination their observations fail to provide a very good picture of what an elephant looks like as a whole.  The story was popularized in a poem by John Godfrey Saxe which is cited in a surprisingly wide variety of publications, from early childhood education manuals, to scientific and medical reports, to vocational guides and, more predictably, collections of 19C verse.  I know this because a search on a distinctive phrase from the poem’s conclusion: “prate about an elephant not one of them has seen” in the HathiTrust digital library finds more than 140 matches in these places.

Blind searching in large digital text repositories like the HathiTrust or Google Books provides an intriguing but incomplete view of the mass-digitized book corpus.  Frequently cited statistics like “12 million books” in GBS, “5 million books” or “one million public domain books” in Hathi don’t really tell us much about the anatomy of the mammoth.  Pat the elephant…what do you find?  A lot of curious sensory experiences that don’t add up.

When it comes to anatomizing elephants, all parts are not created equal.  Georges Cuvier, who famously reconstructed skeletons on the basis of a tooth or a toe, knew this.  Cuvier confidently and correctly distinguished Indian and African elephant species based on characteristic differences in jawbones; he ‘discovered’ the woolly mammoth based on a close examination of incomplete fossil remains.

I’m inclined to think that counting books (or volumes) is about as useful in characterizing the mass-digitized corpus as counting vertebrae in the catacombs.  It tells us something about how much is there, but not much about who, or what, is there.

Happily, there is an abundance of bibliographic metadata describing the content from which the mass-digitized corpus was sourced that can be used (like a fossilized tooth or a toe) to assign some generic, or I suppose specific, characteristics to the elephant in the room.  Over the past year, OCLC Research has been working on a project with Hathi and some other interested libraries to begin characterizing the enormous, vaguely familiar (snake? spear? tree?) yet altogether revolutionary (woolly!) mammoth created through the digitization of legacy print collections.

We’ve posted some empirical data on the subject and library distribution of titles in the Hathi digital repository here.  

I think it provides a useful complement to the enchanting and progressively revealing fan-dance of class numbers here.

More to come.

Focus and reframe: rights and unpublished materials

Wednesday, March 10th, 2010 by Merrilee

I’m using this blog posting to wrap together a bunch of ideas I’ll be presenting at a meeting tomorrow, Undue Diligence: Seeking Low-risk Strategies for Making Collections of Unpublished Materials More Accessible.

Mark Greene and Dennis Meissner helped to reframe processing modern archival collections in More Product, Less Process. Similarly, Shifting Gears helped to recast digitization from special collections. The purpose of Undue Diligence is to help professionals to look anew at rights issues around unpublished materials, specifically with regard to digitization of those materials, particularly 20th and 21st century collections.

The RLG Partnership exists to identify shared problems spaces, and to reduce pain and effort in those areas. With increasing expectations that our holdings will be made digitally accessible, assessing rights (copyright, along with privacy rights, and potentially sensitive materials) within archival collections is one of those points of pain. The prospect of analyzing items within archival collections is so painful, in fact, that many institutions avoid digitizing collections that were created in the last 70 to 100 years. While this is a very safe practice, it does little to advance broad and democratic access to collections in our care.

The RLG Partnership likewise dodged the copyright bullet in 2007 when we held our forum, Digitization Matters (from which Shifting Gears was born). We ruled copyright out of scope. While reframing the conversation around digitization — from preservation to access, from quality to quantity — did help move the conversation on digitization forward, it did little for those institutions who have major collections relating to … the Great Depression, World Wars I and II, the Korean, Vietnam, and Gulf wars, the civil rights movement, the free speech movement… the list goes on and on. This is a small slice of topics that are studied by researchers, taught in classrooms, and of interest to citizens everywhere.

In 2008, we published a short paper called Copyright Investigation Summary Report, which looked at then-current practices around copyright with both published and unpublished materials. Here, we learned that most investigations related to copyright were in relationship to permissions and almost never to digitization. Work was high effort and low return. “We say no a lot,” said one interviewee. Having conducted the interviews, I was pretty depressed by what I heard, which was a tale of professionals paralyzed by potential risks, and of collections shackled.

One of the proposed outcomes of the paper was to “…further explore community practice and issues around unpublished materials held in special collections and archives.” We did so by sponsoring the meeting that lead to the SAA Orphan Works Statement of Best Practices, which was published in 2009. This document provides good guidance for institutions to conduct a “reasonable search,” but does not frame rights assessment in a risk management strategy.

The risk of perceived harm in digitizing a collection is quite variable, based on factors like content, purpose of creation, and date of creation. We believe, in addition to standards for conducting a reasonable search, the community needs to reframe the issues of rights and risks as a community, and also to embrace rights assessment as archivists: at a collection or series level and not at an item level.

We are holding this event, with a star studded cast of presenters, to help set the stage for an important conversation, which is the development of what we are calling a set of “well intentioned practices.” We hope that this will have two effects. The first is that archivists will not need to reinvent the wheel, and can draw from community practices to identify lower risk collections of high research interest. The second is that institutions will digitize collections more freely. Even if institutions consider digitizing two out of ten collections, as opposed to one out of ten collections, access to collections will double!

We will follow up with subsequent blog postings both to report on the content of Undue Diligence and also to report on outcomes.

Many thanks to the advisory group who both helped to shape this event and our program of work in this area.

If you wish to follow the event on Twitter, follow #UndueD. I’ve also set up a Twapper Keeper for the event.

Europeana at the Halfway Mark

Monday, December 7th, 2009 by Ricky

For the recent LIBER/EBLIDA workshop on digitization at the Koninklijke Bibliotheek in The Hague, I was asked to provide a view on Europeana from the US perspective. Of course, I neither speak for the US nor do I have inside information about Europeana, but I’d been following it from afar and had read just about everything I could get my hands on, so I gamely took the challenge. [Only someone as bloodied by digital paper cuts as I would dare to take on Europeana.] I wasn’t bombarded with rotten tomates, courgettes, and aubergines, so I guess it went OK. My remarks are now available in Volume 19 (2009), No. 2 of the LIBER QUARTERLY.