My former RLG colleague and current OCLC software development manager, Judith Bush, is doing some nice blogging about the JCDL 2009 Designing tomorrow, preserving the past-today sessions here. Check it out.
Archive for the 'Digital Preservation' Category
At the end of last week, I had a chance to read the just-issued white paper, Approaches to Managing and Collecting Born-Digital Literary Materials for Scholarly Use (Matthew G. Kirschenbaum et al). (To get to the paper, follow the link and scroll to “download.”)
My first reaction was, wow, the NEH certainly got a lot for the money. For $11,708, this project certainly packed a lot in. The project started with a core group of institutions with a common interest — preserving and giving access to (the second part is important) “the born-digital documents and records of contemporary authorship.” These records are usually a “hybrid” consisting of both electronic and print outputs. The original group included University of Maryland, UT Austin, and Emory University, but by the end of the project the group had expanded to bring in viewpoints from the Library of Congress, Stanford University, the University of Maine, Yale University, the New York Public Library, the British Library, and the University of Oxford.
I liked two things about this report. The first is the sense that there was a real exchange of ideas, of institutions wanting to learn from one another (rather than develop their own way of doing things). The second was that the project engaged practitioners, primarily authors, in a deep way, trying to understand the way that documents are created and used. I was happy to see further work with both scholars and authors listed under next steps — I think this deep understanding of how these documents are created and how they function, along with how they will be used are essential for reaching understanding preservation and access needs.
I’ll be attending an advisory board meeting for the Future Arch project at the Bodleian (Oxford) in September, and I found this report an excellent primer on many of the issues surrounding the preservation and use of hybrid collections. It introduced some ideas I hadn’t considered or taken seriously before, such the materiality of the creation surround.
Conversations about dealing with digital collections in special collections are often marked by hand-wringing (and remarks about, “that will be after I retire,”) so it’s great to see the community rolling up it’s sleeves and getting to work.
A little over a year ago, I inherited a project that didnât have much more than a name: âExplore and understand the place of large digital text aggregations in scholarship and research.â
I had several discussions with my colleagues about what this project might turn out to be. We had several ideas:
Â– Create a shared understanding of the expectations that researchers and students bring to their interactions with large-scale text aggregations on the web and the requirements for making these collections fit for scholarly use.
Â– Convene an invitational meeting of those already engaged in large-scale digitization efforts to establish a common understanding of scholarly use-cases and the core requirements for library-sourced research services.
Â– Identify service capabilities (bookmarking, annotation, citation management, etc) that are required to support scholarly use of text aggregations.
Â– Assemble a text archive for prototyping and analysis.
Â– Investigate needs of scholars (via focus groups?)
Â– Experiment with the metadata we get from OCLCâs e-Content Synchronization service to see how we can characterize the contents of book aggregations
Â– Experiment with full text functionality we might be able to offer a) on a specific aggregation b) across aggregations
What we were exploring went beyond finding and using a single document. It was about identifying works from many silos to incorporate into a local environment. And it was about performing actions against an index (or multiple indexes) of aggregated digitized works. We could investigate how scholars would work with the range of book text archives, starting with use case scenarios of the types of queries (e.g., in areas such as linguistic analysis, lexical frequency, translation studies, edition comparisons, things like occurrence of geographic place names in fiction, and coincidence of events – like being able to explore how a race riot affected neighborhood population dynamics).
Read the rest of this entry »
The title of this post is my homage to another famous Belgian.
I have been posting from the 9th International Bielefeld Conference in Germany. In yesterday’s closing keynote, Herbert Van de Sompel gave a most unusual presentation. Preparing, on his return to the Los Alamos National Laboratory, for a six-month sabbatical, he used the occasion to review the work he and his various teams have done over the past 10 years or so – and bravely assessed the success or otherwise of the major various initiatives in which he has been involved – SFX, OpenURL, OAI-PMH, OAI-ORE and MESUR (not for the acronymically faint-hearted). Incidentally, the 10-year boundary was as much accident as design. With the exception of one slide (pictured) showing his various project clusters, he had not prepared a new presentation, but instead paced around in front of a succession of old ones â some looking pretty dated â displayed in fabulous detail on the gigantic screen in the Bielefeld Convention Centre main hall. With a plea for more work on digital preservation, he stated that he had discovered that those Powerpoint presentations which were more than 10 years old were no longer readable.
The SFX development work, done at the University of Ghent, has resulted in some 1,700 SFX servers installed worldwide, which link â at a conservative estimate â to some 3 million items every day. Less successful, in his view, was the OpenURL NISO standard. It took three years to achieve, and â despite his ambitious intentions at the time â is still used almost exclusively for journal article linking. Reflecting on this, he remarked that the library community finds it hard to get its standards adopted outwith the library realm.
Herbert was also ambivalent about OAI-PMH. The systemic change predicted at the time of its development has not happened, and may never happen. He remarked that ‘Discovery today is defined by Google’, and in that context PMH did not do a good job because it is based on metadata. Ranking is based on who points at you (see my earlier post on the Webometrics ranking). ‘No one points at metadata records’. But it still provides a good means of synchronising XML-formatted metadata between databases.
He feels that we are moving on from a central concern with journal articles in any case. âWhat do we care about the literature any more? Itâs all about the data (and letâs make sure that the data does not go the way of the literature!)â. He offered some reflections on institutional repositories in passing. They are not ends in themselves (though often seem to be). There is a difference between their typical application in the US and in Europe. European libraries use them more for storing traditional academic papers â versions of the articles which appear in peer-reviewed journals. In the US, there is a tendency to use them for âall that other stuffâ. They are relatively unpopulated due to the fact that authors find it hard to care once they have had the paper accepted by their intended journal. But the other problem is workflow. Most repositories require deposit procedures which are outwith faculty workflows. Worse – content is being deposited by faculty all over the web â on YouTube’s SciTV, on blogs, in flickr. They have no time left for less attractive hubs. We need a button with the simplicity and embeddedness of the SFX resolver button to be present in these environments before we will truly optimise harvesting of content into the repository. There is a challenge âŚ
The ORE work learned lessons from PMH. PMH did not address web architecture primitives. That was why Google rejected the protocol. It did not fit with their URI-crawling world view. ORE therefore used the architecture of the web as the platform for interoperability.
As for the MESUR project, directed by his compatriot Johan Bollen, Herbert described it as ‘phenomenalâ. MESUR took the view that citations as a measure of impact were appropriate for the paper-based world. But now we should assess network-based metrics (the best known of which is Google’s PageRank). A billion usage events were collected to test the hypothesis that network metric data contains valuable data on impact. The hypothesis, he believes, was proved correct. There is structure there, and the ability to derive usable metrics. Indeed, the correlations produced by MESUR reached the fairly radical conclusion that the citation analysis data we have been using for decades is an outlier when compared with network-based methods.
Overall then, more plus points than negatives. And not only was his audience not inclined to criticise, but he was urged to stay and complete his presentation even though it ran over his allotted time by about 20 minutes at the end of an intensive day. How many people in our profession could discuss their work with reference to so many iconic projects? He concluded with a simple message – which he had come to see clearly as he prepared this review: we do what we do in order to optimise the time of researchers. Some recent studies, such as the UK Research Information Network’s Activities, costs and funding flows in scholarly communications (discussed earlier in the conference by Michael Jubb, Director of RIN), and the more recent JISC report, Economic Implications of Alternative Scholarly Publishing Models: Exploring the costs and benefits, express researcher time in cash terms. It amounts to billions of pounds each year.
How much money has been saved and so made available for further research by the projects developed and overseen by Herbert and his colleagues? There is optimisation to be proud of.
The National Library of New Zealand (Te Puna MÄtauranga o Aotearoa in Maori and an RLG Partner) has obviously been busy. Last week they joined the Flickr Commons, and they have already reported some impressive use statistics. But today (well, yesterday in Kiwi time) came an even bigger announcement.
Digital New Zealand, “a nation-wide project to help make New Zealand digital content easier to find, share and use was launched at the National Library of New Zealand on 3 December 2008.” The incredible array of collections made available through this one interface would be news enough for many libraries. But the joy doesn’t stop there.
The project welcomes additional content contributors, and stands ready to provide advice and assistance to help them to do so. Visitors are offered an opportunity to create a tailored search of the site and drop the resulting widget onto any web page they like or use the special search page that is created for them right on the Digital New Zealand site.
If a visitor doesn’t wish to create a tailored web widget, they already have a library of such from which to choose. And for the true technorati, there is the developer section, which provides a simple way for software developers to get a key to be able to use the application programming interface (API) of the site. If all of this isn’t enough to knock your socks off, stay tuned.
The “Memory Maker” is a web-based way to mix and match video clips into your own cinematic production. I kid you not. Try it out. You can add audio or music to add your own special touches. I doubt that any movie miracles will be made here, but the level of interactivity is completely off the charts. To get the full measure of this, you simply must see this movie.
So by now you must think surely I am done singing the praises of Te Puna MÄtauranga o Aotearoa, but I’m not. There’s still more. Like I said, they’ve obviously been busy. The last thing I want to highlight is their National Digital Heritage Archive. Long in the works through a partnership with ExLibris, this preservation system went live on November 4. “The National Digital Heritage Archive (NDHA),” states the web site, “is the National Library’s technical and business solution to preserve and provide long-term public access to its digital heritage collections.” The NLNZ was the flagship partner with ExLibris, and the product is based on the Open Archival Information System (OAIS) model and conforming to trusted digital repository (TDR) requirements (which came out of joint RLG-OCLC work before the two organizations joined).
This is an incredible array of new initiatives by any measure, and a tribute to the leadership of Penny Carnaby, Chief Executive and National Librarian, and John Truesdale, Director National Digital Library, and of course many others who were instrumental in accomplishing all of this work. For my part, it’s hard to believe that it was only a bit more than a year ago when I was talking with Penny and John in a Melbourne bar after participating in a National and State Libraries Australasia strategic planning meeting. They have much to celebrate, as do we, since they have are doing much from which we can learn. I simply can’t wait to see what comes next.
I didnât come to Austin to get an archival jolt from a digital artistsâ book. I’ve been at the Ransom Center this weekend attending a conference on literary archives and writers’ papers, “Creating a Usable Past.” I have never seen William Gibsonâs 1992 artists’ book, one evidently well-known on the Internet. The cataloging notes say Agrippa has some photosensitive engravings and a disk holding the poem, “which may be displayed on a computer screen only once, and then is irretrievably encrypted.” Matt Kirschenbaum, professor at MITH, hacked the code of Agrippa and played it for us on a Mac emulator. Matt tells us his work will be up on the web in six weeks or so.
I was having something akin to Ted Bishop’s experienceÂ with the symptoms of archive fever. Ted is a Virginia Woolf scholar. In Riding with Rilke he describes the “jolt” of reading Woolf’s suicide letter. Yesterday morning the audience at the august Ransom Center was reading Agrippa on the big screen. The Mac emulator made it feel a bit like I was reading it in 1992. Back in 1992 I don’t think I knew what an artists’ book was.
Three of UT’s undergraduates have been blogging the conference at flairforarchives.
I’ve just read the minutes from a recent meeting of the Lot 49 group, which was formed to address issues related to moving image digitization. [Here's a link to notes about the inaugural meeting in July 2007.] The need to be in Dublin, OH last week precluded my being there, but reading the minutes has led me to reflect on how motion and sound fit into Jen’s and my diatribe, Shifting Gears: Gearing Up to Get Into the Flow (about digitizing special collections for access).
Our major premise is that, in cases where we will preserve the original, we ought to think about digitization for access rather than for preservation. In this way, we can get more special collections digitized and accessible, thereby increasing the demand and, hopefully funding, for our collections. The alternative, investing in time-consuming expensive processes, risks special collections becoming marginalized in the midst of the vast quantity of books on-line.
By using the phrase “special collections” we meant to draw attention to digitization of non-book materials, but we hadn’t given a lot of thought specifically to motion and sound. One way in which motion and sound are different than other non-book formats is that the delivery of access copies requires a significantly compressed file, usually sacrificing a lot of quality. Another difference is that the premise that we would most often be preserving the original doesn’t always apply to motion and sound media.
The first objective, always, is stabilization of the content, then provision of access. With motion and audio, sometimes the original is digital (e.g., much current audio) and we can derive an access copy from it. If the original is in a stable analog format (e.g., preservation-quality film), then we can digitize for access. If the original is unstable and needs to be reformatted, there are two possibilities: a) when the best option is to reformat onto another analog medium (e.g., going from nitrate to safety film), we would subsequently create a digital access copy, or b) when the best reformatting option is digital (e.g., going from magnetic tape to digital audio), we’ll want to retain all the quality possible when digitizing, and then derive an access copy.
But let’s not get ahead of ourselves, a lot of motion and sound in our collections hasn’t even been cataloged. [Maybe the next round of CLIR/Mellon Hidden Collections grant funding should be inundated with proposals to describe hidden motion and sound collections.] Until we have a good sense of the nature and size of the problem, we won’t be effective in addressing it. [And if you have any ideas about how to survey backlogs, get in touch with Merrilee, who is launching a project to assess archival backlog survey methods.]
First describe ‘em, then stabilize ‘em, and then by all means, make them accessible.
Because it’s Friday (and because I have a cold!), this is just a round up of bits I’ve been meaning to blog about. They are piling up, and I figure it’s better to get out even a little bit on each, rather than try to find the time to blog about each one in depth.
Jennifer and I did our webinar yesterday (Assessing the impact of special collections) — I blogged about this earlier this month. I wanted to let you know that our slides are in Slideshare. The webinar itself will be posted later in the month, after some vacations. I’ll have more to say about the discussion, and the results of the poll later. For those of you who took the poll, thanks very much!
There was an interesting story on NPR about reCAPTCHA. Last summer, I blogged about our use of reCAPTCHA for validating the comments on this blog. Over time, these small efforts have added up. Something like 1.3 billion word, which adds up to enough text to fill up more than 17,600 books. Beneficiaries have been the New York Times and the Internet Archive / Open Content Alliance. So comment away — you are helping to do great things.
Finally, I was sad to hear that the Party Copyright Blog was shutting down — but even more disturbed that it subsequently vanished. Fortunately, the “voice of the people” was heeded, and the blog was mostly restored. It’s a reminder of the fleeting nature of information on the web and the importance of preserving valuable resources.
For the last 5 years, my friend Mary Elings from the Bancroft Library and I have made a trip to upstate New York in the summer to teach a one week intensive 3-credit graduate class for the iSchool at Syracuse University (IST677 â Digitization in libraries, archives and museums). Every year, preparing for and teaching the class invites us to pause and reflect on how much has changed in just 12 months in the field of digitizing collections. While the main pillars in the outline of our from-the-cradle-to-the-grave syllabus remain, what we say about each topic changes considerably from year to year. Not surprisingly, we increasingly feel that we need to both impart current practice, while at the same time emphasizing new thinking in the field which challenges business as usual.
Luckily enough, Mary and I as a tag-team are well poised to take on that challenge. Mary predominantly reflects the local point of view of the professional who has to get things done in the here-and-now, while I predominantly reflect a global point of view of somebody who can afford to think about how things should adapt to the realities of our networked information economy. When we quibble, it isn’t just a disagreement, but an educational moment illustrating the times we live in; when it all comes together, it should read like the old bumper sticker as an exhortation to âThink global, act local,â as Mary pointed out to the students.
What follows is an impressionistic glance at the areas where I see a shift in what we talk about in class â in most instances, these are trends which I think will become more pronounced as we continue to teach the class.
In this way, the class acknowledges that established work processes are likely to continue while we explore new ways of serving our audiences. It is a balancing act professionals must manage, whether new to or veterans of the library profession. By bringing these issues to our class, we hope to encourage our students to think of libraries, archives and museums as a field rife with possibilities for those with creative minds!
Weâre still on the LAM! At this point, we have received guidance from thought leaders, conducted phone conversations with interested RLG program partners, and visited 5 sites to hold a comprehensive library, archive and museum workshop with participants from all constituencies.
Our site visits were (in chronological order) at the Smithsonian and Yale (workshop blog), the Victoria & Albert Museum (image to the left), U of Edinburgh (image at the bottom) and Princeton (top image). These campus (or campus-like) organizations all harbor various libraries, archives and museums, and are at various stages of collaboration (all the way to administrative integration). The workshops were aimed at both surfacing existing models, as well as deepening the working relationships among the different units. We did not confine ourselves to digital issues, but allowed participants to take the discussion of collaboration in whatever direction they felt was most fruitful, including brick-and-mortar considerations.
While we are in the very beginning stages of work on the final report, here are some random exemplary findings which I think you may find reflected in it:
Weâll also report out on the projects the sites have committed themselves to as an outcome of the workshop. I donât want to let the cat out of the bag too much (although an earlier LAM posting contains some project details), so youâll have to wait for the report to hear all about the projects sites committed themselves to! However, itâs interesting to note that there was a remarkable similarity in the overall ambitions articulated at workshops sites.
At most institutions, we found that a single search across all institutional resources, both for the benefit of the public and the staff, is a major aspiration (and inspired Ricky’s recent post on cross-collection searching), closely followed by a sense that a more compelling body of digitized material needs to be provided, as well as the means of managing those materials for the long term in a pan-institutional trusted digital repository. Most of the sites also grappled with questions of how to better harness user knowledge and contributions, as well as the place of LAM collections in an information landscape dominated by online search engines and social networking sites.
I hope this little teaser posting gives you a good idea of what sorts of insights you can expect to glean from the forthcoming report, which will be written by our consultant Diane Zorich, and should be posted here as part of the PAR report series in early August. Weâll also make the agenda from the day-long workshop as well as the scene-setting power point presentation we used available soon. We hope other institutions may be tempted to hold their own workshops, inspired by the successful template weâve developed.