Interesting ideas do not always a project make

A little over a year ago, I inherited a project that didn’t have much more than a name: “Explore and understand the place of large digital text aggregations in scholarship and research.”

I had several discussions with my colleagues about what this project might turn out to be. We had several ideas:

­– Create a shared understanding of the expectations that researchers and students bring to their interactions with large-scale text aggregations on the web and the requirements for making these collections fit for scholarly use.

­– Convene an invitational meeting of those already engaged in large-scale digitization efforts to establish a common understanding of scholarly use-cases and the core requirements for library-sourced research services.

­– Identify service capabilities (bookmarking, annotation, citation management, etc) that are required to support scholarly use of text aggregations.

­– Assemble a text archive for prototyping and analysis.

­– Investigate needs of scholars (via focus groups?)

­– Experiment with the metadata we get from OCLC’s e-Content Synchronization service to see how we can characterize the contents of book aggregations

­– Experiment with full text functionality we might be able to offer a) on a specific aggregation b) across aggregations

What we were exploring went beyond finding and using a single document. It was about identifying works from many silos to incorporate into a local environment. And it was about performing actions against an index (or multiple indexes) of aggregated digitized works. We could investigate how scholars would work with the range of book text archives, starting with use case scenarios of the types of queries (e.g., in areas such as linguistic analysis, lexical frequency, translation studies, edition comparisons, things like occurrence of geographic place names in fiction, and coincidence of events – like being able to explore how a race riot affected neighborhood population dynamics).

I proposed we develop the capacity to mine text archives to characterize the nature of them. Could we make an Identities-like representation of what time periods are represented, geographic coverage, languages, topics covered, publication date ranges, and how complete it is, and maybe overlap with other aggregations). We could also consider data mining tools for scholars (metasearch-like, that would know how to form various types of queries for each aggregation).

If we had records for all these books in WorldCat, what could we do? See what has been digitized system-wide? Could we use these records in an attempt to assess the impact on originals vis-à-vis use of the digital versions? Could we FRBRize to cluster editions?

What about helping libraries determine what they have to contribute? For a given institution: what can I contribute (to Google, to OCA) that hasn’t already been scanned (compare the e-Content Synchronization records to my holdings, maybe including some copyright metrics), or what can I, and I alone, uniquely contribute? What things in my collections weren’t scanned (say by Google) due to overlap (and therefore aren’t in my copy of digital texts, but can still be accessed and potentially obtained)?

After consulting with many experts in the field, I was still scrambling to find a project. Each expert seemed to know of more projects in this area and sent me to the people involved in them, who knew of more projects. The more I learned, the more I realized the things I wanted to do were either already being done (or were impossible). There were no suggestions for needed activity. There was much interest, but no clear idea of what the RLG Partnership could do to make progress.

Due to OCLC eContent Synchronization plans, which focus on the large book aggregations, and due to OCLC plans to incorporate content from library digitization projects in WorldCat, our focus was directed to the smaller text archives. These are existing archives of digitized books, often scholar-created, such as the Walt Whitman Archive, Women Writers Project, William Blake Archive, The Rossetti Archive, the Melville Electronic Library, the Perseus Library, and so forth. I researched a lot of these text archives. On many sites, it was hard to even tell how many books have been digitized. Some had no metadata, just a list of books. Some had nonfunctioning search functions. Some looked very neglected. Should we attempt to gather them all? Should we create profiles of the known digital book archives (collecting scope, number of titles. Images? Texts? Markup? Host, functionality, how to cite…). What form would the profiles would take? (use the NISO collection description spec?) Where would the profiles go?

The scholars who create these archives are unlikely to catalog the books in them; how can we encourage librarians to ensure that there’s a representation of the digital books in WorldCat? And could we encourage the archives to provide permanent links to the books? This would offer a way for users to discover all books that are available online. Libraries could use these records to guide their ongoing selection for digitization and could link from their OPACs to digital representations.

At the 2008 RLG Annual Meeting we had a breakout session where we discussed the “Scholarly Use of Text Aggregation” program. While people had widely varying ideas of what this could include, many possibilities were discussed and dismissed, either because they were in areas over which we had no control or because others were addressing them. The attendees thought we should make sure that those who run eText centers and those who support scholarly use of digitized texts are kept aware of the resources coming out of Research and Grid services (like Identities, terminologies, VIAF). We were also tasked to track (via participating partners) Bamboo and similar efforts to see if there’s a right way and a right time for RLG contribution.

Here are some of the related efforts I identified:

­– Project Bamboo — a multi-institutional, interdisciplinary, and inter-organizational effort that brings together researchers in arts and humanities, computer scientists, information scientists, librarians, and campus information technologists to a Mellon-funded project to “build shared services to support new scholarship” in the humanities.
­– Brown University, Scholarly Technology Group involvement in/engagement with faculty research projects in the digital humanities.
­– HASTAC – A consortium of humanists, artists, scientists, and engineers, of leading researchers and nonprofit research institutions, HASTAC is committed to new forms of collaboration across communities and disciplines fostered by creative uses of technology.
­– IMLS Next Generation Digital Federations: Adding Value through Collection Evaluation, Metadata Relations and Strategic Scaling (PI is Carole Palmer, UIUC)
­– Michigan’s School of Information (Paul Conway) NSF grant to focus on end-user assessments of large scale digitized collections — those built from special collections and archives.
­– MITH (Maryland Institute for Technology in the Humanities) U Maryland. Visualization tool for digital text collections.
­– Monk Consortium (Metadata Offer New Knowledge) discover and analyze patterns in texts
­– The Nines Miami University (OH) and COLLEX at UVA enabling scholarly action on a distributed corpus of texts/images and infrastructure (tools for selecting, annotating, sharing)
­– The Nora Project: Humanities Text Mining – The goal of the Nora project is to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries.
­– OCUL and JISC large scale e-book usage studies
­– Pathways: augmenting interoperability across scholarly repositories — Los Alamos and Cornell (Mellon, NSF)
­– Stoa Consortium for Electronic Publication in the Humanities, University of Kentucky – dissemination of news and announcements, mainly via the gateway blog; discussion of best practices via discussion groups and white papers; and publication of experimental on-line projects. Linked closely to the Perseus project
­– Wordhoard — An application for the close reading and scholarly analysis of deeply tagged texts.

So we’ll be watching those good efforts — and watching for that right way and right time for us to be involved.

And we’re always open to interesting ideas….

One Comment on “Interesting ideas do not always a project make”

  1. Hi Ricky,

    I just saw your blog post the other day. I agree this is fascinating terrain, if poorly defined, and my research colleagues here at Ithaka and I are definitely keeping our eyes on this.

    A couple of other initiatives beyond the ones you mentioned in your post that might seem to be slightly related –

    JSTOR’s Data for Research program to expose the JSTOR corpus to text analysis (see

    The JISC/NEH/NSF/SSHRC program Digging into Data (see



Comments are closed.