Archive for October, 2006

Mass Digitization and the Collective Collection

Monday, October 30th, 2006 by Constance

Barrie Howard, over at DLF, recently did us a good turn by helping to distribute a short survey that we compiled as part of our session on “Mass Digitization and the Collective Collection” at the upcoming Fall Forum in Boston. From the start, we felt this session should take the form of a moderated “Q & A”, rather than a panel discussion, chiefly because there are (still) so many unanswered questions about library partnerships in mass digitization swirling around — almost two years after the Google Book Search project got started – but also because there is never enough time for audience participation at these events.

Most of the current Google Library partners and many of the contributors to the Open Content Alliance have strong ties to RLG, so we’ve benefited from lots of interactions with institutions that have jumped headlong into these projects. It wasn’t hard to come up with a starter set of questions for our panelists (esteemed colleagues from the California Digital Library, University of Toronto, New York Public Library and JISC) — indeed, most of the really important questions (are we building a universal library?) have been framed repeatedly, if only as rhetorical sallies. What has been difficult is prioritizing the questions for a 90 minute session; suddenly, our “starter set” of 10 tough questions was far too long, and the criteria for selecting the really critical ones, far too subjective. In the end, we decided to let the vox populi have the last word. In this embattled election season, we put our questions out for a vote.

The responses have just started to come in, but already a distinct pattern has emerged. Of the twelve questions our short-list, four keep rising to the top. One concerns network coordination of mass digitization — is it possible? should it be regional? national? multi-national? Another concerns responsibility for preserving the outputs of mass digitization as an aggregate collection. The question that interests me most right now focuses on the rights and privileges that inhere in library-contributed content — have we collectively secured the rights necessary to ensuring that scholarly use of these collections will be possible? A recent posting by Dan Hazen suggests that a collective effort to track and manage the aggregate outputs of mass digitization would serve us well. (What ho, RDM?) Dan’s posting and our early survey results simply confirm what Lorcan has been saying about the collective collection for a year or more: collaboration is the name of the game.

ICDAT Taiwan wrap-up

Monday, October 30th, 2006 by Günter

The aforementioned Jerry and myself spent a good bit of time together in Taipei: after long days at the conference, we’d unwind visiting Longshan Temple, or cruising one of the famous nightmarkets in Taipei (the picture above is from Shilin). We also served together on the wrap-up panel of the conference, which sparked some interesting discussions. I asked the audience how their library, archive and museum community interacted in the all-inclusive NDAP project, and by and large, the answer seemed to be that the approach of creating collaborations around content types (similar to what I suggested in my talk) really worked to rally the different domains around a common goal. Currently, NDAP has 13 “themes” (what I’d call strategic aggregations based on content type), 12 of which are listed here. Number 13 would be their digital video project.

When our moderator Simon C. Lin (Academica Sinica) turned the tables on the panelists ( mainly from Europe and the US), making us comment on what we thought of the state of NDAP, the assembled Taiwanese attendees must have beamed with pride – Wayne Hodgins (AutoDesk) encouraged NDAP to take an international leadership role in discussions around digital libraries, while Jerry commented that he’d like to see research in areas such as descriptive metadata for archaeological digs (see the talk by Yu-Yun Lin ) feed back into international standards efforts. I lend my voice as well to compliment NDAP on their exemplary effort, and expressed the hope that after learning so much about their access strategy, we’d soon hear more about their thoughts on long-term digital retention of the massive amounts of assets generated (a nod to Jerry’s talk about METS and PREMIS.) I think we all felt humbled by what we had learned from and about them, and could only hope that they felt the same about the lot of us who they’d flown in from the other side of the planet!

It appears that the next major phase of their project will include creating more international collaborations, and if you happen to be at MCN Pasadena in November, you’ll be able to learn more from the NDAP delegation itself during the Taiwan Special Interest Group meeting.

More on ICDAT 2006

Friday, October 27th, 2006 by Günter

Jerry McDonough, another invited speaker at the ICDAT, kindly commented that my talk sparked 3 new ideas for research projects in his mind. I highlighted what I like to call “parallel descriptive technologies” in libraries, archives and museums – each community now has the claim of having created a complete suite of standards for describing and disseminating content. However, if you look closely, what has really happened is that each community has defined an optimal way of describing one specific type of content (objects of material culture, bibliographic materials, archival collections).

The main argument I advanced was that rather than think about descriptive practice as confined to certain types of institutions, we should think of descriptive practice as guided by the materials at hand. People would think it rather odd if a museum used CDWA / CCO to describe the books in its library, while nobody takes offense if a library uses MARC (or its more XML savvy sidekick MODS) / AACR2 (RAD) to describe objects of material culture. If we want to build more cohesive aggregations of content, I’d submit that libraries, archives and museums will have to agree on the same suite of standards for the same types of materials. The cohesion achieved throught this discipline would also serve users well. It has worked for books – now let’s make it work for objects of material culture.

I also argued that the main sticking point in all of this isn’t data structures such as CDWA or MARC, but data content standards and vocabularies. One reasonable data structure can be mapped to another reasonable data structure, but cross-walks don’t achieve interoperability if the parties involved use different conventions for arriving at data content such as personal names or dates, or different controlled vocabularies to tell them whether the object in question is an “andiron” or a “firedog” (to use a time-honored example remembered from a Murtha Baca talk). And, to come back to Jerry, here’s where one of his research ideas comes into play: he contended that since data content standards such as CCO, AACR2 (RAD) and DA:CS are rules-based, the output created by applying them should be susceptible to computational transformation. An intriguing idea, don’t you think? And now remind me: what were your other two research ideas, Jerry?

Buying our way out of the copyright dilemna?

Thursday, October 26th, 2006 by Merrilee

A post on BoingBoing led me to this email from Wikipedia founder Jimmy Wales. In it, Wales asks for:

…some examples of works you would like to see made free, works that we are not doing a good job of generating free replacements for, works that could in theory be purchased and freed.

Dream big. Imagine there existed a budget of $100 million to purchase copyrights to be made available under a free license. What would you like to see purchased and released under a free license?

The ensuing conversation gets pretty free-ranging, with people drifting off-topic to say what else they’d like to do with $100M (say, digitizing all of out of copyright works, or creating and maintaining specialized encyclopedias). Then there’s a backlash — why would we reward copyright holders with money, when they are in essence locking up intellectual heritage? Why not use the money to fund lobbying efforts to change copyright law instead? And so on.

The question as initially posed is very interesting. If this opportunity was given to the research library community, and if we had a very short time frame in which to answer, how would we respond, and what criteria would we use? Circulation data? Holdings data? Audience level? Has the work been digitized?

Copyright was identified as one of the top challenges for discovery and delivery in our RLG Partners workshop in September, and will most likely be discussed in our upcoming symposium in March.

Summary notes from the workshop are now up, as is a draft agenda for the symposium (still subject to change, based on input from a program committee!).

ICDAT 2006 Tapei, Taiwan

Wednesday, October 25th, 2006 by Günter

Having recently returned from the International Conference on Digital Archive Technologies in Taiwan, I am still struck by both the event and the location. A note about Taipei: it has shopping malls of a size and number to make any upstanding American city greenback-green with envy, and they contrast rather strikingly with the Buddhist, Taoist and Confucian temples sprouting out of every nook and cranny of the city. Add to that the diversity of Asian cuisines offered up for sampling in upscale venues as well as on the street of the legendary Taipei nightmarkets, and a very hearty first introduction to Asia (as this was for me) is achieved.

More to the point of the event and the topic of this blog: if you don’t turn green with envy at the mention of massive shopping malls (and bookstores, may I add!), you’ll probably turn green once you learn more about the National Digital Archives Program (NDAP) of Taiwan, the organizing body of the conference. Through massive funding from the Taiwanese government (from US$ 10-20 million / year from 2002 on), NDAP has been able to solder together the nation’s major libraries, archives and museums into a reasonably unified digitization-machine. I think in the US we can only dream of a similarly focused vision of bringing together all digitized cultural resources under one description and access framework.

The sophistication of their approach to descriptive practice has impressed me during a talk [pdf] given by Shu-Jiun (Sophy) Chen during MCN 2005, and in Taiwan I learned that Sophy’s team (Metadata Architecture & Application Team) is now extending its reach into the sphere of Learning Objects. NDAP has also made great strides in digitizing video, as witnessed by an introduction to their research and tool-building efforts by Chih-Yi Chiu. As a matter of fact, digital video turned out to be one of the focal points of the conference. Richard Wright from the BBC eloquently described the conundrum audio-visual archives face: while they turn to digitization to flee the deterioration of analog tape, now they find themselves in an even tighter race against obsolescence with their digital content. According to Wright, analog formats were dependable for a few decades, while the first generation of digital video had a life-span of 10 years at the BBC. Another invited speaker, Pasquale Savino (National Research Council of Italy), introduced the European project ECHO to provide access to historical documentary films, with interesting insights into techniques for automated indexing of key frames, object recognition, text extraction from audio and subtitles, etc., as well as a data model [pdf] which makes use of FRBR to describe audio-visual materials.

Another cluster of talks focused on web archiving. Most fascinating: the keynote by Hsinchun Chen (Artificial Intelligence Lab, U of Arizona) described how he uses web archiving techniques to document the activities of terrorists. Striking fact: according to Chen, 80% of anti-terrorism intelligence information are public, i.e available in chat-rooms, on websites, in online videos, etc. The tools created by the lab enable gathering the information on websites (reasonably dynamic) as well as forums (extremely dynamic), plus the subsequent automatic statistical analysis of the content, for example to correlate the level of aggression in language with the actual real-life violence produced by a given group. The Artifical Intelligence Lab has gathered 1.2 Terrabytes of information on terrorist groups using these techniques.

There’s more to tell, but I’ve also got a full inbox of e-mail to read! I’ll write a little more about the conference later on in the week…

It’s a bird, it’s a satellite, it’s …

Monday, October 23rd, 2006 by Constance

.. . . another high-flying digitization effort: Alouette Canada. The project, which officially launched in June 2006, recently gained a project director, the affable Mr. Brian Bell, former chair of the Canadian Initiative on Digital Libraries (CIDL). I had a long and rewarding conversation with Brian last week, in which he revealed the inner workings of the marvelous machine that will keep this distributed digital library aloft. Alouette will build on the success of the Our Ontario prototype, which provides cross-collection searching of digital content from libraries, archives and museums throughout the province. The Alouette prototype (still under wraps) includes some of the best-loved features of services like Flickr and– enabling user interaction with content — as well as dynamic subject-indexing in word clouds and even the increasingly popular Google map mash-up.

Billed as an “open digitization initiative,” Alouette shares some common features with the Open Content Alliance (OCA) — not altogether surprising, since founding members of the Canadian project (including the Universities of Toronto and Alberta) are also contributing content to the OCA’s open library. Both projects are heating up this month, right on the heels of announcements from Google Book Search and (separately) Cornell University. Members of the OCA met last Friday in San Francisco to discuss the initiative’s progress to date. Peter Brantley at CDL has helpfully posted presentations from the meeting, including one by my new OCLC colleague Bill Carney. Bill’s working on a nifty project to synchronize data-flows from the OCA with WorldCat to ensure that all that wonderful digitized content benefits from the discovery environment of I’d love to think that we could get the shimmering content from Alouette — also the name of a lovely river in BC — into the flow (a trope long-favored by my new boss, Lorcan Dempsey).

What really sets Alouette apart from other large-scale efforts in the mass digitization arena, I think, is its commitment to enabling smaller, specialized research collections (like historical societies and museums) to participate in the virtual land-rush and secure a little habitat of their own. I suspect this impulse is deeply rooted in the origins of the Ontario Digital Library, which has a good deal in common with community information and referral initiatives like 2-1-1.

Will Alouette Canada generate the same kind of excitement and national pride as its space-age namesake? Will it achieve its vision of “harness[ing] the will and energy of every library, archive, gallery, museum, historical society or institute of record to create a comprehensive collection of digital resources for the benefit of its citizens”? Only time will tell — in the meantime, I’ll be keeping my eyes trained on the night skies, hoping for ray of Northern light.

Web archiving for the election, anyone?

Thursday, October 19th, 2006 by Merrilee

I recently got a notice from our friends at the Internet Archive that their Election 2006 web archive is open for business. You can nominate a web site (or sites) that you think should be crawled in order to help document the upcoming election. Go to (and scoll down a little, the bit on the election crawl is “below the fold,” at least for me). If your preparations to vote involves doing a little online research, take a few moments to throw some URLs over the fence to the Internet Archive.

The results will be available via Archive-It in a few weeks.

Non-English Access Recommendations – Whacha think?

Wednesday, October 18th, 2006 by Karen

ALA’s Association for Library Collections and Technical Services (ALCTS) Task Force on non-English Access issued its report and recommendations on October 10 for public comment. I enjoyed working with many long-time colleagues from RLG partner institutions and area studies groups. Isn’t it a timely report when, as Merrilee mentioned, the US just passed the 300 million population mark, attributed to the rise in immigrants? And those immigrants read non-English language materials. (Hey! In my hometown of San Francisco, 45% of all residents speak a language other than English at home, including mine.)

Comments are requested by December 1. No comments on either the report or the recommendations, so far. How about you? Aren’t we all interested in internationalization and globalization and outreach to the worldwide community? Doesn’t that involve non-English access?

If you don’t have the time to read all 70 pages of the report (or the few I contributed), please read the executive summary and recommendations and submit some comments.

A day in the life…

Monday, October 16th, 2006 by Merrilee

Two interesting projects, both (mostly) taking place outside of “our” community.

The Yahoo! Time Capsule: Between October 10 and November 8, people can submit photos, writings, etc. to document the 30-day time period. I took a quick peek at the site today, and the thing that struck me what how international it is — it seemed like most of the photos that I looked at were from Central and South America, and some Asian postings, when I’d expect it to be mostly Americans posting and contributing.

When the project is completed, it will be “sealed and entrusted” (whatever that means) to the Smithsonian Folkways project. This will present an interesting challenge for digital preservation. I completely agree with Jeanne over at Spellbound that the statement (on the overview page) “This is the first time that digital data will be gathered and preserved for historical purposes,” is completely nutty. I don’t see this as at all the same as the efforts of the Internet Archive, but it’s a silly statement nonetheless.

No mention of American Archives Month (which Anne highlighted earlier this month) in relationship to the Yahoo! Time Capsule.

Tomorrow, the UK-based History Matters, One Day in History “mass blog” will go live — UK residents are invited to contribute blog entries to document a completely ordinary day. This collection will be contributed to the British Library, as part of the Web Archive in Modern British Collections. Since it will be part of the web archive, it’s a little clearer to me how the collection will be wrapped and stored (as ARC files, I would assume).

Incidentally, tomorrow the US population is supposed to reach 300 million people. Even though it’s impossible to tell who that person will be (an immigrant coming into the country or someone born tomorrow, who can say?), you can keep an eye on this and the world population at the U.S. Census’ Population Clocks page.

Archivist items

Wednesday, October 11th, 2006 by Merrilee

Kind of random, but I think these two go nicely together:

John Battelle reveals a desire to archive advertising.

Meanwhile, this cartoon refers to a new malady, “Archivaholism.”

Could the two be related?

Speaking of archives, there’s a newish blog maintained by Mark Matienzo called ArchivesBlog, “A collection of blogs by and for archivists.” Mark has been nice enough to include HangingTogether in the mix.