The most thought-provoking presentation I attended at ALA Annual was given by Candi Yano, a professor of operations research at UC Berkeley. Yano was reporting to the ALCTS Chief Collection Development Officers on the results of a study commissioned by Ithaka to determine the “optimal overlap” of library holdings to assure long-term preservation of print collections in a range of different circumstances (circulating collections vs. dark archives) and according to variable time and loss-probability thresholds. It was impressive piece of work and I’m sure the final version of the research article will be widely cited, after (sigh) it works its way through the labyrinth of peer review and journal publication cycles.
Here’s a snapshot of the minimum number of copies required to ensure preservation of at least one copy over a fifty or one hundred year time horizon (snipped from a handout circulated at the meeting):
Long story short: Yano calculated that as many as 15 copies of any given title would need to be retained to achieve near certainty of preservation (of at least one copy) over a one hundred year period, given an annual (accidental) loss rate of .005. Half of one percent seems like a pretty low loss rate, but one can easily see that over a century it would amount to perceptible erosion, especially in the largest collections. If you double the probable loss rate to .010 — the rate that UC Berkeley uses for insurance purposes — the required number of copies jumps to 30. In a worst case scenario, with a high failure rate (.050), a two hundred year time horizon, and a relatively low probability of single-copy preservation (.999), the optimal overlap rises to a staggering 197,064 copies.
Yano reasonably proposed that a hybrid preservation approach could be used to lower the threshold of recommended duplication; if a small number of copies are held in a dark archive, for example, the number of duplicate copies to be retained in circulating collections in order to achieve a robust preservation guarantee is significantly less. Thus, if just 2 copies were removed from circulation and committed to optimal storage environments, a further 10 copies held elsewhere in the library system would be sufficient to ensure survival of at least one copy, one hundred years hence.
Yano presented an elegant matrix in which duplication thresholds were set against a sliding scale of probable loss rates: in a “lossy” circulation environment, one would need to increase the total deposits in dark archives; if the loss rate in the archives is high, one would need to retain many more copies in the aggregate circulating collection. If the loss rates are relatively high in both environments (Yano’s probable loss rates maxed out at .005 for the “locked” copies and .015 for titles in general circulation), the required number of backup copies could be as high as 36, again provided that 2 copies are secured in a dark archive.
This is an excerpt from the matrix of required overlap in a hybrid preservation environment:
The smart people in the CCDO assembly raised all kinds of challenging questions about these findings, many of them focused on the primary motivation behind print preservation Are we mostly concerned about preserving print collections to enable future use of content in its original format, or ensuring that we can re-build digital archives like JSTOR? University of California staff were quick to point out that the costs associated with validating dark archive deposits are staggering, especially for long serial runs – how reasonable is it to assume that the community can support this kind of activity at scale? (UC and Harvard currently provide dark archives for the JSTOR journal back-file content.) Further, the statistical model is built on the assumption that “back up” copies in circulation will have been pre-validated as suitable replacements for the archived titles – a practice that would entail further direct costs for holding institutions, without any tangible benefit. One can imagine schemes (exchange payments, sponsorship, reciprocal cost avoidance programs) that would support this distributed and hybrid approach to print preservation, but the kinds of cooperative agreements that would be necessary to sustain them will require a fundamental re-assessment of institutional access and ownership claims.
Near the close of the CCDO session, Vicky Reich– self-proclaimed “LOCKSS lady” – strode to the microphone and voiced her thanks to Ithaka for sponsoring this work and to Yano for executing it. I think everyone in the room shared her admiration for Yano’s analysis, though I suspect few felt as personally vindicated by its findings. To me, the implications of this long-awaited work raised new concerns about the decreasing rates of duplication in system wide library holdings and the inherent fragility of a system that relies on tacit assumptions about institutional retention and access commitments. At last reckoning, as much as 40% of books in the WorldCat database represent titles held by a single institution. The average number of holdings per record (for books) stands at about 13. Of course, in some cases, a single holding may include many copies. An ongoing study by my colleague Ed O’Neill finds the average level of duplication measured at the item or copy level is <5 for a major statewide academic book collection.
In sum, the evidence in hand suggests that there is substantially less duplication in aggregate holdings than is required to achieve the preservation guarantees obtained in Yano’s model. Given unrelenting space pressures on library print collections, and decreasing circulation rates, it seems imperative that libraries – research libraries, in particular – take immediate action to establish a common understanding of our respective (and collective) preservation goals and identify the core requirements for managing this highly distributed, thinly duplicated resource as a single, shared collection. This is an area where RLG partners are poised to take action.
Kudos to Ithaka (Roger Schonfeld, in particular) and Professor Yano for making us all smarter about how probability pertains to distributed collection management. Their presentation at ALA was one of the best attempts I’ve seen to model aggregate library holdings as a system.