There’s been a lot of attention to geologic subsidence of late, what with all the sinkholes opening up in Florida, Louisiana and other places. Here in California, we are more often concerned with the gradual change in ground level due to the draining of aquifers that support large-scale farming. From year to year, the difference in ground level may be nearly imperceptible but over the space of a few decades the landscape has been radically transformed.
The subsidence metaphor was on my mind recently, as I was looking over some data compiled by my colleague Thom Hickey, documenting the usage of headings (subjects and names) in WorldCat. OCLC Research has done quite a lot of work exploring new approaches to managing subject and name authorities, notably in VIAF and FAST. I was interested to see how Thom’s data might be used to measure change — uplift and subsidence — in the library landscape. By computing the frequency with which FAST and VIAF headings occur in institutional collections cataloged in WorldCat, one can identify which libraries hold the most materials related to particular topics, places and people. And this in turn provides a measure of the relative distinctiveness of library collections, judged not in terms of the ‘rarity’ of holdings but rather by the concentration of related content.
It seemed to me that Thom’s data might have something interesting to say about how the emergence of large-scale digitized book aggregations – HathiTrust, Google Books, etc — is altering the library environment. It stands to reason that as these large hubs begin to consolidate content sourced from libraries (and, in Google’s case, publishers), they will displace traditional library ‘centers of excellence’ in some subject areas. Those who remember the DLF Aquifer project will recall that the initial prototype was designed to pool digitized resources in a given subject area (initially American History, later narrowed to Abraham Lincoln and the US Civil War). In the very large aggregations of HathiTrust and GoogleBooks, subject specialization has emerged more gradually. There has not been much public attention to measuring the scope of subject-based collections within those aggregations, nor to benchmarking them against existing institutional holdings.*
The FAST and VIAF centers data provide evidence of both subsidence and uplift in the current collections environment — that is, shifts in centers of excellence as measured by scope of subject based holdings. The ‘re-leveling’ that has been wrought in just a few years of large-scale digitization is already significant. Digital aggregations have, by design or accident, emerged as important subject repositories that rival and even outrank some of the largest institutional libraries in WorldCat.
For instance, HathiTrust, an organization not yet five years old, already holds the greatest concentration of titles on the topic of marine biology, surpassing the Library of Congress as well as two major research universities with world-class oceanography programs.
In the case of Marine biology, the difference between the number of titles held by HathiTrust and the Library of Congress is not very large — fewer than 200 titles. But in other instances, the relative subsidence of traditional centers of excellence is more dramatic. For instance, Google Books substantially outranks several major research libraries in holdings related to Russian periodicals (journals, newspapers and the like).
This represents an important change in the library system, with monumental old hubs being progressively overshadowed by new collections that are produced not by the slow accretion of library acquisitions but by large-scale digitization and (re)aggregation. It provides a compelling illustration of how Web-scale content aggregations are altering the library operating environment. In the case of HathiTrust especially, this disruption can (and I think should) be seen as a positive change: it enables libraries to rethink traditional, institution-scale collection management and stewardship — a topic we examined in our Cloud-sourcing Research Collections report some years ago.
Using Thom’s ‘centers’ data, we can identify hundreds of topics and identities for which HathiTrust offers better coverage than any other library in WorldCat. Here a few topics in which the Digital Library distinguishes itself:
And a few of the personal names for which its coverage is unrivaled:
Interestingly, the other top-ranked collections (by size) for these same subjects and identities are not always the source of HathiTrust’s richness. One might have anticipated that Hathi’s leadership was simply a by-product of aggregating content from existing centers of excellence, but in fact Hathi has developed unexpected strengths by aggregating at a very large scale from a diverse pool of contributors. For example, Harvard University and the University of Michigan each hold sizable collections of works by the poet Jean Ingelow; yet, the richness of Hathi’s Ingelow collection is mostly due to contributions from campus libraries in the University of California system.
The FAST and VIAF ‘centers’ data provide a fascinating new vantage point on the changing collections landscape. We’ll be looking at ways to integrate it into ongoing research projects, including the mega-regions work, where we hope it can help us detect regional collecting trends that might inform shared stewardship priorities.
*Note: HathiTrust provides a nice visualizations and a list of subject areas in the Digital Library, based on Library of Congress classification numbers. These provide a good overview of subject-based coverage but without reference to comparable coverage in other libraries. It is generally known that Google is selective with respect to identifying library partners, but I’m not aware of any public documentation related to a specific collection development strategy. Their aim, famously, is to provide comprehensive coverage of the world’s books, not to develop excellence in any given subject area.