There is a well-known fable about blind men with contrasting views on the anatomy of an elephant, each having examined a separate piece of the beast and independently concluded that it is either very like a spear, or a fan, or a snake, etc. Even in combination their observations fail to provide a very good picture of what an elephant looks like as a whole. The story was popularized in a poem by John Godfrey Saxe which is cited in a surprisingly wide variety of publications, from early childhood education manuals, to scientific and medical reports, to vocational guides and, more predictably, collections of 19C verse. I know this because a search on a distinctive phrase from the poem’s conclusion: “prate about an elephant not one of them has seen” in the HathiTrust digital library finds more than 140 matches in these places.
Blind searching in large digital text repositories like the HathiTrust or Google Books provides an intriguing but incomplete view of the mass-digitized book corpus. Frequently cited statistics like “12 million books” in GBS, “5 million books” or “one million public domain books” in Hathi don’t really tell us much about the anatomy of the mammoth. Pat the elephant…what do you find? A lot of curious sensory experiences that don’t add up.
When it comes to anatomizing elephants, all parts are not created equal. Georges Cuvier, who famously reconstructed skeletons on the basis of a tooth or a toe, knew this. Cuvier confidently and correctly distinguished Indian and African elephant species based on characteristic differences in jawbones; he ‘discovered’ the woolly mammoth based on a close examination of incomplete fossil remains.
I’m inclined to think that counting books (or volumes) is about as useful in characterizing the mass-digitized corpus as counting vertebrae in the catacombs. It tells us something about how much is there, but not much about who, or what, is there.
Happily, there is an abundance of bibliographic metadata describing the content from which the mass-digitized corpus was sourced that can be used (like a fossilized tooth or a toe) to assign some generic, or I suppose specific, characteristics to the elephant in the room. Over the past year, OCLC Research has been working on a project with Hathi and some other interested libraries to begin characterizing the enormous, vaguely familiar (snake? spear? tree?) yet altogether revolutionary (woolly!) mammoth created through the digitization of legacy print collections.
We’ve posted some empirical data on the subject and library distribution of titles in the Hathi digital repository here.
I think it provides a useful complement to the enchanting and progressively revealing fan-dance of class numbers here.
More to come.