This is another installment in my continuing series of eclectic, peripatetic, and yes, let’s just say it: “pathetic” data investigations. The most recent identified the top countries of publication for WorldCat records. For whatever reason, I got it into my head to determine which English words appear the most in the main title of WorldCat items.
Clearly there are at least two ways to go about this: a) a formal, well-designed, highly replicable and ultimately near perfect investigation, or b) a slapdash, fast, seat-of-your-pants investigation of questionable merit. When given such a choice, I find the latter completely irresistable. So I took part of my day today and did exactly that.
Since I already had code on our research cluster affectionately named “Gravel” that could extract a specific subfield, I powered it up and sucked out all of the 245 $a fields from WorldCat. As part of that process, I extracted only unique strings. The sharp ones among you have likely noticed a couple flaws already: 1) I was too lazy to filter based on language, and 2) I was too careless to normalize the title strings.
Flaws have never stopped me before, so I blazed on as if nothing was amiss. Then I threw that monster file onto another computer where I didn’t have to worry about interfering with any of the actually useful work that my colleagues where doing on Gravel (you’re welcome). There I wrote a special-purpose Perl script to take each title string, split it into individual words, lowercase them, and count up the occurrences. I dabbled in creating a “stop-words” list of useless words like “a” and “an” and “and” and “the” (ad infinitum) but that quickly began looking like a rabbit hole. As I was only really interested in identifying the top 30 or so words I figured my human eyeball would be sufficient to trap those in the end. Likewise with the foreign words.
That was really about it. Well, except for all the time I spent on Facebook waiting for the operations to complete. Did I say that out loud?
Anyway, without further ado (thank god) here are the top occurring meaningful English title words in WorldCat:
OK, now move along, nothing to see here.
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.