This is another installment in my continuing series of eclectic, peripatetic, and yes, let’s just say it: “pathetic” data investigations. The most recent identified the top countries of publication for WorldCat records. For whatever reason, I got it into my head to determine which English words appear the most in the main title of WorldCat items.
Clearly there are at least two ways to go about this: a) a formal, well-designed, highly replicable and ultimately near perfect investigation, or b) a slapdash, fast, seat-of-your-pants investigation of questionable merit. When given such a choice, I find the latter completely irresistable. So I took part of my day today and did exactly that.
Since I already had code on our research cluster affectionately named “Gravel” that could extract a specific subfield, I powered it up and sucked out all of the 245 $a fields from WorldCat. As part of that process, I extracted only unique strings. The sharp ones among you have likely noticed a couple flaws already: 1) I was too lazy to filter based on language, and 2) I was too careless to normalize the title strings.
Flaws have never stopped me before, so I blazed on as if nothing was amiss. Then I threw that monster file onto another computer where I didn’t have to worry about interfering with any of the actually useful work that my colleagues where doing on Gravel (you’re welcome). There I wrote a special-purpose Perl script to take each title string, split it into individual words, lowercase them, and count up the occurrences. I dabbled in creating a “stop-words” list of useless words like “a” and “an” and “and” and “the” (ad infinitum) but that quickly began looking like a rabbit hole. As I was only really interested in identifying the top 30 or so words I figured my human eyeball would be sufficient to trap those in the end. Likewise with the foreign words.
That was really about it. Well, except for all the time I spent on Facebook waiting for the operations to complete. Did I say that out loud?
Anyway, without further ado (thank god) here are the top occurring meaningful English title words in WorldCat:
2020380 new
1853252 report
1431184 study
1159042 development
1069940 analysis
1004554 history
978681 county
968097 international
929294 state
890928 guide
856935 system
789983 education
778732 school
756569 united
748894 national
736474 management
706559 social
700137 book
688993 states
688328 studies
687695 general
687665 american
679083 systems
678582 public
677286 water
671552 research
666407 life
661707 health
645966 plan
644212 world
642100 effects
OK, now move along, nothing to see here.
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.