As I’ve described in a series of posts recently (“Adventures in Hadoop”, four so far), I’ve been having fun on our new compute cluster. Well, maybe “fun” isn’t exactly the right term for diving into the depths of the MARC format, but hey, librarians have to get their kicks somehow.
Anyway, I’ve been doing some work that will eventually see the light of day but for now I want to report on one small finding — the top subject areas in WorldCat. But first let me be very clear about my methodology so that incorrect assumptions are not made.
What I’ve done is to use our Hadoop infrastructure to look at every occurrence of the 650 MARC field and set aside and count the contents of every $a subfield. What this means is that if a record in WorldCat has these subjects:
World War, 1939-1945 — Naval operations.
World War, 1939-1945 — Aerial operations.
World War, 1939-1945 — Pacific Ocean.
Then “World War, 1939-1945″, being the contents of the $a subfield is counted three times. Therefore, the figures below are not the number of titles with that top-level topic, but the number of times it occurs in WorldCat as a whole. It should also be noted that this is across all formats. Here are the top 20:
807860 English language 739051 World War, 1939-1945 696769 Women 608170 Popular music 583375 Education 558876 Science 522882 Music 512224 Agriculture 433770 Art 403742 Law 397194 Indians of North America 379298 Jews 361501 Architecture 354640 Geology 345761 Railroads 343079 Geschichte. 321255 Roads 313043 World War, 1914-1918 305187 African Americans 293148 City planning