Top Topics in WorldCat

As I’ve described in a series of posts recently (“Adventures in Hadoop”, four so far), I’ve been having fun on our new compute cluster. Well, maybe “fun” isn’t exactly the right term for diving into the depths of the MARC format, but hey, librarians have to get their kicks somehow.

Anyway, I’ve been doing some work that will eventually see the light of day but for now I want to report on one small finding — the top subject areas in WorldCat. But first let me be very clear about my methodology so that incorrect assumptions are not made.

What I’ve done is to use our Hadoop infrastructure to look at every occurrence of the 650 MARC field and set aside and count the contents of every $a subfield. What this means is that if a record in WorldCat has these subjects:

World War, 1939-1945 — Naval operations.
World War, 1939-1945 — Aerial operations.
World War, 1939-1945 — Pacific Ocean.

Then “World War, 1939-1945”, being the contents of the $a subfield is counted three times. Therefore, the figures below are not the number of titles with that top-level topic, but the number of times it occurs in WorldCat as a whole. It should also be noted that this is across all formats. Here are the top 20:

807860	English language
739051	World War, 1939-1945
696769	Women
608170	Popular music
583375	Education
558876	Science
522882	Music
512224	Agriculture
433770	Art
403742	Law
397194	Indians of North America
379298	Jews
361501	Architecture
354640	Geology
345761	Railroads
343079	Geschichte.
321255	Roads
313043	World War, 1914-1918
305187	African Americans
293148	City planning
 “Geschichte” is German for “History”. It will be interesting to see how this list changes as we add more non-U.S. records to the database.