I’m a data geek. I just love processing data in various ways to see what I can find out. So recently I decided to look into the countries of publication as recorded in the 300+ million MARC records in WorldCat. Just for kicks I did some processing of the 260 $a subfield, which is the “Place of publication, distribution, etc.” as it appears on the piece, or noted in various other ways if it doesn’t.
As you might imagine, what results from such an investigation is a complete dog’s breakfast, with a large variety of punctuation marks, typographical errors, imaginative spellings, and just plain junk. No, it is much better to parse bytes 15-17 of the 008 field, which at least are supposed to only contain values from this list maintained by the Library of Congress. Progress.
That is, until one discovers that this “Code List for Countries” is not exactly that. If you happen to be in a certain select part of the world (mostly the United States, Canada, and Australia), you can also select state or province-specific codes. So before I used this table to translate the codes for actual countries I first had to translate the table, so that the code for “California” translated instead to “United States”. Progress.
Oh, and then countries have this tiresome tendency to change over time. The Soviet Union broke up. Czechoslovakia split into two. And don’t even get me started about the hot mess that used to fall under the general term of “Micronesia”. So I had to make some executive (and no doubt indefensible) decisions about how to deal with those. By and large, if I could identify some geography (e.g., Uzbekistan) that had a former life that could also be identified (e.g., Uzbek S.S.R.), I translated them both into the current entity. But lord only knows how many items that don’t have this distinction end up being miscounted. But progress of some sort nonetheless.
Oh, and places like “West Berlin” got their own code. How quaint. But now I’m just whining.
In the end I had the table translated into my twisted view of reality and could run my program against the entirety of WorldCat, parsing out the precious three bytes from the 008 and running my undoubtedly flawed translation on the result. I just love that “Unknown” came out on top. Somehow, after this journey, it seemed fitting.
With no further ado, here are the top 25 “countries” of publication from the records in WorldCat:
74,330,023 Unknown 52,460,566 United States 34,014,675 Germany 24,374,828 United Kingdom 21,009,805 France 9,142,988 Japan 8,706,853 China 7,950,373 Spain 6,649,599 Italy 6,312,625 Netherlands 6,142,256 Canada 5,641,525 Switzerland 3,725,639 Russia 3,516,374 Australia 3,310,194 Poland 2,923,655 Denmark 2,739,910 Sweden 2,219,850 India 1,996,800 Slovenia 1,936,800 Austria 1,612,948 Belgium 1,518,478 Israel 1,514,824 Brazil 1,412,034 Mexico 1,197,454 Finland
The full list is here. Knock yourself out. I sure did.
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.