Countries of Publication in WorldCat

I’m a data geek. I just love processing data in various ways to see what I can find out. So recently I decided to look into the countries of publication as recorded in the 300+ million MARC records in WorldCat. Just for kicks I did some processing of the 260 $a subfield, which is  the “Place of publication, distribution, etc.” as it appears on the piece, or noted in various other ways if it doesn’t.

As you might imagine, what results from such an investigation is a complete dog’s breakfast, with a large variety of punctuation marks, typographical errors, imaginative spellings, and just plain junk. No, it is much better to parse bytes 15-17 of the 008 field, which at least are supposed to only contain values from this list maintained by the Library of Congress. Progress.

That is, until one discovers that this “Code List for Countries” is not exactly that. If you happen to be in a certain select part of the world (mostly the United States, Canada, and Australia), you can also select state or province-specific codes. So before I used this table to translate the codes for actual countries I first had to translate the table, so that the code for “California” translated instead to “United States”. Progress.

Oh, and then countries have this tiresome tendency to change over time. The Soviet Union broke up. Czechoslovakia split into two. And don’t even get me started about the hot mess that used to fall under the general term of “Micronesia”. So I had to make some executive (and no doubt indefensible) decisions about how to deal with those. By and large, if I could identify some geography (e.g., Uzbekistan) that had a former life that could also be identified (e.g., Uzbek S.S.R.), I translated them both into the current entity. But lord only knows how many items that don’t have this distinction end up being miscounted. But progress of some sort nonetheless.

Oh, and places like “West Berlin” got their own code. How quaint. But now I’m just whining.

In the end I had the table translated into my twisted view of reality and could run my program against the entirety of WorldCat, parsing out the precious three bytes from the 008 and running my undoubtedly flawed translation on the result. I just love that “Unknown” came out on top. Somehow, after this journey, it seemed fitting.

With no further ado, here are the top 25 “countries” of publication from the records in WorldCat:

74,330,023  Unknown
52,460,566  United States
34,014,675  Germany
24,374,828  United Kingdom
21,009,805  France
 9,142,988  Japan
 8,706,853  China
 7,950,373  Spain
 6,649,599  Italy
 6,312,625  Netherlands
 6,142,256  Canada
 5,641,525  Switzerland
 3,725,639  Russia
 3,516,374  Australia
 3,310,194  Poland
 2,923,655  Denmark
 2,739,910  Sweden
 2,219,850  India
 1,996,800  Slovenia
 1,936,800  Austria
 1,612,948  Belgium
 1,518,478  Israel
 1,514,824  Brazil
 1,412,034  Mexico
 1,197,454  Finland

The full list is here. Knock yourself out. I sure did.

Tweet about this on TwitterShare on TumblrShare on LinkedInShare on FacebookBuffer this pageShare on Google+Email this to someone