ISBNs in WorldCat

Recently a question came up on the BIBFRAME list about ISBNs, and how many of them were in MARC records. This is just the kind of question that OCLC Research is uniquely placed to answer, so I quickly wrote some simple Perl code to run as a Hadoop streaming job to find out.

It was remarkably quick and easy to find out, although I had to edit and re-run the code when I discovered a flaw in my logic. This is, sadly, all too frequently the case. But not too much later I had my result:

Occurrences # per Record Percent of WC
230444194 0 77.71%
55668178 2 18.77%
4766652 1 1.61%
3708352 4 1.25%
616623 3 0.21%
411230 6 0.14%
125715 8 0.04%
65796 5 0.02%
45304 10 0.02%
30155 12 0.01%

These are all of the occurrences of a 020 $a in WorldCat as of 1 May 2013 [Added for clarification: the prior sentence describes exactly what is being counted. That is, I am not (yet) examining ISBNs for 10-digit vs. 13-digit; therefore, many of the records with 2 ISBNs may in fact simply have both versions].  A few observations:

  • Many items in WorldCat were published before the invention of the ISBN.
  • Many items in WorldCat are not ISBN-appropriate (e.g., unpublished materials).
  • ISBNs are therefore problematic as identifiers except for a narrow slice of materials (mainly printed books since the mid-60s).

A much better identifier for many purposes is, I assert, the OCLC number.

Tweet about this on TwitterShare on TumblrShare on LinkedInShare on FacebookBuffer this pageShare on Google+Email this to someone


  1. This is a very odd result. I conjecture that you are looking at both 10- and 13- digit ISBNs, and that you count 13 digit isbns as different from the corresponding 10 digit isbns. Those should not be considered different. I predict if you look at 13 digit only, the distribution would be more normal.

  2. Right, I was quite clear of the limitations of this analysis: “These are all of the occurrences of a 020 $a in WorldCat as of 1 May 2013.” In other words, I cared not whether an ISBN was 10 or 13 digit because that really wasn’t the point of the analysis for me — for me, it was how few records had an ISBN at all. But I could redo the analysis, which would as I think you are suggesting, that the records that have 2 ISBNS really have one — but both the 10-digit and 13-digit formulation.

  3. So the reason there are so few singleton isbn’s is probably because the missing member of the 10/13 pair is automatically generated somewhere. The singletons are probably 979-* ISBNs or records that have not seen the autogeneration process.

  4. There are a fair number of reused ISBNs, too, I’m afraid (same ISBN pulls up 2 or more different titles, though usually from the same publisher).

Comments are closed.