The Most Edited Book Records in WorldCat

February 7th, 2014 by Roy

hungergamesIn my last post I identified the most edited records in WorldCat, which, no surprise, were all serials. Someone who read the post asked about this information by format (e.g., books, maps, scores, etc.). I doubt that I will get to all of the various formats, but I decided to take a look at books.

Unlike serials, for which I noted those that had 60 or more edits, for books I had to lower the threshold to 40 to get any at all (the most edited item had 58 edits). So here are the book records which have been edited more than 39 times in WorldCat (in no particular order):

An inevitable conclusion from the above seems to be that the more libraries that hold a book the more likely a cataloger will be to touch the record for it, which would explain how Harry Potter and the Hunger Games books made it on the list.

Related posts:

The Most Edited Records in WorldCat

January 30th, 2014 by Roy

Recently I’ve been doing a large pile of data processing jobs that has me working in cycles of 20 minutes or so. In other words, I do some edits, kick off a job on our compute cluster (fondly named “Gravel” — don’t ask) and about 20 minutes later I do roughly the same thing. Yeah, I know, you’re thinking “why doesn’t he automate it?”. And I would, except that this is a shared resource and rather than kicking off my monster list of jobs that could keep the cluster running from now until…well…a long while from now I think it’s better to introduce some variability in load.

All of that is a long introduction to how I came to discover the most edited records in WorldCat. To fill in those 20 minute blocks I took up some “mini investigations” that do not take as long to perform.

For one such investigation I looked into how often WorldCat Records have been edited and by whom. I will be blogging about this in an upcoming post, but a small slice of this investigation was a closer look at the records that have been edited a lot. Since we keep track of the cataloging symbol of every institution that has edited a record, these can stack up for records that require updates on a regular basis — in other words, serials.

All of the records for these serials were edited more than 60 times over their life in WorldCat, and in no particular order:

Take a bow, serials catalogers, you’ve clearly earned your pay.

Related posts:

The Most Used English Title Words in WorldCat

January 3rd, 2014 by Roy

This is another installment in my continuing series of eclectic, peripatetic, and yes, let’s just say it: “pathetic” data investigations. The most recent identified the top countries of publication for WorldCat records. For whatever reason, I got it into my head to determine which English words appear the most in the main title of WorldCat items.

Clearly there are at least two ways to go about this: a) a formal, well-designed, highly replicable and ultimately near perfect investigation, or b) a slapdash, fast, seat-of-your-pants investigation of questionable merit. When given such a choice, I find the latter completely irresistable.  So I took part of my day today and did exactly that.

Since I already had code on our research cluster affectionately named “Gravel” that could extract a specific subfield, I powered it up and sucked out all of the 245 $a fields from WorldCat. As part of that process, I extracted only unique strings. The sharp ones among you have likely noticed a couple flaws already: 1) I was too lazy to filter based on language, and 2) I was too careless to normalize the title strings.

Flaws have never stopped me before, so I blazed on as if nothing was amiss. Then I threw that monster file onto another computer where I didn’t have to worry about interfering with any of the actually useful work that my colleagues where doing on Gravel (you’re welcome). There I wrote a special-purpose Perl script to take each title string, split it into individual words, lowercase them, and count up the occurrences. I dabbled in creating a “stop-words” list of useless words like “a” and “an” and “and” and “the” (ad infinitum) but that quickly began looking like a rabbit hole. As I was only really interested in identifying the top 30 or so words I figured my human eyeball would be sufficient to trap those in the end. Likewise with the foreign words.

That was really about it. Well, except for all the time I spent on Facebook waiting for the operations to complete. Did I say that out loud?

Anyway, without further ado (thank god) here are the top occurring meaningful English title words in WorldCat:

2020380 new
1853252 report
1431184 study
1159042 development
1069940 analysis
1004554 history
978681 county
968097 international
929294 state
890928 guide
856935 system
789983 education
778732 school
756569 united
748894 national
736474 management
706559 social
700137 book
688993 states
688328 studies
687695 general
687665 american
679083 systems
678582 public
677286 water
671552 research
666407 life
661707 health
645966 plan
644212 world
642100 effects

OK, now move along, nothing to see here.

Related posts:

Countries of Publication in WorldCat

December 10th, 2013 by Roy

I’m a data geek. I just love processing data in various ways to see what I can find out. So recently I decided to look into the countries of publication as recorded in the 300+ million MARC records in WorldCat. Just for kicks I did some processing of the 260 $a subfield, which is  the “Place of publication, distribution, etc.” as it appears on the piece, or noted in various other ways if it doesn’t.

As you might imagine, what results from such an investigation is a complete dog’s breakfast, with a large variety of punctuation marks, typographical errors, imaginative spellings, and just plain junk. No, it is much better to parse bytes 15-17 of the 008 field, which at least are supposed to only contain values from this list maintained by the Library of Congress. Progress.

That is, until one discovers that this “Code List for Countries” is not exactly that. If you happen to be in a certain select part of the world (mostly the United States, Canada, and Australia), you can also select state or province-specific codes. So before I used this table to translate the codes for actual countries I first had to translate the table, so that the code for “California” translated instead to “United States”. Progress.

Oh, and then countries have this tiresome tendency to change over time. The Soviet Union broke up. Czechoslovakia split into two. And don’t even get me started about the hot mess that used to fall under the general term of “Micronesia”. So I had to make some executive (and no doubt indefensible) decisions about how to deal with those. By and large, if I could identify some geography (e.g., Uzbekistan) that had a former life that could also be identified (e.g., Uzbek S.S.R.), I translated them both into the current entity. But lord only knows how many items that don’t have this distinction end up being miscounted. But progress of some sort nonetheless.

Oh, and places like “West Berlin” got their own code. How quaint. But now I’m just whining.

In the end I had the table translated into my twisted view of reality and could run my program against the entirety of WorldCat, parsing out the precious three bytes from the 008 and running my undoubtedly flawed translation on the result. I just love that “Unknown” came out on top. Somehow, after this journey, it seemed fitting.

With no further ado, here are the top 25 “countries” of publication from the records in WorldCat:

74,330,023  Unknown
52,460,566  United States
34,014,675  Germany
24,374,828  United Kingdom
21,009,805  France
 9,142,988  Japan
 8,706,853  China
 7,950,373  Spain
 6,649,599  Italy
 6,312,625  Netherlands
 6,142,256  Canada
 5,641,525  Switzerland
 3,725,639  Russia
 3,516,374  Australia
 3,310,194  Poland
 2,923,655  Denmark
 2,739,910  Sweden
 2,219,850  India
 1,996,800  Slovenia
 1,936,800  Austria
 1,612,948  Belgium
 1,518,478  Israel
 1,514,824  Brazil
 1,412,034  Mexico
 1,197,454  Finland

The full list is here. Knock yourself out. I sure did.

Related posts:

Conversations about “Starting the Conversation”

December 6th, 2013 by Ricky

One of the best parts of my job is working with OCLC Research Library Partner staff on working groups. In this case we never got together face-to-face, but managed to put together a pretty good report, Starting the Conversation: University-wide Research Data Management Policy. Though we started out with a conference call, all the work took place via email and shared documents. The working group consisted of:
Dan Tsang, chair — University of California, Irvine
Anna Clements — University of St. Andrews
Joy Davidson — DCC, University of Glasgow
Mike Furlough — Pennsylvania State University
Amy Nurnberger — Columbia University
Sally Rumsey — University of Oxford
Anna Shadbolt — University of Melbourne
Claire Stewart — Northwestern University
Beth Warner — Ohio State University
Perry Willett — California Digital Library
I supplied the bones, they filled in some of the sections, and I polished it up.
Working in OCLC Research, we try to stay on top of the literature and we hear a lot about application, but there’s nothing like being in the thick of it, so it’s really great to have the expert input of those actually working in research libraries.

Related posts:

Visualizations of MARC Usage

December 2nd, 2013 by Roy

As part of my work to reveal exactly how the MARC standard has been used over the several decades it has existed (available at “MARC Usage in WorldCat”), I’ve always wanted to produce visualizations of the data. Recently, with essential help from my colleagues JD Shipengrover and Jeremy Browning, I was able to do exactly that.

After trying various graphical depictions of the data, we finally settled on an interactive “starburst” view of the data. The initial view provides a high-level summary of how often various tags have been used within particular formats. The interactive part allows you to “drill down” a level into a more detailed view.

We are providing two views of the data: from the point of view of the formats being described (that is, the top-level is comprised of the various formats — books, journals, etc.), and from the point of view of the tags (that is, the top-level is comprised of the various MARC tags).

If you have any ideas about a visualization you would like to see, let me know.

Related posts:

Learning Commons: well-made in Japan

November 27th, 2013 by Jim

During a very hectic, very interesting week visiting research libraries in Japan last week I had the good fortune to tour the new (April 2013) Learning Commons at Doshisha University. It is not a library-managed facility but the library helps to staff it along with other Student Support Services staff. The facility itself is as good an implementation as I’ve seen anywhere including the new facilities at North Carolina State University’s new library. The Doshisha University Learning Commons brochure

The Commons itself is a multi-story structure constructed adjacent to the library and connected to the library at various levels. As a consequence students can move very freely from the collections and quiet of the traditional library to the group study, presentation, production and technology areas of the learning commons. There are plenty of visible but unobtrusive staff available to the students. People in red jackets offer technology support, in blue jackets peer instruction and guidance, in yellow you get media production and on each floor a desk staffed by a librarian.

There are no fixed furnishings in the entire facility. Everything can be moved. As an experiment they left one group study space with two tables without rollers. That space is the most infrequently used in the building. I was impressed with the energy of the staff and the enthusiasm of the students. The location of the facility bordering on one of the busiest streets in Kyoto purposely serves to advertise the learning environment of this private university. The big study and computing rooms are lined up along picture windows that face out onto this boulevard ensuring that Kyoto citizens know that Doshisha is a good place to learn.

Check out some photos taken during my walk-through in this Flickr set. Look for the Global Village sign that designates an area where no Japanese is to be spoken.

P.S. After the original post my colleagues at Doshisha advised me that an English language version of their Learning Commons brochure is available (.pdf).

Related posts:

Harvesting Book Metadata From Wikipedia to Wikidata

November 27th, 2013 by Max

Infoboxes for a long time were Wikipedias’ way of storing data, and Wikidata is set to replace that techonlogy, with added bonuses like inter-language sharing. To get to that promise one first step is for Infoboxes to be harvested into Wikidata. I have started by harvesting Infobox Book in the 9 biggest Wikipedia languages that share the template: English, Italian, French, Spanish, Russian, Polish, Portugese, Swedish, and Japanese.

The point of harvesting Infobox Book specifically is that the Wikidata citation guidelines for books specify that the Library FRBR concept should be used, so I wanted to build out infrastructure to that end. FRBR is about describing Bilbliographic record at many different levels and here’s an example of what this kind of citation would look like in Wikidata:

With that in mind lets have a look at the data. O ur entry point is the set of Wikipedia pages that use Infobox Book -transclusion in Wikpedia parlance – in the 9 aforementioned languages. This measure is only an approximation and does not completely reflect how many Wikipedia topics are about books in a language for three reasons. The first is that the conception of a what is a book is not strictly enforced on Wikipedia.  An article could be about a physical item or an amorphous work idea,  or even sometimes the inclusion of an infobox book template is only a nod to a book like French article on this racing pigeon.  The second is that not all articles about a book necessarily contain a transclusion to Infobox Book. And thirdly some specialised Infobox Books have developed and are used instead, like Infobox Doctor Who Book.

In this next chart we look at the total Infobox Book transclusions, the total articles of a language, and the ratio between the two. Despite large variation in absolute numbers, the percentage of Books Articles in a Wikipedia is somewhere beteween .1-1% of all articles. Italians affirm themselves as the most bilbliophilic. We’ll also see later on about how their practice of labelling genre differs from the others.


Infobox Book Transclusion Counts By Language
Language Infobox Book Transclusions Total Articles (000′s) Percentage of Total Articles
en 30582 4432 0.690
es 3534 1057 0.334
sv 3023 1598 0.189
pl 2782 1005 0.277
pt 1975 803 0.246
ru 1865 1061 0.176
it 10788 1082 0.997
ja 1446 886 0.163
fr 7935 1441 0.551


In each Infobox I crawled for the most used properties across all languages and whose values were either string identifiers or links to other Wikipedia pages. When a value is a link to another Wikipedia page, for instance a link to the page of the author, that is useful because when harvested Wikidata can store the author property as a link to another Wikidata item. This is desirable as in Wikidata we seek to build a Wiki of relations.

Here is a graph of the properties that found, which were added to Wikidatak, and which were already in the database.

Properties Harvest

So as you can see there are now over 30,000 relations between books and their authors and illustrators in Wikidata, as well as the original language and genres of the books. In addition knowing which book is which from a disambiguation perspective is made easier by the inclusion of over 50,000 identifiers.

One difficulty that was encountered was that even though ISBNs are recorded in Infobox Book, the type of ISBN – 10 or 13 – was not discriminated. Wikidata does however discriminate, and so as I was sorting these ISBNs I thought it would be sage to also verify them. OCLC runs an API called  xID for this very purpose. While using xID it also struck me that the OCLC control number could be returned for a given ISBN. As Wikidata is rapidly evolving into a hub of identifiers, I included those in pushing to Wikidata. During this harvest then I also inserted an additional 10,117 OCNs (not pictured above).

As I mentioned It’s not just boring, nameless identifiers that we want to eventually integrate into all the Wikipedia pages by Wikidata. I inspected genre data as well to see how much cross-cultural benefit we’d receive by doing these sorts of harvests.  Below are the Top 10 genres found in Infobox Book by each language. The text shown are the English Labels of the Wikidata Items of links found in each local Infobox. I’ve also outlined those genres which are unique. So you can see that Swedes care a bit more about the choir books and the Japanese have a bent towards police drama.


Infobox Book Top 10s

What first jumped out at me is how inconsistently the idea of genre is used. In some ways its used to describe the content’s emotion and focus, like “science fiction” or “horror”. Other times its used to describe form like “novel”. In fact only the Italians really are very consistent as their top ten, albeit discusses form in “novel”, “essay”, “short story”, “poetry”, “anthology”, “autobiography”, “novella”, “dialogue”, and “poem”.

Another problem between languages is that the genres mismatch often because they are pointing to only slightly different articles. That is we see appearances from the Wikidata items for “fantasy”, “fantasy literature”, “Fantastique”, and “high fantasy”. (By the way you can draw your own conclusions about the demographics of Wikipedia editors when this much fantasy lit pervades the results.)

A conclusion that can be drawn from all this is that there is still some work to be done on negotiating cultural differences on Wikidata. Wikidata has made a lot of connections between Wikipedia articles in different languages, but not all of those merges are clean. The French conflate a pigeon and a book about a pigeon, and its linked to languages that discuss only the pigeon. Meanwhile how how the Italians interpret “genre” is a different, not necessarily incompatible, notion to others. There are some discussions still to be had probably before Infoboxes completely switch over to using Wikidata data, but we are at least one step closer to that goal.

Related posts:

Metadata for digital objects

November 26th, 2013 by Karen

That was the topic discussed recently by OCLC Research Library Partners metadata managers. It was initiated by Jonathan LeBreton of Temple, who noted the questions staff raised when describing voluminous image collections such as: Do we share the metadata even if it would swamp results? What context can be provided economically? What are others doing both in terms of data schemas and where the metadata is shared?

The discussion revolved around these themes:

Challenges in addressing the sheer volume of digital materials.  Managers are making decisions based on staffing, subject expertise, collection’s importance and funding. It was suggested that some metadata could be extracted from the technical metadata, such as dates and location. We discussed the possibility of crowd-sourcing metadata creation, although experience to date is that a few volunteers are responsible for most contributions, and the successful examples tend to be for transcription, editing OCR’d text, and categorizations. (The At a Glance: Sites that Support Social Metadata chart indicates the ones that enhance data either through improved description or subject access.) The context must matter to people for them to volunteer their efforts. (See the OCLC Report, Social Metadata for Libraries, Archives and Museums: Executive Summary.) With the anticipated increase of born-digital and other digitized materials, there’s a greater need for batch and bulk processing.

Grappling with born-digital materials.  Libraries are receiving the digital equivalents of personal papers and using the Forensic Toolkit to “process” these digital collections.  Preservation and rights management, in addition to description, are important components and no commercially available system yet addresses these needs. The Association of Research Libraries is working with the Society of American Archivists to customize its Digital Archives Specialist (DAS) Program to develop the requisite skills for managing born-digital for ARL library staff. OCLC Research has produced several reports in conjunction with its Demystifying Born Digital program of work.

Concerns about “siloization”, or proliferation of “boutique” collections, using different metadata schema. Metadata is being created in different native systems within an institution, metadata that is often not loaded into a central catalog or even accessible in the local discovery layer. User-created metadata in institutional repositories may be OAI harvested by OCLC and thus may appear in WorldCat even if not visible in the institution’s local discovery tool. Managers grapple with whether to spend resources on updating such metadata before it is exposed for harvesting.  Another challenge is deciding what to include in which discovery layer, and what should be silo’d.  The numerous repositories within an institution can result in complex metadata flows for discovery, as illustrated by UC San Diego’s Prezi diagram. Some institutions map their various metadata schema to MODS (Metadata Object Description Schema), but all non-MARC metadata is converted to MARC when loaded into WorldCat.

What are the “essential elements” to provide access across collections? We posited that librarians have been discussing “core” or “essential” metadata elements for decades, starting with Dublin Core and the Program for Cooperative Cataloging’s “BIBCO Standard Record”. Librarians have been entering metadata for the system it was designed for, but then ultimately the data moves to another system later.  Library metadata is no longer confined to a single system: it may be exposed to search engines and viewed with lots of non-library metadata.

The Library of Congress’ Bibliographic Framework Initiative  portends a future where all metadata will be “non-MARC” and we will rely more on linked data URIs in place of metadata text strings.  How can we use the promise of that future to get to where we need to be?

Related posts:

First Scholars’ Contributions to VIAF: Greek!

November 25th, 2013 by Karen

Perseus logo in VIAF Cluster

Contributors to the Virtual International Authority File (VIAF) have generally been national libraries and other library agencies.  We have just loaded into VIAF the first set of personal names from a scholarly resource, the Perseus Catalog hosted by Tufts University, an OCLC Research Library Partner. The Perseus Catalog aims to provide access to at least one online edition of every major Latin and Greek author from antiquity to 600 CE. Adding the Greek, Arabic and other script forms of names in the Perseus Catalog enrich existing VIAF clusters that previously lacked them.

This addition represents a milestone in our Scholars’ Contributions to VIAF activity. We anticipate mutual benefits from our collaboration with scholars. Scholars benefit from using VIAF URIs as persistent identifiers for the names in their own databases, linked data applications and scholarly discourse to disambiguate names in multinational collaborations and using VIAF as a means to disseminate scholarly research on names beyond scholars’ own communities. Both scholarly societies and libraries benefit from enriching VIAF with name authority data which would not otherwise be contributed by national libraries.

As noted in an earlier blog post, Irreconcilable differences? Name authority control & humanities scholarship,  OCLC Research discovered key issues important to scholars that didn’t mesh well with library practices represented in name authority files due to differences in intended audiences, disciplinary norms and metadata needs. However, if scholars do use the Library of Congress’ Metadata Authority Description schema, or MADS, as the Perseus Catalog does, we can add their files to VIAF much more easily.

Adding these scholarly files can demonstrate the benefits of tapping scholarly expertise to enhance and add to name authorities represented in VIAF. We have already seen the number of “alternate name forms” associated with VIAF clusters that include the Perseus Catalog’s contributions increase, with scripts not yet represented. We look forward to more such enhancements from other scholars’ contributions.


Related posts: