November, 2012

VIAFbot Debriefing

Wednesday, November 28th, 2012 by Max

Shortly after reaching the 1/4 million edits milestone VIAFbot finished linking Wikipedia biography articles to Examining the bot’s logs reveals telling statistics about the landscape of Authorities on Wikipedia. We can now know how much linked authority data is on Wikipedia, it’s composition, and the similarities between languages.

First, let’s understand the flow of the bot’s job. With VIAFbot I sought to reciprocate the links from to Wikipedia, which were algorithmically matched by name, important dates, and selected works. Therefore it started by visiting all the Wikipedia links  that existed on Note that  owing to the delay between when the links were created and now, some of the pages had been deleted or merged (Fig. 1 orange region). For the rest of the set-up it utilized German Wikipedia which has focused a lot on their authorities data. VIAFbot also loaded all available equivalent German Wikipedia articles to our English matches, the “interwiki link” in Wikipedia parlance.

Next VIAFbot searched for the equivalent structured-data Authority control, and Normdaten templates to see what preexisting authorities data those pages held. German Wikipedia shone with 92,253 Normdaten templates (Fig 1. purple region), 74,864 had the VIAF paramater filled (Fig. 1 pink region), compared to English Wikipedia’s mere score of 9,034 templates with 770 VIAF IDs.

Figure 1.

The program then compared the VIAF IDs supplied by English Wikipedia, German Wikipedia, and, although not always were all three sources present. While two or more sources didn’t conflict, VIAFbot wrote the VIAF ID to the English Wikipedia page. If a conflict was found, then the bot noted it for human inspection on Wikipedia along with which sources conflicted. One statistic that was telling was how often the different sources disagreed with one another. These disagreement rates were surprisingly similar, but German Wikipedia seemed to disagree marginally less with at 11.3% compared to English’s 15.9% (Fig 2.)

Figure 2.

In the noncontroversial non-disagreement cases, of which there were 254,678, there were still some errors found of a different variety. Even though there was no disagreement among the sources, and probably in the instances in which there was only the source, the wrong VIAF number was written. Some very dedicated Wikipedians took to reporting these errors, and will incorporate those corrections. That is the power of crowdsourcing refining algorithmic accuracy.

The question still remains of how much these links being used? Google analytics on the site, can help answer that. German Wiki was the largest referrer to Wikipedia as late as September 2012. VIAFbot started editing in October, and the effect was immediately tangible – soon gaining poll position and then doubling total referrals (Fig. 3).  It must be said though that this level of viewership may not be sustained as the “curiosity clicks” of Wikipedians being notified of changes through their watchlists starts to fade.

Figure 3. Referral traffic to

Still, don’t doubt the usefulness of the project. For instance we received this email from John Myers of Union College in  Schenectady NY,

 ”I had an Arabic name to enter into a record as part of a note, and I wasn’t confident about the diacritics.  So, I look in the authority file to temporarily download it, copy the form of the name, and then move on.  Couldn’t find the name in OCLC.  Look in Wikipedia under his common name – bingo.  Even better, Wikipedia has a link to VIAF, double bingo!  With the authorized form from VIAF, I could readily find the record in OCLC (I was tempted to copy the name form directly from VIAF, but didn’t want to push my luck.)  The miracles of an interconnected bibliographic dataverse!”

VIAFbot had written the link for ‘Aziz ‘Aku ak-Misri only a few days prior.

The principal benefit of VIAFbot is the interconnected structure. Recognizing this, other Wikipedias (Italian and Swedish) have been in contact and asked for the same on their Wiki’s. Yet to truly be interconnected the next step forwards is to integrate VIAF IDs not into any one Wikipedia, but into the forthcoming Wikidata, a central database for all Wikipedias  across languages. Fortuitously, the pywikidata bot framework is stabilizing, and I’m in need of a new project now.

Without confusion,

Max Klein (@notconfusing)








Top Corporate Names in WorldCat

Tuesday, November 20th, 2012 by Roy

As I explained earlier, I have been doing some investigations into how MARC has been used over the last several decades. Curious about the contents of the 110 $a (corporate names), I parsed it and the top 30 headings are listed below. Keep in mind a few things, however:

  • Entities can be put together in different ways. For example , there is “Great Britain” and “England and Wales” and “Scotland” all appear in the list.
  • My process (as presently constituted) is simplistic. Therefore, both “Canada.” and “CANADA.” are counted separately.
  • Slight variations in headings produce different entries. For example, “Santa Fe River Baptist Association (Fla.)” and “Santa Fe River Baptist Association.”
  • Typos produce different entries.
Eventually I will make the entire list available. If you’re really eager, email me.
1417046 United States.
587986  Great Britain.
358417	France.
206591	Canada.
176754	Geological Survey (U.S.)
101421	California.
98397	Michigan.
79615	Australia.
78175	Catholic Church.
64390	New York (State).
57037	New Zealand.
48218	Sotheby's (Firm)
46196	Hôtel Drouot.
45853	Québec (Province).
44812	New South Wales.
44022	England and Wales.
43469	Massachusetts.
41914	Pennsylvania.
41560	Christie, Manson & Woods.
41292	Église catholique.
39517	Ontario.
36636	Scotland.
36234	Illinois.
34691	United Nations.
31121	India.
31011	Agence de presse Meurisse.
29958	Cornell University.
29648	Church of England.
29073	Japan.
28675	Victoria.

Top Topics in WorldCat

Wednesday, November 7th, 2012 by Roy

As I’ve described in a series of posts recently (“Adventures in Hadoop”, four so far), I’ve been having fun on our new compute cluster. Well, maybe “fun” isn’t exactly the right term for diving into the depths of the MARC format, but hey, librarians have to get their kicks somehow.

Anyway, I’ve been doing some work that will eventually see the light of day but for now I want to report on one small finding — the top subject areas in WorldCat. But first let me be very clear about my methodology so that incorrect assumptions are not made.

What I’ve done is to use our Hadoop infrastructure to look at every occurrence of the 650 MARC field and set aside and count the contents of every $a subfield. What this means is that if a record in WorldCat has these subjects:

World War, 1939-1945 — Naval operations.
World War, 1939-1945 — Aerial operations.
World War, 1939-1945 — Pacific Ocean.

Then “World War, 1939-1945″, being the contents of the $a subfield is counted three times. Therefore, the figures below are not the number of titles with that top-level topic, but the number of times it occurs in WorldCat as a whole. It should also be noted that this is across all formats. Here are the top 20:

807860	English language
739051	World War, 1939-1945
696769	Women
608170	Popular music
583375	Education
558876	Science
522882	Music
512224	Agriculture
433770	Art
403742	Law
397194	Indians of North America
379298	Jews
361501	Architecture
354640	Geology
345761	Railroads
343079	Geschichte.
321255	Roads
313043	World War, 1914-1918
305187	African Americans
293148	City planning
 ”Geschichte” is German for “History”. It will be interesting to see how this list changes as we add more non-U.S. records to the database.

The Flipped Library

Monday, November 5th, 2012 by Jim

My colleague, Lorcan Dempsey, did a very nice synthesis of “MOOCs, Libraries, OCLC” for the OCLC Board of Trustees this morning. Given the massive attention and the surge of interest in MOOCs (witness that the article – Year of the MOOC – in the New York Times has stayed on the most emailed since it was published on 2 November 2012) he was asked to provide an overview and some foundational information so the trustees could have a preliminary discussion about the implications for libraries. Perhaps he will turn this into a piece for more general publication.

One of things he drew out was the ways in which MOOCs are forcing an exploration of the scale, shape and costs of pedagogy, prompting new thinking about assessment, and creating environments that can facilitate and take advantage of predictive and adaptive analytics. In talking about the shape of pedagogy he pointed out the ways in which they were consciously capitalizing on social technologies, gamification techniques, virtual laboratories and peer learning. MOOCs might become the vehicle that institutionalized the ‘flipped classroom’ as the norm.

I wasn’t very familiar with the ‘flipped classroom’ concept. I’d only come across it in reading about the Khan Academy. Teachers were assigning the Khan modular lectures as homework and then using the classroom time for personal tutoring, independent problem solving, inquiry-based activities, project-based learning and peer interaction. I now understand that the flipped classroom concept and approach is a much more broadly-established approach and that the Khan Academy example is just a specific manifestation of the concept. I found these three brief blog posts from leading proponents of the approach in secondary education to be very helpful.

As the trustee discussion proceeded Betsy Wilson, Dean of Libraries at the University of Washington seized on the flipped classroom observation saying that this is what libraries had been doing over the last ten years.

Everybody was already operating a flipped library.

I thought it was a spot-on analogy and very descriptive of where academic libraries have been heading. Consider that the current academic library no longer requires students and faculty to come to the libraries for their information seeking and consumption. It delivers materials online to the users preferred environment when they need the information in ways that support time-shifting consumption and repeated encounters. The library building is being re-imagined around support for independent study, collaborative work, group interactions and library services are being re-invented around support for the processes of learning and research rather than collections.

The phrase ‘flipped library’ is a very nice way to capture what’s going on. I’m going to start using it. I don’t know if it will gain traction. The phrase ‘flipped classroom’ seems to have gained widespread use because it had an accompanying catch phrase – “Moving from sage on the stage to guide on the side.” What’s the equivalent catch phrase for the flipped library? If you’ve got a candidate please share.

The flipped library in the photo is the Wyoming Branch of the Free Library at 231 East Wyoming Avenue, Philadelphia, PA 19120 It was opened October 30, 1930 and was the last library funded by Carnegie.