Archive for January, 2013

Adventures in Hadoop, #5: String Searching vs. Record Parsing

Friday, January 25th, 2013 by Roy

In a previous post I described how I could now string search WorldCat using the Research compute cluster and Hadoop. This means I can find any string of characters anywhere in a MARC record and output the OCLC numbers of matches and/or the entire record, all within a few minutes. Keep in mind we are now talking about nearly 300 million records.

String searching is not, by any means, the only way we use Hadoop. It is actually more common for us to use code (typically written in Java, Python, or Perl) to parse the records and output portions for further processing. But since I had been using such code for simply identifying records of interest, I began to wonder which method of processing was faster.

In a short experiment I quickly proved that string searching was about three times faster for simple identification and output of records than was the code I had previously been using. This is because, I believe, the code I had been using would parse the record before determining if it met my criteria. This one extra step added so much overhead to the process that it would take 15 minutes (in one test) rather than 5.

This likely means that in some cases where relatively few records would match your criteria, you would still be better off extracting the records by string searching and then running your extraction code against them off-cluster. For example, if I wanted to pull out the 245 fields of, say, about 1,000 records, I’d likely be better off extracting the records I wanted by string searching and then process that file directly without using Hadoop.

One last permutation, however. If your process is one that identifies 1,000 records in some situations and several million in another, having one process through which all operations flow is more efficient than two or more separate processes.

And so it goes. Another day and another adventure in Hadoop.

MOOCs and Libraries: a look at the landscape

Wednesday, January 23rd, 2013 by Merrilee

Unless you’ve been hiding under a rock, you know that MOOCs have been causing a bit of a stir in the academic sector. In the last year, MOOCs have exploded, from a handful of early innovators, to dozens of elite institutions becoming partners with organizations like Coursera, edX, and in the UK, the Open University lead FutureLearn venture. The reasons for this are many, well-documented, and also highly debated. Instead of reviewing what you can read elsewhere, I’d like to focus on the relationship between MOOCs and libraries. Here’s what I was curious about: What is the connection between MOOCs and libraries? What’s happening now and where are the opportunities?

To answer my question, I reached out to members of the OCLC Research Library Partnership. This group comprises 20 of 32 Coursera institutions; 3 of 6 edX institutions; and 4 of 12 FutureLearn institutions. I was fortunate to have either an email exchange or (even better) a phone call with nearly everyone I contacted. This information from those in the trenches has been invaluable. In these exchanges, I asked my basic questions: what are you doing now? what do you think the next steps are? As expected, a number of themes have emerged, along with a wide variety of attitudes (from white knuckle fear to excitement, and everything in between :-)). Below is a summary of what I’ve learned so far.

FutureLearn has not quite fully launched yet, but the libraries at those institutions are planning to work with one another (good news). Within edX, the librarians have also formed an informal network (more good news). Within the larger Coursera network of institutions, there is no similar alliance of librarians.

Here are some some of the themes that have emerged:

  • On the content side, most institutions are engaging with some sort of copyright or licensing negotiations, or are ensuring that materials used in courses are cleared for use in that context (this does not necessarily add up to making materials open access). At some institutions, this is a time consuming (and obviously not scalable) activity. With many institutions, this is really the only point of contact with MOOCs.
  • In that vein, I spoke to a few people who are cautiously optimistic about MOOC implementation being a great opportunity to have an impactful conversation about open access publications or learning objects with faculty.
  • Most of those I spoke with acknowledged that MOOCs could be a great opportunity for their campus to rethink teaching on campus — MOOCs provide a sandbox for experimentation, a place to test what works, what doesn’t, and an environment where findings can be driven back into the next iteration. This can be done, in part, through the collection and analysis of data. This fits with the current emphasis in libraries (and elsewhere) on data collection, and assessment.
  • Along with this, there’s an opportunity for libraries to think anew about library instruction and the role that library research plays in a MOOC or “flipped” environment.
  • There are also opportunities for partnerships. Some libraries may use the MOOC experiment as an opportunity to work with other units on campus, and to draw attention to what the library brings to the campus “team.”. This is also an opportunity to work with faculty and instructors in new ways (or for a new reason). At a time when academic libraries are casting about for recasting the research services they offer, it may also be a good time to reframe teaching support.
  • I did these interviews as background for an event we’ve been planning together with the University of Pennsylvania Libraries (and which I’m pleased to announce!) “MOOCs and Libraries: Massive Opportunity or Overwhelming Challenge?” March 18-19. We’re still shaping the program and confirming speakers, but if you check out the event page, you will see the various themes we’ll be covering.

    Do you have other ideas? Want to be part of the conversation? Leave a comment here, send me an email, or Tweet under the hash #mooclib. I look forward to hearing from you!

    Trust in Digital Repositories – best IDCC conference paper

    Thursday, January 17th, 2013 by Jim

    I am delighted that a paper titled “Trust in Digital Repositories” co-authored by my OCLC Research colleague, Ixchel Faniel, was given the best conference paper award at the just-concluded International Data Curation Conference in Amsterdam. Okay, she had help. Co-authors are Elizabeth Yakel (University of Michigan School of Information) with Adam Kriesberg (UMSI) and Ayoung Yoon (University of North Carolina School of Information and Library Science).

    We can’t link to the paper because it hasn’t been published yet. However you will find the presentation slides embedded in the conference program that I linked to above.

    The work described in the presentation looked at whether the actions stipulated as key to the audit and certification of trustworthy digital repositories were actually instrumental in creating trust in the designated community of users. Plain language – we said do these things and you should be trusted. Are those really the things that influence the repository users’ judgement about trustworthiness? And does that judgement differ by disciplinary affiliation?

    I’m not going to spoil it. What do you think?

    This work was based on the Trustworthy Repositories Audit and Certification checklist that OCLC Research published about five years ago. The Digital Curation Center itself has a nice page on the development of the certification checklist which goes back quite a long way. The Research Libraries Group had a lot to do with its origins thanks to my former colleague, Robin Dale.

    It pleases me that this work has bridged organizations and colleagues. Shout out to Robin. Congratulations to Ixchel.

    Wikipedia Analytics Engine

    Monday, January 14th, 2013 by Max

    Wikipedia has its own data-structure in templates with parameters — if you are not familiar with Wikipedia templates, an example is “infoboxes,” which show up as fixed-format tables in the top right-hand corner of articles. Templates, and the metadata they contain, have been exploited for research in the past, but I’ve wanted to create a toolchain that would connect Wikipedia data and library data. I also wanted to be able to include a few more features than the standard Wikipedia statistics engines. For instance (a) working over all pages in a MediaWiki dump to analyze the differences between pages that do and don’t include certain templates (b) take into account what I term subparameters of of templates, and (c) do it all in a multithreaded way. Here is an early look at some analysis which may shed light on the notion of systemic biases in Wikipedia.


    Of all the biases Wikipedia is accused “recentism” has seemed to me one of the more subtle. To investigate I wanted to compare the shape of the curve of global population to that of birthdates of biography articles on Wikipedia. For data, I looked in templates, specifically English Wikipedia’s {{Persondata}} for parameter DATE OF BIRTH, and German Wikipedia’s {{Personendaten}} for the parameter GEBURTSDATUM. For the comparison of Global Population I used UN data. In both cases you can see that the Wikipedia curves are below global population until about 1800, and outpace population in growth thereafter. These more exponential curves corroborate Wikipedia leaning covering more recent events more heavily. Curiously both Wikipedia lines peak at about 1988 and then all but disappear. If you want a biography article on Wikipedia apparently it helps to be 25 years old.

    Occurences of Birth Dates in English and German Wikipedia Compared to Global Population

    Simple Metrics

    This is quite a simple analysis. One of the chief benefits of working with OCLC is that there is a lot of bibliographic data to play with, so lets marry the two sources: Wikipedia template data and OCLC data. For this section I queried all the Wikipedia pages from December 2012 for all the citation templates, and extracted all the ISBNs and OCLC numbers.

    One way to characterize the cited books is audience level, derived from WorldCat holdings data. Audience level is expressed as a “a decimal between 0.01 (juvenile books) and 1.00 (scholarly research works).” Taking simple mean averages of audience level across all citations gives 0.47 on English Wikipedia. In German it’s 0.44. If we plot the histograms of each, we get moderately normal curves, that actually even tend to skew left.

    Audience Level English Audience Level German

    Is Wikipedia stuffed with incomprehensibly dense knowledge? Maybe, but it’s citations aren’t necessarily.

    Subject Analysis

    Another bias claim lodged against Wikipedia is that content is heavily concentrated towards certain subjects. Is the same true for its citations? Every Wikipedia article could have any number of ISBNs or OCLC numbers, (see figure below). In FRBR terms, these identifiers relate to manifestations so using WorldCat they were clustered into works, at the expression level. And every work is about any number of subjects. Here I used the FAST subject headings, which are a faceted version of Library of Congress Subject Headings.

    Subject Anaylsis Procedure for Wikipedia

    Subject Analysis Procedure for Wikipedia

    Then I totaled the number of citations on Wikipedia within each subject, creating a list of subjects with their respective citation frequency. Utilizing that list here is a word-cloud visualization of Wikipedia’s 100 most cited subjects, inferred through the subjects assigned to the works cited.

    A world cloud of the FAST Subject Headings of the most cited Books in Wikipedia

    A world cloud of the FAST Subject Headings of the most cited books in English Wikipedia

    There is a large preponderance of subjects that confirm subcultures that Wikipedia is noted for its bias. Politics, Military History, Religion, Math and Physics,  Comics and Video Games, and Mycology. At least of they are going to be overrepresetented in general, they should be well cited.

    Below is the same algorithm applied to a different Wikipedia – can you guess the language?  Quite funny to see courts, administrative agencies, and executive departments with such prominence.


    That should give just a glimpse as to the range of avenues of inquiries available from being able to deeply search and connect Wikipedia template parameters with library data. Any special requests for specific queries?

    Wikily yours,


    twitter: notconfusing