Archive for the 'Software' Category

Another Step on the Road to Developer Support Nirvana

Monday, March 10th, 2014 by Roy

devnetToday we released a brand spanking new web site for library coders. It has some cool features including a new API Explorer that will make it a lot easier for software developers to understand and use our application program interfaces (APIs). But seen from a broader perspective, this is just another way station on a journey we began some years ago to enable our member libraries to have full machine access to our services.

When I joined OCLC in May 2007, I immediately began collaborating with my colleagues in charge of these efforts, as I knew many library developers and had been active in the Code4Lib community. As a part of this effort, we flew in some well-known library coders to our headquarters in Dublin, OH, to pick their brains about the kinds of things they would like to see us do, which helped us to form a strategy for ongoing engagement.

From there we hired Karen Coombs, a well-known library coder from the University of Houston, to lead our engagement efforts. Under Karen’s leadership we engaged with the community in a series of events we began calling hackathons, although we soon changed to calling them “mashathons” in response to the pejorative nature the term “hack” had in Europe. In those events we brought together library developers to spend a day or two of intense learning and open development. The output of those events began populating our Gallery of applications and code libraries.

Karen also dug into the difficult, but very necessary, work to more thoroughly and consistently document our APIs. Her yeoman work in this regard helped to provide a more consistent and easier to understand and use set of documentation from which we continue to build upon and improve.

When Karen was moved into another area of work within OCLC to better use her awesome coding ability, Shelley Hostetler was hired to carry on this important work.

In this latest web site release I think you will find it even easier to understand and navigate. One essential difference is it is much easier to get started since we have better integrated information about, and access to, key requesting and management when those are required (some services do not require a key).

Although this new site offers a great deal to developers who want to know how to use our growing array of web services, we recognize it is but another step along the road to developer nirvana. So check it out and let us know how we can continue to improve. As always, we’re listening!

 

OCLC Exposes Bibliographic Works Data as Linked Open Data

Tuesday, February 25th, 2014 by Roy

zenToday in Cape Town, South Africa, at the OCLC Europe, Middle East and Africa Regional Council (EMEARC) Meeting, my colleagues Richard Wallis and Ted Fons made an announcement that should make all library coders and data geeks leap to their feet. I certainly did, and I work here. However, viewed from our perspective this is simply another step along a road that we set out on some time ago. More on that later, but first to the big news:

  1. We have established “work records” for bibliographic records in WorldCat, which bring together the sometimes numerous manifestations of a work into one logical entity.
  2. We are exposing these records as linked open data on the web, with permanent identifiers that can be used by other linked data aggregations.
  3. We have provided a human readable interface to these records, to enable and encourage understanding and use of this data.

Let me dive into these one by one, although the link above to Richard’s post also has some great explanations.

One of the issues we have as librarians is to somehow relate all the various printings of a work. Think of Treasure Island, for example. Can you imagine how many times that has been published? It hardly seems helpful, from an end-user perspective, to display screen upon screen of different versions of the same work. Therefore, identifying which works are related can have a tremendous beneficial impact on the end user experience. We have now done that important work.

But we also want to enable others to use these associations in powerful new ways by exposing the data as linked (and linkable) open data on the web. To do this, we are exposing a variety of serializations of this data: Turtle, N-Triple, JSON-LD, RDF/XML, and HTML. When looking at the data, please keep in mind that this is an evolutionary process. There are possible linkages not yet enabled in the data that will be later. See Richard’s blog post for more information on this. The license that applies to this is the Open Data Commons Attribution license, or ODC-BY.

Although it is expected that the true use of this data will be by software applications and other linked data aggregations, we also believe it is important for humans to be able to see the data in an easy-to-understand way. Thus we are providing the data through a Linked Data Explorer interface. You will likely be wondering how you can obtain a work ID for a specific item, which Richard explains:

How do I get a work id for my resources? – Today, there is one way. If you use the OCLC xISBN, xOCLCNum web services you will find as part of the data returned a work id (eg. owi=”owi12477503”). By striping off the ‘owi’ you can easily create the relevant work URI: http://worldcat.org/entity/work/id/12477503

In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data. For example you will find the following in the data for OCLC number 53474380:

schema:exampleOfWork http://worldcat.org/entity/work/id/12477503

As you can see, although today is a major milestone in our work to make the WorldCat data aggregation more useful and usable to libraries and others around the world, there is more to come. We have more work to do to make it as usable as we want it to be and we fully expect there will be things we will need to fix or change along the way. And we want you to tell us what those things are. But today is a big day in our ongoing journey to a future of actionable data on the  web for all to use.

MARCEdit Integrates WorldCat Capabilities

Thursday, October 31st, 2013 by Roy

As recently announced by Terry Reese, his program MARCEdit now includes a great set of new capabilities for users of WorldCat. Recently made possible by the release of the WorldCat Metadata API from OCLC, here are just a few of the things you can do directly from MARCEdit:

  • Set Batch Holdings in OCLC.
  • Batch upload/edit records into WorldCat.
  • Search WorldCat directly from within MARCEdit.

This is just the kind of integration that our web services now make available for software of all kinds. By providing an application program interface (API) that enables not just search and display of records, but also updating and creating records, we are exposing the full range of WorldCat metadata capabilities to virtually any software developer.

We have long said that by enabling developers to use our services at a deeper level we would enable new kinds of services that we  could not develop ourselves. Now we are seeing exactly that. Kudos to Terry Reese for building new capabilities into an already stellar application.

Pinch Me. I’m Dreaming.

Thursday, May 30th, 2013 by Roy

Commodore PETI was a library assistant at a community college in California when the microcomputer Commodore PET was released. At the time I was starting to think about going to college to get a library degree. To be clear, I first needed to get my B.A. as I had not gone to college out of high school (I was actually a high school dropout, but that’s a long story). Since libraries were all about information, and computers excelled in processing information, it seemed obvious to me that I needed to learn how to use computers.

That began my lifelong relationship with computers, since I wrote my very first software program on that Commodore PET, stored on a cassette tape, in the early 1980s. It was a library orientation tutorial.

From there, I minored in computer science and majored in geography at Humboldt State University. As an assistant to a geography professor there I wrote statistical analysis programs in FORTRAN and took classes in COBOL and Pascal. These were batch systems, where you submitted your job and waited for it to be run at the whim of the operators. The largest computer I ever used at Humboldt in the 80s is now far eclipsed in both power and storage by the phone in my pocket.

At UC Berkeley, where I received my MLS in 1986 and worked for 15 years, I made the acquaintance of UNIX and online time sharing computing. What a breakthrough that was. You could write and run a program immediately and get instant feedback. For a lousy code jockey like me, it was heaven. I didn’t have to wait to find out I had left out a semicolon.

So fast forward to today. Earlier today I kicked off a computing job on the OCLC Research compute cluster that took 4 minutes and 19 seconds to complete. At first that sounds like a long time, until you find out what it did. The job located 129 MARC records out of nearly 300 million that had the text string “rdacontnet”. It did this without indexes or databases. 300 million MARC records are sitting out on spinning disks somewhere in Ohio and from California I searched every byte of those records for a specific text string and it completed in less than 5 minutes. Pinch me. I’m dreaming.

 

Photo courtesy of Don DeBold, Creative Commons License Attribution 2.0 Generic

Adventures in Hadoop, #5: String Searching vs. Record Parsing

Friday, January 25th, 2013 by Roy

In a previous post I described how I could now string search WorldCat using the Research compute cluster and Hadoop. This means I can find any string of characters anywhere in a MARC record and output the OCLC numbers of matches and/or the entire record, all within a few minutes. Keep in mind we are now talking about nearly 300 million records.

String searching is not, by any means, the only way we use Hadoop. It is actually more common for us to use code (typically written in Java, Python, or Perl) to parse the records and output portions for further processing. But since I had been using such code for simply identifying records of interest, I began to wonder which method of processing was faster.

In a short experiment I quickly proved that string searching was about three times faster for simple identification and output of records than was the code I had previously been using. This is because, I believe, the code I had been using would parse the record before determining if it met my criteria. This one extra step added so much overhead to the process that it would take 15 minutes (in one test) rather than 5.

This likely means that in some cases where relatively few records would match your criteria, you would still be better off extracting the records by string searching and then running your extraction code against them off-cluster. For example, if I wanted to pull out the 245 fields of, say, about 1,000 records, I’d likely be better off extracting the records I wanted by string searching and then process that file directly without using Hadoop.

One last permutation, however. If your process is one that identifies 1,000 records in some situations and several million in another, having one process through which all operations flow is more efficient than two or more separate processes.

And so it goes. Another day and another adventure in Hadoop.

Wikipedia Analytics Engine

Monday, January 14th, 2013 by Max

Wikipedia has its own data-structure in templates with parameters — if you are not familiar with Wikipedia templates, an example is “infoboxes,” which show up as fixed-format tables in the top right-hand corner of articles. Templates, and the metadata they contain, have been exploited for research in the past, but I’ve wanted to create a toolchain that would connect Wikipedia data and library data. I also wanted to be able to include a few more features than the standard Wikipedia statistics engines. For instance (a) working over all pages in a MediaWiki dump to analyze the differences between pages that do and don’t include certain templates (b) take into account what I term subparameters of of templates, and (c) do it all in a multithreaded way. Here is an early look at some analysis which may shed light on the notion of systemic biases in Wikipedia.

Birthdates

Of all the biases Wikipedia is accused “recentism” has seemed to me one of the more subtle. To investigate I wanted to compare the shape of the curve of global population to that of birthdates of biography articles on Wikipedia. For data, I looked in templates, specifically English Wikipedia’s {{Persondata}} for parameter DATE OF BIRTH, and German Wikipedia’s {{Personendaten}} for the parameter GEBURTSDATUM. For the comparison of Global Population I used UN data. In both cases you can see that the Wikipedia curves are below global population until about 1800, and outpace population in growth thereafter. These more exponential curves corroborate Wikipedia leaning covering more recent events more heavily. Curiously both Wikipedia lines peak at about 1988 and then all but disappear. If you want a biography article on Wikipedia apparently it helps to be 25 years old.

Occurences of Birth Dates in English and German Wikipedia Compared to Global Population

Simple Metrics

This is quite a simple analysis. One of the chief benefits of working with OCLC is that there is a lot of bibliographic data to play with, so lets marry the two sources: Wikipedia template data and OCLC data. For this section I queried all the Wikipedia pages from December 2012 for all the citation templates, and extracted all the ISBNs and OCLC numbers.

One way to characterize the cited books is audience level, derived from WorldCat holdings data. Audience level is expressed as a “a decimal between 0.01 (juvenile books) and 1.00 (scholarly research works).” Taking simple mean averages of audience level across all citations gives 0.47 on English Wikipedia. In German it’s 0.44. If we plot the histograms of each, we get moderately normal curves, that actually even tend to skew left.

Audience Level English Audience Level German

Is Wikipedia stuffed with incomprehensibly dense knowledge? Maybe, but it’s citations aren’t necessarily.

Subject Analysis

Another bias claim lodged against Wikipedia is that content is heavily concentrated towards certain subjects. Is the same true for its citations? Every Wikipedia article could have any number of ISBNs or OCLC numbers, (see figure below). In FRBR terms, these identifiers relate to manifestations so using WorldCat they were clustered into works, at the expression level. And every work is about any number of subjects. Here I used the FAST subject headings, which are a faceted version of Library of Congress Subject Headings.

Subject Anaylsis Procedure for Wikipedia

Subject Analysis Procedure for Wikipedia

Then I totaled the number of citations on Wikipedia within each subject, creating a list of subjects with their respective citation frequency. Utilizing that list here is a word-cloud visualization of Wikipedia’s 100 most cited subjects, inferred through the subjects assigned to the works cited.

A world cloud of the FAST Subject Headings of the most cited Books in Wikipedia

A world cloud of the FAST Subject Headings of the most cited books in English Wikipedia

There is a large preponderance of subjects that confirm subcultures that Wikipedia is noted for its bias. Politics, Military History, Religion, Math and Physics,  Comics and Video Games, and Mycology. At least of they are going to be overrepresetented in general, they should be well cited.

Below is the same algorithm applied to a different Wikipedia – can you guess the language?  Quite funny to see courts, administrative agencies, and executive departments with such prominence.

dewiki-fast-word-cloud

That should give just a glimpse as to the range of avenues of inquiries available from being able to deeply search and connect Wikipedia template parameters with library data. Any special requests for specific queries?

Wikily yours,

Max

twitter: notconfusing

VIAFbot Debriefing

Wednesday, November 28th, 2012 by Max

Shortly after reaching the 1/4 million edits milestone VIAFbot finished linking Wikipedia biography articles to VIAF.org. Examining the bot’s logs reveals telling statistics about the landscape of Authorities on Wikipedia. We can now know how much linked authority data is on Wikipedia, it’s composition, and the similarities between languages.

First, let’s understand the flow of the bot’s job. With VIAFbot I sought to reciprocate the links from VIAF.org to Wikipedia, which were algorithmically matched by name, important dates, and selected works. Therefore it started by visiting all the Wikipedia links  that existed on VIAF.org. Note that  owing to the delay between when the links were created and now, some of the pages had been deleted or merged (Fig. 1 orange region). For the rest of the set-up it utilized German Wikipedia which has focused a lot on their authorities data. VIAFbot also loaded all available equivalent German Wikipedia articles to our English matches, the “interwiki link” in Wikipedia parlance.

Next VIAFbot searched for the equivalent structured-data Authority control, and Normdaten templates to see what preexisting authorities data those pages held. German Wikipedia shone with 92,253 Normdaten templates (Fig 1. purple region), 74,864 had the VIAF paramater filled (Fig. 1 pink region), compared to English Wikipedia’s mere score of 9,034 templates with 770 VIAF IDs.

Figure 1.

The program then compared the VIAF IDs supplied by English Wikipedia, German Wikipedia, and VIAF.org, although not always were all three sources present. While two or more sources didn’t conflict, VIAFbot wrote the VIAF ID to the English Wikipedia page. If a conflict was found, then the bot noted it for human inspection on Wikipedia along with which sources conflicted. One statistic that was telling was how often the different sources disagreed with one another. These disagreement rates were surprisingly similar, but German Wikipedia seemed to disagree marginally less with VIAF.org at 11.3% compared to English’s 15.9% (Fig 2.)

Figure 2.

In the noncontroversial non-disagreement cases, of which there were 254,678, there were still some errors found of a different variety. Even though there was no disagreement among the sources, and probably in the instances in which there was only the VIAF.org source, the wrong VIAF number was written. Some very dedicated Wikipedians took to reporting these errors, and VIAF.org will incorporate those corrections. That is the power of crowdsourcing refining algorithmic accuracy.

The question still remains of how much these links being used? Google analytics on the VIAF.org site, can help answer that. German Wiki was the largest referrer to Wikipedia as late as September 2012. VIAFbot started editing in October, and the effect was immediately tangible – soon gaining poll position and then doubling total referrals (Fig. 3).  It must be said though that this level of viewership may not be sustained as the “curiosity clicks” of Wikipedians being notified of changes through their watchlists starts to fade.

Figure 3. Referral traffic to VIAF.org.

Still, don’t doubt the usefulness of the project. For instance we received this email from John Myers of Union College in  Schenectady NY,

 “I had an Arabic name to enter into a record as part of a note, and I wasn’t confident about the diacritics.  So, I look in the authority file to temporarily download it, copy the form of the name, and then move on.  Couldn’t find the name in OCLC.  Look in Wikipedia under his common name – bingo.  Even better, Wikipedia has a link to VIAF, double bingo!  With the authorized form from VIAF, I could readily find the record in OCLC (I was tempted to copy the name form directly from VIAF, but didn’t want to push my luck.)  The miracles of an interconnected bibliographic dataverse!”

VIAFbot had written the link for ‘Aziz ‘Aku ak-Misri only a few days prior.

The principal benefit of VIAFbot is the interconnected structure. Recognizing this, other Wikipedias (Italian and Swedish) have been in contact and asked for the same on their Wiki’s. Yet to truly be interconnected the next step forwards is to integrate VIAF IDs not into any one Wikipedia, but into the forthcoming Wikidata, a central database for all Wikipedias  across languages. Fortuitously, the pywikidata bot framework is stabilizing, and I’m in need of a new project now.

Without confusion,

Max Klein (@notconfusing)

 

 

 

 

 

 

 

Adventures in Hadoop, #3: String Searching WorldCat

Tuesday, September 25th, 2012 by Roy

OK, I admit it. Ever since I joined OCLC over five years ago I’ve harbored a dream. My dream was to one day string search WorldCat. What that means is to have the ability to find any random string of characters anywhere in any MARC field within the over 250 million records that comprise this huge union catalog. Unix geeks call it “regular expressions” or “regex” for short, and often use the Unix command “grep” to string search files. Now I admit it, this is a very geeky dream, but it’s mine and I’m not giving it up.

Luckily, I don’t have to. In fact, just the other day, thanks to a colleague, I actually did it.

Read the rest of this entry »