Archive for October, 2012

Registering researchers in authority files

Monday, October 29th, 2012 by Karen

Last month we launched a new task group of OCLC Research Library Partner staff and others who are involved in uniquely identifying authors and researchers that can be shared in a linked data environment.

We were spurred by institutions’ need to uniquely identify all their researchers to measure their scholarly output, a factor in reputation and ranking. Yet national authority files cover researchers only partially. They do not include authors that write only journal articles, or researchers who don’t publish but create or contribute to data sets and other research activities.

We see a number of activities in this “name space” with potential overlap, including: the International Standard Name Identifier (ISNI), the Virtual International Authority File (VIAF), Open Researchers & Contributor ID (ORCID), the Dutch Digital Author Identifier system (DAI), The Names Project in the UK, the Program for Cooperative Cataloging’s NACO program, researcher profile systems such as VIVO, and Current Research Information Systems (CRIS).

The Registering Researchers in Authority Files Task Group will document the benefits of researcher identification; significant challenges; trade-offs among the current approaches; and mechanisms for linking approaches and data. We are starting with use case scenarios, for example:

  • Researchers who want to identify others in their field
  • Institutions that need to collate the intellectual output of their researchers
  • Funders who want to track the outputs for awarded grants
  • Services providing persistent identifiers for researchers that need to disambiguate names.and ensure correct attributions.

We are hoping that our report will help address all of the above needs, and suggest approaches for linking data from different sources in a coherent way. Details on this activity and the task group roster —including experts from the Netherlands, the United Kingdom, and the United States—are on our new Registering Researchers in Authority Files activity page on the OCLC Research website.

If there are systems or “name authority hubs” you want to make sure we look at, please let us know with a comment below.

 

Elusive Quality

Thursday, October 25th, 2012 by Ricky

We talk a lot about data curation, but rarely about data quality. How do researchers determine if a dataset is appropriate for their intended purposes? They may need to know how the data was gathered (sometimes including the sensor equipment used and how it was calibrated), the degree of accuracy of the data, what null elements mean, what subsequent changes have been made to the data, and all sorts of provenance information.

The University of North Carolina invited about 20 people from a variety of communities to an NSF-funded workshop, titled, Curating for Quality: Ensuring Data Quality to Enable New Science. The final report has just been published. In its appendices are the white papers that were prepared in advance of the workshop, including one that Brian Lavoie and I wrote, titled, The Economics of Data Integrity, which is on page 53 of the report.

The most useful outcomes of the workshop came from the group’s brainstorming of projects that would advance the discussion. We settled on eight that seemed actionable and fleshed them out a bit. We were encouraged to pursue the projects that moved us, either by working informally with like-minded individuals or by making a proposal to NSF. There’s no reason, however, that anyone couldn’t take up any of these ideas.

For those of you in a hurry, the Conclusion and Call to Action on page 17 and 18 of the report sum up the issues quite nicely.

Open Access Wikipedia Challenge on P2PU

Tuesday, October 23rd, 2012 by Max

It’s been traditional recently to hold Wikipedia Loves Libraries events during Open Access Week, and I fully support the practice. What’s also been traditional, in a way that I wanted to change, was the editathon format for those events. After scrunching my mind to brainstorm and consulting with other Wikipedia Loves Libraries volunteers on ways of experimental trainings and celebrations, we came up with Open Access Wikipedia Challenge.   The challenge is to embed media that was harvested from Open Access journals in Wikipedia, and we created a special edition barnstar for completing it. This challenge is totally friendly to newbies and librarians as it includes over 1 hour total of six screencast tutorial videos that explain every detail right from account creation, to Wikipedia’s transclusion, and each module has waypoint challenges. At the time of this writing already nine challengers have accepted.

Below is the introductory video which is hosted on youtube, and the challenge is on P2PU.

Max Klein
twitter: @notconfusing 

Adventures in Hadoop #4: A Trivial Mechanism to Review Results

Wednesday, October 17th, 2012 by Roy

As I’ve been learning more about how to use Hadoop via streaming, I discovered that I frequently needed an easy way to review records identified by a particular process. For example, my colleague Karen Smith-Yoshimura has recently been wanting to locate MARC records that have particular characteristics. She provides me with the set of characteristics she wants to use as a filter, and then I edit some existing Python code to perform the filter and find the records. In some cases as few as 8,000 records are pulled out from the more than 250 million records that now comprise WorldCat.

But then those records need to be viewed in some way. At the end of my process I gather up the output records into a file that consists of one line per record. In some cases I only output the OCLC number, but in other cases the entire record will be output following the OCLC number. Even in those cases, however, the line always begins with the OCLC number. That enables me to set up a simple process for reviewing the output.

To do this I wrote a simple CGI program that finds all of the files ending in “.txt” in a certain directory and lists them for the user to select. When the use clicks on a particular filename, the program parses the file, taking the OCLC number and setting up a couple links. One link is to the raw MARC record as it is stored in our HBase WorldCat table, and the other link takes the user to the record in WorldCat.org. I also send along a parameter that enables OCLC staff to see the XML or BER version of the record in WorldCat.org. Therefore, reviewing the records that are output by a Hadoop job is as simple as dropping the output file into a directory and going to that directory with a web browser. A few clicks is all it takes.

Wikipedia Loves Libraries — how you can participate?

Monday, October 15th, 2012 by Merrilee

Over the summer, our Wikipedian in Residence, Max, did two webinars that gave librarians a glimpse behind the curtain of Wikipedia. One of the things he highlighed in those webinars was Wikipedia Loves Libraries, a Wikipedia-conceived initiative to bring libraries (and archives) closer together. We were heartened to learn that at least two of the events that are planned (at the Multnomah County Library on October 27th and West Hollywood on November 17th) were at least in part inspired by our webinars! There are also events planned at the New York Public Library, Princeton University, the Smithsonian, George Washington University, Indiana University, and elsewhere — you can check out the full list here.

What about you? Are you interested in hosting an event and partnering with local Wikipedians? There is a handy form to get you started, and lots of good models online. And if you want some handholding or have questions, don’t hesitate to get in touch.

You can also watch the webinars if you are intrigued.

Cousins: The Bookworm and Wikignome

Wednesday, October 3rd, 2012 by Max

As we all know, the best you can hope of a meeting is not a conclusion, but a chuckle at a statistical oddity. When OCLC’s Top Library Loans List came out, such a positive meeting was had. Upon glancing the pulp fiction (see chart below) I wondered if Wikipedia editors were also driven by such trivia? I turned to Python, R, and article edit histories to find out.

The top 10 list is such:

  1. Hunger Games by Suzanne Collins
  2. Catching Fire by Suzanne Collins
  3. Mockingjay by Suzanne Collins
  4. Fifty Shades of Grey by E.L. James
  5. Game of Thrones by George R. R. Martin
  6. The Help by Kathryn Stockett
  7. Thinking, Fast and Slow by Daniel Kahneman
  8. Steve Jobs by Walter Isaacson
  9. Quiet: The Power of Introverts in a World That Can’t Stop Thinking by Susan Cain
  10. Dance with Dragons by George R.R. Martin

Now let’s take a tour through Wikipedia’s history for a feeling for the editors affinities towards these monographs:

We can tell that there isn’t a lot of similarity between the novels, except that they’ve all experience small peaks within the last year or so. That isn’t surprising, because the list in question is for the most requested inter-library loans for the period of a year starting July 11th 2011. So let’s take a look at how actively edited these books were in that time frame.

Besides the fact that the relationship here looks a bit exponential, as we’d expect of crowdsourced material, there is another curious correlation afoot. The ordering of the monographs by edits, is remarkably similar to the ranking by loan-requests. Keeping the by-edits ordering, then charting the loan positions we get something reassuringly linear.

In fact the the Top 6 are exactly predicted. If you were going use  quick-sort inversion counting analysis to compare the closeness of two lists, I believe you get the low count of 2 (correct me if I’m wrong). This indicates that there is a possible correlation between book demand in the library and wikipedia editor interest online. So librarians take note, when deciding on your stock, pre-empt the rush and look to Wikipedia – the Wikignomes have a psychic connection with the bookworms.

Not confusingly yours,

twitter.com/notconfusing

Max Klein, Wikipedian in Residence

P.S. The code to look at and graph Wikipedia articles is a small project I’ve open sourced, and is available on github. I’ve also built in the functionality to pull stats from a Wikipedia category, which allows for such fun as, looking at the entire edit histories of all the Pulitzer Prize Winners.