Trust in Digital Repositories – best IDCC conference paper

January 17th, 2013 by Jim

I am delighted that a paper titled “Trust in Digital Repositories” co-authored by my OCLC Research colleague, Ixchel Faniel, was given the best conference paper award at the just-concluded International Data Curation Conference in Amsterdam. Okay, she had help. Co-authors are Elizabeth Yakel (University of Michigan School of Information) with Adam Kriesberg (UMSI) and Ayoung Yoon (University of North Carolina School of Information and Library Science).

We can’t link to the paper because it hasn’t been published yet. However you will find the presentation slides embedded in the conference program that I linked to above.

The work described in the presentation looked at whether the actions stipulated as key to the audit and certification of trustworthy digital repositories were actually instrumental in creating trust in the designated community of users. Plain language – we said do these things and you should be trusted. Are those really the things that influence the repository users’ judgement about trustworthiness? And does that judgement differ by disciplinary affiliation?

I’m not going to spoil it. What do you think?

This work was based on the Trustworthy Repositories Audit and Certification checklist that OCLC Research published about five years ago. The Digital Curation Center itself has a nice page on the development of the certification checklist which goes back quite a long way. The Research Libraries Group had a lot to do with its origins thanks to my former colleague, Robin Dale.

It pleases me that this work has bridged organizations and colleagues. Shout out to Robin. Congratulations to Ixchel.

Related posts:

Wikipedia Analytics Engine

January 14th, 2013 by Max

Wikipedia has its own data-structure in templates with parameters — if you are not familiar with Wikipedia templates, an example is “infoboxes,” which show up as fixed-format tables in the top right-hand corner of articles. Templates, and the metadata they contain, have been exploited for research in the past, but I’ve wanted to create a toolchain that would connect Wikipedia data and library data. I also wanted to be able to include a few more features than the standard Wikipedia statistics engines. For instance (a) working over all pages in a MediaWiki dump to analyze the differences between pages that do and don’t include certain templates (b) take into account what I term subparameters of of templates, and (c) do it all in a multithreaded way. Here is an early look at some analysis which may shed light on the notion of systemic biases in Wikipedia.

Birthdates

Of all the biases Wikipedia is accused “recentism” has seemed to me one of the more subtle. To investigate I wanted to compare the shape of the curve of global population to that of birthdates of biography articles on Wikipedia. For data, I looked in templates, specifically English Wikipedia’s {{Persondata}} for parameter DATE OF BIRTH, and German Wikipedia’s {{Personendaten}} for the parameter GEBURTSDATUM. For the comparison of Global Population I used UN data. In both cases you can see that the Wikipedia curves are below global population until about 1800, and outpace population in growth thereafter. These more exponential curves corroborate Wikipedia leaning covering more recent events more heavily. Curiously both Wikipedia lines peak at about 1988 and then all but disappear. If you want a biography article on Wikipedia apparently it helps to be 25 years old.

Occurences of Birth Dates in English and German Wikipedia Compared to Global Population

Simple Metrics

This is quite a simple analysis. One of the chief benefits of working with OCLC is that there is a lot of bibliographic data to play with, so lets marry the two sources: Wikipedia template data and OCLC data. For this section I queried all the Wikipedia pages from December 2012 for all the citation templates, and extracted all the ISBNs and OCLC numbers.

One way to characterize the cited books is audience level, derived from WorldCat holdings data. Audience level is expressed as a “a decimal between 0.01 (juvenile books) and 1.00 (scholarly research works).” Taking simple mean averages of audience level across all citations gives 0.47 on English Wikipedia. In German it’s 0.44. If we plot the histograms of each, we get moderately normal curves, that actually even tend to skew left.

Audience Level English Audience Level German

Is Wikipedia stuffed with incomprehensibly dense knowledge? Maybe, but it’s citations aren’t necessarily.

Subject Analysis

Another bias claim lodged against Wikipedia is that content is heavily concentrated towards certain subjects. Is the same true for its citations? Every Wikipedia article could have any number of ISBNs or OCLC numbers, (see figure below). In FRBR terms, these identifiers relate to manifestations so using WorldCat they were clustered into works, at the expression level. And every work is about any number of subjects. Here I used the FAST subject headings, which are a faceted version of Library of Congress Subject Headings.

Subject Anaylsis Procedure for Wikipedia

Subject Analysis Procedure for Wikipedia

Then I totaled the number of citations on Wikipedia within each subject, creating a list of subjects with their respective citation frequency. Utilizing that list here is a word-cloud visualization of Wikipedia’s 100 most cited subjects, inferred through the subjects assigned to the works cited.

A world cloud of the FAST Subject Headings of the most cited Books in Wikipedia

A world cloud of the FAST Subject Headings of the most cited books in English Wikipedia

There is a large preponderance of subjects that confirm subcultures that Wikipedia is noted for its bias. Politics, Military History, Religion, Math and Physics,  Comics and Video Games, and Mycology. At least of they are going to be overrepresetented in general, they should be well cited.

Below is the same algorithm applied to a different Wikipedia – can you guess the language?  Quite funny to see courts, administrative agencies, and executive departments with such prominence.

dewiki-fast-word-cloud

That should give just a glimpse as to the range of avenues of inquiries available from being able to deeply search and connect Wikipedia template parameters with library data. Any special requests for specific queries?

Wikily yours,

Max

twitter: notconfusing

Related posts:

OCLC Research 2012: Welcome new colleagues!

December 31st, 2012 by Merrilee

This is the the final posting in a short series, looking back on just some of what what’s happened in OCLC Research during 2012.

I think that 2012 must have been a banner year for new colleagues in OCLC Research, or maybe it just seems that way. I’ve already mentioned Max, but here are a few more.

We started off the year by welcoming Titia van der Werf. Titia works in our Leiden office, and focusses much of her attention on European partners and projects. She is also a welcome addition to the Mobilizing Unique Materials team.

Our European team was further bolstered by Shenghui Wang, who joined us in May. Like Titia, Shenghui works out of our Leiden offices. The focus of her work is on text and data mining, deepening our strengths in this area.

We are lucky to have one of OCLC’s Diversity Fellows working with us this year — Julianna Barrera-Gomez (based in our Dublin office) is working with Lynn Silipigni-Connaway and Ixchel Faniel on a variety of projects. We’re fortunate to have these talented young people working with us during their time at OCLC!

And speaking of new ideas, we had two colleagues who joined us in September and October for long visits. Takashi Shimada joined us as an OCLC Research Fellow. Taka (as he graciously allows us to call him) came to us from Keio University and spend time both in San Mateo and Dublin learning about activities within the OCLC Research Library Partnership, and helped us gain a better understanding and appreciation of the issues faced by Japanese research libraries. Simone Kortekaas from Utrecht University spent three weeks of her sabbatical in our Dublin offices, both learning and sharing. We welcome visits like this (whether long or short) because they help us know how our work can make an impact in a real world setting.

As we close out 2012, we look forward to 2013 and all we’ll learn during the coming year. We’ll be sharing it here with you, so stay tuned. We wish you a happy, productive, and peaceful year!

Related posts:

OCLC Research 2012: ArchiveGrid

December 27th, 2012 by Merrilee

This is the fourth posting in a short series, looking back on just some of what we’ve done in the last year.

ArchiveGrid is both a discovery system for an aggregation for archival collection descriptions, and a research sandbox, where we can experiment with both tightly and loosely structured data, and also try out interface design and emerging technologies. The ArchiveGrid team has done a lot in 2012 — here are just some highlights.

Although connections to ArchiveGrid from smartphones and tablets make up a relatively small percentage of overall use (currently about 11%), it is double what it was a year ago and is expected to grow. So we developed a new ArchiveGrid web interface that used responsive web design principles, letting the system adapt to a wide range of devices. The new interface was developed and tested over the summer, demonstrated at RBMS and SAA, and launched in October.

Around 75 new contributors of mostly EAD, but also PDF and HTML finding aids, joined ArchiveGrid this year, helping grow the index to a record 1.8 million collection descriptions from WorldCat and from crawler sites that institutions host. In February, the Northwest Digital Archives gave researchers another access point to noted Pacific Northwest archival and manuscript collections by contributing to ArchiveGrid its aggregated finding aids from 36 colleges, universities, libraries, museums, and historical societies in Oregon, Washington, Alaska, Idaho, and Montana. In March shortly before St. Patrick’s Day, National University of Ireland – Galway joined ArchiveGrid in March as our first contributor from Ireland, with 164 finding aids harvested and indexed.

The team gathered data via a survey that went out in spring to archives and special collections researchers. The purpose of the survey was to update our findings from previous user studies about these researchers. We also wanted to find out how Web 2.0 technology had changed how archives and special collections research is done. Surprisingly, we spotted a shift in who archives and special collections researchers are, with “unaffiliated scholars” – those who are not genealogists, faculty, and graduate students – making up nearly a quarter of the total number of survey respondents. We also noted a smaller role than expected of social media in archives and special collections research and a simultaneous need for archivists and librarians to embed themselves online where the researchers are and give help that most researchers say they trust. Ellen Ast presented results from the survey at the June RBMS meeting in San Diego. Look for more next year!

The ArchiveGrid team has been busy promoting ArchiveGrid in various venues — at SAA and at regional conferences for archives professionals who may not attend SAA. I presented on ArchiveGrid at the Society of Southwest Archivists / Council of Intermountain Archivists meeting in May (via Skype, which was a new experience for me!) and Bruce Washburn led a 90-minute discussion about ArchiveGrid at Mid-Atlantic Regional Archives Conference in October, leading to a flurry of new and potential ArchiveGrid contributors.

Capping the year, OCLC Research secured an intern, Marc Bron, for 2013 who will develop a WorldCat and ArchiveGrid data mapping system in order to improve name-based discovery. Bron is a doctoral student from the Netherlands and will work in the San Mateo research office.

Lead by Ellen Ast, the ArchiveGrid team launched a companion blog at the beginning of the year as a new venue for project team members to write about ArchiveGrid, our research activities around archival research and discovery, and developments in archives and special collections. The blog tracks new contributors and index growth, announces system developments, explains how we build and maintain our system, summarizes activity at conferences attended, highlights collections, and notes current events relevant to our target audience: archives and special collections practitioners, users, and aficionados. If you want to continue to follow ArchiveGrid in action, keep up with us all year around by following our blog!

Related posts:

OCLC Research 2012: Happy Holidays!

December 21st, 2012 by Merrilee

Taking a break from our end of year summary, which will continue next week after Christmas. Until then, happy holidays from us to you!

pinterest

Related posts:

OCLC Research 2012: Born Digital

December 20th, 2012 by Merrilee

This is the third posting in a miniseries of blog postings, looking back on what we’ve done in the last year. More to come!

One of the findings from our 2010 survey of special collections and archives in the US and Canada was that dealing with “born-digital” materials is one of the most challenging issues facing special collections. This is nothing new, but we realized that it was time to move past the “deer in the headlights” phase we seem to be in and move towards practical solutions based on emerging practice.

This year, Ricky Erway teamed up with Jackie Dooley and a crackerjack team of experts to push forward on born-digital solutions. The result is our Demystifying Born Digital project area, and two reports: You’ve Got to Walk Before You Can Run: First Steps for Managing Born-Digital Content Received on Physical Media, and Swatting the Long Tail of Digital Media: A Call for Collaboration.

You’ve Got to Walk is a gem of a report, informed by the group of practitioners who advise the Demystifying project. Its simple advice is encouraging, and practical. When we took a big stack of copies to the Society of American Archivists meeting, they were snapped up. This paper inspired the Jump In initiative — SAA’s Manuscripts Repositories section put out a challenge for archivists to take the Jump In pledge and take some of those first steps outlined in the report. Results will be discussed at next year’s meeting in August. We are of course delighted that this report has inspired action and look forward to hearing about the outcomes.

Swatting the Long Tail is a call for action more than it is a report. It calls for collaboration on transferring digital content from unstable physical media, and challenges the community to come up with an ecology of service providers.

More reports are in the works, and we’re looking forward to seeing what other action our work encourages, as well as what inspiration we can take from the community.

Related posts:

OCLC Research 2012: and the winner is…

December 19th, 2012 by Merrilee

We are doing a mini series of blog postings to reflect on some of our accomplishments in 2012. This posting is the second in the series.

Each year, OCLC Research staff are honored in various ways. This year is no exception and in fact we seem to have had a bumper crop.

In March, Ixchel Faniel won the iConference Award for her paper “Managing Fixity and Fluidity in Data Repositories.” The paper was co-authored with University of Michigan School of Information Professor Elizabeth Yakel and two doctoral students, Morgan Daniels and Kathleen Fear. This is one of the many contributions that Ixchel is making to help us understand data repositories and digital curation.

In May, colleagues Lynn Silipigni Connaway and Patrick Confer won RUSA’s 2012 Reference Service Press Award for their article “‘Are We Getting Warmer?’: Query Clarification in Live Chat Virtual Reference.” Lynn and Patrick co-authored the article with research colleagues Marie L. Radford, Susanna Sabolcsi-Boros, and Hannah Kwon of Rutgers, the State University of New Jersey.

You can hold your applause for Lynn, because in November she won the ALISE/Bohdan S. Wynar Research Paper Competition for her article “Not dead yet! A longitudinal study of query type and ready reference accuracy in live chat and IM reference,” to be published in Library & Information Science. Lynn and Marie have done a lot to improve our understanding of chat reference (and in my opinion have done much to underscore the value of basic customer service in libraries).

In October, our colleague Jeff Young was honored as the 2012 Kent State University SLIS Alumnus of the Year, an award given to a graduate who has made a significant contribution to the profession. Jeff’s was selected because of his important work sing Linked Data to increase the presence and discoverability of library data and materials on the web. One of these days, Jeff should get an special award for helping to explain linked data to his colleagues, but we haven’t gotten our act together yet.

Research colleagues also continue to be your “friends in high places”: Lynn was elected to the ASIS&T Board of Directors; Brian Lavoie was elected to the Dryad Data Repository Board of Directors; Eric Childress was invited to join the NISO Content and Collection Management Committee; and of course Jackie Dooley began her term as president of the Society of American Archivists (we still do get to see Jackie from time to time, although most of her blogging these days is over at Off the Record).

Finally, OCLC Research received an award of a different kind — funding! In June, JISC extended funding for the project “Visitors and Residents: What Motivates Engagement with the Digital Information Environment?”. On our end, the work is being led by none other than Lynn Silipigni Connaway, who is working with David S. White from the University of Oxford. This project helps expand our transnational knowledge base about students and technology.

Congratulations to everyone, and best wishes for continued success in the new year.

Related posts:

OCLC Research 2012: Wikipedia and Libraries

December 18th, 2012 by Merrilee

At the end of 2012, we are doing a mini series of blog postings to reflect on some of the year’s high points. This posting is the first in the series. Watch for updates!

2012 has been a great year for me, because I’ve had the privilege of seeing a project I’ve been passionate about for some time come to life — exploring the connection between Wikipedia and Libraries. Around this time last year I began making connections with the Wikipedia GLAM community, and exploring the idea of OCLC Research hosting a Wikipedian in Residence. We were fortunate enough to receive organizational support for this idea, and with help from folks in the Wikipedia community, craft a position description, and bring Max Klein into our team in OCLC Research. Having Max working with us has been terrific and not just because of his Wikipedia skills.

Since we’ve had Max on board, we attended Wikimania, have held not one but two Wikipedia Loves Libraries events, held two successful webinars attended by more than 500 librarians, done countless videos (okay, I counted them up and there are at least 8). And then there was the Open Access Wikipedia Challenge on P2PU. Oh, and VIAFbot, which brought authority control templates and VIAF links to thousands of articles on the English language Wikipedia.

Earlier this month, I presented a breakout session at CNI (along with Sara Snyder, from the Archives of American Art) on the connection between Wikipedia and Libraries. The session was well attended but more importantly, there was a lot of interest and excitement about the connection between Wikipedia and libraries. I’m very pleased that Max’s term has been extended, so he can help us explore some of those possibilities. So as we close out a successful and productive year, I look forward to another year of highlights in this area.

Want to know more? View all the HangingTogether blog posts on this topic!

Related posts:

Managing print books: A mega-problem?

December 12th, 2012 by Constance

This research note was co-authored by Brian Lavoie  and Constance Malpas.

Opportunity cost seems to be the watchword for print book collections these days. The staff, physical space, and other resources consumed by print-centric collections and services are badly needed to support new priorities in library services, such as deeper user engagement and closer alignment with changing research and learning practices. In the face of evidence of declining print book usage, combined with an ever-expanding array of digital alternatives, it is not difficult to imagine a future where “bookless” libraries are the norm.

But this may be premature. Few libraries are prepared to pack up their print books and send them to off-site high-density storage. On several highly-publicized occasions, plans to reduce local print book inventory have met vigorous opposition – witness the recent firestorm at the New York Public Library. In short, print collections pose a dilemma for libraries: they are assets too valuable to dispose of, yet sinking in priority vis-à-vis other aspects of the library service portfolio. The phrase “managing down print”, increasingly common in print management discussions, neatly captures the dueling imperatives: the need to allocate resources away from managing print book collections, but to do so in a gradual, orderly way. So the search is on for the golden mean: a viable print management strategy that can at once leverage more value out of the legacy print investment, and lower maintenance costs. This question is far from settled, but the contours of the solution are becoming apparent. First, future print management strategies are likely to be collaborative, with print books increasingly viewed as a shared asset to be managed cooperatively. Second, the scale of cooperation receiving the most attention, in terms of both planned and implemented solutions, is at the regional level.

This is not to suggest that the rest is a mere matter of detail: for example, the policy and technical infrastructures needed to support a regional strategy for cooperative print management are still in early stages of development. In the meantime, we can speculate on what a network of cooperatively-managed regional print book collections might look like. The OCLC Research report Print Management at “Mega-scale”: A Regional Perspective on Print Book Collections in North America explores a new geography of print book collections based on the concept of mega-regions. Mega-regions are geographical areas defined on the basis of economic integration and other forms of interdependence. The mega-regions framework has the benefit of basing regional boundaries on a substantive underpinning of shared traditions, mutual interests, and the needs of a common constituency.

In the report, we combine WorldCat data with an operationalization of the mega-region concept by urbanist Richard Florida to produce a network of twelve mega-regional print book collections – i.e., the collective print book holdings of all libraries in each region – corresponding to the twelve North American mega-regions identified by Florida (see figure below; click on image to view full size). We explore the salient characteristics of the mega-regional collections individually and as a group, and synthesize these characteristics into a set of stylized facts. The stylized facts are then used to explore the implications of a regionally-based, cooperative print strategy across a wide spectrum of issues, including access, management, and preservation.

(Click on image to view full-size version.)

Viewing print book collections as a cooperatively-managed regional resource yields benefits on both the supply-side and the demand side. On the demand side, aggregating the print holdings of many institutions into a single collective collection creates a resource of greater scope and depth than any single local collection. Exposing this collective collection to users around the region – or even beyond – may amplify or even create demand for print books that experience little or no local use. On the supply-side, regional coordination could streamline print management and reduce costs. Opportunities emerge for collaboration and coordination in collecting and retention decisions – for example, by diminishing excessive duplication and sharing collecting priorities across many institutions.

While our application of the mega-regions framework to print management is speculative, evidence does suggest that the organization of library stewardship is being reconfigured on a new supra-institutional, regional basis. The Western Regional Storage Trust, a cooperative effort to archive print journals in libraries in many Western (and even Midwestern) US libraries, is one among many examples.  Some of these initiatives, like the CIC Shared Print Archive or the ASERL Print Journal Archive, have the potential – if not the explicit intent – to deliver benefit at mega-regional scale:  CIC member libraries are distributed across the expansive CHI-PITTS  region and ASERL’s membership is concentrated in CHAR-LANTA.  It will be interesting to see if these natural experiments in redistributing print preservation responsibilities across broad geographies result in a richer collective resource, undergirded by a robust federation of preservation commitments, or a differently fragmented set of regional collections.

In the coming year, we’ll have an opportunity to extend our mega-regions analysis by taking a demand-side view of the North American print book collection. We’ll be working with partner libraries in the CIC (notably the Ohio State University) to examine how inter-lending data might be combined with supply-side holdings data to inform a regional print management strategy for retrospective monographic collections in CHI-PITTS. Here’s a thumbnail sketch of the regional resource, excerpted from our project proposal:

In aggregate, the print book resource held in CHI-PITTS libraries amounts to more than 40% of print book titles in North America. About 16% of these titles are unique to the region, i.e. not duplicated in any of the other eleven mega region collections. The remainder constitutes a significant preservation “backstop” for other North American libraries: 50-92% of titles held by other individual mega-regions are duplicated in CHI-PITTS libraries. Thus, investments in the preservation of print books in the CHI-PITTS region can deliver significant benefit to libraries throughout North America. Conversely, there are relatively few regional collections that duplicate a significant share of the CHI-PITTS collection, which means that the burden of print preservation responsibilities (and investments) will be largely shouldered by institutions within the region. Since less than a fifth of the print books in the region are held by academic research libraries – traditionally viewed as the institutions with the greatest stake in print preservation – it seems apparent that networks like the CIC will have an important role to play in rationalizing regional print preservation priorities and investment.

The CIC is an interesting test case for this sort of project, since all libraries in the consortium are partners in the HathiTrust Digital Library, a shared digital repository. By our reckoning, a third or more of the titles held by CIC member libraries are already “backed up” by digital preservation copies in HathiTrust.  Yet from a regional perspective, the situation is strikingly different:  we estimate that less than a fifth of the print books in CHI-PITTS are duplicated by HathiTrust. The collective preservation burden therefore remains significant even in a region with comparatively robust cooperative library infrastructure.

In regions where shared library infrastructure is less developed or less integrated, the challenges may be even greater.  Take Southern California, for example.  We estimate that the regional print book resource in the SO-CAL mega-region amounts to just under 10 million titles with about 40 million library holdings (i.e. holdings set by libraries in the region).  While much smaller in size than the CHI-PITTS collection, the SO-CAL collection represents an important regional asset and a significant stewardship concern for academic libraries in the area.  As elsewhere, these libraries are individually and collectively reassessing the opportunity costs of managing local print inventory and considering “above the institution” solutions.  Not surprisingly, smaller academic libraries look to larger research-intensive institutions as partners in the preservation enterprise and potential providers of shared infrastructure.

The University of California system, with five large research libraries and a high-density storage facility in the SO-CAL region, is an obvious focus of attention. But the infrastructure developed to support a statewide research university system with a global brand cannot simply be extended to serve all other libraries in the region. There is no shared governance model for the regional library resource, which is distributed across hundreds of public and private institutions. And there is no business model currently in place that would enable libraries to opt in to “preservation by proxy” arrangements. Yet, progress is being made. A group of library leaders from academic libraries and consortia in and around Southern California will meet later this week to begin what is certain to be a long conversation about a regional print management strategy. Bob Kieft, a long-time supporter (and sometime agitator) for collaborative collection management, has organized the meeting, which will be hosted by UCLA. It’s impossible to predict what the outcomes of the discussion might be – there is certainly no recipe for success in regional print management – but it is unquestionably an important first step in addressing what is increasingly a “mega” problem.

 

Related posts:

VIAFbot Debriefing

November 28th, 2012 by Max

Shortly after reaching the 1/4 million edits milestone VIAFbot finished linking Wikipedia biography articles to VIAF.org. Examining the bot’s logs reveals telling statistics about the landscape of Authorities on Wikipedia. We can now know how much linked authority data is on Wikipedia, it’s composition, and the similarities between languages.

First, let’s understand the flow of the bot’s job. With VIAFbot I sought to reciprocate the links from VIAF.org to Wikipedia, which were algorithmically matched by name, important dates, and selected works. Therefore it started by visiting all the Wikipedia links  that existed on VIAF.org. Note that  owing to the delay between when the links were created and now, some of the pages had been deleted or merged (Fig. 1 orange region). For the rest of the set-up it utilized German Wikipedia which has focused a lot on their authorities data. VIAFbot also loaded all available equivalent German Wikipedia articles to our English matches, the “interwiki link” in Wikipedia parlance.

Next VIAFbot searched for the equivalent structured-data Authority control, and Normdaten templates to see what preexisting authorities data those pages held. German Wikipedia shone with 92,253 Normdaten templates (Fig 1. purple region), 74,864 had the VIAF paramater filled (Fig. 1 pink region), compared to English Wikipedia’s mere score of 9,034 templates with 770 VIAF IDs.

Figure 1.

The program then compared the VIAF IDs supplied by English Wikipedia, German Wikipedia, and VIAF.org, although not always were all three sources present. While two or more sources didn’t conflict, VIAFbot wrote the VIAF ID to the English Wikipedia page. If a conflict was found, then the bot noted it for human inspection on Wikipedia along with which sources conflicted. One statistic that was telling was how often the different sources disagreed with one another. These disagreement rates were surprisingly similar, but German Wikipedia seemed to disagree marginally less with VIAF.org at 11.3% compared to English’s 15.9% (Fig 2.)

Figure 2.

In the noncontroversial non-disagreement cases, of which there were 254,678, there were still some errors found of a different variety. Even though there was no disagreement among the sources, and probably in the instances in which there was only the VIAF.org source, the wrong VIAF number was written. Some very dedicated Wikipedians took to reporting these errors, and VIAF.org will incorporate those corrections. That is the power of crowdsourcing refining algorithmic accuracy.

The question still remains of how much these links being used? Google analytics on the VIAF.org site, can help answer that. German Wiki was the largest referrer to Wikipedia as late as September 2012. VIAFbot started editing in October, and the effect was immediately tangible – soon gaining poll position and then doubling total referrals (Fig. 3).  It must be said though that this level of viewership may not be sustained as the “curiosity clicks” of Wikipedians being notified of changes through their watchlists starts to fade.

Figure 3. Referral traffic to VIAF.org.

Still, don’t doubt the usefulness of the project. For instance we received this email from John Myers of Union College in  Schenectady NY,

 ”I had an Arabic name to enter into a record as part of a note, and I wasn’t confident about the diacritics.  So, I look in the authority file to temporarily download it, copy the form of the name, and then move on.  Couldn’t find the name in OCLC.  Look in Wikipedia under his common name – bingo.  Even better, Wikipedia has a link to VIAF, double bingo!  With the authorized form from VIAF, I could readily find the record in OCLC (I was tempted to copy the name form directly from VIAF, but didn’t want to push my luck.)  The miracles of an interconnected bibliographic dataverse!”

VIAFbot had written the link for ‘Aziz ‘Aku ak-Misri only a few days prior.

The principal benefit of VIAFbot is the interconnected structure. Recognizing this, other Wikipedias (Italian and Swedish) have been in contact and asked for the same on their Wiki’s. Yet to truly be interconnected the next step forwards is to integrate VIAF IDs not into any one Wikipedia, but into the forthcoming Wikidata, a central database for all Wikipedias  across languages. Fortuitously, the pywikidata bot framework is stabilizing, and I’m in need of a new project now.

Without confusion,

Max Klein (@notconfusing)

 

 

 

 

 

 

 

Related posts: