Archive for the 'Renovating Descriptive Practice' Category

ISBNs in WorldCat

Thursday, May 23rd, 2013 by Roy

Recently a question came up on the BIBFRAME list about ISBNs, and how many of them were in MARC records. This is just the kind of question that OCLC Research is uniquely placed to answer, so I quickly wrote some simple Perl code to run as a Hadoop streaming job to find out.

It was remarkably quick and easy to find out, although I had to edit and re-run the code when I discovered a flaw in my logic. This is, sadly, all too frequently the case. But not too much later I had my result:

Occurrences # per Record Percent of WC
230444194 0 77.71%
55668178 2 18.77%
4766652 1 1.61%
3708352 4 1.25%
616623 3 0.21%
411230 6 0.14%
125715 8 0.04%
65796 5 0.02%
45304 10 0.02%
30155 12 0.01%

These are all of the occurrences of a 020 $a in WorldCat as of 1 May 2013 [Added for clarification: the prior sentence describes exactly what is being counted. That is, I am not (yet) examining ISBNs for 10-digit vs. 13-digit; therefore, many of the records with 2 ISBNs may in fact simply have both versions].  A few observations:

  • Many items in WorldCat were published before the invention of the ISBN.
  • Many items in WorldCat are not ISBN-appropriate (e.g., unpublished materials).
  • ISBNs are therefore problematic as identifiers except for a narrow slice of materials (mainly printed books since the mid-60s).

A much better identifier for many purposes is, I assert, the OCLC number.

Irreconcilable differences? Name authority control & humanities scholarship

Wednesday, March 27th, 2013 by Karen

This post is co authored by David Michelson, Vanderbilt University

Over the past year OCLC Research has been working with a group of Syriac studies scholars with the goal of tapping their expertise to enrich the Virtual International Authority File (VIAF), by adding Syriac script to existing names and adding new ones. Syriac is a dialect of Aramaic, developed in the kingdom of Mesopotamia in the first century A.D. It flourished in the Persian and Roman Empires, and Syriac texts comprise the third largest surviving corpus of literature from the fourth through seventh centuries, after Greek and Latin. We anticipated that the issues we addressed could then be applied to scholars in other disciplines. We started with the assumption that the scholars could use the Library of Congress’ Metadata Authority Description Schema, or MADS.

We have learned a lot in the process of building a bridge between scholarly interest in names as a subject of historical research and VIAF’s interest in persistent identifiers for each name in authority files. We found that we shared values for name authorities:

  • Scholars and librarians share a mutual appreciation for each others’ work on identifying names appearing in historical research.
  • Many scholarly projects in the digital humanities are already relying on VIAF for authority control and to anchor Linked Open Data. The Syriac scholars pointed us to digital humanities projects— such as the Fihrist, a union catalog of Islamic manuscripts hosted in the UK, and those listed in the Digital Classicist Wiki under “Very Clean URIs”—that have adopted VIAF URIs as the best method for authority control and to link to other data sets.
  • VIAF can provide part of the cyberinfrastructure for digital humanities, a standard way for linking and querying data, a need identified by The American Council of Learned Societies’ national Commission on Cyberinfrastructure.

We discovered two key issues important to scholars that just don’t mesh well with the library practices represented in name authority files, which VIAF aggregates, due to differences in intended audiences, disciplinary norms, and metadata needs:

  • Scholars eschew a “preferred name”. Libraries need to bring together all the variant forms of a name under one form, choosing a “predominant form” if a person writes in more one language. This approach meets the discovery needs for a specific national or linguistic community. Scholarship is international, and the “preferred name” in one locale will differ from another. Further, the context is crucial for classifying names. For scholars, a “preferred name” needs to also include by whom and for what purpose it is preferred. For example, a Syriac name in use in 600 may be classified as “classical Syriac”; but the same name in use one thousand years later may be classified as a neo-Aramaic dialect. The same Syriac author might have multiple “preferred forms” in multiple languages (Syriac, Arabic, Greek), each used by different or competing cultural communities. This applies to other languages as well. Scholars resist declaring a “preferred form” because it could exclude some historical or cultural perspective. Each form may be “authoritative” depending on the time and place it appears.
  • Scholars need to know the provenance of each form of name. When a name has multiple forms, scholars—especially historians— need to know the provenance of each name, following the citation practices commonly used in their field. Historical and textual scholarship is built on conventions of evidence and values the process of contesting intellectual claims. MADS does not provide the structure for citing these sources or providing the required contextual information. Although library practices require “literary warrant” to justify why one form of name was chosen as the authorized heading or access point, they do not document the context for any of the variant forms. There is not even a field to indicate the language of a name’s form. We can deduce the language of the preferred form only by the source of the authority file. Scholars find little value in name information without provenance data, an equivalent of footnotes.

The good news is that our collaboration has pointed the way for future interaction between VIAF, the VIAF Council, and the scholarly community:

  • Syriac studies colleagues are building their own Syriaca.org database where they can describe each personal name with the granularity that meets their scholarly requirements. We will work together to create a crosswalk so that OCLC Research can extract the information that fits into a MADS structure, and can still enrich existing VIAF clusters with Syriac and other script forms or add new names. VIAF and Syriaca.org will follow existing protocols for using the http://viaf.org/viaf/sourceID namespace in minting URIs for new names not yet in VIAF.
  • For those who need the additional details, people could click a link to the name in the Syriaca.org database, much as those who want to read a biography of a VIAF name can click on a Wikipedia link, if present. Thus VIAF can still integrate scholars’ expertise and serve scholarly users without needing to overcome the fundamental differences between library and scholarly practices.
  • Syriaca.org will work with OCLC and the VIAF Council to establish a path for other scholarly research organizations to contribute to VIAF.

The screen captures of the current VIAF cluster and a Syriac Reference Portal Demo record for Ephrem below help us imagine how VIAF could be enhanced.

VIAF Cluster

VIAF Cluster

Extract from the Syriac Reference Portal Demo

Extract from the Syriac Reference Portal Demo

David Michelson is the assistant professor of early Christianity at Vanderbilt University and director of The Syriac Reference Portal, a joint project among Vanderbilt University, Princeton University, St. Michael’s College Vermont, Texas A&M University, Beth Mardutho the Syriac Institute and other affiliate institutions, funded by the National Endowment for the Humanities and the Andrew W. Mellon Foundation.

“Cataloging Unchained”

Wednesday, February 27th, 2013 by Roy

Lorcan Dempsey (VP of Research at OCLC) has long said that we need to “make our data work harder.” And for years that is exactly what OCLC Research has been doing. So when I was asked to speak on data mining at the OCLC European, Middle East, and African Regional Council Meeting in Strasbourg, France, I knew I would have a lot to talk about. Too much, in fact.

Instead of trying to cover everything we’ve been doing in a whirlwind of slides that no one would remember, I decided to use WorldCat Identities as a “poster child” for the kinds of data mining activities we have been doing recently here at OCLC Research. Then, I described another, related project — the Virtual International Authority File. To bring it all home I mentioned how we’re considering how we might be able to marry these two resources into one “super” identities service.

Consider what it would mean to take an aggregation of library-curated authority records and enhance it with algorithmically-derived data from WorldCat as well as links to other resources about creators such as Wikipedia. This would provide a rich resource of information about creators, all sitting behind authoritative and maintained identifiers that could be used in emerging new bibliographic structures such as is being created by the Library of Congress’ Bibliographic Framework Transition Initiative. The mind reels with the possibilities.

But before I could jump into all this I needed a way to quickly explain why we are doing things like this — and how we are doing them. I decided I needed to make a video. So last week that is exactly what I did, with help from colleagues in Dublin. The result was less than three-and-a-half minutes long, and yet it amply set the stage for what was to come after. Plus, it can have a life of its own.

Take a look yourself, at “Cataloging Unchained”, and let me know what you think in the comments.

Adventures in Hadoop, #5: String Searching vs. Record Parsing

Friday, January 25th, 2013 by Roy

In a previous post I described how I could now string search WorldCat using the Research compute cluster and Hadoop. This means I can find any string of characters anywhere in a MARC record and output the OCLC numbers of matches and/or the entire record, all within a few minutes. Keep in mind we are now talking about nearly 300 million records.

String searching is not, by any means, the only way we use Hadoop. It is actually more common for us to use code (typically written in Java, Python, or Perl) to parse the records and output portions for further processing. But since I had been using such code for simply identifying records of interest, I began to wonder which method of processing was faster.

In a short experiment I quickly proved that string searching was about three times faster for simple identification and output of records than was the code I had previously been using. This is because, I believe, the code I had been using would parse the record before determining if it met my criteria. This one extra step added so much overhead to the process that it would take 15 minutes (in one test) rather than 5.

This likely means that in some cases where relatively few records would match your criteria, you would still be better off extracting the records by string searching and then running your extraction code against them off-cluster. For example, if I wanted to pull out the 245 fields of, say, about 1,000 records, I’d likely be better off extracting the records I wanted by string searching and then process that file directly without using Hadoop.

One last permutation, however. If your process is one that identifies 1,000 records in some situations and several million in another, having one process through which all operations flow is more efficient than two or more separate processes.

And so it goes. Another day and another adventure in Hadoop.

Top Corporate Names in WorldCat

Tuesday, November 20th, 2012 by Roy

As I explained earlier, I have been doing some investigations into how MARC has been used over the last several decades. Curious about the contents of the 110 $a (corporate names), I parsed it and the top 30 headings are listed below. Keep in mind a few things, however:

  • Entities can be put together in different ways. For example , there is “Great Britain” and “England and Wales” and “Scotland” all appear in the list.
  • My process (as presently constituted) is simplistic. Therefore, both “Canada.” and “CANADA.” are counted separately.
  • Slight variations in headings produce different entries. For example, “Santa Fe River Baptist Association (Fla.)” and “Santa Fe River Baptist Association.”
  • Typos produce different entries.
Eventually I will make the entire list available. If you’re really eager, email me.
1417046 United States.
587986  Great Britain.
358417	France.
206591	Canada.
176754	Geological Survey (U.S.)
101421	California.
98397	Michigan.
79615	Australia.
78175	Catholic Church.
64390	New York (State).
57037	New Zealand.
48218	Sotheby's (Firm)
46196	Hôtel Drouot.
45853	Québec (Province).
44812	New South Wales.
44022	England and Wales.
43469	Massachusetts.
41914	Pennsylvania.
41560	Christie, Manson & Woods.
41292	Église catholique.
39517	Ontario.
36636	Scotland.
36234	Illinois.
34691	United Nations.
31121	India.
31011	Agence de presse Meurisse.
29958	Cornell University.
29648	Church of England.
29073	Japan.
28675	Victoria.

Top Topics in WorldCat

Wednesday, November 7th, 2012 by Roy

As I’ve described in a series of posts recently (“Adventures in Hadoop”, four so far), I’ve been having fun on our new compute cluster. Well, maybe “fun” isn’t exactly the right term for diving into the depths of the MARC format, but hey, librarians have to get their kicks somehow.

Anyway, I’ve been doing some work that will eventually see the light of day but for now I want to report on one small finding — the top subject areas in WorldCat. But first let me be very clear about my methodology so that incorrect assumptions are not made.

What I’ve done is to use our Hadoop infrastructure to look at every occurrence of the 650 MARC field and set aside and count the contents of every $a subfield. What this means is that if a record in WorldCat has these subjects:

World War, 1939-1945 — Naval operations.
World War, 1939-1945 — Aerial operations.
World War, 1939-1945 — Pacific Ocean.

Then “World War, 1939-1945″, being the contents of the $a subfield is counted three times. Therefore, the figures below are not the number of titles with that top-level topic, but the number of times it occurs in WorldCat as a whole. It should also be noted that this is across all formats. Here are the top 20:

807860	English language
739051	World War, 1939-1945
696769	Women
608170	Popular music
583375	Education
558876	Science
522882	Music
512224	Agriculture
433770	Art
403742	Law
397194	Indians of North America
379298	Jews
361501	Architecture
354640	Geology
345761	Railroads
343079	Geschichte.
321255	Roads
313043	World War, 1914-1918
305187	African Americans
293148	City planning
 ”Geschichte” is German for “History”. It will be interesting to see how this list changes as we add more non-U.S. records to the database.

Registering researchers in authority files

Monday, October 29th, 2012 by Karen

Last month we launched a new task group of OCLC Research Library Partner staff and others who are involved in uniquely identifying authors and researchers that can be shared in a linked data environment.

We were spurred by institutions’ need to uniquely identify all their researchers to measure their scholarly output, a factor in reputation and ranking. Yet national authority files cover researchers only partially. They do not include authors that write only journal articles, or researchers who don’t publish but create or contribute to data sets and other research activities.

We see a number of activities in this “name space” with potential overlap, including: the International Standard Name Identifier (ISNI), the Virtual International Authority File (VIAF), Open Researchers & Contributor ID (ORCID), the Dutch Digital Author Identifier system (DAI), The Names Project in the UK, the Program for Cooperative Cataloging’s NACO program, researcher profile systems such as VIVO, and Current Research Information Systems (CRIS).

The Registering Researchers in Authority Files Task Group will document the benefits of researcher identification; significant challenges; trade-offs among the current approaches; and mechanisms for linking approaches and data. We are starting with use case scenarios, for example:

  • Researchers who want to identify others in their field
  • Institutions that need to collate the intellectual output of their researchers
  • Funders who want to track the outputs for awarded grants
  • Services providing persistent identifiers for researchers that need to disambiguate names.and ensure correct attributions.

We are hoping that our report will help address all of the above needs, and suggest approaches for linking data from different sources in a coherent way. Details on this activity and the task group roster —including experts from the Netherlands, the United Kingdom, and the United States—are on our new Registering Researchers in Authority Files activity page on the OCLC Research website.

If there are systems or “name authority hubs” you want to make sure we look at, please let us know with a comment below.

 

Two Huge Linked Data Announcements

Wednesday, June 20th, 2012 by Roy

This week we have announced two major initiatives that are now providing significant library linked data resources to the world. First was the announcement yesterday that all of the 23rd Edition of the Dewey Decimal Classification has been released on the web as linked data. From the announcement:

All assignable classes from DDC 23, the current full edition of the Dewey Decimal Classification, have been released as Dewey linked data. As was the case for the Abridged Edition 14 data, we define “assignable” as including every schedule number that is not a span or a centered entry, bracketed or optional, with the hierarchical relationships adjusted accordingly. In short, these are numbers that you find attached to many WorldCat records as standard Dewey numbers (in 082 fields), as additional Dewey numbers (in 083 fields), or as number components (in 085 fields).

Second was today’s announcement that we have now added Schema.org descriptive markup. as well as draft set of library extensions, to all of WorldCat. From the press release:

OCLC is taking the first step toward adding linked data to WorldCat by appending Schema.org descriptive mark-up to WorldCat.org pages. WorldCat.org now offers the largest set of linked bibliographic data on the Web. With the addition of Schema.org mark-up to all book, journal and other bibliographic resources in WorldCat.org, the entire publicly available version of WorldCat is now available for use by intelligent Web crawlers, like Google and Bing, that can make use of this metadata in search indexes and other applications.

For more information, see “Linked Data at OCLC”. Please keep in mind that these efforts are beginning steps. We will be reviewing the feedback we receive and likely making changes as opportunities to improve present themselves. For example, we are working to pull together a group of institutions that can collaborate on establishing a set of extensions to the Schema.org elements. A very beginning draft is available, but it will likely go through many changes as others become more closely involved. We welcome your participation.

Follow-up addendum: We’ve had several folks ask about data dumps relative to the WorldCat.org linked data announcement. Adding Schema.org linked data to WorldCat.org is, for the time being, an experiment that we’re putting out there in order to garner feedback and get some early usage results. We expect our model to change; because of that, we’re not publishing any bulk downloads of the data at this time.

Thick Description: Fingerprints, Sonnets, and Aboutness in Special Collections

Thursday, May 17th, 2012 by Jennifer

Discoverability of special collections has long been a top concern of the OCLC Research Library Partnership.  What works? Break out of the OPAC? Beyond MARC? End run around EAD?

Constance recently started a conversation here in the office about “catablogs.”  She’d seen that NYU’s Chela Weber taught a workshop in New York about how to use a blog as a low-overhead collection management system.  A “catablog” can create searchable, browseable online presentations of collections.

Today the Atlantic posted a short article about the impact of blogging rare books. At St Andrews, Daryl Green’s blog played an unusual role in what are otherwise standard special collections procedures – identifying new acquisitions and raising scholarly and financial support. (Book-nerd disclosure: I’ve been following Daryls’ blog for his 52 weeks of fantastic bindings, but Constance sent me the Atlantic article this morning.)

Ellen’s blogging about collections in ArchiveGrid is driving a healthy amount of traffic to ArchiveGrid itself. This is exactly the kind of research question we wanted to pursue with ArchiveGrid. Bruce has wondered if commentary and interpretation wouldn’t improve discovery and make it easier for a researcher to decide what to pursue.

This has prompted me to revisit The Metadata IS the Interface and user studies of relationships between description and discovery or use. Archivists and librarians contribute to discovery when they discard illusions of neutrality and express their excitement for the materials and their opinions about their significance. MARC and EAD have enhanced our management of collections, but don’t necessarily serve all the needs of our users these days.

Over on the RBMS-ish (rare books and manuscripts) side of our profession, considerable thought has been given recently to more rich description – “records more like sonnets,” as the Beinecke’s Ellen Elickson put it. I might borrow a term from the anthropologist Cliff Geertz and call it “thick description.” Michelle Light and Tom Hyry have advocated post-modern colophons and annotations. One of the RBMS hipsters has been arguing it is time to bust out of “the coldness of our description.” Mark Dimunation (Library of Congress) and others have imagined meaty and flexible descriptions of special collections like a wheel: hub and spoke. Merrilee blogged about Mark’s talk:

“Dimunation has been intrigued by James Asher’s call for progressive bibliography in which catalog records are viewed as hubs where information can be linked in, or hung on the core record as necessary. In this way, additional information can accrue over time, and doesn’t necessarily need to be contained in the catalog. Links to information that lives outside the catalog form a virtual vertical file that can document unique characteristics, and help form the fingerprint of an item.”

When I first joined OCLC Research, in the days of Shifting Gears, I thought that I’d wasted the past 10 years of my career building curated web exhibits of boutique collections of rare books, manuscripts and archives. In 2007 we needed to scale up digitization. Now my thinking is coming full circle. Curated blogs and exhibits, combined with the voice of the librarian/archivist, accomplish exactly what we’ve always wanted – to make collections visible and increase their impact.

Read the rest of this entry »

Yet more social metadata for LAMs

Monday, April 23rd, 2012 by Karen

Today we released Social Metadata for Libraries, Archives, and Museums, Part 3: Recommendations and Readings. This is the last in a series of three reports a 21-member Social Metadata Working Group from five countries produced as the result of our research in 2009 and 2010.

The cultural heritage organizations in the OCLC Research Library Partnership have been eager to expand their reach into user communities and to take advantage of users’ expertise to enrich their descriptive metadata. Social metadata—content contributed by users—is evolving as a way to both augment and recontexutalize the content and metadata created by LAMs.

Our first report, Social Metadata for Libraries, Archives, and Museums, Part 1: Site Reviews, provides an environmental scan of sites and third-party hosted social media sites relevant to libraries, archives, and museums. We noted which social media features each site supported, such as tagging, comments, reviews, images, videos, ratings, recommendations, lists, links to related articles, etc.

Our second report, Social Metadata for Libraries, Archives, and Museums, Part 2: Survey Analysis, analyzed the results from a social metadata survey of site managers conducted from October to November 2009. Forty percent of the responses came from outside the United States. More than 70 percent had been offering social media features for two years or less. The vast majority of respondents considered their sites to be successful.

This third report provides eighteen recommendations and an annotated list of all the resources the working group consulted. The key message: “We believe it is riskier to do nothing and become irrelevant to your user communities than to start using social media features.” Among our recommendations:

  • Establish clear objectives and determine what metrics you need to measure success.
  • Leverage the enthusiasm of your user communities to contribute.
  • Look at other sites similar to your own that are already using social media features successfully before you start.
  • Consider using third-party hosted social media sites rather than creating your own.

All three reports total over 300 pages, so we’ve also prepared a much shorter Executive Summary with the highlights from all three reports.

The reports and the recording of our 9 March 2012 Webinar are all available here. We look forward to hearing your feedback – perhaps on our Social Metadata for LAMs Facebook page?

As with many OCLC Research publications, this report was written to help meet the needs of the OCLC Research Library Partnership. The Partnership not only inspires but also underwrites this type of work, so many thanks to the institutions who both contribute to and support our work!