The following post was written by Jeff Young with contributions from Jeff Mixter.
Analyzing MARC data for quality assurance can be a challenging task. This problem becomes exponentially more difficult when you are working at scale, with an aggregation of MARC records from multiple cataloging contributors. The complexity of MARC combined with the multitude of cataloging practices makes indexing and querying MARC data challenging. Libraries are pivoting to linked data structures, seeking to capitalize on the opportunity to create and provide more interoperable data that can help improve the end-user search and discovery experience. While many user-supporting value propositions of linked data are by now well-characterized, linked data approaches can also be employed for analyzing MARC data in ways that are more efficient and streamlined than is typical with other methods. This post takes those skills and tools adopted by libraries for dealing with linked data and applies those methodologies, focusing primarily using RDF and SPARQL, and proposes their suitability in retrospective analysis of existing MARC data.
Believe it or not, MARC can be indexed using SPARQL. Granted, RDF triplestores cannot index MARC/XML directly, but a literal transformation of MARC/XML into a standard RDF serialization is not that hard, and can prove to be a swift and direct way to retain MARC-originating data values for querying. The use cases may not be immediately obvious, but the analysis below is a starting point that provides evidence of the utility and efficiency of this approach for anyone seeking to try these methods out. To be clear, though, the proposal here is not to manage or exchange data directly in this form; only to enable query access to existing MARC data.
As a warning to the reader: this post delves deeply into RDF and SPARQL and assumes familiarity with the W3C standards. For those looking to learn more about RDF and SPARQL we recommend looking over the W3C standards for RDF and SPARQL. You could also look at the very readable book ‘Linked Data for the perplexed librarian’
For reference, the MARC 21 Format for Bibliographic Data is defined by the Library of Congress (https://www.loc.gov/marc/bibliographic/). Although it is possible and even useful to render the MARC standard into machine-readable RDFS/OWL, such a model would reflect MARC structure (records, fields, subfields, tags, codes, indicators) rather than entities and relationships that most modern ontologies strive to represent. The idea of a literal treatment of MARC as an RDF vocabulary is not entirely new. A prior attempt to formalize this idea is available at www.marc21rdf.info. The main difference in our proposal is the tailoring of namespaces and terms to make SPARQL queries simpler and more flexible for people already familiar with MARC.
An advantage of indexing MARC data in a graph-based index is the ability to easily accommodate the sheer number of MARC fields, subfields, indicators and their respective combinations that are possible across a set of MARC records. In a graph-based environment, the vital part is defining each of the MARC elements in an ontology. The data itself can use any number or combination of properties defined in the ontology and the combination has no effect on the database structure. Relational database management systems (RDBMS) require a well-defined table structure, with primary and foreign keys, with are prone to problems when the schema changes or incoming data does not match the defined schema. Similarly, traditional Lucene databases require mappings for building indexes out of incoming data. Like with RDBMS, Lucene-based text indexes have trouble dealing with data that deviates from predictable patterns. Both RDBMS and Lucene indexes excel when they are properly tuned to the data and visa-versa but given the variable nature of MARC data we think a graph-based approach is better suited for data analysis.
The examples below are derived from a WorldCat sample previously used to explore the “Challenges of Multilingualism” for a 2017 “Visualizing the Digital Humanities” workshop. The dataset focused on a set of 10,658 bibliographic records representing various editions, translations and derivative works associated with 5 primary works:
- The Grand Design by Stephen Hawking and Leonard Mlodinow
First published: 2010 in English
- Les mots et les choses: Une archéologie des sciences humaines by Michel Foucault First published: 1966 in French
- Pêcheur d’Islande by Pierre Loti
First published: 1886 in French
- Principia philosophiae by René Descartes
First published: 1644 in Latin
- Sein und Zeit by Martin Heidegger
First published: 1927 in German
The idea of indexing MARC with SPARQL does not depend on any specific material types but having a coherent set of records creates an interesting opportunity to explore patterns that are obscured by record-based indexing.
Rather than show what the RDF might look like, it might be easier to show some SPARQL queries and a few use cases. The basic structural pattern is this:
This query is looking for “245 $a”, i.e., the Title portion of the data. The “bd” in front of the “245” is borrowed from a convention used by MARC Format documentation pages to signify “bibliographic data” as opposed to “ad” used to signify “authority data”. The “s” in front of subfield code “a” is used as substitute for the “$” convention, which is not a valid character in this context. Although this experiment focused on MARC bibliographic records, having differentiated authority tags would allow both formats to coexist in the same triplestore.
Here is the result of the query:
|<http://worldcat.org/oclc/11478549>||Les principles de la philosophie :|
|<http://worldcat.org/oclc/123043674>||Sonzai to jikan /|
|<http://worldcat.org/oclc/1283522>||Subjekt und Dasein :|
|<http://worldcat.org/oclc/131592706>||Discurso del método /|
|<http://worldcat.org/oclc/1489961>||Discurso do método,|
|<http://worldcat.org/oclc/1625492>||Discourse on method :|
|<http://worldcat.org/oclc/162838102>||Pêcheur d’Islande :|
|<http://worldcat.org/oclc/166693785>||Discours de la méthode :|
|<http://worldcat.org/oclc/173689396>||Reason and responsibility :|
Notice that any field in MARC can be addressed by querying tag:[MARC21 tag] and any subfield can be addressed by querying code:[MARC21 subfield code].
A more targeted use case might be to create labeled hotlinks to bibliographic records associated with an identified creator like René Descartes “(isni)0000000121296144”:
|<http://worldcat.org/oclc/881005686>||Méditations métaphysiques :|
|<http://worldcat.org/oclc/467087058>||Oeuvres philosophiques /|
|<http://worldcat.org/oclc/457691357>||Renati Des-Cartes Principia philosophiae :|
|<http://worldcat.org/oclc/457691408>||Les Principes de la philosophie, escripts en latin par René Descartes, et traduits en françois par un de ses amis [l’abbé Picot].|
|<http://worldcat.org/oclc/457690458>||Renati Des Cartes Specimina philosophiae :|
A more unconventional example might be a federated query to join a written work identified in Wikidata (Q404567) to a list of multilingual manifestations in WorldCat:
This query takes the multilingual labels associated with the Wikidata item and then looks to see if any of those labels occur in a MARC 240 $a (Uniform Title). There is no guarantee that this “join” across datasets will produce a result, but it illustrates some new possibilities for connecting data across systems.
|Sein und Zeit||<http://worldcat.org/oclc/750887037> L’Etre et le Temps /|
<http://worldcat.org/oclc/812691322> Der Satz vom Grund /
<http://worldcat.org/oclc/875807232> Essere e tempo /
<http://worldcat.org/oclc/751084857> Sein und Zeit /
<http://worldcat.org/oclc/934803400> al- Kainūna wa-‘z-zamān
<http://worldcat.org/oclc/909179013> Martin Heidegger [művei] :
<http://worldcat.org/oclc/940281413> Varat och tiden
<http://worldcat.org/oclc/433669817> Ser y tiempo
<http://worldcat.org/oclc/464627340> Varat och tiden
<http://worldcat.org/oclc/443417201> Bit in čas /
<http://worldcat.org/oclc/749266607> Bycie i czas /
<http://worldcat.org/oclc/58323664> Oleminen ja aika /
<http://worldcat.org/oclc/750885662> El ser y el tiempo /
<http://worldcat.org/oclc/804476823> Sein und Zeit /
<http://worldcat.org/oclc/827722507> Sein und Zeit /
<http://worldcat.org/oclc/475541216> Being and time /
<http://worldcat.org/oclc/750529394> Sein und Zeit /
<http://worldcat.org/oclc/171292830> Væren og tid /
<http://worldcat.org/oclc/668432210> Zijn en tijd /
<http://worldcat.org/oclc/802028378> Der Satz vom Grund /
<http://worldcat.org/oclc/908860161> Lét és idő /
<http://worldcat.org/oclc/749911377> Bycie i czas /
<http://worldcat.org/oclc/750573807> Sein und Zeit.
<http://worldcat.org/oclc/941482600> Vara och tid
<http://worldcat.org/oclc/750573910> Sein und Zeit /
<http://worldcat.org/oclc/749489751> Bycie i czas /
<http://worldcat.org/oclc/668362771> Being and time /
<http://worldcat.org/oclc/823713453> Being and time /
<http://worldcat.org/oclc/909507129> Lét és idő :
<http://worldcat.org/oclc/470946401> Varat och tiden
<http://worldcat.org/oclc/751092215> Sein und Zeit.
<http://worldcat.org/oclc/909382448> El ser y el tiempo /
<http://worldcat.org/oclc/860569409> Sein und Zeit /
<http://worldcat.org/oclc/852973087> Bycie i czas /
<http://worldcat.org/oclc/749546537> Bycie i czas /
Methodology, scale, and infrastructure
The MARC/XML records used in this test were converted into RDF/XML using an XSLT stylesheet. This data was then reserialized into N-Triples for loading into a triplestore. Below is a breakdown of the numbers and types of triples generated:
- MARC/XML records input: 10,658
- Total number of triples: 3,025,776
- Average number of triples per record: 284
The ontology has four primary categories of properties based on the MARC21 guidelines. For readability, each category has been assigned a distinct namespace prefix. Here is a basic description of each category paraphrased from LC’s “Understanding MARC Bibliographic” documentation.
- tag: Each field in a MARC record is associated with a 3-digit number called a “tag”. A tag identifies the kind of data that follows. (446,306 “tag:” triples)
- ind: Some fields are further defined by “indicators”. These are two-character positions that follow each tag. (380,696 “ind:” triples)
- code: Most fields (except tag 001-009) are broken down into “subfields” preceded by a single alpha-numeric “code” character that indicates its meaning. (1,766,556 “code:” triples)
- offset: Fields with tag 001-009 are broken down by character position offsets rather than codes. (432,218 “offset:” triples)
We used BlazeGraph as our triple store for this experiment.
One component of this approach that needs further study is the scalability. The test of three million triples worked fine on a standard laptop but expanding to hundreds of millions or billions of triples would require more powerful hardware. The total number of triples in this experiment accounts for all the structural components in MARC. Some components may be less interesting than others, so the number of triples could be trimmed accordingly.
The idea of using SPARQL to analyze bibliographic data has been around a long time. This 2014 blog post by Leigh Dodds contains some practical queries to illustrate the process. The difference in the approach described here is that we’ve indexed the MARC structure directly, without a lossy mapping to more semantically meaningful terms. This has resulted in expedient and customizable analysis potential against data that is comprehensive and true to its source. The other thing to notice is that this MARC-style of RDF is completely unvarnished. If your goal is to find errors and inconsistency or perform data analysis across your MARC data, this approach can help.
Jeff Mixter works on research and development projects focusing on linked data and digital collections material. He holds Bachelor’s Degrees in History and German from The Ohio State University as well as Master’s Degrees in Library Information Science and Information Architecture/Knowledge Management from Kent State University.