MarcEdit and other tools for batch processing and metadata reconciliation

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Jennifer Baxmeyer of Princeton University and Sharon Farnel of the University of Alberta. Many OCLC Research Library Partners use MarcEdit and/or other tools such as OpenRefine, scripts (e.g., Python, Ruby or Perl), and macros for metadata reconciliation and batch-processing. Metadata managers were interested in learning which tools their peers find most useful for specific tasks and identifying which tasks are most redundant.

This is the ranking of the tools most used reported by OCLC Research Library Partners metadata managers:

MarcEdit
OpenRefine
Scripts (Python, Ruby, Perl)
Macros
Excel
XSLT
Other

Given MarcEdit’s popularity among the metadata managers, its developer, Terry Reese, participated in our discussions. MarcEdit has a large, global, and active user community, as indicated in its 2017 Usage Snapshot. Terry estimates that about one-third of all users work in non-MARC environments, and two-thirds of the most active users are OCLC members.

Among the wide variety of tasks OCLC Research Library Partners metadata managers reported using MarcEdit for:

Data transformations
Enhancing vendor records
Building MARC records from spreadsheets
Linked data reconciliation
De-duplication of records within a file
Merging two or more records into one
Z39.50 harvesting
Metadata reconciliation before sending records to other systems

Partners value the integration of MarcEdit with other OCLC services, and would welcome more. For example, MarcEdit has a Connexion plug-in so you can batch-edit records within MarcEdit and then upload the edited records back into WorldCat. MarcEdit provides direct integration with WorldCat’s Metadata API, providing support for updating master and local bibliographic records and managing holdings in WorldCat. It provides easy access to OCLC’s Classify, FAST, and VIAF APIs.

OpenRefine is widely used with non-MARC data for analyzing data before processing and reconciling data with external vocabularies. Several noted that OpenRefine is more flexible than MarcEdit but has a much higher learning curve.

Institutions commonly use multiple tools together. Programming scripts—whether created by metadata specialists or in collaboration with systems or IT staff—can be called from within MarcEdit’s command line tool. This allows mixing different types of automation, while still involving technical services staff who don’t write code. Terry noted he created MarcEdit for catalogers and metadata specialists as an automation tool that did not require programming while providing both application APIs and interfaces that could be used within a wide variety of workflows.

In late 2017, MarcEdit 7 was released. Among its many new features is a light-weight clustering functionality. This provides a powerful new way to find relationships between data without introducing a large learning curve. Terry hopes that this new functionality encourages metadata managers to explore tools like OpenRefine. He finds OpenRefine particularly useful working with “large clusters” to identify the metadata that needs reconciliation (for example, all personal names in the 100, 600 and 700 MARC fields). Terry created a YouTube video on Importing and Exporting data between MarcEdit and OpenRefine. For a full list of Terry’s YouTube tutorials see his MarcEdit Playlist. People who want to share experiences with MarcEdit or learn tips from the large MarcEdit community may subscribe to the MARCEDIT-L listserv generously hosted by George Mason University and championed by George Mason’s Ian Fairclough.

We all found it difficult to envision what batch editing would look like in a linked data or BIBFRAME environment. Most linked data datasets that our discussants access are for viewing only, rather than for re-use or manipulation. Early experimenters with linked data have found it difficult to work with the linked data datasets directly, instead downloading the data onto local servers. To use linked data tools, real-time data is needed. Infrequent updates of linked data datasets present a problem for linked data experimenters.

Obtaining persistent identifiers makes it much easier to do authority work, obviating the need to compare text strings. But transforming MARC data into entities results in MARC authority records expressed as linked data – is that a limitation?

Karen Smith-Yoshimura

Karen Smith-Yoshimura, senior program officer, topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements. Karen retired from OCLC November 2020.