Skip to content

Hanging Together

the OCLC Research blog

  • Home
  • About
Main Menu
Metadata

Кириллица в WorldCat

April 15, 2020April 15, 2020 - by Merrilee Proffitt

This post was written by OCLC staff members Jenny Toves, Bryan Baldus, and Mary Haessig. The title of the post is “Cyrillic alphabet in WorldCat”.

Karen Smith-Yoshimura had an idea: since Latin transliterations for Cyrillic is a 1:1 relationship for Russian*, we should add Cyrillic to WorldCat Russian records that don’t have it.

 hand drawn cyrillic alphabet
Hand drawn cyrillic alphabet / Getty Images

Last month Karen blogged about Cyrillicizing WorldCat Russian records, and given the great interest expressed in this project by those in the OCLC membership with Russian collections, my team wanted to share more technical information about the process.

Expert catalogers at OCLC immediately knew Cyrillicizing Russian would not be simple! Our first step was to identify a set of records that were likely to contain Russian and only Russian – no Ukrainian words for instance. Our criteria included things like the language of the item must be Russian, the place of publication must be Russia, there can’t be an 041 field with any language other than Russian and no 880 fields in the record already. 1.6 million records were identified. The first prototype was quite optimistic and set about converting tags 245, 246, 260, 264, 362, 490, 500 and 505 to Cyrillic and storing the result in 880s. Right away we encountered characters that aren’t in the Library of Congress Russian romanization table so we decided to skip any subfields with an unexpected character.

Which is where we discovered (I say ‘we’ but the catalogers weren’t surprised in the least!) that many words had errors with the diacritics. Diacritics were missing, over the wrong letters or the wrong diacritic was used. Much time was spent by Mary Haessig (Contract Cataloging, OCLC) and Peter Fletcher (Cataloging and Metadata, UCLA) reviewing generated Cyrillic, spotting wrong words and then turning the errors into common patterns that would fix a large number of problems with a single rule. An iterative cycle of data analytics and human review resulted in rules and vocabulary lists that allowed us to correct Latin fields in 915k records in preparation for transliteration back to Cyrillic. An example of a rule is finding the word ‘dlia’ in one of our records. Occurrences of dlia were converted to dli︠a︡ about 18 thousand times. This is a case of missing ligatures on a common word.

The review process was still turning up words that weren’t right and the return on investment was shrinking. Finding patterns that fixed more than a handful of records stopped happening. The decision was made to halt the review process and try to find a way to characterize which records most likely were correct. Data analytics to the rescue! Existing WorldCat records with Cyrillic text were mined to create a dictionary of known words. If the enhanced records did not create any words that were currently unknown to WorldCat, then the update was kept. This allowed us to update 958k records representing 3.7M holdings.

The next challenge was getting the enhanced records back into WorldCat, so I turned to my colleagues in Metadata Quality. For a project like this in the past, WorldCat Metadata Quality staff would typically use Connexion client macros to update the records with the Cyrillic data. Each Connexion client instance would go through a file of 10,000 OCNs over the course of approximately 8-24 hours, depending on the nature of the records involved. With 968,000 records, that would mean processing 97 files. Even if 8 Connexion client instances were run simultaneously, that would take approximately 4 full days processing the records using a macro.

As an alternative, Bryan Baldus (Metadata Policy, OCLC) pointed me at the WorldCat Metadata API and Karen Coombs (API Strategy, OCLC). I wrote a tiny little routine in Python that runs on the Research Hadoop cluster and processes the updates in parallel. Most of the records went in easily. Some did not pass validation. Some are PCC—either BIBCO or CONSER. In other words, records that my authorization cannot touch. A few had been touched since my snapshot. The validation errors got sent to the “QC macro” and the PCC Records went to Bryan. He was able to successfully process the BIBCO records. The remaining CONSER records, approximately 10,000 of the total set of records, were unable to be updated via the API and will need to be carefully updated via a Connexion client macro. A 30 hour job over a weekend completed the bulk of the updates. Success!

What about the records that got dropped because they contain words new to WorldCat? Some of them are legitimate words. Some of them have previously discussed types of errors. None of the remaining errors occur with a satisfying frequency. New methods of data analytics will be needed in order to get an acceptable return on investment for the time needed to review the data and the time needed to write the rules to fix enough of the remaining records.

An example of why the remaining records are hard is this record:

=LDR  00000cam a  00000Ka
>001  ocn870999477
…
>245  10$6880-01$aTeplotekhnicheskie i zvukoizoliat︠s︡ionnye kachestva ograzhdeniĭ domov povyshennoĭ ėtazhnosti /$cV.R. Khlevchuk ; E.T. Artykpaev.
>260    $6880-02$aMoskva :$bStroiizdat,$c1979.
…
>880  10$6245-01/(N$aТеплотехнические и звукоизолиационные качества ограждений домов повышенной этажности /$cВ.Р. Хлевчук ; Е.Т. Артыкпаев.
>880    $6260-02/(N$aМосква :$bСтроииздат,$c1979.

The 2 Cyrillic fields generated a total of 7 unknown words (in bold). One of the words is missing a diacritic, 2 are names, the others are correct but not currently in WorldCat. Fixing that single record requires fixing 7 unknown words—none of which occurs in more than a handful of other records and not with this exact batch of other unknown words. So you can see where it can take a fair amount of work to just get a single record approved—much less move the total fixed count up by enough to notice.

I think the use of analytics to find likely problems in the data and to group problems into similar patterns to reduce reviewer time and increase programmer productivity is a huge win for improving the quality of WorldCat in general—not just for adding Cyrillic.

*except for the hard sign / soft signs which have no case in Latin but have case in Cyrillic.

Merrilee Proffitt

Merrilee Proffitt is Senior Manager andprovides project management skills and expert support to institutions within the OCLC Research Library Partnership.

oclc.org/research/people/proffitt.html
Share on Facebook
Facebook
Tweet about this on Twitter
Twitter
Email this to someone
email

Related Posts

Towards respectful and inclusive description

December 17, 2020December 18, 2020

さようなら (Sayōnara)

November 16, 2020November 12, 2020

Transitioning to the Next Generation of Metadata

October 1, 2020September 30, 2020

OCLC Research

Hanging Together is the blog of OCLC Research. Learn more about OCLC Research on our website.

Links

  • Lorcan Dempsey's Weblog
  • Next – OCLC Blog
  • OCLC Research
  • OCLC Research Library Partnership
  • WebJunction

Categories

  • Archives and Special Collections (195)
  • Born-Digital Special Collections (14)
  • Collective Collections (118)
  • Data Science (7)
  • Digital Preservation (69)
  • Digitization (24)
  • Equity, Diversity, Inclusion (EDI) (8)
  • Evolving Scholarly Record (10)
  • Higher Education Future (8)
  • Identifiers (26)
  • Infrastructure and Standards Support (88)
  • Libraries (93)
  • Libraries Archives and Museums (126)
  • Libraries in the Enterprise (2)
  • Library Management (5)
  • Linked Data (33)
  • Measurement and Behaviors (44)
  • Metadata (75)
  • Miscellaneous (176)
  • Modeling new services (112)
  • MOOCs (7)
  • Museums (57)
  • Open Access (14)
  • Renovating Descriptive Practice (114)
  • Research Data Management (20)
  • Research Information Management (35)
  • Research Library Partnership (164)
  • Research support (24)
  • Resource Sharing (8)
  • Searching (38)
  • SHARES (6)
  • Social Interoperability (1)
  • Supporting Scholarship (65)
  • Systemwide Organization (42)
  • User Behavior Studies and Synthesis (6)
  • Visual Resources (17)
  • Web Archiving (14)
  • WebJunction (7)
  • Wikimedia (43)

Share Buttons

Share on Facebook
Facebook
Tweet about this on Twitter
Twitter
Email this to someone
email

Email Notifications


 

Recent Comments

  • Gail Thornburg on さようなら (Sayōnara)
  • Ivy Anderson on さようなら (Sayōnara)
  • Günter on さようなら (Sayōnara)
  • Shuwen Cao on さようなら (Sayōnara)
  • Andrew Padilla on Presenting metadata from different sources in discovery layers

Recent Posts

  • Engaging in “Difficult Conversations” on race: lessons learned from an RLP team practice group
  • “The Big Ask”: Securing Recurring Campus Funding for a Research Data Service at the University of Illinois
  • Emerging Roles for Libraries in Bibliometric and Research Impact Analysis: Lessons Learned from the University of Waterloo
  • COVID-19 Research and REALM
  • Frequently asked questions: resource sharing practice in the time of COVID-19, Phase I

Admin.

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
  • [Un]Subscribe to Posts
© 2020 OCLC