This post was written by OCLC staff members Jenny Toves, Bryan Baldus, and Mary Haessig. The title of the post is “Cyrillic alphabet in WorldCat”.
Karen Smith-Yoshimura had an idea: since Latin transliterations for Cyrillic is a 1:1 relationship for Russian*, we should add Cyrillic to WorldCat Russian records that don’t have it.
Last month Karen blogged about Cyrillicizing WorldCat Russian records, and given the great interest expressed in this project by those in the OCLC membership with Russian collections, my team wanted to share more technical information about the process.
Expert catalogers at OCLC immediately knew Cyrillicizing Russian would not be simple! Our first step was to identify a set of records that were likely to contain Russian and only Russian – no Ukrainian words for instance. Our criteria included things like the language of the item must be Russian, the place of publication must be Russia, there can’t be an 041 field with any language other than Russian and no 880 fields in the record already. 1.6 million records were identified. The first prototype was quite optimistic and set about converting tags 245, 246, 260, 264, 362, 490, 500 and 505 to Cyrillic and storing the result in 880s. Right away we encountered characters that aren’t in the Library of Congress Russian romanization table so we decided to skip any subfields with an unexpected character.
Which is where we discovered (I say ‘we’ but the catalogers weren’t surprised in the least!) that many words had errors with the diacritics. Diacritics were missing, over the wrong letters or the wrong diacritic was used. Much time was spent by Mary Haessig (Contract Cataloging, OCLC) and Peter Fletcher (Cataloging and Metadata, UCLA) reviewing generated Cyrillic, spotting wrong words and then turning the errors into common patterns that would fix a large number of problems with a single rule. An iterative cycle of data analytics and human review resulted in rules and vocabulary lists that allowed us to correct Latin fields in 915k records in preparation for transliteration back to Cyrillic. An example of a rule is finding the word ‘dlia’ in one of our records. Occurrences of dlia were converted to dli︠a︡ about 18 thousand times. This is a case of missing ligatures on a common word.
The review process was still turning up words that weren’t right and the return on investment was shrinking. Finding patterns that fixed more than a handful of records stopped happening. The decision was made to halt the review process and try to find a way to characterize which records most likely were correct. Data analytics to the rescue! Existing WorldCat records with Cyrillic text were mined to create a dictionary of known words. If the enhanced records did not create any words that were currently unknown to WorldCat, then the update was kept. This allowed us to update 958k records representing 3.7M holdings.
The next challenge was getting the enhanced records back into WorldCat, so I turned to my colleagues in Metadata Quality. For a project like this in the past, WorldCat Metadata Quality staff would typically use Connexion client macros to update the records with the Cyrillic data. Each Connexion client instance would go through a file of 10,000 OCNs over the course of approximately 8-24 hours, depending on the nature of the records involved. With 968,000 records, that would mean processing 97 files. Even if 8 Connexion client instances were run simultaneously, that would take approximately 4 full days processing the records using a macro.
As an alternative, Bryan Baldus (Metadata Policy, OCLC) pointed me at the WorldCat Metadata API and Karen Coombs (API Strategy, OCLC). I wrote a tiny little routine in Python that runs on the Research Hadoop cluster and processes the updates in parallel. Most of the records went in easily. Some did not pass validation. Some are PCC—either BIBCO or CONSER. In other words, records that my authorization cannot touch. A few had been touched since my snapshot. The validation errors got sent to the “QC macro” and the PCC Records went to Bryan. He was able to successfully process the BIBCO records. The remaining CONSER records, approximately 10,000 of the total set of records, were unable to be updated via the API and will need to be carefully updated via a Connexion client macro. A 30 hour job over a weekend completed the bulk of the updates. Success!
What about the records that got dropped because they contain words new to WorldCat? Some of them are legitimate words. Some of them have previously discussed types of errors. None of the remaining errors occur with a satisfying frequency. New methods of data analytics will be needed in order to get an acceptable return on investment for the time needed to review the data and the time needed to write the rules to fix enough of the remaining records.
An example of why the remaining records are hard is this record:
=LDR 00000cam a 00000Ka >001 ocn870999477 … >245 10$6880-01$aTeplotekhnicheskie i zvukoizoliat︠s︡ionnye kachestva ograzhdeniĭ domov povyshennoĭ ėtazhnosti /$cV.R. Khlevchuk ; E.T. Artykpaev. >260 $6880-02$aMoskva :$bStroiizdat,$c1979. … >880 10$6245-01/(N$aТеплотехнические и звукоизолиационные качества ограждений домов повышенной этажности /$cВ.Р. Хлевчук ; Е.Т. Артыкпаев. >880 $6260-02/(N$aМосква :$bСтроииздат,$c1979.
The 2 Cyrillic fields generated a total of 7 unknown words (in bold). One of the words is missing a diacritic, 2 are names, the others are correct but not currently in WorldCat. Fixing that single record requires fixing 7 unknown words—none of which occurs in more than a handful of other records and not with this exact batch of other unknown words. So you can see where it can take a fair amount of work to just get a single record approved—much less move the total fixed count up by enough to notice.
I think the use of analytics to find likely problems in the data and to group problems into similar patterns to reduce reviewer time and increase programmer productivity is a huge win for improving the quality of WorldCat in general—not just for adding Cyrillic.
*except for the hard sign / soft signs which have no case in Latin but have case in Cyrillic.
Merrilee Proffitt is Senior Manager. She provides community development skills and expert support to institutions within the OCLC Research Library Partnership.