As I’ve been learning more about how to use Hadoop via streaming, I discovered that I frequently needed an easy way to review records identified by a particular process. For example, my colleague Karen Smith-Yoshimura has recently been wanting to locate MARC records that have particular characteristics. She provides me with the set of characteristics she wants to use as a filter, and then I edit some existing Python code to perform the filter and find the records. In some cases as few as 8,000 records are pulled out from the more than 250 million records that now comprise WorldCat.
But then those records need to be viewed in some way. At the end of my process I gather up the output records into a file that consists of one line per record. In some cases I only output the OCLC number, but in other cases the entire record will be output following the OCLC number. Even in those cases, however, the line always begins with the OCLC number. That enables me to set up a simple process for reviewing the output.
To do this I wrote a simple CGI program that finds all of the files ending in “.txt” in a certain directory and lists them for the user to select. When the use clicks on a particular filename, the program parses the file, taking the OCLC number and setting up a couple links. One link is to the raw MARC record as it is stored in our HBase WorldCat table, and the other link takes the user to the record in WorldCat.org. I also send along a parameter that enables OCLC staff to see the XML or BER version of the record in WorldCat.org. Therefore, reviewing the records that are output by a Hadoop job is as simple as dropping the output file into a directory and going to that directory with a web browser. A few clicks is all it takes.