Adventures in Hadoop, #3: String Searching WorldCat

OK, I admit it. Ever since I joined OCLC over five years ago I’ve harbored a dream. My dream was to one day string search WorldCat. What that means is to have the ability to find any random string of characters anywhere in any MARC field within the over 250 million records that comprise this huge union catalog. Unix geeks call it “regular expressions” or “regex” for short, and often use the Unix command “grep” to string search files. Now I admit it, this is a very geeky dream, but it’s mine and I’m not giving it up.

Luckily, I don’t have to. In fact, just the other day, thanks to a colleague, I actually did it.

What made it possible was our new research cluster with Hadoop (described here) and some code from a colleague that I was able to tweak and run. The code runs on all 40 server nodes simultaneously and when it’s done, all I have to do is combine all of the 40 individual outputs together into one file and then I have a file of all the records that matched my search string. I even created a mechanism where I could just drop that file into a web-accessible directory and when I click on the filename a script parses out the OCLC numbers and links them to both the WorldCat.org record and the raw version pulled from HBase — thereby making the records easily browseable for review.

My first string search of WorldCat was, naturally, “Tennant, Roy”. That netted 118 records, although I doubt they are all me. Perhaps it’s obvious, but when you do an operation like this you want your string to be fairly unique. I don’t even want to think about what might happen should I string search “the”, for example. But for some purposes there is simply no substitute.

How long did it take? 2 minutes and 22 seconds. And that was while another job was running. Also, I’ve performed other string searches that have completed in under a minute under the same load.

So color me ecstatic. You need to find MARC records that have text in a field that isn’t indexed? No problem. You need to find text that might be in any number of different MARC fields? I’ve got your back. You want to fix any record that misspells a particular word? Piece of cake. The World(Cat) is my oyster. Now stand back lest I get some of it on you.


Note: For those wanting to see the code, it is really just two shell scripts with some preset environmental variables, but they run simultaneously via Hadoop across the cluster:

“findit.sh”:

set -e

INPUT=/user/toves/July2012/worldcat1
JOBNAME=findit
OPTIONS=”-D mapred.compress.map.output=true -D mapred.job.name=${JOBNAME}”
/drive6/clink/scripts/cleanoutput.sh findit
hadoop jar $HJAR $OPTIONS -input ${INPUT} \
-output findit \
-mapper megrep.sh -reducer /bin/cat -numReduceTasks 0 \
-file megrep.sh

“megrp.sh”:

grep -s ‘STRING TO FIND’
exit 0

Tweet about this on TwitterShare on TumblrShare on LinkedInShare on FacebookBuffer this pageShare on Google+Email this to someone

About Roy Tennant

Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.

One Comment

  1. Oh. Wow. Geeky, perhaps, but like anyone who has ever tried raw string searching in MARC, 2:22 for 250 million records is incomprehensibly fast.

    I\’m astonished that this item hasn\’t been deluged in admiring comments, but I suspect that the number of people who appreciate the wondrousness of this achievement is declining rapidly as age thins out the pack. But I for one offer my heartfelt applause. Now if only I could get my hands on my very own Hadoop cluster …

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>