In my first adventure I introduced Hadoop and the computer cluster upon which we will be running it as we perform various data mining procedures. Now I will begin to get “down and dirty” into the technologies and some of the ways we are using them.
Since I am just beginning to learn about Hadoop, HDFS (the Hadoop file system), HBase (the Hadoop database), and all kinds of other related techologies (I can’t wait to get to Pig, myself), I will likely be starting with simple “hello world!” style tasks. If you’re a programmer but don’t know anything about Hadoop, these will likely be right down your alley. If you’re already familiar with Hadoop you might be fascinated to see how badly I mangle the explanations. So here goes…
On Gravel, our Research cluster, we have a full copy of WorldCat loaded into HBase. HBase shines at random access, and the key for any record in HBase is the OCLC number, so nothing could be simpler than fetching a specific record from HBase. Well, or so you think. But like anything there are at least several ways to do it:
- HBase shell: From the command line you can start up shell access to HBase, and access records in a particular table. WorldCat is one big table in HBase, and we have a number of other tables available as well. A simple command like “get ‘TABLENAME’,’ROWNUM'” will fetch a particular row, and since the row number is equal to the OCLC number, for us to fetch a WorldCat record it’s as simple as “get ‘Worldcat’,’234′” to fetch the record for the item with 234 as the OCLC number.
- Stargate REST Server: A very simple REST query to localhost can fetch records as well via the Stargate REST server. For example, the request http://localhost:PORT/TABLENAME/ROW (or, using the values above, http://localhost:PORT/Worldcat/234) will retrieve that row from the specified table. However, the record will be returned with the data encoded in Base64, so it must be translated to become usable.
- Thrift API: A cross-language API that claims to be more lightweight than REST for many operations. That is absolutely everything I know about it.
- Java code: Hadoop expects interaction via Java, so the best, most “native” way to write code to interact with the Hadoop family of technologies is with Java. If you prefer not to use Java, you will need to use the “streaming” function to use Hadoop.
As a simple “Hello World!” style exercise, I set out to use the REST server access above and write a small CGI script to present a web form that expects an OCLC number and then retrieves the record for that number from HBase. Getting the record was the easy part, but then you have to separate the data from the wrapper XML and decode the Base64 data. I got stuck on this, and my colleague Bruce Washburn came to my rescue (and on a weekend no less). By Monday it was working like a charm.
To make the response easier to deal with, I also put in line breaks between XML tags, otherwise the bulk of the record comes back as a single line. All told it’s 60 lines including documentation and blank lines, plus a few external subroutines and the MIME::Base64 and LWP::Simple Perl modules.
Bottom line — I got my feet wet with a small but potentially useful exercise and as a side benefit I now have code to easily retrieve and decode a record from HBase that I can now potentially use as a subroutine in future programs. Not bad for a start.
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.
Sorry for the delay in getting back to you, Peter. So…you want me to be a laughingstock by exposing my crappy code, eh? But seriously, I’ll try to get it up once I get back in the office. Feel free to poke me.
Roy, how about throwing your code up on github? I’d like to watch over your shoulder, and seeing the code would help get a sense of how steep the learning curve will be if I get a chance to try Hadoop out.