What is it with geeks and wacky names, anyway? Despite what it sounds like, Hadoop is neither a rarely-glimpsed mammal from Tasmania, nor a children’s board game. Nope, rather it is a family of technologies that implement various aspects of Google’s MapReduce algorithm cum programming model that is optimized for processing huge data sets.
Since WorldCat is nothing if not a huge data set, it seems only natural that we would be using MapReduce technologies to data mine WorldCat, and in fact we have — for years. But what is new to us is making the transition to the Hadoop family of technologies that implements MapReduce in Java and is tuned specifically for running on clusters of computers, as we have in Research.
As I stumble along the learning trail with my colleagues (in particular, Bruce Washburn here in the San Mateo office), I hope to write about some of my experiences here — not so much to educate as to entertain through laughter. You see, I tend to learn new technologies just well enough to create havoc, and that can be quite entertaining if not instructive about what not to do. But more on that later.
For today’s post I will introduce you to “gravel,” which is replacing our former compute cluster that had been dubbed “pebbles” (don’t ask). It looks like this:
- 1 “head” (control) node – 2 6-core 3.1 GHz processors, 64 GB RAM 24 TB hard disk
- 40 “compute” (processing) nodes – each with 2 4-core 2.6 GHz processors, 32 GB RAM, 6 TB hard disk
For those of you following along at home, overall we have well over a terabyte of RAM and about a quarter of a petabyte of disk storage. Needless to say, even with multiple copies of WorldCat we have plenty of headroom for data mining processes.
So hardware isn’t much of a limitation right now, but it will take me more work to get up to speed on the software side. I hope to document some of those antics here over the coming weeks. It may be as pretty as watching sausage being made (which, as an Indiana farm boy with German ancestry I actually have seen — and smelled!), but I’m hoping it will be humorous if not also informative. You be the judge.
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.
2 Comments on “Adventures in Hadoop, #1: Introduction and the Research Cluster”
Comments are closed.