Hadoop, there it is.

I bought a Hadoop book a few days ago. It isn’t like I have the time or data to implement Hadoop anywhere right now or anything. It’s really just something for me to geek out on while I’m avoiding doing the things that I should be doing.

What is Hadoop? Hadoop is a software framework for crunching very large data sets using lots of computers. Hadoop is what Twitter uses to pick out what’s trending from the multiple terabytes of data it produces daily.

For a necessarily oversimplified, non-Hadoop example of how the Hadoop process works and why it’s useful, think of a computer with one processor chugging away at a relatively simple problem. Think of it as the algorithmic equivalent of the the problem of trying to count the number of people in a small, mixed-density municipality. If the number of people was small enough, you’d probably be able to get away with sending one person door-to-door to get head counts from all the households in the city limits. Your counter could add up the people as she went or she could wait until everyone had been counted and add the results then. If your counter doesn’t waste any time chatting with people, so that the time spent counting per household is small, then you’d likely find that most of your counter’s time was spent traveling between houses.

Even though most of the counter’s time won’t be spent counting at all, this solo-counter process would likely be an acceptable way to count small populations. What if the population you wanted to count was much larger, say on the order of hundreds of millions of households? With populations this large, it wouldn’t matter how quick your counter is, he’d still have to walk between hundreds of millions of households. Even if your counter could somehow count the number of people in a house instantaneously just by looking at the front door, the time he spent walking would likely be unacceptably long. Here, the time spent summing the data is dominated by the time spent getting data to the person doing the summing.

For larger populations, a faster method might utilize many counters at the same time. Each one would have their own area to cover, and each area could be chosen to take approximately the same amount of time. When counters finished counting their assigned area, they would move on to another area and continue counting. Once the counting was done, determining the population size would be a relatively simple matter of adding up the household totals from each counter.

In Hadoop, this idea is implemented algorithmically using several computers via a process called MapReduce. During the mapping portion, each computer (counter) produces keys (corresponding to areas or households) “mapped” to values (the number of people counted in an area, or a household depending on how the key is defined). All these key-value pairs are then “reduced” via a specified function (summation here) to a count of total population.

It makes intuitive sense. Having many people count a lot of things can be quicker than having one person count a lot of things, but more isn’t always better. For instance, imagine counting the small municipality with a large number of counters. If each counter was assigned 10 houses and you had 100 households you’d need ten counters. You’d have to come up with procedures to assign counters to households while ensuring that households weren’t counted multiple times. Each counter would require a separate clipboard, vest, map, etc. There’d be 10 sets of tax forms to fill out and ten mileage reimbursement requests. If too many counters were added, the end process might take less time, but at a significant cost in efficiency and cash. If you weren’t careful, you might inadvertently use so many counters that you spend more time and resources allocating counters and waiting for them to finish than you would have spent waiting for one counter to do her thing (imagine assigning 90 people to count 100 households). Clearly, there is some data size threshold that determines when distributed computation schemes like Hadoop are a better alternative to single computer solutions.

I haven’t yet had the need to perform any Hadoop-scale analysis, perhaps when I have some free time, I’ll get an account at Amazon Web Services and hunt down some rogue terabytes for analysis. Or maybe I’ll get lucky and land a job where analysis on the terabyte scale is routine. A guy can dream, right?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>