Hadoop, there it is.

I bought a Hadoop book a few days ago. It isn’t like I have the time or data to implement Hadoop anywhere right now or anything. It’s really just something for me to geek out on while I’m avoiding doing the things that I should be doing.

What is Hadoop? Hadoop is a software framework for crunching very large data sets using lots of computers. Hadoop is what Twitter uses to pick out what’s trending from the multiple terabytes of data it produces daily.

For a necessarily oversimplified, non-Hadoop example of how the Hadoop process works and why it’s useful, think of a computer with one processor chugging away at a relatively simple problem. Think of it as the algorithmic equivalent of the the problem of trying to count the number of people in a small, mixed-density municipality. If the number of people was small enough, you’d probably be able to get away with sending one person door-to-door to get head counts from all the households in the city limits. Your counter could add up the people as she went or she could wait until everyone had been counted and add the results then. If your counter doesn’t waste any time chatting with people, so that the time spent counting per household is small, then you’d likely find that most of your counter’s time was spent traveling between houses.

Even though most of the counter’s time won’t be spent counting at all, this solo-counter process would likely be an acceptable way to count small populations. What if the population you wanted to count was much larger, say on the order of hundreds of millions of households? With populations this large, it wouldn’t matter how quick your counter is, he’d still have to walk between hundreds of millions of households. Even if your counter could somehow count the number of people in a house instantaneously just by looking at the front door, the time he spent walking would likely be unacceptably long. Here, the time spent summing the data is dominated by the time spent getting data to the person doing the summing.

For larger populations, a faster method might utilize many counters at the same time. Each one would have their own area to cover, and each area could be chosen to take approximately the same amount of time. When counters finished counting their assigned area, they would move on to another area and continue counting. Once the counting was done, determining the population size would be a relatively simple matter of adding up the household totals from each counter.

In Hadoop, this idea is implemented algorithmically using several computers via a process called MapReduce. During the mapping portion, each computer (counter) produces keys (corresponding to areas or households) “mapped” to values (the number of people counted in an area, or a household depending on how the key is defined). All these key-value pairs are then “reduced” via a specified function (summation here) to a count of total population.

It makes intuitive sense. Having many people count a lot of things can be quicker than having one person count a lot of things, but more isn’t always better. For instance, imagine counting the small municipality with a large number of counters. If each counter was assigned 10 houses and you had 100 households you’d need ten counters. You’d have to come up with procedures to assign counters to households while ensuring that households weren’t counted multiple times. Each counter would require a separate clipboard, vest, map, etc. There’d be 10 sets of tax forms to fill out and ten mileage reimbursement requests. If too many counters were added, the end process might take less time, but at a significant cost in efficiency and cash. If you weren’t careful, you might inadvertently use so many counters that you spend more time and resources allocating counters and waiting for them to finish than you would have spent waiting for one counter to do her thing (imagine assigning 90 people to count 100 households). Clearly, there is some data size threshold that determines when distributed computation schemes like Hadoop are a better alternative to single computer solutions.

I haven’t yet had the need to perform any Hadoop-scale analysis, perhaps when I have some free time, I’ll get an account at Amazon Web Services and hunt down some rogue terabytes for analysis. Or maybe I’ll get lucky and land a job where analysis on the terabyte scale is routine. A guy can dream, right?

Free time

I don’t have a lot of free time. Between the family, the school, the assistantships and the internship, my calendar is packed. Being in grad school means making a commitment to always having something else you should be doing. For every task I’m assigned there is an unknown, finite amount of time I have to spare. That Bayes assignment from two weeks ago? I’ve got maybe six hours for it. I’m glad it only took two or three. The Clinical Trials assignment that’s due on Thursday? I can probably make eight hours for that, but they have to be spread out in two hour chunks and I hope it will only take three or four. My family? I’ve got 7-9 hours per weekday for them. Grading papers? Two hours.

I’m generally good at doing the things I should be doing. Frequently, I can juggle tasks well enough to find time to do things that I don’t have to do. This process usually involves a complex, high-level form of procrastination and a firm commitment to fuller living through sleep deprivation. Sleep is one of the few responsibilities I can neglect without disappointing myself or people who rely on me, so choosing to sleep less is usually an easy choice. This is clearly less than ideal, but that’s how it is. There I things I want more than the feeling of being well rested.

Lately, in between time spent wandering virtual countrysides saving virtual people from virtual world-devouring dragon-gods, I’ve been spending my free time trying to fill gaps in my statistical computing abilities. I just finished The Art of R Programming. Not a page turner, but interesting nonetheless. I like R and I’m competent enough to be able to figure out how to use it to do all the things I need it to do. However, most of my R skills have been forged via trial and error over the glowing coals of various blogs, mailing lists, and CRAN pages. I picked up the book because I felt it would be beneficial to build a more solid foundation under my slightly-out-of-code shack of R knowledge. “The Art of R Programming” fit the bill quite nicely. I highly recommend it to anyone looking for a fairly comprehensive, broad view of R as a programming language.

Eventually, I will have to stop ignoring the fact that I need to write a rather large paper as part of earning a master’s degree. I’m excited about the project on which the paper will be based. I think it will be fairly engrossing and I’m putting it off because I’m not ready for it to sweep me away yet. The basic idea is this: you have data on a bunch of people from a loosely connected social network (think students at a school). You have demographic info, BMI, daily exercise levels, etc. These people were also asked to name 5 of their friends and you have this data too. You want to use this data to estimate the effects on the probability that a person is obese from the level of that person’s association with other obese people. Models exist for these types of relationships. However, problems can arise when you have people whose data are missing. For my thesis, I’m going to be implementing a couple different methods for imputing missing social network data for the purposes of estimating network effects. I think it will be a nice combination of coding, simulation and head scratching. I’m still at the idea stage. I think once the ideas bouncing around my head reach a critical mass, the real work will begin.

Personal Data

As a field of study, Statistics is essentially the study of how to make sense of data. It is full of formal, quantitative methods for dealing with uncertainty. There are fundamental questions that a successful analysis should touch upon. What do we know (the data)? What do we expect (our assumptions)? What should we expect given what we know (what does the data look like)? How does what we know affect the validity of the assumptions about what we expect (do the data seem excessively implausible given our assumptions)? The goal typically takes the form of setting up a fall guy (or gal) named Null Hypothesis and checking to see whether you can knock him (or her) down with your data. Traditionally, the null hypothesis is the enemy. If you are a drug company, the null hypothesis is that your drug is worthless (compared to the current standard of treatment).

On a structurally similar but less formal level, we’re all statisticians. We all compute our own personal statistics on a daily basis, using incredibly biased data derived from our muddled thoughts and our 5 senses. For instance, when I look myself over before I leave the house in the morning, I implicitly create an average rating of my appearance based on any number of specific observable data points. How’s my hair (when I have hair)? Are my teeth still crooked? Is my forehead doing that flaky, dry skin thing? Is my shirt suitably ironed? Does this stomach fat make it look like I have stomach fat?

All these things are weighted by importance and aggregated to create an overall “this is how I look” score. I then take this score and compare it to a null hypothesis about my appearance. My personal appearance null is usually “I look good” and I’m perfectly happy not knocking it down at all. Choosing a null for your personal appearance is a complicated process that incorporates your current state of mind and whole slew of conscious and unconscious thoughts based on a lifetime’s worth of accumulated experience.

So my perception of my appearance is my data. I use it to create an overall “this is how I look score”, which is then compared to what I’d expect the score to be if I looked okay (my null). This process is very sensitive to outliers. So if I looked like this guy, except that I had a giant, bulbous blister on my forehead, that blister would likely dominate my “this is how I look” score despite being a small, atypical portion of an otherwise magnificent package and I’d probably reject the null hypothesis that I look good.

The results of my personal appearance test have ramifications. Whether I decide that I look okay or not will partially determine how I interact with the world with me. It will inform my interactions with other people. During these interactions, I will be subconsciously gathering and aggregating more data and performing more informal statistical tests: Is this person listening to me? Do they understand what I’m talking about, or am I completely incoherent? Was that joke I told actually funny or was that laugh just a polite acknowledgment of attempted comedy? For each of these questions, I’d compare the person’s behavior (the data) with how I’d expect them to behave (the null) and draw conclusions accordingly.

If I feel especially dashing, then perhaps I’m slightly more apt to feel confident in my interactions with other people and this confidence can result in more positive ad hoc estimates of how well these interactions are going. These ad hoc estimates are data points that go into a running calculation of how well my day is going.

At the end of the day, I have all these internal data points describing the quality of my day. These too are then aggregated into a relative rating describing the overall quality of my day. If the average quality of my day seems especially good or bad (compared to my expectations), I might spend some time going through the data trying to figure out whether there was a specific event that made my day especially good or bad. Or, I’ll probably be asleep within moments of laying down.

I bet you do this too.

A lot of the time, this process is completely subconscious. One interesting thing you can do is try and figure out what your null hypotheses are and how they got to be that way. It can be enlightening to see how much of an effect things that happened ages ago can have in providing context for the decisions you make now.

Bruce Lee

I must be getting old. I used to be able to drink coffee at any hour of the day and still get to sleep whenever I needed to. 5 years ago, I could have a cup at 9 pm and be asleep by 11 pm. Not any more. Too bad, because I like coffee like I used to like cigarettes.

Because I like coffee like I used to like cigarettes, 11 pm this evening found me laying in bed, wide awake, with way too much of my conscious mind focused on the unbearable noiselessness of it all. And who should happen to whirl through the vast, resonance-prone soundstage that was my coffee-addled brain, but one Bruce Lee. And if you’ve ever seen (heard) Bruce in action, you probably correctly guessed that he wasn’t quiet about it either.

So, in the interests of wearing out the jittery hamster that is my conscious mind by running it through the wheel in the corner of my brain, here are a few things that I remember about Bruce Lee from when I was a youngster that may or may not actually be true.

Bruce Lee was ahead of his time. Way back in the 60s he was already at least intuitively aware of ideas which as recent as last year were the subject of a best-selling book. Lee understood that when it comes to winning a fist fight, intuition plays a key role. It matters less whether one has fancy belts than it does that one can step aside at the appropriate time or land a solid punch on time and in the right place like a pallet in a UPS commercial (I don’t know that Lee ever yelled “LOGISTICS” as he landed a punch, but given his awesomeness, I’d bet that the idea of doing so occurred to him at least once). To win a fight you need to operate with the split-second timing of the finely honed intuition that is the “fast thinking” alluded to in the title of Kahneman’s book.

I haven’t made it all the way through Thinking, Fast and Slow. What with all the glitz and glamor of graduate school and familihood, who has the time? I have read the introduction for the book AND I think I totally heard an interview with Kahneman on MPR once, which is pretty much exactly the amount of familiarity required to hold a water cooler conversation about the subject. Here goes the water cooler-level summary. Kahneman bifurcates thought itself into two types. Fast thought is generally our default mode. It’s like an autopilot system with enough awareness built in to know when it’s in over its head. Slow thought is what happens when fast thought bows out. Slow thought tells you why you really need to quit smoking sometime after fast thought led you outside, lit your cigarette and relished in the warm, glorious feeling of chemically-sated addiction.

Slow thought will also lose you a fight. Here’s how: if you are about to fight someone and you aren’t comfortable fighting, the first thought that will occur to your fast thought cogitation system is “Uh oh. Can I run here, because, dude, that guy looks pi-issed?” and if running isn’t an option, its next thought is “This situation is tense and unfamiliar. I’m not quite sure what to do here. Uh, I better think about this more.” Then slow thought takes over and you get punched. Because when you have to think about how to respond to a fist flying at your head, you’ve pretty much already gotten punched.

Lee recognized that this dynamic existed even for highly trained martial artists. This is because martial arts training in Bruce Lee’s time rarely effectively simulated the street fighting experience. Lee’s idea: I should expose myself to as many different fighting styles as possible, figure out the circumstances under which they are effective and then use this knowledge to gain such a comprehensive, visceral understanding of how to fight well that no matter the circumstances, I will be able to rely solely on my fast-acting intuition rather than my slower, more deliberative mode of thought.

So he pursued that idea, and he yea, it was good. Because Bruce Lee had a mind like a hungry octopus angling for fishies made out of pure fact. He was driven by his own curiosity and enjoyed connecting disparate ideas in innovative ways. He wanted to figure out how to kick anyone’s ass without even having to think about kicking their ass while he was kicking their ass. Hence, Jeet Kun Do, which roughly translates as “Hey, how about instead of rigidly clinging to one way of fighting, you learn about and capitalize on the strengths of a variety of styles”. Its brilliance is in its lack of specificity and its adaptability. Lee saw it as more of a philosophy than a fighting style.

I had the pleasure of taking 6 months of Jeet Kun Do at a certain local martial arts academy. My stint ended when I broke a foot doing something completely unrelated to training. Then, after my foot healed, I popped a rib doing something completely related to training. Then fell out of the habit of going to class and eventually cancelled my membership. It was fun while it lasted and if I ever have the time, I’d love to go back.

Final fact before bedtime: Bruce Lee also starred in Fist of Fury, which was later remade with Jet Li as Fist of Legend. Both of these movies are better than Enter the Dragon.