Personal Data

As a field of study, Statistics is essentially the study of how to make sense of data. It is full of formal, quantitative methods for dealing with uncertainty. There are fundamental questions that a successful analysis should touch upon. What do we know (the data)? What do we expect (our assumptions)? What should we expect given what we know (what does the data look like)? How does what we know affect the validity of the assumptions about what we expect (do the data seem excessively implausible given our assumptions)? The goal typically takes the form of setting up a fall guy (or gal) named Null Hypothesis and checking to see whether you can knock him (or her) down with your data. Traditionally, the null hypothesis is the enemy. If you are a drug company, the null hypothesis is that your drug is worthless (compared to the current standard of treatment).

On a structurally similar but less formal level, we’re all statisticians. We all compute our own personal statistics on a daily basis, using incredibly biased data derived from our muddled thoughts and our 5 senses. For instance, when I look myself over before I leave the house in the morning, I implicitly create an average rating of my appearance based on any number of specific observable data points. How’s my hair (when I have hair)? Are my teeth still crooked? Is my forehead doing that flaky, dry skin thing? Is my shirt suitably ironed? Does this stomach fat make it look like I have stomach fat?

All these things are weighted by importance and aggregated to create an overall “this is how I look” score. I then take this score and compare it to a null hypothesis about my appearance. My personal appearance null is usually “I look good” and I’m perfectly happy not knocking it down at all. Choosing a null for your personal appearance is a complicated process that incorporates your current state of mind and whole slew of conscious and unconscious thoughts based on a lifetime’s worth of accumulated experience.

So my perception of my appearance is my data. I use it to create an overall “this is how I look score”, which is then compared to what I’d expect the score to be if I looked okay (my null). This process is very sensitive to outliers. So if I looked like this guy, except that I had a giant, bulbous blister on my forehead, that blister would likely dominate my “this is how I look” score despite being a small, atypical portion of an otherwise magnificent package and I’d probably reject the null hypothesis that I look good.

The results of my personal appearance test have ramifications. Whether I decide that I look okay or not will partially determine how I interact with the world with me. It will inform my interactions with other people. During these interactions, I will be subconsciously gathering and aggregating more data and performing more informal statistical tests: Is this person listening to me? Do they understand what I’m talking about, or am I completely incoherent? Was that joke I told actually funny or was that laugh just a polite acknowledgment of attempted comedy? For each of these questions, I’d compare the person’s behavior (the data) with how I’d expect them to behave (the null) and draw conclusions accordingly.

If I feel especially dashing, then perhaps I’m slightly more apt to feel confident in my interactions with other people and this confidence can result in more positive ad hoc estimates of how well these interactions are going. These ad hoc estimates are data points that go into a running calculation of how well my day is going.

At the end of the day, I have all these internal data points describing the quality of my day. These too are then aggregated into a relative rating describing the overall quality of my day. If the average quality of my day seems especially good or bad (compared to my expectations), I might spend some time going through the data trying to figure out whether there was a specific event that made my day especially good or bad. Or, I’ll probably be asleep within moments of laying down.

I bet you do this too.

A lot of the time, this process is completely subconscious. One interesting thing you can do is try and figure out what your null hypotheses are and how they got to be that way. It can be enlightening to see how much of an effect things that happened ages ago can have in providing context for the decisions you make now.

Lies, Damn Lies and Statistics: A Primer

There is a classic quote that goes something like this: “There are three types of lies: lies, damn lies, and statistics.” This quote is often attributed to Mark Twain, even though he attributed it to Benjamin Disraeli, who may or may not have actually said it. Perhaps there should be a fourth type of lie concerning misattributed quotations? Regardless of where it came from, it’s more a prescription for dealing with conceptually difficult information than a statement about statistics. In fact, if you dig deep enough into what this quote seems to actually mean, you’ll see that it’s a statistical statement at its core.

First, what is a statistic? A statistic is some sort of quantitative summary of a set of data. Think average. Think standard deviation. Think kurtosis. Then forget about kurtosis. It’ll probably never matter to you. Summaries of data are useful because if you can reasonably make certain assumptions about the data, a few summary statistics are all you need to convert that data into something meaningful. If the range of IQs is normally distributed, then indeed, half of the population is of below average intelligence. Seems simple enough. Problems arise, however, when these statistics are misinterpreted. For instance, say you happen to be studying a population where intelligence is exponentially distributed (use your imagination), then half the population’s IQ is below 0.63 x the population average. Statistics are properties of a sample – they are a group-scale phenomena. It is entirely possible for no member of a population to be average, hence, the average family has something like 2.3 children. Average families can have fractional children. With real families, the convention is to count children with whole numbers.

Summary statistics by themselves sometimes aren’t all that interesting. If one is trying to determine whether a real difference exists between two different measurable phenomena, summary statistics will likely only get you halfway there. Imagine trying to compare stats for Adrian Peterson and Walter Payton to determine who is the better running back. You might compute their career gameday rushing yards and compare. You could compare Peterson’s stats from his first 50 games, with Payton’s stats from his first 50 games. What else might you need to account for to make a comparison of these numbers appropriate? Well, it’s hard to run well behind a horrible offensive line, so one could attempt to take offensive line quality into account. Offensive philosophy is also important. It might be difficult to get rushing yards if your playcaller leans heavily towards passing plays despite your running back’s brilliance. Here it can get tricky. What’s the best way to quantify O-line quality? What about offensive pass-heaviness? I’m not saying it can’t be done, just that a proper analysis of the data could get fairly complex fairly quickly. If you wanted to go the easier route, you could just assume a priori that none of those other things matter.

In order to make sense of the next part, an explanation of terminology may come in handy. A population is a group of things. A sample is a portion of a population. A probability is the ratio of the number of ways a particular event can happen to the number ways all possible relevant events could happen. The probability of flipping a head on a fair coin is the number of heads on a coin divided by the number of sides on the coin, (for simplicity’s sake, you’d probably assume that the outer rim of the coin can be neglected). A probability distribution is an expression of how likely various values are to show up in a sample from a population. In other words, it provides a map from possible values to the probability associated with those values. Take a six-sided die. Assuming fairness, each side has a 1/6 probability of coming up per roll. This can be represented as a distribution. The range of possible outcomes in a distribution are called its support. In the case of a die, the support consists of the numbers 1, 2, 3, 4, 5, 6. Each one of these values is mapped to a corresponding probability, here all of the probabilities are 1/6. A loaded die would have the same support but different probabilities. Loaded or not, all of these probabilities need to add up to one.

Back to football. It turns out that for the first 50 games of their careers, Peterson and Payton have pretty similar average yards per game (AP – 99.28 ypg, WP – 94.08 ypg). Is the difference between these two number, given spread of the data, large enough to be due to an actual difference in ability? Statistics provides a means of answering this question, provided you’re willing to make an assumption. The next assumption, which is where capital S Statistics really comes in, appears when you assume some sort of underlying distribution for your data. That nature of the assumption is this: each player has one true mean yards per game and any weekly divergence between their actual yardage and this theoretical yardage is due solely to chance (a Bayesian would assume that the true mean yards per game was itself a random variable). This assumption should be based on sound reasoning and a thorough inspection of the data. If the data looks exotic enough, you can even assume that you don’t know anything about the underlying distribution, this is called the Rumsfeld distribution, and it’s known for being known to be unknown (not really, see nonparametric statistics). With regards to running backs, you might assume that for each game, each player’s total rushing yards comes from a Normal distribution centered around some average value. This would give you information of how large you’d expect differences in their 50-game yards per game average to be. Then, to determine the better player you’d compare your data using some formal statistical test based on this assumption.

So, let’s try it out and see where it takes us. Data is available here: All-Day vs. Sweetness. As an aside, I must say, Payton wins on the nickname front.

Note the following plot. It shows a normalized histogram for each player overlayed with corresponding smoothed density curves. These plots show two different ways of representing the distributions of these data. Either way, the data line up fairly closely, and indeed, a t-test suggests that there isn’t a statistically significant difference between their performances.

Yard distributions

The t-test assumes that the the underlying observations are from a Normal distribution. Do the data bear this assumption out? The next two plots show comparisons between each player’s set of data and a simulated Normal distribution based on the mean and standard deviation of each player’s data. The simulated Normal curves are in black. Both sets are approximately symmetric and each seems to have a maximum value in the neighborhood of its Normal companion. Both have similar means (AP – 99.28, WP – 94.08) and standard deviations (53.36, 56.39). AP’s yards are a little tighter around their mean than the Normal and both players have a little bump corresponding to their single-game rushing yard records. For the purposes of this analysis, normality seems a reasonable, if not entirely correct, assumption to me.

AP's distribution

WP's distributions

So there you have it. Statistical evidence that Adrian Peterson and Walter Payton are indistinguishably talented football players. Interchangeable. Except that’s not really what the statistics say at all. Let’s list some of the things that went into this analysis.

1) We assumed that our subset of data was representative of our real quantity of interest.
2) We assumed that our subset of data was drawn from a particular probability distribution.
3) We compared the data in the context of what we’d expect it to look like if it were actually from our assumed probability distribution and found no reason to believe that either set came from a different distribution than the one we assumed.

Sports statistics aren’t my strong suit, but I’m pretty sure there’s more to this “who’s the better running back?” question than yards per game. What about yards per carry? Or yards conditional on offensive philosophy and offensive line quality? If we were talking Tim Tebow we’d have to include God effects in our model somehow. Yards per game might on its own be sufficient to characterize running back talent, but we don’t know that. This analysis didn’t even try to take other factors into account. So assumption 1) is probably wrong. This is huge. In an effort to answer one question, we actually answered a different, possibly irrelevant question. Assumption 2) seems close enough to true, but at this point, does that even matter?

Statistical information is usually presented in a results first format. This makes sense, since the results are typically the goal. The problem is that the results rarely tell the whole story and seemingly rarer still do the articles surrounding those results. So you end up with headlines that blare alarmist claims with little context and articles that present data with nary a critical question. For instance, what were the demographics of the sampled population in the linked article? Are these kids performing worse than previous generations? How frequently do adults misuse “there”, “their” and “they’re”? Is it possible that the apparently high proportion of parents who think their children are less advanced than previous generations is more due to parental self-doubt and recall bias than anything real? Who is OnePoll? Do they have an agenda?

Don’t get me wrong. If the UK is anything like the US, then they have serious, complicated systemic problems with their education system including widespread racial, cultural and geographic disparities with the net result being a lot of kids who don’t necessarily know the things they ought to. However, these results as presented give a less than complete picture of what is actually going on.

With that being said, I don’t doubt the statistics. I suspect that there’s a very remote chance that they (the numbers, not the interpretation) aren’t being presented accurately here. Even in that event, the sin is the responsibility of a person and not a data set. Statistics don’t lie. Ever. People use statistics to lie. They use statistics to lie because the average person doesn’t know enough about statistics to see the holes in the types of flawed appeals to statistics that frequently get trotted out in support of flawed ideas.

The Twain/Disraeli/unknown quote is a warning, imploring people to automatically doubt the integrity of anyone whose argument relies heavily on numbers. I can’t really argue with the sentiment. A successful lie frequently depends on the liar’s ability to control information and how that information is perceived. Take this fact and pair it with the very human inability to correctly parse statistical information and you’ve got a viable path to skulduggery. Gallant presents his data in context with all the appropriate caveats. Goofus deprives his audience of its ability to interpret that data by presenting only the most favorable portions of it and neglects to mention any flaws that may have occurred with data collection and analysis.