There is a classic quote that goes something like this: “There are three types of lies: lies, damn lies, and statistics.” This quote is often attributed to Mark Twain, even though he attributed it to Benjamin Disraeli, who may or may not have actually said it. Perhaps there should be a fourth type of lie concerning misattributed quotations? Regardless of where it came from, it’s more a prescription for dealing with conceptually difficult information than a statement about statistics. In fact, if you dig deep enough into what this quote seems to actually mean, you’ll see that it’s a statistical statement at its core.
First, what is a statistic? A statistic is some sort of quantitative summary of a set of data. Think average. Think standard deviation. Think kurtosis. Then forget about kurtosis. It’ll probably never matter to you. Summaries of data are useful because if you can reasonably make certain assumptions about the data, a few summary statistics are all you need to convert that data into something meaningful. If the range of IQs is normally distributed, then indeed, half of the population is of below average intelligence. Seems simple enough. Problems arise, however, when these statistics are misinterpreted. For instance, say you happen to be studying a population where intelligence is exponentially distributed (use your imagination), then half the population’s IQ is below 0.63 x the population average. Statistics are properties of a sample – they are a group-scale phenomena. It is entirely possible for no member of a population to be average, hence, the average family has something like 2.3 children. Average families can have fractional children. With real families, the convention is to count children with whole numbers.
Summary statistics by themselves sometimes aren’t all that interesting. If one is trying to determine whether a real difference exists between two different measurable phenomena, summary statistics will likely only get you halfway there. Imagine trying to compare stats for Adrian Peterson and Walter Payton to determine who is the better running back. You might compute their career gameday rushing yards and compare. You could compare Peterson’s stats from his first 50 games, with Payton’s stats from his first 50 games. What else might you need to account for to make a comparison of these numbers appropriate? Well, it’s hard to run well behind a horrible offensive line, so one could attempt to take offensive line quality into account. Offensive philosophy is also important. It might be difficult to get rushing yards if your playcaller leans heavily towards passing plays despite your running back’s brilliance. Here it can get tricky. What’s the best way to quantify O-line quality? What about offensive pass-heaviness? I’m not saying it can’t be done, just that a proper analysis of the data could get fairly complex fairly quickly. If you wanted to go the easier route, you could just assume a priori that none of those other things matter.
In order to make sense of the next part, an explanation of terminology may come in handy. A population is a group of things. A sample is a portion of a population. A probability is the ratio of the number of ways a particular event can happen to the number ways all possible relevant events could happen. The probability of flipping a head on a fair coin is the number of heads on a coin divided by the number of sides on the coin, (for simplicity’s sake, you’d probably assume that the outer rim of the coin can be neglected). A probability distribution is an expression of how likely various values are to show up in a sample from a population. In other words, it provides a map from possible values to the probability associated with those values. Take a six-sided die. Assuming fairness, each side has a 1/6 probability of coming up per roll. This can be represented as a distribution. The range of possible outcomes in a distribution are called its support. In the case of a die, the support consists of the numbers 1, 2, 3, 4, 5, 6. Each one of these values is mapped to a corresponding probability, here all of the probabilities are 1/6. A loaded die would have the same support but different probabilities. Loaded or not, all of these probabilities need to add up to one.
Back to football. It turns out that for the first 50 games of their careers, Peterson and Payton have pretty similar average yards per game (AP – 99.28 ypg, WP – 94.08 ypg). Is the difference between these two number, given spread of the data, large enough to be due to an actual difference in ability? Statistics provides a means of answering this question, provided you’re willing to make an assumption. The next assumption, which is where capital S Statistics really comes in, appears when you assume some sort of underlying distribution for your data. That nature of the assumption is this: each player has one true mean yards per game and any weekly divergence between their actual yardage and this theoretical yardage is due solely to chance (a Bayesian would assume that the true mean yards per game was itself a random variable). This assumption should be based on sound reasoning and a thorough inspection of the data. If the data looks exotic enough, you can even assume that you don’t know anything about the underlying distribution, this is called the Rumsfeld distribution, and it’s known for being known to be unknown (not really, see nonparametric statistics). With regards to running backs, you might assume that for each game, each player’s total rushing yards comes from a Normal distribution centered around some average value. This would give you information of how large you’d expect differences in their 50-game yards per game average to be. Then, to determine the better player you’d compare your data using some formal statistical test based on this assumption.
Note the following plot. It shows a normalized histogram for each player overlayed with corresponding smoothed density curves. These plots show two different ways of representing the distributions of these data. Either way, the data line up fairly closely, and indeed, a t-test suggests that there isn’t a statistically significant difference between their performances.
The t-test assumes that the the underlying observations are from a Normal distribution. Do the data bear this assumption out? The next two plots show comparisons between each player’s set of data and a simulated Normal distribution based on the mean and standard deviation of each player’s data. The simulated Normal curves are in black. Both sets are approximately symmetric and each seems to have a maximum value in the neighborhood of its Normal companion. Both have similar means (AP – 99.28, WP – 94.08) and standard deviations (53.36, 56.39). AP’s yards are a little tighter around their mean than the Normal and both players have a little bump corresponding to their single-game rushing yard records. For the purposes of this analysis, normality seems a reasonable, if not entirely correct, assumption to me.
So there you have it. Statistical evidence that Adrian Peterson and Walter Payton are indistinguishably talented football players. Interchangeable. Except that’s not really what the statistics say at all. Let’s list some of the things that went into this analysis.
1) We assumed that our subset of data was representative of our real quantity of interest.
2) We assumed that our subset of data was drawn from a particular probability distribution.
3) We compared the data in the context of what we’d expect it to look like if it were actually from our assumed probability distribution and found no reason to believe that either set came from a different distribution than the one we assumed.
Sports statistics aren’t my strong suit, but I’m pretty sure there’s more to this “who’s the better running back?” question than yards per game. What about yards per carry? Or yards conditional on offensive philosophy and offensive line quality? If we were talking Tim Tebow we’d have to include God effects in our model somehow. Yards per game might on its own be sufficient to characterize running back talent, but we don’t know that. This analysis didn’t even try to take other factors into account. So assumption 1) is probably wrong. This is huge. In an effort to answer one question, we actually answered a different, possibly irrelevant question. Assumption 2) seems close enough to true, but at this point, does that even matter?
Statistical information is usually presented in a results first format. This makes sense, since the results are typically the goal. The problem is that the results rarely tell the whole story and seemingly rarer still do the articles surrounding those results. So you end up with headlines that blare alarmist claims with little context and articles that present data with nary a critical question. For instance, what were the demographics of the sampled population in the linked article? Are these kids performing worse than previous generations? How frequently do adults misuse “there”, “their” and “they’re”? Is it possible that the apparently high proportion of parents who think their children are less advanced than previous generations is more due to parental self-doubt and recall bias than anything real? Who is OnePoll? Do they have an agenda?
Don’t get me wrong. If the UK is anything like the US, then they have serious, complicated systemic problems with their education system including widespread racial, cultural and geographic disparities with the net result being a lot of kids who don’t necessarily know the things they ought to. However, these results as presented give a less than complete picture of what is actually going on.
With that being said, I don’t doubt the statistics. I suspect that there’s a very remote chance that they (the numbers, not the interpretation) aren’t being presented accurately here. Even in that event, the sin is the responsibility of a person and not a data set. Statistics don’t lie. Ever. People use statistics to lie. They use statistics to lie because the average person doesn’t know enough about statistics to see the holes in the types of flawed appeals to statistics that frequently get trotted out in support of flawed ideas.
The Twain/Disraeli/unknown quote is a warning, imploring people to automatically doubt the integrity of anyone whose argument relies heavily on numbers. I can’t really argue with the sentiment. A successful lie frequently depends on the liar’s ability to control information and how that information is perceived. Take this fact and pair it with the very human inability to correctly parse statistical information and you’ve got a viable path to skulduggery. Gallant presents his data in context with all the appropriate caveats. Goofus deprives his audience of its ability to interpret that data by presenting only the most favorable portions of it and neglects to mention any flaws that may have occurred with data collection and analysis.