<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>LOL of Large Numbers</title>
	<atom:link href="http://loloflargenumbers.com/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://loloflargenumbers.com/blog</link>
	<description>Asymptotically Effervescent</description>
	<lastBuildDate>Sun, 10 Mar 2013 15:48:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>A Year&#8217;s Worth of Birth, or, How to Impute Babies</title>
		<link>http://loloflargenumbers.com/blog/?p=184</link>
		<comments>http://loloflargenumbers.com/blog/?p=184#comments</comments>
		<pubDate>Sun, 10 Mar 2013 15:48:05 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Birth]]></category>
		<category><![CDATA[Birth Map]]></category>
		<category><![CDATA[Map]]></category>
		<category><![CDATA[Natality]]></category>
		<category><![CDATA[NCHS]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=184</guid>
		<description><![CDATA[This is what 4 million births looks like, assuming that you&#8217;re floating high above the continental United States and that each birth appears as a green dot as it occurs and then changes into a blue dot for all subsequent days. It looks best in full screen HD. The video uses data from 2004. On [...]]]></description>
				<content:encoded><![CDATA[<p>This is what 4 million births looks like, assuming that you&#8217;re floating high above the continental United States and that each birth appears as a green dot as it occurs and then changes into a blue dot for all subsequent days. It looks best in full screen HD.</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='640' height='390' src='http://www.youtube.com/embed/s7N2ZQZ4mBA?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span></p>
<p>The video uses data from 2004. On each day babies were born and each green dot corresponds to a birth. When the video moves on to the next day, all births from the previous day turn blue and remain blue for the rest of the year. This isn&#8217;t an exact representation of where or when these births actually happened, but I think that it&#8217;s a fairly plausible approximation.</p>
<p><strong>NCHS DATA</strong></p>
<p>The National Center for Health Statistics (NCHS) has <a href="http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm">data</a>. A lot of it. Millions and millions of data points waiting &#8211; aching &#8211; for someone to turn them into meaning. They have data on birth, death, air quality, and other subjects denoted by cryptic acronyms.</p>
<p>One such data set is the 2004 natality data set. Each observation in this file corresponds to a single birth, and for 2004 in the US, there were over 4.1 million babies born (2.4 million people died &#8211; see 2004 mortality data set).</p>
<p>The natality data go back to the 1968. I chose 2004 because the 2004 data set is the last one that includes geographic information. After 2004, accessing data with geographic information requires more comprehensive vetting.</p>
<p>The data corresponding to each birth is comprehensive, with fields for mother&#8217;s age, place of delivery, attendant, parent&#8217;s education, natal comorbidities, etc. However, in order to maintain the privacy of the people represented in the data, a few variables are omitted or conditionally set to uninformative values. For instance, while each record comes with information about the state and county where the birth occurred, for births which occurred in counties with population less than 100,000 people, the county field is set to a useless &#8217;999&#8242; value. The data also omit any information about the day on which a birth occurred. There is good reason for leaving these details out; it would be fairly easy to match a person to their record in the data if they lived in a less populous county and you knew even cursory information about their demographic info and birth date.</p>
<p>But my goal here wasn&#8217;t to track down rural women and confront them with my knowledge of their child&#8217;s apgar score or whether the use of forceps was attempted during labor. I just wanted to plot the location of the birth on a map because I thought it might make for an interesting visual.</p>
<p><strong>Filling in the Blanks</strong></p>
<p>My plan was to take each birth, match it up with longitude and latitude coordinates based on the county of occurrence, then use these coordinates to plot the birth&#8217;s location on a map of the US for each day of the year. Simple, right?</p>
<p>Maybe not. Getting the raw data was easy. A right click was all it took. The data come compressed, and expand to a 5+ gigabyte, fixed width text file. This is a file that&#8217;s just asking for SAS, but SAS is for large organizations and people who don&#8217;t want to produce attractive pictures with their primary software package. Since I am neither of those things (but I will admit to appreciating the very specific advantages of SAS when I am working on behalf of a large organization), I wrote a Ruby script to parse the raw data into a MySQL database and then used R to pull month-wide chunks of database as needed.</p>
<p>The plots were generated in R using the maps library and ggplot2.</p>
<p>I got around the missing date information in the data by assuming that births were equally likely on any day of a month and assigning them thusly. This isn&#8217;t strictly true, as the probability of being born on Saturday or Sunday is substantially lower than the probability of being born any other day of the week. I thought about taking this fact into account when I handed out birth dates, but opted not because it wouldn&#8217;t add much to the visualization.</p>
<p>I initially generated a video like the one above using the data as it came (with randomly assigned birth dates) and ended up with vast swaths of the west showing no births at all. It didn&#8217;t look plausible. It is true that population density is pretty small in the west, but no births at all wasn&#8217;t going to work. So what&#8217;s a fella to do? I had to come up with a reasonable way to place the people from counties with fewer than 100,000 that wouldn&#8217;t appear too conspicuously wrong.</p>
<p>I&#8217;ve done a bit of work with decennial census data, so I was aware of the massive amount of data that the census folks generate. At first, I thought it would be sufficient for my purposes to get a list of counties with fewer than 100,000 people for each state and then randomly place rural people in one of these counties. This approach worked okay, but was still problematic because the longitude and latitude coordinates for each county in the census data were located at a county&#8217;s most populous city. For larger rural counties, the resulting birth patterns still looked implausibly sparse because all the births assigned to that county would be concentrated in one corner, which left the rest of the county looking like a dead zone.</p>
<p>I eventually settled on using census tract data. Each rural birth is assigned a birth location in a census tract somewhere in its state, with births being distributed to census tracts with probabilities proportional to a tract&#8217;s population. To ensure that rural births didn&#8217;t end up in densely populated urban areas, they were only allowed to be assigned to census tracts located in counties whose total county population was less than 100,000. So a person who gave birth in Otter Tail County (pop ~60,000) could be assigned to a census tract in any of the 70 or so counties in Minnesota whose with population was less than 100,000. This strategy assumes that all rural census tracts have the same proportion of women who are going to give birth, which is an assumption I didn&#8217;t at all try to verify.</p>
<p>Because the births in the most populous counties were located by county, and census tracts are generally more granular than county, I set the code up to randomly assign births in populous counties to census tracts from populous counties within the same state with probability proportional to census tract population. So a person who gave birth in Hennepin County (pop. &gt;&gt; 100,000) could be assigned to any census tract in MN in a county with population greater than 100,000.</p>
<p>So the births shown in the movie above are, strictly speaking, not at all what a time-lapse map of actual births in 2004 looked like; it is very unlikely that a birth shown in the video actually occurred on the day shown at the place shown. However, given the gaps in the data, I think it&#8217;s a reasonable representation of what it the video might have looked like had it been produced using the complete data.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=184</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Barry Gibbs Sampler</title>
		<link>http://loloflargenumbers.com/blog/?p=165</link>
		<comments>http://loloflargenumbers.com/blog/?p=165#comments</comments>
		<pubDate>Sat, 24 Mar 2012 05:36:58 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[bayes]]></category>
		<category><![CDATA[BeeGees]]></category>
		<category><![CDATA[changepoint]]></category>
		<category><![CDATA[JAGS]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=165</guid>
		<description><![CDATA[There are two main schools of thought in the realm of statistical analysis (sorry Nihilists, though your attitude toward uncertainty is admirable in its boldness, it&#8217;s no way to go about making decisions). The Frequentist school of thought is what is taught to most entry-level statisticians, because it seems to work well for a whole [...]]]></description>
				<content:encoded><![CDATA[<p>There are two main schools of thought in the realm of statistical analysis (sorry Nihilists, though your attitude toward uncertainty is admirable in its boldness, it&#8217;s no way to go about making decisions). The Frequentist school of thought is what is taught to most entry-level statisticians, because it seems to work well for a whole slew of things and the things it doesn&#8217;t work well for, well, what you don&#8217;t know can&#8217;t hurt you, right?</p>
<p>Frequentists treat their estimands as unchanging constants. Typically Frequentist methods seek to produce an interval that can, with a high level of confidence, be expected to contain a parameter of interest. So if a Frequentist were attempting to estimate the average length of a wallaby ear, they would measure a bunch of wallaby ears, assume that there is one true average wallaby ear length and use information contained in their sample of wallaby ear lengths along with a distributional assumption about all wallaby ear lengths to infer an interval that should contain the true mean wallaby ear length 95% of the time. This is a bit counterintuitive. The natural inclination is to believe that there is a 95% chance that the true mean is within the interval produced via this process. However, this isn&#8217;t how it works. A Frequentist will tell you that the mean wallaby ear length is what it is with a probability of one and is what it is not with probability of zero. They might further explain that their method can generate an interval containing the one true mean wallaby ear length with an any ol&#8217; arbitrary level of certainty, and that for some mysterious reason, 95% typically defines the line between &#8220;certain enough&#8221; and &#8220;we&#8217;ll never get this published&#8221;.</p>
<p>A Bayesian would treat the mean wallaby ear length itself as a random variable, make a distributional assumption about mean wallaby ear length, then measure a bunch of wallaby ears and combine the information from this sample with the information from the prior assumed mean wallaby ear distribution to produce a posterior distribution for mean wallaby ear length. The result of the Bayesian method is a probability distribution for mean wallaby ear length which a wallaby earmuff maker might then use to properly fill out their inventory for winter in the Southern Hemisphere (though they might have better luck using the raw data as a guide instead of the distribution of the mean). If you&#8217;ve noticed the simplicity in interpreting the results of the Bayesian method, you&#8217;re not alone. It is a lot nicer to be able to claim knowledge of the distribution of mean wallaby ear lengths. Plus, is it really really realistic for Frequentists to assume that there is one constant mean wallaby ear length? Aren&#8217;t wallaby populations constantly changing with the natural cycle of birth, growth and death? Would this constant change then imply a constantly changing mean wallaby ear length? Isn&#8217;t treating estimands as random more accurate?</p>
<p>A Frequentist might respond to this question by pointing out that there isn&#8217;t necessarily an objective way to specify the distribution of the mean wallaby ear length, and that attempting to do so reduces the credibility of the analysis. Better to be consistently slightly wrong than inconsistently slightly wrong. And then the Bayesian would say &#8220;Pish posh to you!&#8221; and the Frequentist would respond &#8220;Well! Aren&#8217;t you a devilish little scoundrel!&#8221; and then they&#8217;d retire to their offices, where each one would write long, passionate letters to the editors of various statistical journals containing elaborate arguments about the foolishness of the other. </p>
<p>Bayesian and Frequentist methods both have advantages and drawbacks. I&#8217;m not going to pretend to be enough of an expert in the field to elucidate them with authority.</p>
<p>I&#8217;m taking a Bayesian Analysis course this semester. It involves a lot of coding, which is nice. For the current homework assignment, I have to write an algorithm which implements a Gibbs Sampler on a set of coal mine accident data. The story behind the data is that coal mines are dangerous and that sometimes governments pass laws to make coal mines less dangerous. If these laws work, accident counts drop, and people who would have otherwise died in coal mine accidents live long enough to die of black lung. Pip pip cheerio! These data have a changepoint, which just means that at some point, a change occurred in mechanism driving the phenomena being studied. If you want to read about these data and a Bayesian method for finding the changepoint, get your JSTOR login info out and <a href="http://www.jstor.org/stable/2347570" title="bayesian changepoint" target="_blank">click here</a> (or google &#8220;Hierarchical Bayesian Analysis of Changepoint Problems&#8221; and find a wayward pdf). </p>
<p>One reason Bayesian methods are useful is that they leave you with a probability distribution, AKA the posterior. If you can specify your models in the right way, you can produce a closed form expression for your posterior. This is how Bayesian analysis had to be done before computers were cheap and easy. Frequently, it isn&#8217;t possible produce a closed form for your posterior and so various sampling techniques are used in conjunction with your data and model to simulate the posterior. One of these is an algorithm known as the Gibbs Sampler. Unfortunately, the Gibbs Sampler has nothing to do with the Gibb brothers of BeeGees fame. But seeing as how I just wrote a Gibbs Sampler for my homework, I thought it might be informative to use it to analyze their yearly output of singles as a set of data with a changepoint.</p>
<p>First, a plot of the number of BeeGees singles from 1963-2001 (via <a href="http://en.wikipedia.org/wiki/Bee_Gees_discography" title="BeeGees" target="_blank">Wikipedia</a>)</p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/03/gibbsraw.png" alt="BeeGees Singles Plot" /></p>
<p>Seems like they had fairly high output until about 1980, then less output until, after the death of one of the Gees in 2003, they retired the BeeGee name. Running these data through a simplification (I specified the changepoint a priori) of the model in the Carlin paper I linked to above yields the following plot.</p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/03/thetalambdarplot.png" alt="BeeGees Results" /></p>
<p>The model assumes that the number of singles released per year is a Poisson-distributed random variable. The curves in the above plot correspond to the number of expected singles in a given year. The lambda curve, which is centered around 1.0, corresponds to the expected number of singles per year following 1980. The theta curve, centered around 3.1, corresponds to the expected number of of singles per year between 1963 and 1980. There is essentially zero overlap between these curves, indicating a pretty clear change in BeeGees single productivity in the year 1980. Wikipedia blames a backlash against disco. Disco Stu was not amused.</p>
<p>Bayesian analysis is flexible too. For instance, this model can be refit with the changepoint as a parameter, resulting in the following posterior. </p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/03/JAGSKplot.png" alt="BeeGees Changepoint" /></p>
<p>The posterior distribution for the changepoint has a mode of 18, which corresponds to 1981. According to this analysis, the year which most likely divides two levels of single production for the BeeGees is 1981.  This second analysis used JAGS instead of my Gibbs Sampler code because that&#8217;s how I did the homework and I&#8217;m too much of a tired and lazy blogger to rewrite my sampler to incorporate the changepoint.</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=165</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hadoop, there it is.</title>
		<link>http://loloflargenumbers.com/blog/?p=157</link>
		<comments>http://loloflargenumbers.com/blog/?p=157#comments</comments>
		<pubDate>Mon, 27 Feb 2012 22:18:14 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[terabytes]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=157</guid>
		<description><![CDATA[I bought a Hadoop book a few days ago. It isn&#8217;t like I have the time or data to implement Hadoop anywhere right now or anything. It&#8217;s really just something for me to geek out on while I&#8217;m avoiding doing the things that I should be doing. What is Hadoop? Hadoop is a software framework [...]]]></description>
				<content:encoded><![CDATA[<p>I bought a <a href="http://shop.oreilly.com/product/0636920021773.do" title="Hadoop: The Definitive Guide" target="_blank">Hadoop</a> book a few days ago. It isn&#8217;t like I have the time or data to implement Hadoop anywhere right now or anything. It&#8217;s really just something for me to geek out on while I&#8217;m avoiding doing the things that I should be doing.</p>
<p>What is Hadoop? Hadoop is a software framework for crunching very large data sets using lots of computers. Hadoop is what Twitter uses to pick out what&#8217;s trending from the multiple terabytes of data it produces daily.</p>
<p>For a necessarily oversimplified, non-Hadoop example of how the Hadoop process works and why it&#8217;s useful, think of a computer with one processor chugging away at a relatively simple problem. Think of it as the algorithmic equivalent of the the problem of trying to count the number of people in a small, mixed-density municipality. If the number of people was small enough, you&#8217;d probably be able to get away with sending one person door-to-door to get head counts from all the households in the city limits. Your counter could add up the people as she went or she could wait until everyone had been counted and add the results then. If your counter doesn&#8217;t waste any time chatting with people, so that the time spent counting per household is small, then you&#8217;d likely find that most of your counter&#8217;s time was spent traveling between houses. </p>
<p>Even though most of the counter&#8217;s time won&#8217;t be spent counting at all, this solo-counter process would likely be an acceptable way to count small populations. What if the population you wanted to count was much larger, say on the order of hundreds of millions of households? With populations this large, it wouldn&#8217;t matter how quick your counter is, he&#8217;d still have to walk between hundreds of millions of households. Even if your counter could somehow count the number of people in a house instantaneously just by looking at the front door, the time he spent walking would likely be unacceptably long. Here, the time spent summing the data is dominated by the time spent getting data to the person doing the summing.</p>
<p>For larger populations, a faster method might utilize many counters at the same time. Each one would have their own area to cover, and each area could be chosen to take approximately the same amount of time. When counters finished counting their assigned area, they would move on to another area and continue counting. Once the counting was done, determining the population size would be a relatively simple matter of adding up the household totals from each counter.</p>
<p>In Hadoop, this idea is implemented algorithmically using several computers via a process called MapReduce. During the mapping portion, each computer (counter) produces keys (corresponding to areas or households) &#8220;mapped&#8221; to values (the number of people counted in an area, or a household depending on how the key is defined). All these key-value pairs are then &#8220;reduced&#8221; via a specified function (summation here) to a count of total population.</p>
<p>It makes intuitive sense. Having many people count a lot of things can be quicker than having one person count a lot of things, but more isn&#8217;t always better. For instance, imagine counting the small municipality with a large number of counters. If each counter was assigned 10 houses and you had 100 households you&#8217;d need ten counters. You&#8217;d have to come up with procedures to assign counters to households while ensuring that households weren&#8217;t counted multiple times. Each counter would require a separate clipboard, vest, map, etc. There&#8217;d be 10 sets of tax forms to fill out and ten mileage reimbursement requests. If too many counters were added, the end process might take less time, but at a significant cost in efficiency and cash. If you weren&#8217;t careful, you might inadvertently use so many counters that you spend more time and resources allocating counters and waiting for them to finish than you would have spent waiting for one counter to do her thing (imagine assigning 90 people to count 100 households). Clearly, there is some data size threshold that determines when distributed computation schemes like Hadoop are a better alternative to single computer solutions.</p>
<p>I haven&#8217;t yet had the need to perform any Hadoop-scale analysis, perhaps when I have some free time, I&#8217;ll get an account at Amazon Web Services and hunt down some rogue terabytes for analysis. Or maybe I&#8217;ll get lucky and land a job where analysis on the terabyte scale is routine. A guy can dream, right?</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=157</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Free time</title>
		<link>http://loloflargenumbers.com/blog/?p=153</link>
		<comments>http://loloflargenumbers.com/blog/?p=153#comments</comments>
		<pubDate>Tue, 21 Feb 2012 06:58:03 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[business]]></category>
		<category><![CDATA[que hace tiempo]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=153</guid>
		<description><![CDATA[I don&#8217;t have a lot of free time. Between the family, the school, the assistantships and the internship, my calendar is packed. Being in grad school means making a commitment to always having something else you should be doing. For every task I&#8217;m assigned there is an unknown, finite amount of time I have to [...]]]></description>
				<content:encoded><![CDATA[<p>I don&#8217;t have a lot of free time. Between the family, the school, the assistantships and the internship, my calendar is packed. Being in grad school means making a commitment to always having something else you <em>should</em> be doing. For every task I&#8217;m assigned there is an unknown, finite amount of time I have to spare. That Bayes assignment from two weeks ago? I&#8217;ve got maybe six hours for it. I&#8217;m glad it only took two or three. The Clinical Trials assignment that&#8217;s due on Thursday? I can probably make eight hours for that, but they have to be spread out in two hour chunks and I hope it will only take three or four. My family? I&#8217;ve got 7-9 hours per weekday for them. Grading papers? Two hours. </p>
<p>I&#8217;m generally good at doing the things I should be doing. Frequently, I can juggle tasks well enough to find time to do things that I don&#8217;t have to do. This process usually involves a complex, high-level form of procrastination and a firm commitment to fuller living through sleep deprivation. Sleep is one of the few responsibilities I can neglect without disappointing myself or people who rely on me, so choosing to sleep less is usually an easy choice. This is clearly less than ideal, but that&#8217;s how it is. There I things I want more than the feeling of being well rested.</p>
<p>Lately, in between time spent wandering virtual countrysides saving virtual people from virtual world-devouring dragon-gods, I&#8217;ve been spending my free time trying to fill gaps in my statistical computing abilities. I just finished <a href="http://shop.oreilly.com/product/9781593273842.do" title="The Art of R Programming" target="_blank">The Art of R Programming</a>. Not a page turner, but interesting nonetheless. I like R and I&#8217;m competent enough to be able to figure out how to use it to do all the things I need it to do. However, most of my R skills have been forged via trial and error over the glowing coals of various blogs, mailing lists, and <a href="http://cran.r-project.org/" title="CRAN" target="_blank">CRAN</a> pages. I picked up the book because I felt it would be beneficial to build a more solid foundation under my slightly-out-of-code shack of R knowledge. &#8220;The Art of R Programming&#8221; fit the bill quite nicely. I highly recommend it to anyone looking for a fairly comprehensive, broad view of R as a programming language.</p>
<p>Eventually, I will have to stop ignoring the fact that I need to write a rather large paper as part of earning a master&#8217;s degree. I&#8217;m excited about the project on which the paper will be based. I think it will be fairly engrossing and I&#8217;m putting it off because I&#8217;m not ready for it to sweep me away yet. The basic idea is this: you have data on a bunch of people from a loosely connected social network (think students at a school). You have demographic info, BMI, daily exercise levels, etc. These people were also asked to name 5 of their friends and you have this data too. You want to use this data to estimate the effects on the probability that a person is obese from the level of that person&#8217;s association with other obese people. Models exist for these types of relationships. However, problems can arise when you have people whose data are missing. For my thesis, I&#8217;m going to be implementing a couple different methods for imputing missing social network data for the purposes of estimating network effects. I think it will be a nice combination of coding, simulation and head scratching. I&#8217;m still at the idea stage. I think once the ideas bouncing around my head reach a critical mass, the real work will begin.</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=153</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Personal Data</title>
		<link>http://loloflargenumbers.com/blog/?p=107</link>
		<comments>http://loloflargenumbers.com/blog/?p=107#comments</comments>
		<pubDate>Tue, 07 Feb 2012 20:20:45 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[feelings!]]></category>
		<category><![CDATA[measurement]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=107</guid>
		<description><![CDATA[As a field of study, Statistics is essentially the study of how to make sense of data. It is full of formal, quantitative methods for dealing with uncertainty. There are fundamental questions that a successful analysis should touch upon. What do we know (the data)? What do we expect (our assumptions)? What should we expect [...]]]></description>
				<content:encoded><![CDATA[<p>As a field of study, Statistics is essentially the study of how to make sense of data. It is full of formal, quantitative methods for dealing with uncertainty. There are fundamental questions that a successful analysis should touch upon. What do we know (the data)? What do we expect (our assumptions)? What should we expect given what we know (what does the data look like)? How does what we know affect the validity of the assumptions about what we expect (do the data seem excessively implausible given our assumptions)? The goal typically takes the form of setting up a fall guy (or gal) named Null Hypothesis and checking to see whether you can knock him (or her) down with your data. Traditionally, the null hypothesis is the enemy. If you are a drug company, the null hypothesis is that your drug is worthless (compared to the current standard of treatment).</p>
<p>On a structurally similar but less formal level, we&#8217;re all statisticians. We all compute our own personal statistics on a daily basis, using incredibly biased data derived from our muddled thoughts and our 5 senses. For instance, when I look myself over before I leave the house in the morning, I implicitly create an average rating of my appearance based on any number of specific observable data points. How&#8217;s my hair (when I have hair)? Are my teeth still crooked? Is my forehead doing that flaky, dry skin thing? Is my shirt suitably ironed? Does this stomach fat make it look like I have stomach fat? </p>
<p>All these things are weighted by importance and aggregated to create an overall &#8220;this is how I look&#8221; score. I then take this score and compare it to a null hypothesis about my appearance. My personal appearance null is usually &#8220;I look good&#8221; and I&#8217;m perfectly happy not knocking it down at all. Choosing a null for your personal appearance is a complicated process that incorporates your current state of mind and whole slew of conscious and unconscious thoughts based on a lifetime&#8217;s worth of accumulated experience. </p>
<p>So my perception of my appearance is my data. I use it to create an overall &#8220;this is how I look score&#8221;, which is then compared to what I&#8217;d expect the score to be if I looked okay (my null). This process is very sensitive to outliers. So if I looked like <a href="http://biostatisticsryangosling.tumblr.com/" title="Biostatistics Ryan Gosling" target="_blank">this guy</a>, except that I had a giant, bulbous blister on my forehead, that blister would likely dominate my &#8220;this is how I look&#8221; score despite being a small, atypical portion of an otherwise magnificent package and I&#8217;d probably reject the null hypothesis that I look good. </p>
<p>The results of my personal appearance test have ramifications. Whether I decide that I look okay or not will partially determine how I interact with the world with me. It will inform my interactions with other people. During these interactions, I will be subconsciously gathering and aggregating more data and performing more informal statistical tests: Is this person listening to me? Do they understand what I&#8217;m talking about, or am I completely incoherent? Was that joke I told actually funny or was that laugh just a polite acknowledgment of attempted comedy? For each of these questions, I&#8217;d compare the person&#8217;s behavior (the data) with how I&#8217;d expect them to behave (the null) and draw conclusions accordingly.</p>
<p>If I feel especially dashing, then perhaps I&#8217;m slightly more apt to feel confident in my interactions with other people and this confidence can result in more positive ad hoc estimates of how well these interactions are going. These ad hoc estimates are data points that go into a running calculation of how well my day is going.</p>
<p>At the end of the day, I have all these internal data points describing the quality of my day. These too are then aggregated into a relative rating describing the overall quality of my day. If the average quality of my day seems especially good or bad (compared to my expectations), I might spend some time going through the data trying to figure out whether there was a specific event that made my day especially good or bad. Or, I&#8217;ll probably be asleep within moments of laying down.</p>
<p>I bet you do this too. </p>
<p>A lot of the time, this process is completely subconscious. One interesting thing you can do is try and figure out what your null hypotheses are and how they got to be that way. It can be enlightening to see how much of an effect things that happened ages ago can have in providing context for the decisions you make now.</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=107</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bruce Lee</title>
		<link>http://loloflargenumbers.com/blog/?p=134</link>
		<comments>http://loloflargenumbers.com/blog/?p=134#comments</comments>
		<pubDate>Tue, 07 Feb 2012 07:26:55 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=134</guid>
		<description><![CDATA[I must be getting old. I used to be able to drink coffee at any hour of the day and still get to sleep whenever I needed to. 5 years ago, I could have a cup at 9 pm and be asleep by 11 pm. Not any more. Too bad, because I like coffee like [...]]]></description>
				<content:encoded><![CDATA[<p>I must be getting old. I used to be able to drink coffee at any hour of the day and still get to sleep whenever I needed to. 5 years ago, I could have a cup at 9 pm and be asleep by 11 pm. Not any more. Too bad, because I like coffee like I used to like cigarettes. </p>
<p>Because I like coffee like I used to like cigarettes, 11 pm this evening found me laying in bed, wide awake, with way too much of my conscious mind focused on the unbearable noiselessness of it all. And who should happen to whirl through the vast, resonance-prone soundstage that was my coffee-addled brain, but one Bruce Lee. And if you&#8217;ve ever seen (heard) Bruce in action, you probably correctly guessed that he wasn&#8217;t quiet about it either.</p>
<p>So, in the interests of wearing out the jittery hamster that is my conscious mind by running it through the wheel in the corner of my brain, here are a few things that I remember about Bruce Lee from when I was a youngster that may or may not actually be true.</p>
<p>Bruce Lee was ahead of his time. Way back in the 60s he was already at least intuitively aware of ideas which as recent as last year were the subject of a <a href="http://www.nytimes.com/2011/11/27/books/review/thinking-fast-and-slow-by-daniel-kahneman-book-review.html?pagewanted=all" title="Thinking Fast and Slow" target="_blank">best-selling book</a>. Lee understood that when it comes to winning a fist fight, intuition plays a key role. It matters less whether one has fancy belts than it does that one can step aside at the appropriate time or land a solid punch on time and in the right place like a pallet in a UPS commercial (I don&#8217;t know that Lee ever yelled &#8220;LOGISTICS&#8221; as he landed a punch, but given his awesomeness, I&#8217;d bet that the idea of doing so occurred to him at least once). To win a fight you need to operate with the split-second timing of the finely honed intuition that is the &#8220;fast thinking&#8221; alluded to in the title of Kahneman&#8217;s book. </p>
<p>I haven&#8217;t made it all the way through Thinking, Fast and Slow. What with all the glitz and glamor of graduate school and familihood, who has the time? I have read the introduction for the book AND I think I totally heard an interview with Kahneman on MPR once, which is pretty much exactly the amount of familiarity required to hold a water cooler conversation about the subject. Here goes the water cooler-level summary. Kahneman bifurcates thought itself into two types. Fast thought is generally our default mode. It&#8217;s like an autopilot system with enough awareness built in to know when it&#8217;s in over its head. Slow thought is what happens when fast thought bows out. Slow thought tells you why you really need to quit smoking sometime after fast thought led you outside, lit your cigarette and relished in the warm, glorious feeling of chemically-sated addiction.</p>
<p>Slow thought will also lose you a fight. Here&#8217;s how: if you are about to fight someone and you aren&#8217;t comfortable fighting, the first thought that will occur to your fast thought cogitation system is &#8220;Uh oh. Can I run here, because, dude, that guy looks pi-issed?&#8221; and if running isn&#8217;t an option, its next thought is &#8220;This situation is tense and unfamiliar. I&#8217;m not quite sure what to do here. Uh, I better think about this more.&#8221; Then slow thought takes over and you get punched. Because when you have to think about how to respond to a fist flying at your head, you&#8217;ve pretty much already gotten punched.</p>
<p>Lee recognized that this dynamic existed even for highly trained martial artists. This is because martial arts training in Bruce Lee&#8217;s time rarely effectively simulated the street fighting experience. Lee&#8217;s idea: I should expose myself to as many different fighting styles as possible, figure out the circumstances under which they are effective and then use this knowledge to gain such a comprehensive, visceral understanding of how to fight well that no matter the circumstances, I will be able to rely solely on my fast-acting intuition rather than my slower, more deliberative mode of thought. </p>
<p>So he pursued that idea, and he yea, it was good. Because Bruce Lee had a mind like a hungry octopus angling for fishies made out of pure fact. He was driven by his own curiosity and enjoyed connecting disparate ideas in innovative ways. He wanted to figure out how to kick anyone&#8217;s ass without even having to think about kicking their ass while he was kicking their ass. Hence, Jeet Kun Do, which roughly translates as &#8220;Hey, how about instead of rigidly clinging to one way of fighting, you learn about and capitalize on the strengths of a variety of styles&#8221;. Its brilliance is in its lack of specificity and its adaptability. Lee saw it as more of a philosophy than a fighting style. </p>
<p>I had the pleasure of taking 6 months of Jeet Kun Do at a certain local martial arts academy. My stint ended when I broke a foot doing something completely unrelated to training. Then, after my foot healed, I popped a rib doing something completely related to training. Then fell out of the habit of going to class and eventually cancelled my membership. It was fun while it lasted and if I ever have the time, I&#8217;d love to go back.</p>
<p>Final fact before bedtime: Bruce Lee also starred in Fist of Fury, which was later remade with Jet Li as Fist of Legend. Both of these movies are better than Enter the Dragon.</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=134</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Lies, Damn Lies and Statistics: A Primer</title>
		<link>http://loloflargenumbers.com/blog/?p=92</link>
		<comments>http://loloflargenumbers.com/blog/?p=92#comments</comments>
		<pubDate>Sun, 29 Jan 2012 17:00:16 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[lies]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=92</guid>
		<description><![CDATA[There is a classic quote that goes something like this: &#8220;There are three types of lies: lies, damn lies, and statistics.&#8221; This quote is often attributed to Mark Twain, even though he attributed it to Benjamin Disraeli, who may or may not have actually said it. Perhaps there should be a fourth type of lie [...]]]></description>
				<content:encoded><![CDATA[<p>There is a classic quote that goes something like this: &#8220;There are three types of lies: lies, damn lies, and statistics.&#8221; This quote is often attributed to Mark Twain, even though he attributed it to Benjamin Disraeli, who may or may not have actually said it. Perhaps there should be a fourth type of lie concerning misattributed quotations? Regardless of where it came from, it&#8217;s more a prescription for dealing with conceptually difficult information than a statement about statistics. In fact, if you dig deep enough into what this quote seems to actually mean, you&#8217;ll see that it&#8217;s a statistical statement at its core.</p>
<p>First, what is a statistic? A statistic is some sort of quantitative summary of a set of data. Think average. Think standard deviation. Think kurtosis. Then forget about kurtosis. It&#8217;ll probably never matter to you. Summaries of data are useful because if you can reasonably make certain assumptions about the data, a few summary statistics are all you need to convert that data into something meaningful. If the range of IQs is normally distributed, then indeed, half of the population is of below average intelligence. Seems simple enough. Problems arise, however, when these statistics are misinterpreted. For instance, say you happen to be studying a population where intelligence is exponentially distributed (use your imagination), then half the population&#8217;s IQ is below 0.63 x the population average. Statistics are properties of a sample &#8211; they are a group-scale phenomena. It is entirely possible for no member of a population to be average, hence, the average family has something like 2.3 children. Average families can have fractional children. With real families, the convention is to count children with whole numbers.</p>
<p>Summary statistics by themselves sometimes aren&#8217;t all that interesting. If one is trying to determine whether a real difference exists between two different measurable phenomena, summary statistics will likely only get you halfway there. Imagine trying to compare stats for Adrian Peterson and Walter Payton to determine who is the better running back. You might compute their career gameday rushing yards and compare. You could compare Peterson&#8217;s stats from his first 50 games, with Payton&#8217;s stats from his first 50 games. What else might you need to account for to make a comparison of these numbers appropriate? Well, it&#8217;s hard to run well behind a horrible offensive line, so one could attempt to take offensive line quality into account. Offensive philosophy is also important. It might be difficult to get rushing yards if your playcaller leans heavily towards passing plays despite your running back&#8217;s brilliance. Here it can get tricky. What&#8217;s the best way to quantify O-line quality? What about offensive pass-heaviness? I&#8217;m not saying it can&#8217;t be done, just that a proper analysis of the data could get fairly complex fairly quickly. If you wanted to go the easier route, you could just assume a priori that none of those other things matter. </p>
<p>In order to make sense of the next part, an explanation of terminology may come in handy. A population is a group of things. A sample is a portion of a population. A probability is the ratio of the number of ways a particular event can happen to the number ways all possible relevant events could happen. The probability of flipping a head on a fair coin is the number of heads on a coin divided by the number of sides on the coin, (for simplicity&#8217;s sake, you&#8217;d probably assume that the outer rim of the coin can be neglected). A probability distribution is an expression of how likely various values are to show up in a sample from a population. In other words, it provides a map from possible values to the probability associated with those values. Take a six-sided die. Assuming fairness, each side has a 1/6 probability of coming up per roll. This can be represented as a distribution. The range of possible outcomes in a distribution are called its support. In the case of a die, the support consists of the numbers 1, 2, 3, 4, 5, 6. Each one of these values is mapped to a corresponding probability, here all of the probabilities are 1/6. A loaded die would have the same support but different probabilities. Loaded or not, all of these probabilities need to add up to one.</p>
<p>Back to football. It turns out that for the first 50 games of their careers, Peterson and Payton have pretty similar average yards per game (AP &#8211; 99.28 ypg, WP &#8211; 94.08 ypg). Is the difference between these two number, given spread of the data, large enough to be due to an actual difference in ability? Statistics provides a means of answering this question, provided you&#8217;re willing to make an assumption. The next assumption, which is where capital S Statistics really comes in, appears when you assume some sort of underlying distribution for your data. That nature of the assumption is this: each player has one true mean yards per game and any weekly divergence between their actual yardage and this theoretical yardage is due solely to chance (a Bayesian would assume that the true mean yards per game was itself a random variable). This assumption <em>should</em> be based on sound reasoning and a thorough inspection of the data. If the data looks exotic enough, you can even assume that you don&#8217;t know anything about the underlying distribution, this is called the Rumsfeld distribution, and it&#8217;s known for being known to be unknown (not really, see <a href="http://www.statsoft.com/textbook/nonparametric-statistics/" title="Nonparametric Statistics" target="_blank">nonparametric statistics</a>). With regards to running backs, you might assume that for each game, each player&#8217;s total rushing yards comes from a Normal distribution centered around some average value. This would give you information of how large you&#8217;d expect differences in their 50-game yards per game average to be. Then, to determine the better player you&#8217;d compare your data using some formal statistical test based on this assumption.</p>
<p>So, let&#8217;s try it out and see where it takes us. Data is available here: <a href="http://www.pro-football-reference.com/players/P/PeteAd01/gamelog//" title="Adrian Peterson Stats" target="_blank">All-Day</a> vs. <a href="http://www.pro-football-reference.com/players/P/PaytWa00/gamelog//" title="Walter Payton Stats" target="_blank">Sweetness</a>. As an aside, I must say, Payton wins on the nickname front. </p>
<p>Note the following plot. It shows a normalized histogram for each player overlayed with corresponding smoothed density curves. These plots show two different ways of representing the distributions of these data. Either way, the data line up fairly closely, and indeed, a t-test suggests that there isn&#8217;t a statistically significant difference between their performances.</p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/01/yards.png" alt="Yard distributions" /></p>
<p>The t-test assumes that the the underlying observations are from a Normal distribution. Do the data bear this assumption out? The next two plots show comparisons between each player&#8217;s set of data and a simulated Normal distribution based on the mean and standard deviation of each player&#8217;s data. The simulated Normal curves are in black. Both sets are approximately symmetric and each seems to have a maximum value in the neighborhood of its Normal companion. Both have similar means (AP &#8211; 99.28, WP &#8211; 94.08) and standard deviations (53.36, 56.39). AP&#8217;s yards are a little tighter around their mean than the Normal and both players have a little bump corresponding to their single-game rushing yard records. For the purposes of this analysis, normality seems a reasonable, if not entirely correct, assumption to me.</p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/01/adnormal.png" alt="AP's distribution" /></p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/01/wpnormal.png" alt="WP's distributions" /></p>
<p>So there you have it. Statistical evidence that Adrian Peterson and Walter Payton are indistinguishably talented football players. Interchangeable. Except that&#8217;s not really what the statistics say at all. Let&#8217;s list some of the things that went into this analysis. </p>
<p>1) We assumed that our subset of data was representative of our real quantity of interest.<br />
2) We assumed that our subset of data was drawn from a particular probability distribution.<br />
3) We compared the data in the context of what we&#8217;d expect it to look like if it were actually from our assumed probability distribution and found no reason to believe that either set came from a different distribution than the one we assumed.</p>
<p>Sports statistics aren&#8217;t my strong suit, but I&#8217;m pretty sure there&#8217;s more to this &#8220;who&#8217;s the better running back?&#8221; question than yards per game. What about yards per carry? Or yards conditional on offensive philosophy and offensive line quality? If we were talking Tim Tebow we&#8217;d have to include God effects in our model somehow. Yards per game might on its own be sufficient to characterize running back talent, but we don&#8217;t know that. This analysis didn&#8217;t even try to take other factors into account. So assumption 1) is probably wrong. This is huge. In an effort to answer one question, we actually answered a different, possibly irrelevant question. Assumption 2) seems close enough to true, but at this point, does that even matter?</p>
<p>Statistical information is usually presented in a results first format. This makes sense, since the results are typically the goal. The problem is that the results rarely tell the whole story and seemingly rarer still do the articles surrounding those results. So you end up with headlines that blare <a href="http://www.mirror.co.uk/news/top-stories/2012/01/23/schoolkids-can-t-add-up-or-spell-survey-finds-115875-23712794/" title="Tripe" target="_blank">alarmist claims</a> with little context and articles that present data with nary a critical question. For instance, what were the demographics of the sampled population in the linked article? Are these kids performing worse than previous generations? How frequently do adults misuse &#8220;there&#8221;, &#8220;their&#8221; and &#8220;they&#8217;re&#8221;? Is it possible that the apparently high proportion of parents who think their children are less advanced than previous generations is more due to parental self-doubt and recall bias than anything real? Who is OnePoll? Do they have an agenda? </p>
<p>Don&#8217;t get me wrong. If the UK is anything like the US, then they have serious, complicated systemic problems with their education system including widespread racial, cultural and geographic disparities with the net result being a lot of kids who don&#8217;t necessarily know the things they ought to. However, these results as presented give a less than complete picture of what is actually going on.</p>
<p>With that being said, I don&#8217;t doubt the statistics. I suspect that there&#8217;s a very remote chance that they (the numbers, not the interpretation) aren&#8217;t being presented accurately here. Even in that event, the sin is the responsibility of a person and not a data set. Statistics don&#8217;t lie. Ever. People use statistics to lie. They use statistics to lie because the average person doesn&#8217;t know enough about statistics to see the holes in the types of flawed appeals to statistics that frequently get trotted out in support of flawed ideas.</p>
<p>The Twain/Disraeli/unknown quote is a warning, imploring people to automatically doubt the integrity of anyone whose argument relies heavily on numbers. I can&#8217;t really argue with the sentiment. A successful lie frequently depends on the liar&#8217;s ability to control information and how that information is perceived. Take this fact and pair it with the very human inability to correctly parse statistical information and you&#8217;ve got a viable path to skulduggery. Gallant presents his data in context with all the appropriate caveats. Goofus deprives his audience of its ability to interpret that data by presenting only the most favorable portions of it and neglects to mention any flaws that may have occurred with data collection and analysis.</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=92</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Like wildfire</title>
		<link>http://loloflargenumbers.com/blog/?p=74</link>
		<comments>http://loloflargenumbers.com/blog/?p=74#comments</comments>
		<pubDate>Thu, 26 Jan 2012 15:30:25 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[epidemic]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=74</guid>
		<description><![CDATA[When I first started to feel comfortable with R, I decided to try to write an epidemic simulator. There were at least two reasons I wanted to do this. First, epidemics are interesting. Epidemics are just one of the many ways in which luck expresses its complete apathy towards all the divisions we create between [...]]]></description>
				<content:encoded><![CDATA[<p>When I first started to feel comfortable with R, I decided to try to write an epidemic simulator. There were at least two reasons I wanted to do this. First, epidemics are <a href="http://www.theghostmap.com/" title="The Ghost Map" target="_blank">interesting</a>. Epidemics are just one of the many ways in which luck expresses its complete apathy towards all the divisions we create between each other. Cholera doesn&#8217;t care who you are. If you drink well water that&#8217;s been contaminated by cholera-infested human waste, and you happen to be in love (or in London) in the time of cholera, you&#8217;ll probably be dead in a few days. </p>
<p>Roughly speaking, an epidemic is an uncharacteristic or unexpected increase in the prevalence of a specific disease over a short span of time, short being relative. If the time span is too long, your epidemic is really just an adjustment of the status quo. That&#8217;s why you don&#8217;t hear people lamenting the current epidemic of car accidents. The car accident epidemic that started right around the same time automobiles came into widespread use has lasted so long that it&#8217;s no longer novel. </p>
<p>The concept of an epidemic gets even more interesting when you remove disease from it. For instance, if you take the definition given above and replace the word &#8220;disease&#8221; with &#8220;idea&#8221; or &#8220;fashion accessory&#8221; (I&#8217;m looking at you, trucker hats). How do mechanisms of disease transfer compare with mechanisms of idea transfer? Are water cooler conversations the mental analogue to sharing the same cup at that water cooler? Are there specific personality traits that make some people more susceptible to certain ideas in the same way that genetics and environmental exposures can make a person more susceptible to certain diseases? Can the recent surges of support for various GOP presidential candidates be explained in terms of epidemics? Does the realization that candidate x is not Mitt Romney suddenly make a person more susceptible to the condition of supporting candidate x? I&#8217;m sure the handsomely dressed, chain smoking folks in marketing have this all figured out already. </p>
<p>The second reason to write an epidemic simulation: I could use the process to teach myself how to code in R. Practice makes perfect, right? And I like writing code. To me, writing code is like setting up dominoes or some sort of text-based Rube Goldberg machine. You write a set of instructions: do this, then do that, then, if this is like that, then you be all like &#8220;nuh uh&#8221;. Then you execute those instructions, and if your code is right, things happen exactly how you expect them to.  </p>
<p>So I wrote it. It took some work and some late nights, but <a href="http://blog.lib.umn.edu/enge/sphere11/2011/04/simulations.html" title="Blog for original simulation" target="_blank">I did it</a>. I ran a few simulations, and felt briefly satisfied. Eventually, I used it to generate data for a final project in a class I was taking. Then I forgot about it for a while.</p>
<p>I recently I decided to revisit my code after working my way partly through <a href="http://shop.oreilly.com/product/9781593273842.do" title="The Art of R Programming" target="_blank">The Art of R Programming</a>. I was sitting on the bleachers in my daughter&#8217;s gymnastics class one day. I was going back and forth between reading the book and watching her show off her cartwheels and balancing skills, and I realized that the way I had originally written the epidemic simulator was all wrong. Though the simulation worked, it was written in a generic way, which is another way of saying that I wasn&#8217;t taking advantage of the particular strengths of the R language. For instance, my original code was overly reliant on ye olden &#8220;for loop&#8221;. For those who are unfamiliar, a for loop is a way of doing something over and over again, but slightly different each time. When you go into work and stay there for eight hours, that&#8217;s kind of like executing a for loop. Each hour (or minute, or work period in between social media updates) is one iteration of the loop. In code, this might be expressed as</p>
<p><code><br />
for(each hour in 8 hours) {<br />
 work<br />
}<br />
</code></p>
<p>The version I started with had 28 for loops. The for loop is a pretty standard piece of equipment in other languages. In R, the for loop is anathema, a tool of the unwashed. They&#8217;re kind of ugly and they tend to slow things down. The current version has exactly one for loop, and it&#8217;s the main simulation loop. I like this loop. It&#8217;s where all the magic happens. The new version is a lot quicker, too. Above, I linked to a <a href="http://blog.lib.umn.edu/enge/sphere11/2011/04/simulations.html" title="Blog for original simulation" target="_blank">blog</a> I wrote about the earlier version of the simulation. The blog also has a still-accurate explanation of how this simulation works. In that version, running the simulation on a 250,000 person population with 5 initial infections took about 4.5 hours to reach the 95th iteration. The new version reached 95 iterations in 53 minutes. This speed increase is due in part to the reduction in loops. I also improved the disease transmission and assignment algorithms so that they&#8217;re much less time intensive.</p>
<p>Here&#8217;s a clip of an epidemic spreading through a 2500 person population:<br />
<p><a href="http://loloflargenumbers.com/blog/?p=74"><em>Click here to view the embedded video.</em></a></p></p>
<p>In this population, the probability of getting sick conditional on exposure and previously having gotten sick is less than one in a thousand, so the disease quickly burns itself out as people become immune. In a more accurate population, interaction and disease transmission would occur between cells that aren&#8217;t adjacent. I started implementing this process, by assigning each member of the population a random group number, with the idea that people with the same group number would have a possibility of interacting and transmitting disease in each iteration, but this functionality isn&#8217;t a huge priority at this point. Maybe in a few months.</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=74</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bikes for days</title>
		<link>http://loloflargenumbers.com/blog/?p=49</link>
		<comments>http://loloflargenumbers.com/blog/?p=49#comments</comments>
		<pubDate>Fri, 20 Jan 2012 06:29:39 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=49</guid>
		<description><![CDATA[About halfway through my first semester of statistics school, I started to get antsy. I was learning all these interesting ways of making sense of data. How to chop it. How to plot it. How to model it. These are interesting and exciting things, if you are me. My anxiousness was rooted in the fact [...]]]></description>
				<content:encoded><![CDATA[<p>About halfway through my first semester of statistics school, I started to get antsy. I was learning all these interesting ways of making sense of data. How to chop it. How to plot it. How to model it. These are interesting and exciting things, if you are me. My anxiousness was rooted in the fact that I wanted to use my new statistical knowledge on novel data. In statistics school, in applied courses, you deal with real data in nearly every assignment, but unless that data happens to fall within your particular sphere of interest, the fruits of your analytical labor are often boring. Or less than boring. They&#8217;re beside the point. I don&#8217;t really care whether changes in wallaby ear length over time are correlated with how large those wallabies are at one year old. Obviously somebody cares about wallaby growth, because someone took the time to chase a bunch of wallabies around and measure their fuzzy little ears and someone else analyzed that data for a <a href="http://www.statsci.org/data/oz/wallaby.html" title="school project" target="_blank">school project</a>. The point of the assignments in these classes is to hone statistical modelling skills, with the idea being that what one models is less important than one&#8217;s ability to model it correctly. The analysis of interesting data is accidental and when it happens, you enjoy it while you can.</p>
<p>So I went looking for more interesting data. It&#8217;s a good time to be a number cruncher. Home computers are powerful enough to handle small-to-medium-scale analysis problems relatively quickly. Also, the internet makes the sharing of data incredibly easy. Did you know that there is a data set containing all the pertinent information for <a href="http://labrosa.ee.columbia.edu/millionsong/" title="million song dataset" target="_blank">one million songs</a>? It&#8217;s 280 GB, all yours if you want it. Plus, the digitization of everything allows for ample opportunities for data collection. Data is frequently the offal of activities completely tangential to data generation. Online retailers collect data ostensibly as an essential part of showing you products, taking your money and sending those products to you. All those terabytes of data have proven useful for other reasons, primarily as a way for advertisers to train algorithms designed to separate you from your money. </p>
<p>One website I found was <a href="http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php" title="Where to find open data on the web" target="_blank">here</a>, which was a few years old, but still seems to have some useful links. Other good places to look for data: the government. NOAA has tons of meteorological data available, much of it free or free if you happen to be using the internet on a university campus. The census has a massive amount of data too. You can even look up which internment camp you&#8217;re going to be assigned to when the FEMA stages a coup following the return of the Mayan sun god next winter. Hahaha. *nervously glances over at Kinich Ahau figurine in the corner* The <a href="http://www.health.state.mn.us/macros/topics/stats.html" title="MDH Website" target="_blank">Minnesota Department of Health</a> has data too. If you ever want to spend an hour or two comparing Cesarean and Vaginal birth rates by immigration status and age, go there. </p>
<p>I actually found what I was looking for on the <a href="http://www.ci.minneapolis.mn.us/bicycles/" title="City of Minneapolis" target="_blank">City of Minneapolis&#8217; website</a>. The city had been collecting bicycle count data on its Midtown Greenway bike path every fifteen minutes for several years. That&#8217;s a lot of data. I thought it might be interesting to run the data through a few models and see what happened, so I sent an email to one of the people responsible for collecting the data to see if I could get my hands on it. He was agreeable, and after a week or so of sporadic, hot, three way email exchanges, I had myself 100 text files of bicycle count data. I grabbed monthly meteorological data from NOAA&#8217;s labyrinthine website and started the process of compiling the count data and folding in the weather data. This was as exciting as it sounds.</p>
<p>It seems natural to assume that weather has a significant effect on bicycle ridership. As I write this, winter has found its way back to Minnesota. There were significantly fewer cyclists on campus today. Also likely a predictor: time of day. There are fewer people riding bicycles at 2 am than there are at 2pm, and also, since many people ride their bicycles to work, there are fewer people on their bicycles at noon than there are at 8am. The following charts, made using only data from the East River Road sensor, illustrate how these assumptions compare with reality as measured by the City of Minneapolis.</p>
<p>The first plot shows hourly bicycle counts by time of day. There are a lot of dots here, so the data is both jittered and drawn slightly transparent. Two things pop out. First, there seems to be a clear relationship between time of day and the number of people riding their bicycles. Not too many people riding in the wee hours. Ridership increases as the day ages and eventually declines as people go home in the evenings. Also, note the round ball of fluff between 100 and 320 riders and between 9am and 5pm. There seems to be a lot of people riding this time of day, but fewer people than might be expected based on the surrounding areas in the plot. What does it mean?</p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/01/eastriverscatter.png" alt="scattplot of rider data" /></p>
<p>It&#8217;s the weekend! The next plot is identical to the previous, except for the color scheme. The blue dots show counts from Saturdays and Sundays. The red dots show weekdays. Ridership patterns are consistent with what one might expect of a cycling population that is  weighted towards commuters. On weekends, 9 to 5ers are free to ride all day, and apparently many of them do. The highest dot, the one that seems like it might be an outlier, represents the hour from 9:30am to 10:30am on Saturday, June 5th 2010, a windless, rainless hour where the air was 67 degrees and relative humidity was at a probably somewhat uncomfortable 77%.</p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/01/eastriverscatter-weekend.png" alt="scatter plot for east river road by weekend" /></p>
<p>The second plot shows riders per hour plotted against hourly air temperature. Because time of day seems like an informative variable, the data are further broken up by hour into 14 lines (wee hours of the morning were omitted because they were essentially zero and added more clutter than information to the plot). Each line represents a sort of average of the number of cyclists counted at a specific time for all of the days in the data set. So the violet line with the highest peak is a smoothed representation of all of the bicycle counts for 5pm in the data. This plot is an expression of an idea that most people grasp intuitively: people are more inclined to ride their bicycles when it&#8217;s warm out. Or, when it&#8217;s zero out, not a lot of people ride their bicycles. As the weather gets warmer more people ride. This trend continues until things get too warm, and then the number of cyclists drops off. It appears the overall approximate ideal cycling temperature is around 80 or so degrees. I myself prefer 50s and 60s.</p>
<p><img src="http://loloflargenumbers.com/blog/wp-content/uploads/2012/01/eastrivertemp.png" alt="Riders per hour by hourly air temperature" /></p>
<p>So weather and time of day have an effect on ridership, as well as whether it&#8217;s the weekend or not. In the interests of getting more statistic-ey, the data were fit to a Poisson regression model using hourly air temperature, hourly relative humidity, and whether it was the weekend or not as predictors. Time of day was also included, though some trickery was used to get around the fact that the distribution of rider counts had more than one peak when conditioning on time of day. Time was recoded using the smallest of the absolute difference between each data point&#8217;s time and 7am and 5pm. Essentially, this new time variable represents the temporal distance between when a measurement was taken and the closest rush hour. It isn&#8217;t necessarily the most elegant solution, but it fit better than the untransformed variable.</p>
<p>Given the large size of the data set, all estimates were statistically significant. The model suggests that each one degree increase in temperature corresponds to a 4% expected increase in number of riders per hour. So, all other things being equal the difference between 50 degrees and 80 degrees should triple the number of riders. A unit increase in relative humidity corresponds to a decrease in riders of about 2%. The weekend effect was huge compared to weather effects, with weekends being associated with a 40% increase in hourly rider counts. </p>
<p>Unfortunately, the coding for time yielded odd results. The model suggests that ridership decreases by 99% for every hour removed from rush hour. That doesn&#8217;t seem right. A plot of weekday riders vs the recoded time variable suggests that perhaps the times chosen as references for rush hour aren&#8217;t resulting in a clean &#8220;peak-merging&#8221; in the data. Perhaps I&#8217;ll revisit this problem at a later date.</p>
<p>So there you go. All that to get at what I already suspected, that people ride their bicycles more during rush hour and that they tend to ride their bicycles more when the weather is nice. Sometimes it can be kind of nice when you take the long, slightly unwieldy and unfamiliar route and still manage to end up where you expected.</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=49</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Internships and Measurements</title>
		<link>http://loloflargenumbers.com/blog/?p=19</link>
		<comments>http://loloflargenumbers.com/blog/?p=19#comments</comments>
		<pubDate>Tue, 17 Jan 2012 20:00:16 +0000</pubDate>
		<dc:creator>Kristopher Kapphahn</dc:creator>
				<category><![CDATA[error]]></category>
		<category><![CDATA[measurement]]></category>

		<guid isPermaLink="false">http://loloflargenumbers.com/blog/?p=19</guid>
		<description><![CDATA[My first nonretail job was as a research assistant. I was finishing up bachelor&#8217;s degree in mechanical engineering right about the same time that the financial sector was figuring out that subprime mortgage backed securities might not be as sound as they first appeared. I felt fairly lucky to have scored an internship as a [...]]]></description>
				<content:encoded><![CDATA[<p>My first nonretail job was as a research assistant. I was finishing up bachelor&#8217;s degree in mechanical engineering right about the same time that the financial sector was figuring out that subprime mortgage backed securities might not be as sound as they first appeared. I felt fairly lucky to have scored an internship as a research assistant prior to my senior year at an energy-focused nonprofit. It paid $14/hr, which was more than I had ever made before.</p>
<p>I would primarily be working on projects whose aim was to quantify secondhand smoke exposure in various settings. The research was funded by an organization whose sole purpose was to spend 3% of Minnesota&#8217;s tobacco settlement fund on ameliorating the effects of tobacco smoke. My bosses were extremely busy people, and my role was essentially to fill in the gaps between their obligations and their time-constrained abilities. The work primarily took the form of equipment babysitting: offloading data, applying grease, and checking that dates and times were synchronized. Occasionally, my boss&#8217; requirements resulted in odd tasks, like estimating the outer surface area of an oddly shaped five-story building or testing building code interpretation software. I remember several instances of cooking and eating various delicacies in a sealed room while our equipment measured the particulates generated by my culinary skillz.</p>
<p>After I graduated, I kept the job, lost the intern title and added field work to my job description. This posed a whole new set of challenges for me. The field work consisted of hauling around large, heavy, sturdy plastic cases full of 7 or so different devices with total retail value around $18,000. We’d take the ominous, occasionally humming, intermittently beeping box to randomly chosen (from a pool of agreeable participants) apartments and leave them there for a week. One thing about apartments that I learned: all apartments built after 1980 are essentially the same. They might have different layouts, but they all have off-white carpet and smell like your neighbor&#8217;s dinner every night around 6:30. You pay for location, not quality. I also learned that conspiracy theorists love a captive audience and that other people&#8217;s lives are occasionally fairly depressing: I got to do an install and removal <a href="http://www.startribune.com/local/stpaul/136985218.html">here</a>, though when I went the tenant I was visiting only mentioned the bedbug infestation. Incidentally, bedbugs can live for a year and a half without eating and one way to rid a big box of fancy equipment of bedbugs is to leave it in an unheated garage while the daily highs drop below zero for a week.</p>
<p>The participants in this study were nonsmokers who had reported smelling smoke in their apartments on a survey that my organization had distributed. The purpose of the box, which was left in place for a week, was to collect a sort of indoor air quality fingerprint for that person&#8217;s apartment. The idea being that we could then use the data to estimate nonsmokers’ level of secondhand exposure. The boxes collected data on temperature, humidity, and CO<sub>2</sub> concentration. We collected three different sets of airborne particulate concentration data and week-averaged polycyclic aromatic hydrocarbon concentration. Participants were also required to fill out a log of daily activities. We measured the hell out of everything we could feasibly </p>
<p>Measuring things is easy. Measuring things in an accurate and consistent manner gets really difficult really quick. Throughout my short career as a research assistant studying secondhand smoke, and also during every lab of my undergraduate experience (in Biostatistics you&#8217;re typically just given the data and the trick is to reformat is so that it can be appropriately fit to a statistical model), the trickiness of proper data collection is something that was constantly in the back of my mind. There is always some distinction between what you want to measure and what you&#8217;re actually measuring. For instance, when you take real-time measurements of airborne particulate concentration in a restaurant, you are probably actually measuring varying levels of voltage induced in a photodiode from the concentration dependent scattering of a laser as it&#8217;s pointed through a sample of air from that restaurant. You&#8217;re not even necessarily measuring tobacco smoke, because there are other constituents of restaurant air that scatter laser light identically to tobacco smoke, like ambient air pollution from outside and byproducts of cooking. In order to estimate the actual concentration of pollutants due to tobacco smoke, you need to estimate the proportion of your laser scatter that is solely due to tobacco smoke. Estimates of this proportion are available in the literature and this method seems to be acceptable for publication purposes.</p>
<p>Even if your measurement scheme is all nailed down, there can still be complications that induce bias. The measurement mechanism outlined above is essentially the mechanism utilized by a device called a <a href="http://www.tsi.com/sidepak-personal-aerosol-monitor-am510/">Sidepak</a>. For a device to work as it is supposed to, it must be deployed correctly. You must know how your equipment can fail to measure what you think it’s measuring. On example: If you want to properly estimate the overall concentration of secondhand smoke particulates in a restaurant using a Sidepak, you shouldn&#8217;t sit near anyone who is smoking because you&#8217;ll then be measuring the concentration of secondhand smoke particulates around that person and not the whole room, and your resulting data will be biased towards higher concentrations.</p>
<p>The process of deriving meaningful information via measurement requires at least three things. First, you must know what you want to know &#8211; you must have your goal clearly defined. You have to be able to determine what data is sufficient to answer your question of interest. If your goal is misspecified here you&#8217;ll likely end up answering a question you didn&#8217;t intend to and that question will likely be a completely uninteresting one. Second, you must know how to measure what you need to measure to be able to estimate what you want to know. If you can&#8217;t actually relate your desired estimate to collectable data then you shouldn&#8217;t collect data. Maybe you could simulate your data instead. Third, you must measure in a way that ensures the least amount of measurement error possible. For instance, if your goal is to estimate the number of people who are committing vote fraud and your data collection method is to seek the expertise of knowledgeable people (not necessarily my first choice, but probably not a worthless exercise if done right), don&#8217;t rely solely on politicians for your information.</p>
<p>This last example, while kind of a joke, shows that mistakes in measurement aren&#8217;t solely problems in the sciences. Indeed, the human brain is very susceptible to failing all three criteria listed above resulting in erroneous beliefs based on biased measures of irrelevant information. For more concrete examples, see any PAC-funded political ad.</p>
]]></content:encoded>
			<wfw:commentRss>http://loloflargenumbers.com/blog/?feed=rss2&#038;p=19</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
