This is what 4 million births looks like, assuming that you’re floating high above the continental United States and that each birth appears as a green dot as it occurs and then changes into a blue dot for all subsequent days. It looks best in full screen HD.
The video uses data from 2004. On each day babies were born and each green dot corresponds to a birth. When the video moves on to the next day, all births from the previous day turn blue and remain blue for the rest of the year. This isn’t an exact representation of where or when these births actually happened, but I think that it’s a fairly plausible approximation.
The National Center for Health Statistics (NCHS) has data. A lot of it. Millions and millions of data points waiting – aching – for someone to turn them into meaning. They have data on birth, death, air quality, and other subjects denoted by cryptic acronyms.
One such data set is the 2004 natality data set. Each observation in this file corresponds to a single birth, and for 2004 in the US, there were over 4.1 million babies born (2.4 million people died – see 2004 mortality data set).
The natality data go back to the 1968. I chose 2004 because the 2004 data set is the last one that includes geographic information. After 2004, accessing data with geographic information requires more comprehensive vetting.
The data corresponding to each birth is comprehensive, with fields for mother’s age, place of delivery, attendant, parent’s education, natal comorbidities, etc. However, in order to maintain the privacy of the people represented in the data, a few variables are omitted or conditionally set to uninformative values. For instance, while each record comes with information about the state and county where the birth occurred, for births which occurred in counties with population less than 100,000 people, the county field is set to a useless ’999′ value. The data also omit any information about the day on which a birth occurred. There is good reason for leaving these details out; it would be fairly easy to match a person to their record in the data if they lived in a less populous county and you knew even cursory information about their demographic info and birth date.
But my goal here wasn’t to track down rural women and confront them with my knowledge of their child’s apgar score or whether the use of forceps was attempted during labor. I just wanted to plot the location of the birth on a map because I thought it might make for an interesting visual.
Filling in the Blanks
My plan was to take each birth, match it up with longitude and latitude coordinates based on the county of occurrence, then use these coordinates to plot the birth’s location on a map of the US for each day of the year. Simple, right?
Maybe not. Getting the raw data was easy. A right click was all it took. The data come compressed, and expand to a 5+ gigabyte, fixed width text file. This is a file that’s just asking for SAS, but SAS is for large organizations and people who don’t want to produce attractive pictures with their primary software package. Since I am neither of those things (but I will admit to appreciating the very specific advantages of SAS when I am working on behalf of a large organization), I wrote a Ruby script to parse the raw data into a MySQL database and then used R to pull month-wide chunks of database as needed.
The plots were generated in R using the maps library and ggplot2.
I got around the missing date information in the data by assuming that births were equally likely on any day of a month and assigning them thusly. This isn’t strictly true, as the probability of being born on Saturday or Sunday is substantially lower than the probability of being born any other day of the week. I thought about taking this fact into account when I handed out birth dates, but opted not because it wouldn’t add much to the visualization.
I initially generated a video like the one above using the data as it came (with randomly assigned birth dates) and ended up with vast swaths of the west showing no births at all. It didn’t look plausible. It is true that population density is pretty small in the west, but no births at all wasn’t going to work. So what’s a fella to do? I had to come up with a reasonable way to place the people from counties with fewer than 100,000 that wouldn’t appear too conspicuously wrong.
I’ve done a bit of work with decennial census data, so I was aware of the massive amount of data that the census folks generate. At first, I thought it would be sufficient for my purposes to get a list of counties with fewer than 100,000 people for each state and then randomly place rural people in one of these counties. This approach worked okay, but was still problematic because the longitude and latitude coordinates for each county in the census data were located at a county’s most populous city. For larger rural counties, the resulting birth patterns still looked implausibly sparse because all the births assigned to that county would be concentrated in one corner, which left the rest of the county looking like a dead zone.
I eventually settled on using census tract data. Each rural birth is assigned a birth location in a census tract somewhere in its state, with births being distributed to census tracts with probabilities proportional to a tract’s population. To ensure that rural births didn’t end up in densely populated urban areas, they were only allowed to be assigned to census tracts located in counties whose total county population was less than 100,000. So a person who gave birth in Otter Tail County (pop ~60,000) could be assigned to a census tract in any of the 70 or so counties in Minnesota whose with population was less than 100,000. This strategy assumes that all rural census tracts have the same proportion of women who are going to give birth, which is an assumption I didn’t at all try to verify.
Because the births in the most populous counties were located by county, and census tracts are generally more granular than county, I set the code up to randomly assign births in populous counties to census tracts from populous counties within the same state with probability proportional to census tract population. So a person who gave birth in Hennepin County (pop. >> 100,000) could be assigned to any census tract in MN in a county with population greater than 100,000.
So the births shown in the movie above are, strictly speaking, not at all what a time-lapse map of actual births in 2004 looked like; it is very unlikely that a birth shown in the video actually occurred on the day shown at the place shown. However, given the gaps in the data, I think it’s a reasonable representation of what it the video might have looked like had it been produced using the complete data.