Choropleths with Alaska and Hawaii in R

The following chart shows a choropleth of mean Apgar score for each state in the continental United States for 2004. Data are from the 2004 national natality data file. California and Texas apparently didn’t report Apgar scores.

Mean Apgar Score, US, 2004
Mean Apgar Score, US, 2004

Maps like this are fairly easy to make in R. As with most R-ey things, there are many routes available. My process generally involves using the maps package to import a dataframe with polygon info to draw the shapes and then merge in some numerical values of interest. A few lines of ggplot-specific syntax and a custom theme and BAM! (as they say) you have your map. There are two municipally significant problems with this process: Alaska and Hawaii aren’t readily available as part of the default US polygon data in maps. Hawaii and Alaska are the two newest states, and so we could probably get by leaving them out in the spirit of good natured rookie ribbing (maybe we’ll include them by default once they’ve been in the Union for more than 60 years)?
Because I’ve never had to produce a choropleth for someone who cares about explicitly including Alaska or Hawaii, which is to say I’ve never been motivated or compensated to produce a choropleth by anyone who isn’t also me, the inability to represent AK and HI hasn’t been a deal breaker. However, my failure thus far to produce maps with AK and HI irks me. I get irked. Alaska and Hawaii are, you know, part of the country, and should probably be represented in all self respecting national maps. And as a person who wasn’t born into R scripting yesterday, I ought to know how to make a map with Hawaii in it. Personal pride and all that.
It turns out that I’m not the first person who has wanted to do this. Google (or Bing, if you’re on the “I Binged before it was cool” track) suggest that this problem has left many a tired, frustrated r-blogger in its wake. There are several “How do I do this” mailing list entries (answer: R is open source, figure it out yourself. You’re welcome for the helpful suggestions, by the way). There are a few “Here’s how I started to solve this problem, but now I’m annoyed and frustrated and to hell with Alaska and Hawaii anyway” blog posts. And there are one or two “Here’s how you do it. Maybe. But you have to use these packages and some black box functions and …” It turns out, it’s kind of a pain in the ass.
Here’s how I did it.
First, you need a shape file. This is a file that contains all the info required to draw a picture of some geographic entity. Search for “arcgis us shapefile” and one of the first links that pops up should belong to www.arcgis.com. Click on it and it will probably take you to a page like the one here. Click Open, then Download. Unzip the file you just downloaded, then use the readShapePoly function in the maptools package to read the states.shp file into R:

us50_shp <- readShapePoly("path to states.shp")

This will return a spatialpolygonsdataframe object chock full of things that I'm sure are important, but that I don't care about.

us50_df <- as.data.frame(us50_shp)

strips away extraneous info and leaves information on state, FIPS code and something called DRAWSEQ which I'm just going to assume plays a role in a chart drawing algorithm.
To pull out the data used to draw the states, the sp2tmap function, also from maptools is useful. It returns a three column data frame with a column for ID ("_ID"), a column for longitude ("_X") and a column for latitude ("_Y"). I like to rename these columns to conform to my preference for avoiding all caps variable names that start with underscores.

us50_points <- sp2tmap(us50_shp)
names(us50_points) <- c("id", "x", "y")

To which ID does the _ID column returned by sp2tmap refer? If you guessed DRAWSEQ, you're both bold and correct. Knowing this, a simple merge will combine the general information in us50_df with the data required to draw the states in us50_points:

us50 <- merge(x = us50_df, y = us50_points, by.x = "DRAWSEQ", by.y = "id")

Now, plot it
ggplot(data = us50, aes(x=x, y=y, group = DRAWSEQ)) + geom_polygon(color = "black", fill = "white")
and you get something ugly:
US No Canada
US No Canada

This is how the world actually looks, sort of. But there's a lot of wasted empty space. It is evident why maps of the U.S. tend to place Alaska and Hawaii underneath California. To do this with these data, the data must be rearranged. The continental US will remain unchanged, while Alaska and Hawaii will be separately scaled and moved to their more traditional positions. The following code splits up the United States.

cont_us <- us50[us50$STATE_ABBR != "HI" & us50$STATE_ABBR != "AK", ]
ak <- us50[us50$STATE_ABBR == "AK", ]
hi <- us50[us50$STATE_ABBR == "HI", ]

Everybody remembers linear algebra where you take a square and multiply the coordinates of its vertices by a matrix and the result is a rhombus that's somewhere else. So I won't review. The following functions are useful for a similar process. The first function, centerState takes an object with x and y coordinates and centers it at the origin. The second, scaleState, takes an object with x and y coordinates, centers it at the origin, scales it according to the product of scale_factor and scale_matrix, then shifts it according to x_shift and y_shift.

centerState <- function(.df) {
.df$x <- .df$x - (diff(range(.df$x, na.rm = T))/2 + min(.df$x, na.rm = T))
.df$y <- .df$y - (diff(range(.df$y, na.rm = T))/2 + min(.df$y, na.rm = T))
return(.df)
}


scaleState <- function(.df, scale_matrix, scale_factor, x_shift, y_shift) {
.df <- centerState(.df)
coords <- t(cbind(.df$x, .df$y))
scaled_coord <- t(scale_factor*scale_matrix %*% coords)

.df$x <- scaled_coord[,1] + x_shift
.df$y <- scaled_coord[,2] + y_shift
return(.df)
}

The following snippet of code shows these functions in action. I defined scale_mat so that the vertical dimension scales to a different extent than the horizontal because I think it looks better. Also, Hawaii is small, so the scale_factor I used expands it. The transformed states are shifted to the left and up from the origin so that they end up tucked comfortably underneath California. The final line combines the transformed state data with the continental United States.

scale_mat <- matrix(c(1,0,0,1.25), ncol = 2, byrow = T)
ak_scale <- scaleState(ak, scale_mat, 0.4, x_shift = -120, y_shift = 25)
hi_scale <- scaleState(hi, scale_mat, 1.5, x_shift = -107, y_shift = 25)
all_us <- rbind(cont_us, ak_scale, hi_scale)

Now, to plot. Here's the Apgar data plotted on the whole US.

All US Apgar
All US Apgar

Not the prettiest map. Alaska could stand to be smaller and Hawaii might look better larger. The algorithm that I used to project the map makes Alaska look even more out of place. But it works. If I ever use this process for work, I'll probably tweak the scaling a bit more and maybe add a parameter and functionality to scaleState to allow for manual rotation.

I've compiled the code into one chunk below. Also included but not mentioned in the above tutorial is a function (mapproject from the mapproj package) for transforming the rectangular coordinates of the map data so that the final plot appears rounded. This code is also available here: github.com/kikapp/hiakoro

library(maptools)
library(mapproj)
library(ggplot2)

centerState <- function(.df) {
.df$x <- .df$x - (diff(range(.df$x, na.rm = T))/2 + min(.df$x, na.rm = T))
.df$y <- .df$y - (diff(range(.df$y, na.rm = T))/2 + min(.df$y, na.rm = T))
return(.df)
}

scaleState <- function(.df, scale_matrix, scale_factor, x_shift, y_shift) {
.df <- centerState(.df)
coords <- t(cbind(.df$x, .df$y))
scaled_coord <- t(scale_factor*scale_matrix %*% coords)

.df$x <- scaled_coord[,1] + x_shift
.df$y <- scaled_coord[,2] + y_shift
return(.df)
}
us50_shp <- readShapePoly("your path to states.shp")
us50_df <- as.data.frame(us50_shp)

us50_points <- sp2tmap(us50_shp)
names(us50_points) <- c("id", "x", "y")

us50 <- merge(x = us50_df, y = us50_points, by.x = "DRAWSEQ", by.y = "id")

ggplot(data = us50, aes(x=x, y=y, group = DRAWSEQ)) + geom_polygon(color = "black", fill = "white")

cont_us <- us50[us50$STATE_ABBR != "HI" & us50$STATE_ABBR != "AK", ]
ak <- us50[us50$STATE_ABBR == "AK", ]
hi <- us50[us50$STATE_ABBR == "HI", ]

scale_mat <- matrix(c(1,0,0,1.25), ncol = 2, byrow = T)
ak_scale <- scaleState(ak, scale_mat, 0.4, x_shift = -120, y_shift = 25)
hi_scale <- scaleState(hi, scale_mat, 1.5, x_shift = -107, y_shift = 25)

all_us <- rbind(cont_us, ak_scale, hi_scale)
proj_type <- "azequalarea"
projected <- mapproject(x = all_us$x, y = all_us$y, projection=proj_type)
all_us$x_proj <- projected[["x"]]
all_us$y_proj <- projected[["y"]]

ggplot(data = all_us, aes(x=x_proj, y=y_proj, group = DRAWSEQ)) + geom_polygon(color = "black", fill = "white")

13 Comments

  1. Thanks!

    The color comes in via the fill argument in the geom_polygon() function. In the posted code, I don’t reference the actual data I used to create the choropleths that are shown; the code should just produce a white map with black borders.

    If you wanted to generate a color-coded choropleth, you’d use a data frame identical to the all_us data frame that also had an additional numeric variable corresponding to the data you want to plot.

  2. Ari Lamstein

    Hi Kristopher,

    This is a great post. I recently released an R package called choroplethr that attempts to make it easier to make choropleths in R:

    blog post: http://tech.truliablog.com/2014/01/15/the-choroplethr-package-for-r/

    github: https://github.com/trulia/choroplethr

    I spend quite a bit of time trying to incorporate Hawaii and Alaska into the maps, but couldn’t find a satisfactory way to do so.

    I would like to use your method in my project, and attribute you in the source code with a link back to this post. Can I receive your permission to do so?

    Thanks.

    Ari

  3. Hi Kristopher,

    This is a great post, thank you very much. However, there is a little issue: If you have an external longitude / latitude data, it becomes necessary to scale those values as well so they can find their right paces on this map (instead of being printed on where Alaska would have been).

    Do you have any quick suggestions for that? :)

    As an example, this point needs to be on Honoulu:

    lat: 21.331
    lon: -158.037

    Thanks again,
    Best wishes.

    • Meren,

      Glad you like the post. Thanks!

      The easiest way that I found to do something like that would be to create an extra row in the hi data.frame before you run it through the scaleState function, then split it out afterwards.


      hi_points < - data.frame(y=21.331, x=-158.037, DRAWSEQ = NA, STATE_NAME = "Hawaii",
      STATE_FIPS = 15, SUB_REGION = "point", STATE_ABBR = "HI")
      hi_scale <- scaleState(rbind(hi, hi_points), scale_mat, 1.5, x_shift = -107, y_shift = 25)

      hi_p <- hi_scale[hi_scale$SUB_REGION == "point",]
      hi_scale <- hi_scale[hi_scale$SUB_REGION != "point",]

      Now, hi_p should contain the correct coordinates to plot. I believe (but haven't tested) that an analogous procedure could be used if you want to project that point with the mapproject function.

  4. Thank you very much for the insight, Kristopher.

    I took your advice, and did something like this to simply edit my data frame (df) that contains all cities and associated info with them:

    for(sample in df[df$state == "AK", ]$samples){
    ak_point <- data.frame(y=df[df$samples == sample, ]$lat,
    x=df[df$samples == sample, ]$lon,
    DRAWSEQ = NA, STATE_NAME = "Alaska",
    STATE_FIPS = 15, SUB_REGION = "point", STATE_ABBR = "AK")
    sp <- scaleState(rbind(ak, ak_point), scale_mat, 0.4, x_shift = -120, y_shift = 25)
    df[df$samples == sample, ]$lat <- sp[sp$SUB_REGION == 'point', ]$y
    df[df$samples == sample, ]$lon <- sp[sp$SUB_REGION == 'point', ]$x
    }

    And finally I got my map that looks much better: http://i.imgur.com/2Hrf6Qh.png :)

    Thanks again!

  5. Jeff Miller

    This is an excellent post: clear, to the point, and the code actually works as described.

    I like the choroplethr package, but this is nice because it gets HI and AK.

    I’d like to extend this to include the Canadian provinces as well as the states. Any pointers on how to do that?

    Thanks!

    • Jeff,

      Thanks for the kind words.

      I think if you can acquire a shape file that has the Canadian provinces, you should be able to convert it via the same process I used to convert the us_shp_50 object.

      I would hope (but really have no clue) that then you could simply use an rbind to combine the US object and the Canadian object. You might also have to change the DRAWSEQ in one of these data frames so that don’t duplicate DRAWSEQs across data frames.

  6. Jeff Miller

    Kristopher, thanks for the quick reply. I’ll give that a try. It would be great if something as simple as rbind worked in this context.

    I actually have one further question, on how to color the states different colors.

    When I used choroplethr on the lower 48, I first made a data.frame with two columns.

    teslawh <- data.frame(region = row.names(zlocale), value = zlocale$wtdAvg)

    The first few rows look like this:

    region value
    2 AK 384
    3 AR 311
    4 AZ 304
    6 CA 328

    (The values are the average for the state watt hours/mile of the Tesla Model S. Cold states have higher wh/m ).

    Then, to make a map in which states are colored according to which quantile of "value" they fall into, I just call

    choroplethr(teslawh, "state" , num_buckets = 3 )

    This will produce a map with three colors.

    I'm new to ggplot (I'm reading Wickham's book now), so I'm not sure how to modify your function call to do something similar.
    I know how to produce the quantiles of "value", I just don't know how to modify the call to ggplot to make the colors it displays for each stats correspond to the appropriate quantile.

    I have the feeling that I need to modify

    geom_polygon(color = "black", fill = "white")

    and to set "fill" to something related to
    the value column in teslawh, but I don't know the proper syntax.

    Any suggestions you have would be a great help! Thanks in advance,

    Jeff

Leave a Reply

Your email address will not be published. Required fields are marked *