Maps

{maps}: the older way to make maps with R

The tx data set comes from the {maps} package, which is an R package that contains similarly formatted data sets for many regions of the globe.

A short list of the datasets saved in maps includes: france, italy, nz, usa, world, and world2, along with county and state. These last two map the US at the county and state levels. To learn more about maps, run help(package = maps).

map_data()

You do not need to access the {maps} package to use its data. {ggplot2} provides the function map_data() which fetches maps from the maps package and returns them in a format that {ggplot2} can plot.

To use map_data() give it the name of a dataset to retrieve. You can retrieve a subset of the data by providing an optional region argument. For example, I can use this code to retrieve a map of Florida from state, which is the dataset that contains all 50 US states.

fl <- map_data("state", region = "florida")

ggplot(fl) +
  geom_polygon(mapping = aes(x = long, y = lat))

Alter the code to retrieve and plot your home state (Try Idaho if you are outside of the US). Notice the capitalization.

id <- map_data("state", region = "idaho")

ggplot(id) +
  geom_polygon(mapping = aes(x = long, y = lat))

Good job! This is how I collected the tx dataset. Let’s look at what would happen if you do not specify a region in map_data().

state

If you do not specify a region, map_data() will retrieve the entire data set, in this case state.

us <- map_data("state")

In practice, you will often have to retrieve an entire dataset at least once to learn what region names to use with map_data(). The names will be stored in the region column of the dataset.

Hmmm

The code below retrieves and plots the entire state data set, but something goes wrong. What?

us <- map_data("state")

ggplot(us) +
  geom_polygon(mapping = aes(x = long, y = lat))

Multiple polygons

In this case, our data is not out of order, but it contains more than one polygon: it contains 50 polygons—one for each state.

By default, geom_polygon() tries to plot a single polygon, which causes it to connect multiple polygons in weird ways.

groups

Which aesthetic can you use to plot multiple polygons? In the code below, map the group aesthetic to the group variable in the state dataset. This variable contains all of the grouping information needed to make a coherent map. Then rerun the code.

ggplot(us) +
  geom_polygon(mapping = aes(x = long, y = lat, group = group))

Good job! You’ve mastered one method for drawing maps. This approach is older, though, and far less flexible than the more modern approach, which uses the {sf} package.

{sf}, or “simple features”

The {sf} package (“sf” = “simple features”) is the new de facto system for doing anything with geographic data with R.

Fully exploring {sf} here goes beyond the scope of these tutorials, but there are a ton of resources online about it:

We’ll just explore it a little bit here.

Why {sf}?

The data that comes from map_data() contains columns for latitude and longitude, which represent geographic data. However, geographic data is a lot more complex than just x/y coordinates.

  1. The latitude and longitude coordinates in datasets like tx and us that we’ve been using are just individual rows. You could accidentally filter out some rows and remove portions of the border. It would be nice if the coordinates for a state stayed together in one object
  2. As we’ve seen, we can use geom_polygon() or geom_path() to connect the dots between latitude and longitude coordinates, but those rows must be in the correct order to work. {ggplot2} has geom_map(), which is like geom_polygon() and geom_path(), but doesn’t require that the order is correct, but it’s tricky to incorporate other data into it. Suppose you want to fill each state by some other variable, like crime rates or median income—you need to somehow get that data into the map and display in the state polygons.
  3. Any time you take a section of a globe and flatten it into a two-dimensional map, you have to decide how to project the roundness into something flat, which changes the shapes of the geographic elements of a map. You probably noticed that the maps we’ve created so far with geom_polygon() and geom_path() look squished and incorrect. We’re not using map projections to ensure they’re flattened correctly or treated as actual map coordinates.

{sf} saves the day

The {sf} package solves all these problems.

  1. Geographic coordinates in {sf}-enabled datasets aren’t stored as separate latitude and longitude columns. Instead, they’re kept in a special column named geometry. This makes it so you can have one row per unit of analysis (country, state, county, province, etc.) instead of thousands of rows.
  2. Since there’s one row per unit of analysis, it’s a lot easier to merge in other data.
  3. The geom_sf() geom uses the special geometry column to plot the coordinates.
  4. The coord_sf() layer ensures that the map uses specific projections (and lets you change to different projections) so that your maps don’t get distorted.

Let’s see how this works.

{sf}-enabled data

The us data we created with map_data() won’t work with {sf}, since it’s just a bunch of x/y coordinates. But we can load the us_states dataset from the {spData} package:

library(spData)
us_states
Simple feature collection with 49 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
Geodetic CRS:  NAD83
First 10 features:
   GEOID        NAME   REGION             AREA total_pop_10 total_pop_15
1     01     Alabama    South 133709.27 [km^2]      4712651      4830620
2     04     Arizona     West 295281.25 [km^2]      6246816      6641928
3     08    Colorado     West 269573.06 [km^2]      4887061      5278906
4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222
5     12     Florida    South 151052.01 [km^2]     18511620     19645772
6     13     Georgia    South 152725.21 [km^2]      9468815     10006693
7     16       Idaho     West 216512.66 [km^2]      1526797      1616547
8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645
9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987
10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253
                         geometry
1  MULTIPOLYGON (((-88.20006 3...
2  MULTIPOLYGON (((-114.7196 3...
3  MULTIPOLYGON (((-109.0501 4...
4  MULTIPOLYGON (((-73.48731 4...
5  MULTIPOLYGON (((-81.81169 2...
6  MULTIPOLYGON (((-85.60516 3...
7  MULTIPOLYGON (((-116.916 45...
8  MULTIPOLYGON (((-87.52404 4...
9  MULTIPOLYGON (((-102.0517 4...
10 MULTIPOLYGON (((-92.01783 2...

Notice how it looks different from the tx and us data. There’s one row for each state, and the geographic data for each state is stored in the geometry column

geom_sf()

That geometry column is special. When we use it with geom_sf(), {ggplot2} will automatically convert it into polygons. If geometry contains points, like city coordinates, geom_sf() will plot them as points; if geometry contains lines, like roads, geom_sf() will plot them as lines.

Try running this to see what happens:

That’s really neat! We didn’t need to worry about the order of the rows in the dataset, and the map isn’t distorted.

Manipulating {sf}-enabled data

Because {sf}-enabled datasets are tibbles, all our {dplyr} verbs work on them. For example, we can use filter() to select and plot just Texas:

Plotting data on maps

R comes with a data set named USArrests that we can use to make a choropleth map. A choropleth map uses the color of each region in the plot to display some value associated with the region.

In our case we will use the UrbanPop variable of USAarrests which records how urbanized each state was in 1973. UrbanPop is the percent of the population who lived within a city.

USArrests
# A tibble: 50 × 4
   Murder Assault UrbanPop  Rape
    <dbl>   <int>    <int> <dbl>
 1   13.2     236       58  21.2
 2   10       263       48  44.5
 3    8.1     294       80  31  
 4    8.8     190       50  19.5
 5    9       276       91  40.6
 6    7.9     204       78  38.7
 7    3.3     110       77  11.1
 8    5.9     238       72  15.8
 9   15.4     335       80  31.9
10   17.4     211       60  25.8
# ℹ 40 more rows

Data wrangling

You can use geom_sf() to create choropleth maps. To use geom_sf(), we need to merge or join USArrests with a dataset that contains geographic information for each state.

We need to combine USArrests with us_states to make this work. Joining datasets goes beyond the scope of this tutorial—I’ll just show how it’s done here. We’ll use left_join() to add all the columns from USArrests to us_states wherever the two datasets share a state name.

This means the two datasets need to share a state name, and right now, this isn’t the case. us_states has a NAME column, and USArrests hides its state names outside of the dataset in the row names (instead of in a column). In contrast, us uses a column of lower case state names.

This converts the rownames in USArrests to a column named NAME and then combines it with us_states:

USArrests2 <- USArrests |> 
  rownames_to_column("NAME")

us_states_merged <- us_states |> 
  left_join(USArrests2, by = join_by(NAME))

us_states_merged
Simple feature collection with 49 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
Geodetic CRS:  NAD83
First 10 features:
   GEOID        NAME   REGION             AREA total_pop_10 total_pop_15 Murder
1     01     Alabama    South 133709.27 [km^2]      4712651      4830620   13.2
2     04     Arizona     West 295281.25 [km^2]      6246816      6641928    8.1
3     08    Colorado     West 269573.06 [km^2]      4887061      5278906    7.9
4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222    3.3
5     12     Florida    South 151052.01 [km^2]     18511620     19645772   15.4
6     13     Georgia    South 152725.21 [km^2]      9468815     10006693   17.4
7     16       Idaho     West 216512.66 [km^2]      1526797      1616547    2.6
8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645    7.2
9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987    6.0
10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253   15.4
   Assault UrbanPop Rape                       geometry
1      236       58 21.2 MULTIPOLYGON (((-88.20006 3...
2      294       80 31.0 MULTIPOLYGON (((-114.7196 3...
3      204       78 38.7 MULTIPOLYGON (((-109.0501 4...
4      110       77 11.1 MULTIPOLYGON (((-73.48731 4...
5      335       80 31.9 MULTIPOLYGON (((-81.81169 2...
6      211       60 25.8 MULTIPOLYGON (((-85.60516 3...
7      120       54 14.2 MULTIPOLYGON (((-116.916 45...
8      113       65 21.0 MULTIPOLYGON (((-87.52404 4...
9      115       66 18.0 MULTIPOLYGON (((-102.0517 4...
10     249       66 22.2 MULTIPOLYGON (((-92.01783 2...

We’ve now added four new columns to us_states, including UrbanPop.

geom_sf() and aesthetics

Now that we have UrbanPop in our data, we can use it as an aesthetic. Try adding aes(fill = UrbanPop) to geom_sf() here:

ggplot(us_states_merged) +
  geom_sf(aes(fill = UrbanPop))

Congratulations! You’ve used geom_map() to make your first choropleth plot! To test your understanding, alter the code to display the Murder, Assault, or Rape variables.

Projections

Taking a three-dimensional globe and flattening it to a two-dimensional surface means that map shapes will always be different, depending on which way you flatten them. You can control how things are flattened using different map projections. There are standard indexes of more than 4,000 of these projections at epsg.io.

Here are some common ones:

  • EPSG:3395: Mercator projection for the world
  • ESRI:54008: Sinusoidal projection for the world
  • ESRI:54009: Mollweide projection for the world
  • ESRI:54030: Robinson projection for the world
  • EPSG:4326: WGS 84: DOD GPS coordinates (standard −180 to 180 system)
  • EPSG:4269: NAD 83: Relatively common projection for North America
  • ESRI:102003: Albers projection specifically for the contiguous United States

Changing projections

By default, geom_sf() will display the map using whatever projection system is embedded in the geometry column. In the case of us_states and us_states_merged, this is https://en.wikipedia.org/wiki/North_American_Datum or EPSG:4269. You can tell by looking at the Geodetic CRS line near the top of the output from us_states_merged:

us_states_merged
Simple feature collection with 49 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
Geodetic CRS:  NAD83
First 10 features:
   GEOID        NAME   REGION             AREA total_pop_10 total_pop_15 Murder
1     01     Alabama    South 133709.27 [km^2]      4712651      4830620   13.2
2     04     Arizona     West 295281.25 [km^2]      6246816      6641928    8.1
3     08    Colorado     West 269573.06 [km^2]      4887061      5278906    7.9
4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222    3.3
5     12     Florida    South 151052.01 [km^2]     18511620     19645772   15.4
6     13     Georgia    South 152725.21 [km^2]      9468815     10006693   17.4
7     16       Idaho     West 216512.66 [km^2]      1526797      1616547    2.6
8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645    7.2
9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987    6.0
10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253   15.4
   Assault UrbanPop Rape                       geometry
1      236       58 21.2 MULTIPOLYGON (((-88.20006 3...
2      294       80 31.0 MULTIPOLYGON (((-114.7196 3...
3      204       78 38.7 MULTIPOLYGON (((-109.0501 4...
4      110       77 11.1 MULTIPOLYGON (((-73.48731 4...
5      335       80 31.9 MULTIPOLYGON (((-81.81169 2...
6      211       60 25.8 MULTIPOLYGON (((-85.60516 3...
7      120       54 14.2 MULTIPOLYGON (((-116.916 45...
8      113       65 21.0 MULTIPOLYGON (((-87.52404 4...
9      115       66 18.0 MULTIPOLYGON (((-102.0517 4...
10     249       66 22.2 MULTIPOLYGON (((-92.01783 2...

To use a different projection, set the crs argument of coord_sf() to a projection ID from epsg.io.

For example, here’s our US map with an Albers projection, which is specifically designed for North America:

Recap

You can now make all of the plots recommended in the Exploratory Data Analysis tutorial. The next tutorial in this primer will teach you several strategies for dealing with overplotting, a problem that can occur when you have large data or low resolution data.