<- map_data("state", region = "florida")
fl
ggplot(fl) +
geom_polygon(mapping = aes(x = long, y = lat))
Maps
{maps}: the older way to make maps with R
The tx
data set comes from the {maps} package, which is an R package that contains similarly formatted data sets for many regions of the globe.
A short list of the datasets saved in maps includes: france
, italy
, nz
, usa
, world
, and world2
, along with county
and state
. These last two map the US at the county and state levels. To learn more about maps, run help(package = maps)
.
map_data()
You do not need to access the {maps} package to use its data. {ggplot2} provides the function map_data()
which fetches maps from the maps package and returns them in a format that {ggplot2} can plot.
To use map_data()
give it the name of a dataset to retrieve. You can retrieve a subset of the data by providing an optional region
argument. For example, I can use this code to retrieve a map of Florida from state
, which is the dataset that contains all 50 US states.
Alter the code to retrieve and plot your home state (Try Idaho if you are outside of the US). Notice the capitalization.
<- map_data("state", region = "idaho")
id
ggplot(id) +
geom_polygon(mapping = aes(x = long, y = lat))
Good job! This is how I collected the tx
dataset. Let’s look at what would happen if you do not specify a region in map_data()
.
state
If you do not specify a region, map_data()
will retrieve the entire data set, in this case state
.
<- map_data("state") us
In practice, you will often have to retrieve an entire dataset at least once to learn what region names to use with map_data()
. The names will be stored in the region
column of the dataset.
Hmmm
The code below retrieves and plots the entire state data set, but something goes wrong. What?
<- map_data("state")
us
ggplot(us) +
geom_polygon(mapping = aes(x = long, y = lat))
Multiple polygons
In this case, our data is not out of order, but it contains more than one polygon: it contains 50 polygons—one for each state.
By default, geom_polygon()
tries to plot a single polygon, which causes it to connect multiple polygons in weird ways.
groups
Which aesthetic can you use to plot multiple polygons? In the code below, map the group
aesthetic to the group
variable in the state
dataset. This variable contains all of the grouping information needed to make a coherent map. Then rerun the code.
ggplot(us) +
geom_polygon(mapping = aes(x = long, y = lat, group = group))
Good job! You’ve mastered one method for drawing maps. This approach is older, though, and far less flexible than the more modern approach, which uses the {sf} package.
{sf}, or “simple features”
The {sf} package (“sf” = “simple features”) is the new de facto system for doing anything with geographic data with R.
Fully exploring {sf} here goes beyond the scope of these tutorials, but there are a ton of resources online about it:
- {sf} documentation
- Chapter 6 in the 3rd edition of ggplot2: Elegant Graphics for Data Analysis
- Lightning quick overview of {sf} and shapefiles
- Spatial Data Science
- Geocomputation with R
We’ll just explore it a little bit here.
Why {sf}?
The data that comes from map_data()
contains columns for latitude and longitude, which represent geographic data. However, geographic data is a lot more complex than just x/y coordinates.
- The latitude and longitude coordinates in datasets like
tx
andus
that we’ve been using are just individual rows. You could accidentally filter out some rows and remove portions of the border. It would be nice if the coordinates for a state stayed together in one object - As we’ve seen, we can use
geom_polygon()
orgeom_path()
to connect the dots between latitude and longitude coordinates, but those rows must be in the correct order to work. {ggplot2} hasgeom_map()
, which is likegeom_polygon()
andgeom_path()
, but doesn’t require that the order is correct, but it’s tricky to incorporate other data into it. Suppose you want to fill each state by some other variable, like crime rates or median income—you need to somehow get that data into the map and display in the state polygons. - Any time you take a section of a globe and flatten it into a two-dimensional map, you have to decide how to project the roundness into something flat, which changes the shapes of the geographic elements of a map. You probably noticed that the maps we’ve created so far with
geom_polygon()
andgeom_path()
look squished and incorrect. We’re not using map projections to ensure they’re flattened correctly or treated as actual map coordinates.
{sf} saves the day
The {sf} package solves all these problems.
- Geographic coordinates in {sf}-enabled datasets aren’t stored as separate latitude and longitude columns. Instead, they’re kept in a special column named
geometry
. This makes it so you can have one row per unit of analysis (country, state, county, province, etc.) instead of thousands of rows. - Since there’s one row per unit of analysis, it’s a lot easier to merge in other data.
- The
geom_sf()
geom uses the specialgeometry
column to plot the coordinates. - The
coord_sf()
layer ensures that the map uses specific projections (and lets you change to different projections) so that your maps don’t get distorted.
Let’s see how this works.
{sf}-enabled data
The us
data we created with map_data()
won’t work with {sf}, since it’s just a bunch of x/y coordinates. But we can load the us_states
dataset from the {spData} package:
library(spData)
us_states
Simple feature collection with 49 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
Geodetic CRS: NAD83
First 10 features:
GEOID NAME REGION AREA total_pop_10 total_pop_15
1 01 Alabama South 133709.27 [km^2] 4712651 4830620
2 04 Arizona West 295281.25 [km^2] 6246816 6641928
3 08 Colorado West 269573.06 [km^2] 4887061 5278906
4 09 Connecticut Norteast 12976.59 [km^2] 3545837 3593222
5 12 Florida South 151052.01 [km^2] 18511620 19645772
6 13 Georgia South 152725.21 [km^2] 9468815 10006693
7 16 Idaho West 216512.66 [km^2] 1526797 1616547
8 18 Indiana Midwest 93648.40 [km^2] 6417398 6568645
9 20 Kansas Midwest 213037.08 [km^2] 2809329 2892987
10 22 Louisiana South 122345.76 [km^2] 4429940 4625253
geometry
1 MULTIPOLYGON (((-88.20006 3...
2 MULTIPOLYGON (((-114.7196 3...
3 MULTIPOLYGON (((-109.0501 4...
4 MULTIPOLYGON (((-73.48731 4...
5 MULTIPOLYGON (((-81.81169 2...
6 MULTIPOLYGON (((-85.60516 3...
7 MULTIPOLYGON (((-116.916 45...
8 MULTIPOLYGON (((-87.52404 4...
9 MULTIPOLYGON (((-102.0517 4...
10 MULTIPOLYGON (((-92.01783 2...
Notice how it looks different from the tx
and us
data. There’s one row for each state, and the geographic data for each state is stored in the geometry
column
geom_sf()
That geometry
column is special. When we use it with geom_sf()
, {ggplot2} will automatically convert it into polygons. If geometry
contains points, like city coordinates, geom_sf()
will plot them as points; if geometry
contains lines, like roads, geom_sf()
will plot them as lines.
Try running this to see what happens:
That’s really neat! We didn’t need to worry about the order of the rows in the dataset, and the map isn’t distorted.
Manipulating {sf}-enabled data
Because {sf}-enabled datasets are tibbles, all our {dplyr} verbs work on them. For example, we can use filter()
to select and plot just Texas:
Plotting data on maps
R comes with a data set named USArrests
that we can use to make a choropleth map. A choropleth map uses the color of each region in the plot to display some value associated with the region.
In our case we will use the UrbanPop
variable of USAarrests
which records how urbanized each state was in 1973. UrbanPop
is the percent of the population who lived within a city.
USArrests
# A tibble: 50 × 4
Murder Assault UrbanPop Rape
<dbl> <int> <int> <dbl>
1 13.2 236 58 21.2
2 10 263 48 44.5
3 8.1 294 80 31
4 8.8 190 50 19.5
5 9 276 91 40.6
6 7.9 204 78 38.7
7 3.3 110 77 11.1
8 5.9 238 72 15.8
9 15.4 335 80 31.9
10 17.4 211 60 25.8
# ℹ 40 more rows
Data wrangling
You can use geom_sf()
to create choropleth maps. To use geom_sf()
, we need to merge or join USArrests
with a dataset that contains geographic information for each state.
We need to combine USArrests
with us_states
to make this work. Joining datasets goes beyond the scope of this tutorial—I’ll just show how it’s done here. We’ll use left_join()
to add all the columns from USArrests
to us_states
wherever the two datasets share a state name.
This means the two datasets need to share a state name, and right now, this isn’t the case. us_states
has a NAME
column, and USArrests
hides its state names outside of the dataset in the row names (instead of in a column). In contrast, us
uses a column of lower case state names.
This converts the rownames in USArrests
to a column named NAME
and then combines it with us_states
:
<- USArrests |>
USArrests2 rownames_to_column("NAME")
<- us_states |>
us_states_merged left_join(USArrests2, by = join_by(NAME))
us_states_merged
Simple feature collection with 49 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
Geodetic CRS: NAD83
First 10 features:
GEOID NAME REGION AREA total_pop_10 total_pop_15 Murder
1 01 Alabama South 133709.27 [km^2] 4712651 4830620 13.2
2 04 Arizona West 295281.25 [km^2] 6246816 6641928 8.1
3 08 Colorado West 269573.06 [km^2] 4887061 5278906 7.9
4 09 Connecticut Norteast 12976.59 [km^2] 3545837 3593222 3.3
5 12 Florida South 151052.01 [km^2] 18511620 19645772 15.4
6 13 Georgia South 152725.21 [km^2] 9468815 10006693 17.4
7 16 Idaho West 216512.66 [km^2] 1526797 1616547 2.6
8 18 Indiana Midwest 93648.40 [km^2] 6417398 6568645 7.2
9 20 Kansas Midwest 213037.08 [km^2] 2809329 2892987 6.0
10 22 Louisiana South 122345.76 [km^2] 4429940 4625253 15.4
Assault UrbanPop Rape geometry
1 236 58 21.2 MULTIPOLYGON (((-88.20006 3...
2 294 80 31.0 MULTIPOLYGON (((-114.7196 3...
3 204 78 38.7 MULTIPOLYGON (((-109.0501 4...
4 110 77 11.1 MULTIPOLYGON (((-73.48731 4...
5 335 80 31.9 MULTIPOLYGON (((-81.81169 2...
6 211 60 25.8 MULTIPOLYGON (((-85.60516 3...
7 120 54 14.2 MULTIPOLYGON (((-116.916 45...
8 113 65 21.0 MULTIPOLYGON (((-87.52404 4...
9 115 66 18.0 MULTIPOLYGON (((-102.0517 4...
10 249 66 22.2 MULTIPOLYGON (((-92.01783 2...
We’ve now added four new columns to us_states
, including UrbanPop
.
geom_sf()
and aesthetics
Now that we have UrbanPop
in our data, we can use it as an aesthetic. Try adding aes(fill = UrbanPop)
to geom_sf()
here:
ggplot(us_states_merged) +
geom_sf(aes(fill = UrbanPop))
Congratulations! You’ve used geom_map() to make your first choropleth plot! To test your understanding, alter the code to display the Murder
, Assault
, or Rape
variables.
Projections
Taking a three-dimensional globe and flattening it to a two-dimensional surface means that map shapes will always be different, depending on which way you flatten them. You can control how things are flattened using different map projections. There are standard indexes of more than 4,000 of these projections at epsg.io.
Here are some common ones:
- EPSG:3395: Mercator projection for the world
- ESRI:54008: Sinusoidal projection for the world
- ESRI:54009: Mollweide projection for the world
- ESRI:54030: Robinson projection for the world
- EPSG:4326: WGS 84: DOD GPS coordinates (standard −180 to 180 system)
- EPSG:4269: NAD 83: Relatively common projection for North America
- ESRI:102003: Albers projection specifically for the contiguous United States
Changing projections
By default, geom_sf()
will display the map using whatever projection system is embedded in the geometry
column. In the case of us_states
and us_states_merged
, this is https://en.wikipedia.org/wiki/North_American_Datum or EPSG:4269. You can tell by looking at the Geodetic CRS
line near the top of the output from us_states_merged
:
us_states_merged
Simple feature collection with 49 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
Geodetic CRS: NAD83
First 10 features:
GEOID NAME REGION AREA total_pop_10 total_pop_15 Murder
1 01 Alabama South 133709.27 [km^2] 4712651 4830620 13.2
2 04 Arizona West 295281.25 [km^2] 6246816 6641928 8.1
3 08 Colorado West 269573.06 [km^2] 4887061 5278906 7.9
4 09 Connecticut Norteast 12976.59 [km^2] 3545837 3593222 3.3
5 12 Florida South 151052.01 [km^2] 18511620 19645772 15.4
6 13 Georgia South 152725.21 [km^2] 9468815 10006693 17.4
7 16 Idaho West 216512.66 [km^2] 1526797 1616547 2.6
8 18 Indiana Midwest 93648.40 [km^2] 6417398 6568645 7.2
9 20 Kansas Midwest 213037.08 [km^2] 2809329 2892987 6.0
10 22 Louisiana South 122345.76 [km^2] 4429940 4625253 15.4
Assault UrbanPop Rape geometry
1 236 58 21.2 MULTIPOLYGON (((-88.20006 3...
2 294 80 31.0 MULTIPOLYGON (((-114.7196 3...
3 204 78 38.7 MULTIPOLYGON (((-109.0501 4...
4 110 77 11.1 MULTIPOLYGON (((-73.48731 4...
5 335 80 31.9 MULTIPOLYGON (((-81.81169 2...
6 211 60 25.8 MULTIPOLYGON (((-85.60516 3...
7 120 54 14.2 MULTIPOLYGON (((-116.916 45...
8 113 65 21.0 MULTIPOLYGON (((-87.52404 4...
9 115 66 18.0 MULTIPOLYGON (((-102.0517 4...
10 249 66 22.2 MULTIPOLYGON (((-92.01783 2...
To use a different projection, set the crs
argument of coord_sf()
to a projection ID from epsg.io.
For example, here’s our US map with an Albers projection, which is specifically designed for North America:
Recap
You can now make all of the plots recommended in the Exploratory Data Analysis tutorial. The next tutorial in this primer will teach you several strategies for dealing with overplotting, a problem that can occur when you have large data or low resolution data.