cases
# A tibble: 3 × 4
Country `2011` `2012` `2013`
<chr> <dbl> <dbl> <dbl>
1 FR 7000 6900 7000
2 DE 5800 6000 6200
3 US 15000 14000 13000
“Tidy data sets are all alike; but every messy data set is messy in its own way.”
— Hadley Wickham
How you tidy an untidy data set will depend on the initial configuration of the data. For example, consider the cases
data set below.
# A tibble: 3 × 4
Country `2011` `2012` `2013`
<chr> <dbl> <dbl> <dbl>
1 FR 7000 6900 7000
2 DE 5800 6000 6200
3 US 15000 14000 13000
What are the variables in cases
?
cases
This video uses older function names:
gather()
is now pivot_longer()
spread()
is now pivot_wider()
Watch this video (you’ll need to click on the link; the video’s privacy settings make it so it can’t be embedded here)
pivot_longer()
You can use the pivot_longer()
function in the {tidyr} package to convert wide data to long data. Notice that pivot_longer()
returns a tidy copy of the dataset, but does not alter the original dataset. If you wish to use this copy later, you’ll need to save it somewhere.
# A tibble: 9 × 3
Country year n
<chr> <chr> <dbl>
1 FR 2011 7000
2 FR 2012 6900
3 FR 2013 7000
4 DE 2011 5800
5 DE 2012 6000
6 DE 2013 6200
7 US 2011 15000
8 US 2012 14000
9 US 2013 13000
Let’s take a closer look at the pivot_longer()
syntax.
pivot_longer()
syntaxHere’s the same call written without the pipe operator, which makes the syntax easier to see.
To use pivot_longer()
, pass it the name of a data set to reshape followed by which columns to pivot longer, the name of a new variable that will contain on the names of these columns as values, and the name of another new variables that will contain the values from these columns:
cols
argument contains the name of the columns to pivot into longer format.names_to
argument is a string specifying the name of the new column to create from the data stored in the column names of the dataset to be reshaped.values_to
argument is a string specifying the name of the new column to create from the data stored in cell values.Any unspecified columns will remain in the dataset, their contents repeated as often as necessary to duplicate each relationship in the original untidy data set.
In a tidy data set, you will find variable names in the column names of the data set. The values will appear in the cells of the columns. Here we organize the year information originally stored across multiple columns in the dataset into a single column called year. This arrangement reduces duplication.
Sometimes you will also find pairs of names and values listed beside each other in two separate columns, as in table2
. Here the type
column lists the names that are associated with the count
column. This layout is sometimes called “narrow” data.
The pivot functions rely on this notion of names and values to describe what should go where. In pivot_longer()
the names_to
argument describes the new column that contains the values that previously appeared where a tidy data frame’s variable names would go. The values_to
argument describes the new column that contains the values that previously appeared in the value positions, e.g. in cells.
table4a
Now that you’ve seen pivot_longer()
in action, try using it to tidy table4a
:
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
The result should contain three columns: country
, year
, and cases
. Begin by modifying our code below.
Good job!
So far, we have listed which columns to pivot by naming them one at a time and combining them using the c()
function, but this isn’t necessary. pivot_longer()
also recognizes column names as well as all of the select()
helpers that you learned about in Isolating Data with {dplyr}. So, for example, these expressions would all do the same thing:
Notice that 1999 and 2000 are numbers. When you directly call column names that are numbers, you need to surround the names with backticks (otherwise pivot_longer()
would think you mean the 1999th and 2000th columns). Use ?select_helpers
to open a help page that lists the select helpers.
table4b
Use pivot_longer()
and the -
helper to tidy table4b
into a dataset with three columns: country
, year
, and population
.
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
Good job! Together the tidy versions of table4a
and table4b
repeat the information in table1.
In Join Data Sets you will learn how to combine them back into a single data set.
If you looked closely at your results in the previous exercises, you may have noticed something odd: the new year
column is a character vector. You can tell because R displays <chr>
beneath the column name. The names_transform
and values_transform
arguments take a list of functions that define how to transform the newly created column defining the desired type of each newly created column.
For example, names_transform = list(week = as.integer)
would convert a character variable called week
to an integer.
Good Job! Now <int>
appears under the year column, which means that R has stored the years as integers instead of character strings. Integers are one of R’s two numeric data types, along with doubles.
pivot_longer()
cases
, table4a
, and table4b
are all rectangular tables:
Rectangular tables are a simple form of wide data. But you will also encounter more complicated examples of wide data. For example, it is common for researchers to place one subject per row. In this case, you might see several columns of identifying information followed by a set of columns that list repeated measurements of the same variable. cases2
emulates such a data set.
# A tibble: 3 × 6
city country continent `2011` `2012` `2013`
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Paris FR Europe 7000 6900 7000
2 Berlin DE Europe 5800 6000 6200
3 Chicago US North America 15000 14000 13000
To tidy this data, you would want to keep the first three columns as they are. Can you tidy this data with pivot_longer()
? Yes, and you already know how. Think about the problem and then tidy cases2
into a data set with five columns: city
, country
, continent
, year
(as an integer), and cases
.
Great job! Now let’s look at how to tidy another common type of untidy data.