group_by() takes a data frame and then the names of one or more columns in the data frame. It returns a copy of the data frame that has been “grouped” into sets of rows that share identical combinations of values in the specified columns.
group_by() in action
For example, the result below is grouped into rows that have the same combination of year and sex values: boys in 1880 are treated as one group, girls in 1880 as another group and so on.
babynames |>group_by(year, sex)
# A tibble: 1,924,665 × 5
# Groups: year, sex [276]
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
Using group_by()
By itself, group_by() doesn’t do much. It assigns grouping criteria that is stored as metadata alongside the original data set. If your dataset is a tibble, as above, R will tell you that the data is grouped at the top of the tibble display. In all other aspects, the data looks the same.
However, when you apply a {dplyr} function like summarize() to grouped data, dplyr will execute the function in a groupwise manner. Instead of computing a single summary for the entire data set, {dplyr} will compute individual summaries for each group and return them as a single data frame. The data frame will contain the summary columns as well as the columns in the grouping criteria, which makes the result decipherable
Watch these animations to see what happens with one group:
To understand exactly what group_by() is doing, remove the line group_by(year, sex) |> from the code above and rerun it. How do the results change?
This animation should help with the intuition of grouping by two columns:
Ungrouping 1
If you apply summarize() to grouped data, summarize() will return data that is grouped in a similar, but not identical fashion. summarize() will remove the last variable in the grouping criteria, which creates a data frame that is grouped at a higher level. For example, this summarize() statement receives a data frame that is grouped by year and sex, but it returns a data frame that is grouped only by year.
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 276 × 3
# Groups: year [138]
year sex total
<dbl> <chr> <int>
1 1880 F 90993
2 1880 M 110491
3 1881 F 91953
4 1881 M 100743
5 1882 F 107847
6 1882 M 113686
7 1883 F 112319
8 1883 M 104627
9 1884 F 129020
10 1884 M 114442
# ℹ 266 more rows
Ungrouping 2
If only one grouping variable is left in the grouping criteria, summarize() will return an ungrouped data set. This feature let’s you progressively “unwrap” a grouped data set:
If we add another summarize() to our pipe,
our data set will first be grouped by year and sex.
Then it will be summarized into a data set grouped by year (i.e. the result above)
Then be summarized into a final data set that is not grouped.
If you wish to manually remove the grouping criteria from a data set, you can do so with ungroup().
babynames |>group_by(year, sex) |>ungroup()
# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
Ungrouping 3
And, you can override the current grouping information with a new call to group_by().
babynames |>group_by(year, sex) |>group_by(name)
# A tibble: 1,924,665 × 5
# Groups: name [97,310]
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
That’s it. Between group_by(), summarize(), and ungroup(), you have a toolkit for taking groupwise summaries of your data at various levels of grouping.
The most popular names by total children
You now know enough to calculate the most popular names by total children (it may take some strategizing, but you can do it!).
In the code chunk below, use group_by(), summarize(), and arrange() to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name. In other words, the total number of boys named “Kelly” should be computed separately from the total number of girls named “Kelly”.
The history of the most popular names by total children
Let’s examine how the popularity of popular names has changed over time. To help us, I’ve made top_10, which is a version of babynames that is trimmed down to just the ten most popular names from above.
top_10
# A tibble: 1,380 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 M John 9655 0.0815
3 1880 M William 9532 0.0805
4 1880 M James 5927 0.0501
5 1880 M Charles 5348 0.0452
6 1880 M Joseph 2632 0.0222
7 1880 M Robert 2415 0.0204
8 1880 M David 869 0.00734
9 1880 M Richard 728 0.00615
10 1880 M Michael 354 0.00299
# ℹ 1,370 more rows
Exercise: Proportions for popular names
Use the code block below to plot a line graph of prop vs year for each name in top_10. Be sure to color the lines by name to make the graph interpretable.
top_10 |>ggplot() +geom_line(aes(x = year, y = prop, color = name))
Exercise: Total children for popular names
Now use top_10 to plot n vs year for each of the names. How are the plots different? Why might that be? How does this affect our decision to use total children as a measure of popularity?