group_by() and summarize()

group_by() takes a data frame and then the names of one or more columns in the data frame. It returns a copy of the data frame that has been “grouped” into sets of rows that share identical combinations of values in the specified columns.

group_by() in action

For example, the result below is grouped into rows that have the same combination of year and sex values: boys in 1880 are treated as one group, girls in 1880 as another group and so on.

babynames |>
  group_by(year, sex)
# A tibble: 1,924,665 × 5
# Groups:   year, sex [276]
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Using group_by()

By itself, group_by() doesn’t do much. It assigns grouping criteria that is stored as metadata alongside the original data set. If your dataset is a tibble, as above, R will tell you that the data is grouped at the top of the tibble display. In all other aspects, the data looks the same.

However, when you apply a {dplyr} function like summarize() to grouped data, dplyr will execute the function in a groupwise manner. Instead of computing a single summary for the entire data set, {dplyr} will compute individual summaries for each group and return them as a single data frame. The data frame will contain the summary columns as well as the columns in the grouping criteria, which makes the result decipherable

Watch these animations to see what happens with one group:

Let’s see what happens when we group with two columns:

To understand exactly what group_by() is doing, remove the line group_by(year, sex) |> from the code above and rerun it. How do the results change?

This animation should help with the intuition of grouping by two columns:

Ungrouping 1

If you apply summarize() to grouped data, summarize() will return data that is grouped in a similar, but not identical fashion. summarize() will remove the last variable in the grouping criteria, which creates a data frame that is grouped at a higher level. For example, this summarize() statement receives a data frame that is grouped by year and sex, but it returns a data frame that is grouped only by year.

babynames |>
  group_by(year, sex) |> 
  summarize(total = sum(n))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 276 × 3
# Groups:   year [138]
    year sex    total
   <dbl> <chr>  <int>
 1  1880 F      90993
 2  1880 M     110491
 3  1881 F      91953
 4  1881 M     100743
 5  1882 F     107847
 6  1882 M     113686
 7  1883 F     112319
 8  1883 M     104627
 9  1884 F     129020
10  1884 M     114442
# ℹ 266 more rows

Ungrouping 2

If only one grouping variable is left in the grouping criteria, summarize() will return an ungrouped data set. This feature let’s you progressively “unwrap” a grouped data set:

If we add another summarize() to our pipe,

  1. our data set will first be grouped by year and sex.
  2. Then it will be summarized into a data set grouped by year (i.e. the result above)
  3. Then be summarized into a final data set that is not grouped.
babynames |>
  group_by(year, sex) |> 
  summarize(total = sum(n)) |> 
  summarize(total = sum(total))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 138 × 2
    year  total
   <dbl>  <int>
 1  1880 201484
 2  1881 192696
 3  1882 221533
 4  1883 216946
 5  1884 243462
 6  1885 240854
 7  1886 255317
 8  1887 247394
 9  1888 299473
10  1889 288946
# ℹ 128 more rows

Ungrouping 3

If you wish to manually remove the grouping criteria from a data set, you can do so with ungroup().

babynames |>
  group_by(year, sex) |> 
  ungroup()
# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Ungrouping 3

And, you can override the current grouping information with a new call to group_by().

babynames |>
  group_by(year, sex) |> 
  group_by(name)
# A tibble: 1,924,665 × 5
# Groups:   name [97,310]
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

That’s it. Between group_by(), summarize(), and ungroup(), you have a toolkit for taking groupwise summaries of your data at various levels of grouping.

Good job! This graph shows different trends than the one above, now let’s consider why.

Next topic