`group_by()` and `summarize()`

group_by() takes a data frame and then the names of one or more columns in the data frame. It returns a copy of the data frame that has been “grouped” into sets of rows that share identical combinations of values in the specified columns.

`group_by()` in action

For example, the result below is grouped into rows that have the same combination of year and sex values: boys in 1880 are treated as one group, girls in 1880 as another group and so on.

babynames |>
  group_by(year, sex)

# A tibble: 1,924,665 × 5
# Groups:   year, sex [276]
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Using `group_by()`

By itself, group_by() doesn’t do much. It assigns grouping criteria that is stored as metadata alongside the original data set. If your dataset is a tibble, as above, R will tell you that the data is grouped at the top of the tibble display. In all other aspects, the data looks the same.

However, when you apply a {dplyr} function like summarize() to grouped data, dplyr will execute the function in a groupwise manner. Instead of computing a single summary for the entire data set, {dplyr} will compute individual summaries for each group and return them as a single data frame. The data frame will contain the summary columns as well as the columns in the grouping criteria, which makes the result decipherable

Watch these animations to see what happens with one group:

group_by(cat1)
group_by(cat2)

Let’s see what happens when we group with two columns:

Interactive editor

To understand exactly what group_by() is doing, remove the line group_by(year, sex) |> from the code above and rerun it. How do the results change?

This animation should help with the intuition of grouping by two columns:

Ungrouping 1

If you apply summarize() to grouped data, summarize() will return data that is grouped in a similar, but not identical fashion. summarize() will remove the last variable in the grouping criteria, which creates a data frame that is grouped at a higher level. For example, this summarize() statement receives a data frame that is grouped by year and sex, but it returns a data frame that is grouped only by year.

babynames |>
  group_by(year, sex) |> 
  summarize(total = sum(n))

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

# A tibble: 276 × 3
# Groups:   year [138]
    year sex    total
   <dbl> <chr>  <int>
 1  1880 F      90993
 2  1880 M     110491
 3  1881 F      91953
 4  1881 M     100743
 5  1882 F     107847
 6  1882 M     113686
 7  1883 F     112319
 8  1883 M     104627
 9  1884 F     129020
10  1884 M     114442
# ℹ 266 more rows

Ungrouping 2

If only one grouping variable is left in the grouping criteria, summarize() will return an ungrouped data set. This feature let’s you progressively “unwrap” a grouped data set:

If we add another summarize() to our pipe,

our data set will first be grouped by year and sex.
Then it will be summarized into a data set grouped by year (i.e. the result above)
Then be summarized into a final data set that is not grouped.

babynames |>
  group_by(year, sex) |> 
  summarize(total = sum(n)) |> 
  summarize(total = sum(total))

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

# A tibble: 138 × 2
    year  total
   <dbl>  <int>
 1  1880 201484
 2  1881 192696
 3  1882 221533
 4  1883 216946
 5  1884 243462
 6  1885 240854
 7  1886 255317
 8  1887 247394
 9  1888 299473
10  1889 288946
# ℹ 128 more rows

Ungrouping 3

If you wish to manually remove the grouping criteria from a data set, you can do so with ungroup().

babynames |>
  group_by(year, sex) |> 
  ungroup()

# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Ungrouping 3

And, you can override the current grouping information with a new call to group_by().

babynames |>
  group_by(year, sex) |> 
  group_by(name)

# A tibble: 1,924,665 × 5
# Groups:   name [97,310]
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

That’s it. Between group_by(), summarize(), and ungroup(), you have a toolkit for taking groupwise summaries of your data at various levels of grouping.

The most popular names by total children

You now know enough to calculate the most popular names by total children (it may take some strategizing, but you can do it!).

In the code chunk below, use group_by(), summarize(), and arrange() to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name. In other words, the total number of boys named “Kelly” should be computed separately from the total number of girls named “Kelly”.

Interactive editor
Solution

babynames |>
  group_by(name, sex) |> 
  summarize(total = sum(n)) |> 
  arrange(desc(total))

The history of the most popular names by total children

Let’s examine how the popularity of popular names has changed over time. To help us, I’ve made top_10, which is a version of babynames that is trimmed down to just the ten most popular names from above.

top_10

# A tibble: 1,380 × 5
    year sex   name        n    prop
   <dbl> <chr> <chr>   <int>   <dbl>
 1  1880 F     Mary     7065 0.0724 
 2  1880 M     John     9655 0.0815 
 3  1880 M     William  9532 0.0805 
 4  1880 M     James    5927 0.0501 
 5  1880 M     Charles  5348 0.0452 
 6  1880 M     Joseph   2632 0.0222 
 7  1880 M     Robert   2415 0.0204 
 8  1880 M     David     869 0.00734
 9  1880 M     Richard   728 0.00615
10  1880 M     Michael   354 0.00299
# ℹ 1,370 more rows

Exercise: Proportions for popular names

Use the code block below to plot a line graph of prop vs year for each name in top_10. Be sure to color the lines by name to make the graph interpretable.

Interactive editor
Solution

top_10 |>
  ggplot() +
    geom_line(aes(x = year, y = prop, color = name))

Exercise: Total children for popular names

Now use top_10 to plot n vs year for each of the names. How are the plots different? Why might that be? How does this affect our decision to use total children as a measure of popularity?

Interactive editor
Solution

top_10 |>
  ggplot() +
    geom_line(aes(x = year, y = n, color = name))

Good job! This graph shows different trends than the one above, now let’s consider why.

Next topic

group_by() in action

Using group_by()

Ungrouping 1

Ungrouping 2

Ungrouping 3

Ungrouping 3

The most popular names by total children

The history of the most popular names by total children

Exercise: Proportions for popular names

Exercise: Total children for popular names

`group_by()` in action

Using `group_by()`