summarize()

summarize() takes a data frame and uses it to calculate a new data frame of summary statistics.

Syntax

To use summarize(), pass it a data frame and then one or more named arguments. Each named argument should be set to an R expression that generates a single value. Summarise will turn each named argument into a column in the new data frame. The name of each argument will become the column name, and the value returned by the argument will become the column contents.

Importantly, the summarize() function is destructive. It collapses a dataset into a single row and throws away any columns that we don’t use when summarizing. Watch this little animation to see what it does:

Example

I used summarize() earlier to calculate the total number of boys named “Andrew”, but let’s expand that code to also calculate

  • max: the maximum number of boys named “Andrew” in a single year
  • mean: the mean number of boys named “Andrew” per year
babynames |> 
  filter(name == "Andrew", sex == "M") |> 
  summarize(total = sum(n), max = max(n), mean = mean(n))
# A tibble: 1 × 3
    total   max  mean
    <int> <int> <dbl>
1 1283910 36204 9304.

Don’t let the code above fool you. The first argument of summarize() is always a data frame, but when you use summarize() in a pipe, the first argument is provided by the pipe operator, |>. Here the first argument will be the data frame that is returned by babynames |> filter(name == "Andrew", sex == "M").

Exercise: summarize()

Use the code chunk below to compute three statistics:

  1. the total number of children who ever had your name
  2. the maximum number of children given your name in a single year
  3. the mean number of children given your name per year

If you cannot think of an R function that would compute each statistic, click the Solution tab.

babynames |> 
  filter(name == "Andrew", sex == "M") |> 
  summarize(total = sum(n), max = max(n), mean = mean(n))

Summary functions

So far our summarize() examples have relied on sum(), max(), and mean(). But you can use any function in summarize() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:

  1. Measures of location: mean(x), median(x), quantile(x, 0.25), min(x), and max(x)
  2. Measures of spread: sd(x), var(x), IQR(x), and mad(x)
  3. Measures of position: first(x), nth(x, 2), and last(x)
  4. Counts: n_distinct(x) and n(), which takes no arguments, and returns the size of the current group or data frame.
  5. Counts and proportions of logical values: sum(!is.na(x)), which counts the number of TRUEs returned by a logical test; mean(y == 0), which returns the proportion of TRUEs returned by a logical test.

Let’s apply some of these summary functions. Click Continue to test your understanding.

Khaleesi challenge

“Khaleesi” is a very modern name that appears to be based on the Game of Thrones TV series, which premiered on April 17, 2011. In the chunk below, filter babynames to just the rows where name == "Khaleesi". Then use summarize() and a summary function to return the first value of year in the data set.

babynames |> 
  filter(name == "Khaleesi") |> 
  summarize(year = first(year))

Distinct name challenge

In the chunk below, use summarize() and a summary function to return a data frame with two columns:

  • A column named n that displays the total number of rows in babynames
  • A column named distinct that displays the number of distinct names in babynames

Will these numbers be different? Why or why not?

babynames |> 
  summarize(n = n(), distinct = n_distinct(name))

Good job! The two numbers are different because most names appear in the data set more than once. They appear once for each year in which they were used.

summarize() by groups?

How can we apply summarize() to find the most popular names in babynames? You’ve seen how to calculate the total number of children that have your name, which provides one of our measures of popularity, i.e. the total number of children that have a name:

babynames |> 
  filter(name == "Andrew", sex == "M") |> 
  summarize(total = sum(n))

However, we had to isolate your name from the rest of your data to calculate this number. You could imagine writing a program that goes through each name one at a time and:

  1. filters out the rows with just that name
  2. applies summarize to the rows

Eventually, the program could combine all of the results back into a single data set. However, you don’t need to write such a program; this is the job of {dplyr}’s group_by() function.

Next topic