|>
babynames filter(name == "Andrew", sex == "M") |>
summarize(total = sum(n), max = max(n), mean = mean(n))
# A tibble: 1 × 3
total max mean
<int> <int> <dbl>
1 1283910 36204 9304.
summarize()
summarize()
takes a data frame and uses it to calculate a new data frame of summary statistics.
To use summarize()
, pass it a data frame and then one or more named arguments. Each named argument should be set to an R expression that generates a single value. Summarise will turn each named argument into a column in the new data frame. The name of each argument will become the column name, and the value returned by the argument will become the column contents.
Importantly, the summarize()
function is destructive. It collapses a dataset into a single row and throws away any columns that we don’t use when summarizing. Watch this little animation to see what it does:
I used summarize()
earlier to calculate the total number of boys named “Andrew”, but let’s expand that code to also calculate
max
: the maximum number of boys named “Andrew” in a single yearmean
: the mean number of boys named “Andrew” per yearbabynames |>
filter(name == "Andrew", sex == "M") |>
summarize(total = sum(n), max = max(n), mean = mean(n))
# A tibble: 1 × 3
total max mean
<int> <int> <dbl>
1 1283910 36204 9304.
Don’t let the code above fool you. The first argument of summarize()
is always a data frame, but when you use summarize()
in a pipe, the first argument is provided by the pipe operator, |>
. Here the first argument will be the data frame that is returned by babynames |> filter(name == "Andrew", sex == "M")
.
summarize()
Use the code chunk below to compute three statistics:
If you cannot think of an R function that would compute each statistic, click the Solution tab.
So far our summarize()
examples have relied on sum()
, max()
, and mean()
. But you can use any function in summarize()
so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:
mean(x)
, median(x)
, quantile(x, 0.25)
, min(x)
, and max(x)
sd(x)
, var(x)
, IQR(x)
, and mad(x)
first(x)
, nth(x, 2)
, and last(x)
n_distinct(x)
and n()
, which takes no arguments, and returns the size of the current group or data frame.sum(!is.na(x))
, which counts the number of TRUE
s returned by a logical test; mean(y == 0)
, which returns the proportion of TRUE
s returned by a logical test.Let’s apply some of these summary functions. Click Continue to test your understanding.
“Khaleesi” is a very modern name that appears to be based on the Game of Thrones TV series, which premiered on April 17, 2011. In the chunk below, filter babynames
to just the rows where name == "Khaleesi"
. Then use summarize()
and a summary function to return the first value of year
in the data set.
In the chunk below, use summarize()
and a summary function to return a data frame with two columns:
n
that displays the total number of rows in babynames
distinct
that displays the number of distinct names in babynames
Will these numbers be different? Why or why not?
Good job! The two numbers are different because most names appear in the data set more than once. They appear once for each year in which they were used.
summarize()
by groups?How can we apply summarize()
to find the most popular names in babynames
? You’ve seen how to calculate the total number of children that have your name, which provides one of our measures of popularity, i.e. the total number of children that have a name:
However, we had to isolate your name from the rest of your data to calculate this number. You could imagine writing a program that goes through each name one at a time and:
Eventually, the program could combine all of the results back into a single data set. However, you don’t need to write such a program; this is the job of {dplyr}’s group_by()
function.