```
|>
babynames filter(name == "Andrew", sex == "M") |>
summarize(total = sum(n), max = max(n), mean = mean(n))
```

```
# A tibble: 1 × 3
total max mean
<int> <int> <dbl>
1 1283910 36204 9304.
```

`summarize()`

`summarize()`

takes a data frame and uses it to calculate a new data frame of summary statistics.

To use `summarize()`

, pass it a data frame and then one or more named arguments. Each named argument should be set to an R expression that generates a single value. Summarise will turn each named argument into a column in the new data frame. The name of each argument will become the column name, and the value returned by the argument will become the column contents.

Importantly, the `summarize()`

function is *destructive*. It collapses a dataset into a single row and throws away any columns that we don’t use when summarizing. Watch this little animation to see what it does:

I used `summarize()`

earlier to calculate the total number of boys named “Andrew”, but let’s expand that code to also calculate

`max`

: the maximum number of boys named “Andrew” in a single year`mean`

: the mean number of boys named “Andrew” per year

```
babynames |>
filter(name == "Andrew", sex == "M") |>
summarize(total = sum(n), max = max(n), mean = mean(n))
```

```
# A tibble: 1 × 3
total max mean
<int> <int> <dbl>
1 1283910 36204 9304.
```

Don’t let the code above fool you. The first argument of `summarize()`

is always a data frame, but when you use `summarize()`

in a pipe, the first argument is provided by the pipe operator, `|>`

. Here the first argument will be the data frame that is returned by `babynames |> filter(name == "Andrew", sex == "M")`

.

`summarize()`

Use the code chunk below to compute three statistics:

- the total number of children who ever had your name
- the maximum number of children given your name in a single year
- the mean number of children given your name per year

If you cannot think of an R function that would compute each statistic, click the Solution tab.

So far our `summarize()`

examples have relied on `sum()`

, `max()`

, and `mean()`

. But you can use any function in `summarize()`

so long as it meets one criteria: the function must take a *vector* of values as input and return a *single* value as output. Functions that do this are known as **summary functions** and they are common in the field of descriptive statistics. Some of the most useful summary functions include:

**Measures of location**:`mean(x)`

,`median(x)`

,`quantile(x, 0.25)`

,`min(x)`

, and`max(x)`

**Measures of spread**:`sd(x)`

,`var(x)`

,`IQR(x)`

, and`mad(x)`

**Measures of position**:`first(x)`

,`nth(x, 2)`

, and`last(x)`

**Counts**:`n_distinct(x)`

and`n()`

, which takes no arguments, and returns the size of the current group or data frame.**Counts and proportions of logical values**:`sum(!is.na(x))`

, which counts the number of`TRUE`

s returned by a logical test;`mean(y == 0)`

, which returns the proportion of`TRUE`

s returned by a logical test.

Let’s apply some of these summary functions. Click Continue to test your understanding.

“Khaleesi” is a very modern name that appears to be based on the *Game of Thrones* TV series, which premiered on April 17, 2011. In the chunk below, filter `babynames`

to just the rows where `name == "Khaleesi"`

. Then use `summarize()`

and a summary function to return the first value of `year`

in the data set.

In the chunk below, use `summarize()`

and a summary function to return a data frame with two columns:

- A column named
`n`

that displays the total number of rows in`babynames`

- A column named
`distinct`

that displays the number of distinct names in`babynames`

Will these numbers be different? Why or why not?

Good job! The two numbers are different because most names appear in the data set more than once. They appear once for each year in which they were used.

`summarize()`

by groups?How can we apply `summarize()`

to find the most popular names in `babynames`

? You’ve seen how to calculate the total number of children that have your name, which provides one of our measures of popularity, i.e. the total number of children that have a name:

However, we had to isolate your name from the rest of your data to calculate this number. You could imagine writing a program that goes through each name one at a time and:

- filters out the rows with just that name
- applies summarize to the rows

Eventually, the program could combine all of the results back into a single data set. However, you don’t need to write such a program; this is the job of {dplyr}’s `group_by()`

function.