# Histograms

### Introduction

Watch this video:

### How to make a histogram

To make a histogram with {ggplot2}, add `geom_histogram()`

to the ggplot2 template. For example, the code below plots a histogram of the `carat`

variable in the `diamonds`

dataset, which comes with {ggplot2}.

### The \(y\) variable

As with `geom_bar()`

, you do not need to give `geom_histogram()`

a \(y\) variable. `geom_histogram()`

will construct its own \(y\) variable by counting the number of observations that fall into each bin on the \(x\) axis. `geom_histogram()`

will then map the counts to the \(y\) axis.

As a result, you can glance at a bar to determine how many observations fall within a bin. Bins with tall bars highlight common values of the \(x\) variable.

### Exercise 1: Interpretation

`binwidth`

By default, {ggplot2} will choose a binwidth for your histogram that results in about 30 bins. You can set the binwidth manually with the `binwidth`

argument, which is interpreted in the units of the x axis:

`bins`

Alternatively, you can set the binwidth with the `bins`

argument which takes the total number of bins to use:

It can be hard to determine what the actual binwidths are when you use `bins`

, since they may not be round numbers.

`boundary`

You can move the bins left and right along the \(x\) axis with the `boundary`

argument. `boundary`

takes an \(x\) value to use as the boundary between two bins ({ggplot2} will align the rest of the bins accordingly):

### Exercise 2: `binwidth`

When you use `geom_histogram()`

, you should always experiment with different binwidths because different size bins reveal different types of information.

To see an example of this, make a histogram of the `carat`

variable in the `diamonds`

dataset. Use a bin size of 0.5 carats. What does the overall shape of the distribution look like?

```
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
```

Good job! The most common diamond size is about 0.5 carats. Larger sizes become progressively less frequent as carat size increases. This accords with general knowledge about diamonds, so you may be prompted to stop exploring the distribution of carat size. But should you?

### Exercise 3: another `binwidth`

Recreate your histogram of `carat`

but this time use a `binwidth`

of 0.1. Does your plot reveal new information? Look closely. Is there more than one peak? Where do the peaks occur?

```
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.1)
```

Good job! The new `binwidth`

reveals a new phenomena: carat sizes like 0.5, 0.75, 1, 1.5, and 2 are much more common than carat sizes that do not fall near a common fraction. Why might this be?

### Exercise 4: another `binwidth`

Recreate your histogram of `carat`

a final time, but this time use a `binwidth`

of 0.01 and set the first boundary to zero. Try to find one new pattern in the results.

```
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.01, boundary = 0)
```

Good job! The new `binwidth`

reveals another phenomena: each peak is very right skewed. In other words, diamonds that are 1.01 carats are much more common than diamonds that are .99 carats. Why would that be?

### Aesthetics

Visually, histograms are very similar to bar charts. As a result, they use the same aesthetics: *alpha*, *color*, *fill*, *linetype*, and *size*.

They also behave in the same odd way when you use the color aesthetic. Do you remember what happens?

### Exercise 5: Putting it all together

Recreate the histogram below.

```
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price, fill = cut), binwidth = 1000, boundary = 0)
```

Good job! Did you ensure that each `binwidth`

is 1000 and that the first boundary is zero?