Histograms

Introduction

Watch this video:

How to make a histogram

To make a histogram with {ggplot2}, add geom_histogram() to the ggplot2 template. For example, the code below plots a histogram of the carat variable in the diamonds dataset, which comes with {ggplot2}.

The \(y\) variable

As with geom_bar(), you do not need to give geom_histogram() a \(y\) variable. geom_histogram() will construct its own \(y\) variable by counting the number of observations that fall into each bin on the \(x\) axis. geom_histogram() will then map the counts to the \(y\) axis.

As a result, you can glance at a bar to determine how many observations fall within a bin. Bins with tall bars highlight common values of the \(x\) variable.

Exercise 1: Interpretation

According to the chart, which is the most common carat size in the data?





binwidth

By default, {ggplot2} will choose a binwidth for your histogram that results in about 30 bins. You can set the binwidth manually with the binwidth argument, which is interpreted in the units of the x axis:

bins

Alternatively, you can set the binwidth with the bins argument which takes the total number of bins to use:

It can be hard to determine what the actual binwidths are when you use bins, since they may not be round numbers.

boundary

You can move the bins left and right along the \(x\) axis with the boundary argument. boundary takes an \(x\) value to use as the boundary between two bins ({ggplot2} will align the rest of the bins accordingly):

Exercise 2: binwidth

When you use geom_histogram(), you should always experiment with different binwidths because different size bins reveal different types of information.

To see an example of this, make a histogram of the carat variable in the diamonds dataset. Use a bin size of 0.5 carats. What does the overall shape of the distribution look like?

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

Good job! The most common diamond size is about 0.5 carats. Larger sizes become progressively less frequent as carat size increases. This accords with general knowledge about diamonds, so you may be prompted to stop exploring the distribution of carat size. But should you?

Exercise 3: another binwidth

Recreate your histogram of carat but this time use a binwidth of 0.1. Does your plot reveal new information? Look closely. Is there more than one peak? Where do the peaks occur?

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.1)

Good job! The new binwidth reveals a new phenomena: carat sizes like 0.5, 0.75, 1, 1.5, and 2 are much more common than carat sizes that do not fall near a common fraction. Why might this be?

Exercise 4: another binwidth

Recreate your histogram of carat a final time, but this time use a binwidth of 0.01 and set the first boundary to zero. Try to find one new pattern in the results.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.01, boundary = 0)

Good job! The new binwidth reveals another phenomena: each peak is very right skewed. In other words, diamonds that are 1.01 carats are much more common than diamonds that are .99 carats. Why would that be?

Aesthetics

Visually, histograms are very similar to bar charts. As a result, they use the same aesthetics: alpha, color, fill, linetype, and size.

They also behave in the same odd way when you use the color aesthetic. Do you remember what happens?

Which aesthetic would you use to color the interior fill of each bar in a histogram?



Exercise 5: Putting it all together

Recreate the histogram below.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price, fill = cut), binwidth = 1000, boundary = 0)

Good job! Did you ensure that each binwidth is 1000 and that the first boundary is zero?

Next topic