Histograms are graphs that display the distribution of your continuous data. They are fantastic exploratory tools because they reveal properties about your sample data in ways that summary statistics cannot. For instance, while the mean and standard deviation can numerically summarize your data, histograms bring your sample data to life.
In this blog post, I’ll show you how histograms reveal the shape of the distribution, its central tendency, and the spread of values in your sample data. You’ll also learn how to identify outliers, how histograms relate to probability distribution functions, and why you might need to use hypothesis tests with them.
Histograms, Central Tendency, and Variability
Use histograms when you have continuous measurements and want to understand the distribution of values and look for outliers. These graphs take your continuous measurements and place them into ranges of values known as bins. Each bin has a bar that represents the count or percentage of observations that fall within that bin. Histograms are similar to stem and leaf plots.
Download the CSV data file to make most of the histograms in this blog post: Histograms.
In the field of statistics, we often use summary statistics to describe an entire dataset. These statistics use a single number to quantify a characteristic of the sample. For example, a measure of central tendency is a single value that represents the center point or typical value of a dataset, such as the mean. A measure of variability is another type of summary statistic that describes how spread out the values are in your dataset. The standard deviation is a conventional measure of dispersion.
These summary statistics are crucial. How often have you heard that the mean of a group is a particular value? It provides meaningful information. However, these measures are simplifications of the dataset. Graphing the data brings it to life. Generally, I find that using graphs in conjunction with statistics provides the best of both worlds!
Let’s see this in action.
Histograms and the Central Tendency
Use histograms to understand the center of the data. In the histogram below, you can see that the center is near 50. Most values in the dataset will be close to 50, and values further away are rarer. The distribution is roughly symmetric and the values fall between approximately 40 and 64.
A difference in means shifts the distributions horizontally along the X-axis (unless the histogram is rotated). In the histograms below, one group has a mean of 50 while the other has a mean of 65.
Additionally, histograms help you grasp the degree of overlap between groups. In the above histograms, there’s a relatively small amount of overlap.
Histograms and Variability
Suppose you hear that two groups have the same mean of 50. It sounds like they’re practically equivalent. However, after you graph the data, the differences become apparent, as shown below.
The histograms center on the same value of 50, but the spread of values is notably different. The values for group A mostly fall between 40 – 60 while for group B that range is 20 – 90. The mean does not tell the entire story! At a glance, the difference is evident in the histograms.
In short, histograms show you which values are more and less common along with their dispersion. You can’t gain this understanding from the raw list of values. Summary statistics, such as the mean and standard deviation, will get you partway there. But histograms make the data pop!
Histograms and Skewed Distributions
Histograms are an excellent tool for identifying the shape of your distribution. So far, we’ve been looking at symmetric distributions, such as the normal distribution. However, not all distributions are symmetrical. You might have nonnormal data that are skewed.
The shape of the distribution is a fundamental characteristic of your sample that can determine which measure of central tendency best reflects the center of your data. Relatedly, the shape also impacts your choice between using a parametric or nonparametric hypothesis test. In this manner, histograms are informative about the summary statistics and hypothesis tests that are appropriate for your data.
For skewed distributions, the direction of the skew indicates which way the longer tail extends.
For right-skewed distributions, the long tail extends to the right while most values cluster on the left, as shown below. These are real data from a study I conducted.
Conversely, for left-skewed distributions, the long tail extends to the left while most values cluster on the right.
Using Histograms to Identify Outliers
Histograms are a handy way to identify outliers. In an instant, you’ll see if there are any unusual values. If you identify potential outliers, investigate them. Are these data entry errors or do they represent observations that occurred under unusual conditions? Or, perhaps they are legitimate observations that accurately describe the variability in the study area.
In a histogram, outliers appear as an isolated bar.
Identifying Multimodal Distributions with Histograms
A multimodal distribution has more than one peak. It’s easy to miss multimodal distributions when you focus on summary statistics, such as the mean and standard deviations. Consequently, histograms are the best method for detecting multimodal distributions.
Imagine your dataset has the properties shown below.
That looks relatively straightforward, but when you graph it, you see the histogram below.
That bimodal distribution is not quite what you were expecting! This histogram illustrates why you should always graph your data rather than just calculating summary statistics!
Related post: Bimodal Distributions: Definition, Examples & Analysis
Using Histograms to Identify Subpopulations
Sometimes these multimodal distributions reflect the actual distribution of the phenomenon that you’re studying. In other words, there are genuinely different peak values in the distribution of one population. However, in other cases, multimodal distributions indicate that you’re combining subpopulations that have different characteristics. Histograms can help confirm the presence of these subpopulations and illustrate how they’re different from each other.
Suppose we’re studying the heights of American citizens. They have a mean height of 168 centimeters with a standard deviation of 9.8 CM. The histogram is below. There appears to be an unusually broad peak in the center—it’s not quite bimodal.
When we divide the sample by gender, the reason for it becomes clear.
Notice how two narrower distributions have replaced the single broad distribution? The histograms help us learn that gender is an essential categorical variable in studies that involve height. The graphs show that the mean provides more precise estimates when we assess heights by gender. In fact, the mean for the entire population does not equal the mean for either subpopulation. It’s misleading!
Related post: Dot Plots: Using, Examples, and Interpreting
Using Histograms to Assess the Fit of a Probability Distribution Function
Analysts can overlay a fitted line for a probability distribution function on their histogram. Here’s a quick distinction between the two:
- Histogram: Displays the distribution of values in the sample.
- Fitted distribution line: Displays the probability distribution function for a particular distribution (e.g., normal, Weibull, etc.) that best fits your data.
A histogram graphs your sample data. On the other hand, a fitted distribution line attempts to find the probability distribution function for a population that has the maximum likelihood of producing the distribution that exists in your sample.
While you can use histograms to evaluate how well the distribution curve fits your sample, I do NOT recommend it! If you insist on using a histogram, assess how closely the bars follow the shape of the fitted line. In the graph below, the fitted line for the normal distribution appears to follow the histogram bars adequately. The legend displays the estimated parameter values of the fitted distribution.
Instead of using histograms to determine how well a distribution fits your data, I recommend using a combination of distribution tests and probability plots. Probability plots are special graphs that are specifically designed to display how well probability distribution functions fit samples. To learn more about these other approaches, read my posts about Identifying the Distribution of your Data and Histograms vs. Probability Plots.
Related post: Understanding Probability Distributions
Using Histograms to Compare Distributions between Groups
To compare distributions between groups using histograms, you’ll need both a continuous variable and a categorical grouping variable. There are two common ways to display groups in histograms. You can either overlay the groups or graph them in different panels, as shown below.
It can be easier to compare distributions when they’re overlaid, but sometimes they get messy. Histograms in separate panels display each distribution more clearly, but the comparisons and degree of overlap aren’t quite as clear. In the examples above, the paneled distributions are clearly more legible. However, overlaid histograms can work nicely in other cases, as you’ve seen in this blog post. Experiment to find the best approach for your data!
While I think histograms are the best graph for understanding the distribution of values for a single group, they can get muddled with multiple groups. Histograms are usually pretty good for displaying two groups, and up to four groups if you display them in separate panels. If your primary goal is to compare distributions and your histograms are challenging to interpret, consider using boxplots or individual plots. In my opinion, those other plots are better for comparing distributions when you have more groups. But they don’t provide quite as much detail for each distribution as histograms.
Again, experiment and determine which graph works best for your data and goals!
Histograms and Sample Size
As fantastic as histograms are for exploring your data, be aware that sample size is a significant consideration when you need the shape of the histogram to resemble the population distribution. Typically, I recommend that you have a sample size of at least 50 per group for histograms. With fewer than 50 observations, you have too little data to represent the population distribution accurately.
Both histograms below use samples drawn from a population that has a mean of 100 and a standard deviation of 15. These characteristics describes the distribution of IQ scores. However, one histogram uses a sample size of 20 while the other uses a sample size of 100. Notice that I’m using percent on the Y-axis to compare histogram bars between different sample sizes.
That’s a pretty huge difference! It takes a surprisingly large sample size to get a good representation of an entire distribution. When your sample size is less than 20, consider using an individual value plot.
Using Hypothesis Tests in Conjunction with Histograms
As you’ve seen in this post, histograms can illustrate the distribution of groups as well as differences between groups. However, if you want to use your sample data to draw conclusions about populations, you’ll need to use hypothesis tests. Additionally, be sure that you use a sampling method, such as random sampling, to obtain a sample that reflects the population.
Differences between groups that are visible on histograms can be quirks caused by random sampling error rather than representing real differences between populations. On histograms, random error can manifest itself as differences between central tendency and variability. Additionally, arbitrary graph factors such as the scale of the Y-axis and different bin sizes can overstate the differences.
Hypothesis tests play a critical role in separating the signal (real differences in the population) from the noise (random sampling error). This protective function helps prevent you from mistaking random error for a real effect. If the appropriate hypothesis test is not statistically significant, your sample provides insufficient evidence for concluding that the pattern on your graph represents a real effect at the population level. In other words, you might be looking at noise in the sample.
Hypothesis Tests for Histograms
Use the following hypothesis tests in conjunction with histograms when you are comparing groups:
2-sample t-test: Assess the equality of two group means.
ANOVA: Test the equality of three or more group means.
Mann-Whitney: Assess the equality of two group medians.
Kruskal-Wallis and Mood’s Median: Test the equality of three or more group medians.
Test of Equal Variances: Assess the equality of group variances or standard deviations.
Histograms are a great way to investigate your data. However, when you need to draw inferences about an entire population, be sure to use a representative sampling method and the proper hypothesis test.
Related post: Median: Definition and Uses