Because histograms display the shape and spread of distributions, you might think they’re the best type of graph for determining whether your data are normally distributed. However, I’ll show you how histograms can trick you! Normal probability plots are a better choice for this task and they are easy to use.
Using Histograms to Graph Normal Distributions
First, let’s look at what you expect to see on a histogram when your data follow a normal distribution.
I’ve added the fitted distribution, and it sure seems to fit the data well.
So, what’s wrong using a histogram to assess normality? Histograms are particularly problematic when you have a small sample size because its appearance depends on the number of data points and the number of bars. When you have less than approximately 20 data points, the bars on the histogram don’t adequately display the distribution.
The histogram above uses 100 data points. However, the histograms below use datasets with only 15 observations in each. Can you tell which datasets follow the normal distribution? For comparison, I’ve included the normal distribution curve that provides the best fit for each dataset. Download the CSV dataset to check them yourself: normal_data_examples. The Cs in the graphs below correspond to the columns in the worksheet.
Surprise! All of these datasets follow the normal distribution, but you can’t tell that from the histograms.
Using Normal Probability Plots to Graph Normal Distributions
Instead, graph these distributions using normal probability plots, which are also known as normal plots. These plots are simple to use. All you need to do is visually assess whether the data points follow the straight line. If the points track the straight line, your data follow the normal distribution. It’s very straightforward!
I’ll graph the same datasets in the histograms above but using normal probability plots. For this type of graph, the best approach is the “fat pencil test.” If you place an imaginary fat pencil over the straight distribution fit line, does it cover the data points? If so, your data are normally distributed. In other words, the data points don’t have to fall right on the line but generally need to follow it.
These normal probability plots show that all the datasets follow the normal distribution. This type of graph is also a great way to determine whether residuals from regression analysis are normally distributed.
The graph below shows how nonnormal data can appear in a normal plot. Notice the systematic departures from the straight line.
Here are a few technical notes on how statistical software creates probability plots. Your software calculates the cumulative distribution function for your dataset and then displays each observation’s value by its estimated cumulative probability. The graph transforms the X and Y axes so that the distribution line is straight. If your data follow the distribution, they will follow that line.
Normal Probability Plots can be Better Than Normality Tests
You can also use normality tests to determine whether your data follow a normal distribution. However, be aware that normality tests are like all other hypothesis tests. As you increase the sample size, their ability to detect small differences increases. With a large enough sample size, these tests can detect miniscule departures from the normal distribution that are meaningless. In this scenario, you can end up with a test that rejects the notion that the data are normally distributed even when they do follow the normal distribution.
For example, the normal probability plot below displays a dataset with 5000 observations along with the normality test results. The p-value for the test is 0.010, which indicates that the data do not follow the normal distribution. However, the points on the graph clearly follow the distribution fit line. These data follow the normal distribution despite the test results. This is a rare case where statisticians will say you can use the graph over the hypothesis test!
In this post, I’ve highlighted using normal probability plots with small and large datasets. However, I prefer using them over histograms for datasets of all sizes. For my eyes at least, it is just easier to determine whether the data points follow a straight line than comparing bars on a histogram to a bell-shaped curve.
This post has been about using probability plots to assess normality. However, you can use these plots to evaluate other distributions. To learn more about this, read my post: How to Identify the Distribution of Your Data.