Assessing Normality: Histograms vs. Normal Probability Plots

Because histograms display the shape and spread of distributions, you might think they’re the best type of graph for determining whether your data are normally distributed. However, I’ll show you how histograms can trick you! Normal probability plots are a better choice for this task and they are easy to use. Normal probability plots are also known as quantile-quantile plots, or Q-Q Plots for short!

Using Histograms to Graph Normal Distributions

First, let’s look at what you expect to see on a histogram when your data follow a normal distribution.

I’ve added the fitted distribution, and it sure seems to fit the data well.

So, what’s wrong using a histogram to assess normality? Histograms are particularly problematic when you have a small sample size because its appearance depends on the number of data points and the number of bars. When you have less than approximately 20 data points, the bars on the histogram don’t adequately display the distribution.

The histogram above uses 100 data points. However, the histograms below use datasets with only 15 observations in each. Can you tell which datasets follow the normal distribution? For comparison, I’ve included the normal distribution curve that provides the best fit for each dataset. Download the CSV dataset to check them yourself: normal_data_examples. The Cs in the graphs below correspond to the columns in the worksheet.

Histogram that displays data that appears to be nonnormal.

Histogram that appears to display nonnormal data.

Surprise! All of these datasets follow the normal distribution, but you can’t tell that from the histograms.

Using Normal Probability Q-Q Plots to Graph Normal Distributions

Instead, graph these distributions using normal probability Q-Q plots, which are also known as normal plots. These plots are simple to use. All you need to do is visually assess whether the data points follow the straight line. If the points track the straight line, your data follow the normal distribution. It’s very straightforward!

I’ll graph the same datasets in the histograms above but use normal probability plots instead. For this type of graph, the best approach is the “fat pencil test.” If you place an imaginary fat pencil over the straight distribution fit line, does it cover the data points? If so, your data are normally distributed. In other words, the data points don’t have to fall right on the line but generally need to follow it.

Normal probability plot that displays data that are normally distributed.

These normal probability Q-Q plots show that all the datasets follow the normal distribution. This type of graph is also a great way to determine whether residuals from regression analysis are normally distributed.

The graph below shows how nonnormal data can appear in a normal plot. Notice the systematic departures from the straight line.

Here are a few technical notes on how statistical software creates Q-Q plots. Your software calculates the cumulative distribution function for your dataset and then displays each observation’s value by its estimated cumulative probability. The graph transforms the X and Y axes so that the distribution line is straight. If your data follow the distribution, they will follow that line.

Normal Probability Q-Q Plots can be Better Than Normality Tests

You can also use normality tests to determine whether your data follow a normal distribution. However, be aware that normality tests are like all other hypothesis tests. As you increase the sample size, their ability to detect small differences increases. With a large enough sample size, these tests can detect minuscule departures from the normal distribution that are meaningless. In this scenario, you can end up with a test that rejects the notion that the data are normally distributed even when they do follow the normal distribution.

For example, the normal probability Q-Q plot below displays a dataset with 5000 observations along with the normality test results. The p-value for the test is 0.010, which indicates that the data do not follow the normal distribution. However, the points on the graph clearly follow the distribution fit line. These data follow the normal distribution despite the test results. This is a rare case where statisticians will say you can use the graph over the hypothesis test!

In this post, I’ve highlighted using normal probability Q-Q plots with small and large datasets. However, I prefer using them over histograms for datasets of all sizes. For my eyes at least, it is just easier to determine whether the data points follow a straight line than comparing bars on a histogram to a bell-shaped curve.

This post has been about using Q-Q plots to assess normality. However, you can use these plots to evaluate other distributions. To learn more about this, read my post: How to Identify the Distribution of Your Data.

Jim Frost says

December 4, 2022 at 10:54 pm

Hi Thomas,

If your sample is non-normal but you know for a fact the population is normal, I’d give a very cautious OK for proceeding. Particularly if you have a sample size of at least 30 because normality isn’t crucial for larger samples anyway. However, there are some major caveats to consider.

If you know that your population follows a normal distribution, but your sample does not, particularly if your sample is strongly non-normal, then you know that your sample does not represent the population in at least some characteristics. That should give you pause if you’re using your sample to draw conclusions about the population. Is there some reason why the sample doesn’t look like the population? It could be random sampling error that occurred by chance. Or perhaps there was some error with your sampling, experimental, and/or measurement process?

So, theoretically you might be ok proceeding, but you really should understand why your sample doesn’t look like the population. That’s a red warning flag that something might be amiss. It depends how different the sample looks from the population. If it’s only slightly non-normal, it might not be a big deal. But if it’s strongly non-normal and it should be normal, it becomes a bigger concern.

As for working in narrow ranges, you’ll need to understand empirically what the data look like in those ranges. Technically, the normal distribution has no upper and lower limits. So, if your data have limits, there’s at least a small degree of non-normality right there. But several conditions can make it non-normal enough to be a problem. If those ranges are artificially constrained (e.g., you remove all values outside the range), chances are the data don’t follow a normal distribution. The more constrained they are, the more of a problem it becomes. Additionally, if the mean is closer to one of the range than the other, the distribution likely skews away in the other direction. For example, if the mean is near 95%, the data are probably left-skewed.

You should examine your data to assess the distribution directly. If they’re not normally distributed, you can either use a non-parametric method or simply collect a large enough sample size so the central limit theorem kicks in and normality isn’t an issue. If you have already, you should read my following posts because they’ll go more in-depth into the issues I talk about above.

Identifying the Distribution of Your Data
Nonparametric vs Parametric Tests
Central Limit Theorem

I hope that helps!

Comments

Thomas says

December 4, 2022 at 10:12 am

Dear Jim,

Thanks for very useful information on your home page!
Some questions below if you have the time.
If you sample from a normal distribution (known from literature or previous experience from larger sample sizes) but your sample does not by chance is not normal.
No outliers by the Grubbs test. Can you apply the standard statistics?
Another topic: Often percentage data is used e.g analytical chemistry (% main peak) where you have closed scale 0-100%. Typically you have between 80-100% is it appropriate using the normal distribution?
Further, another topic is the pH scale. If you work in a narrow range lets say pH 7.0 to 8.0.
Can the data be regarded as coming from a normal distribution?

Loading...

- Jim Frost says
  
  December 4, 2022 at 10:54 pm
  
  Hi Thomas,
  
  If your sample is non-normal but you know for a fact the population is normal, I’d give a very cautious OK for proceeding. Particularly if you have a sample size of at least 30 because normality isn’t crucial for larger samples anyway. However, there are some major caveats to consider.
  
  If you know that your population follows a normal distribution, but your sample does not, particularly if your sample is strongly non-normal, then you know that your sample does not represent the population in at least some characteristics. That should give you pause if you’re using your sample to draw conclusions about the population. Is there some reason why the sample doesn’t look like the population? It could be random sampling error that occurred by chance. Or perhaps there was some error with your sampling, experimental, and/or measurement process?
  
  So, theoretically you might be ok proceeding, but you really should understand why your sample doesn’t look like the population. That’s a red warning flag that something might be amiss. It depends how different the sample looks from the population. If it’s only slightly non-normal, it might not be a big deal. But if it’s strongly non-normal and it should be normal, it becomes a bigger concern.
  
  As for working in narrow ranges, you’ll need to understand empirically what the data look like in those ranges. Technically, the normal distribution has no upper and lower limits. So, if your data have limits, there’s at least a small degree of non-normality right there. But several conditions can make it non-normal enough to be a problem. If those ranges are artificially constrained (e.g., you remove all values outside the range), chances are the data don’t follow a normal distribution. The more constrained they are, the more of a problem it becomes. Additionally, if the mean is closer to one of the range than the other, the distribution likely skews away in the other direction. For example, if the mean is near 95%, the data are probably left-skewed.
  
  You should examine your data to assess the distribution directly. If they’re not normally distributed, you can either use a non-parametric method or simply collect a large enough sample size so the central limit theorem kicks in and normality isn’t an issue. If you have already, you should read my following posts because they’ll go more in-depth into the issues I talk about above.
  
  Identifying the Distribution of Your Data
  Nonparametric vs Parametric Tests
  Central Limit Theorem
  
  I hope that helps!
  
  Loading...
  
Jereesh K Elias says

February 8, 2022 at 1:35 am

I have a doubt regarding choosing a parametric and non-parametric test based on normality of data. I have been taught that if the variables are following a non-normal distribution, we should go for non-parametric test. My doubt is, in case our independent variable is normally distributed and dependent variable is non-normally distributed, broadly which test should we use? parametric or non-parametric?

Loading...

Madhu says

January 18, 2021 at 2:47 pm

Great post!! Hope you continue the great work!

Loading...

Suruchi says

December 1, 2020 at 10:49 am

Hello Jim
Your posts are very helpful to me.
Please guide me about smooth frequency, how to perform it.
Thanks

Loading...

Anurag Chakraborty says

June 29, 2019 at 7:29 pm

How do I construct a normal probability plot ?

Loading...

Cédric ntata says

September 10, 2018 at 11:22 am

Je suis très content d apprendre un plus dans mes connaissance stat

Loading...