The empirical rule in statistics, also known as the 68-95-99.7 rule, states that for normal distributions, 68% of observed data points will lie inside one standard deviation of the mean, 95% will fall within two standard deviations, and 99.7% will occur within three standard deviations.
Thanks to the empirical rule, the mean and standard deviation become extra valuable when you reasonably expect that your data approximate a normal distribution. Simply knowing these two statistics allows you to calculate probabilities and percentages for various outcomes.
The name of the empirical rule comes from empirical research, which uses observations and measurements of real-world outcomes rather than theory. In other words, empirical means it is grounded in practical reality. The empirical rule takes these recorded outcomes and lets you use them to make forecasts and calculate probabilities.
Additionally, statisticians also refer to the empirical rule as the three-sigma rule because nearly all observations occur within three standard deviations. This rule sets a statistical control chart’s upper and lower limits at +/- three standard deviations. In general, this limit serves as a valuable way to identify outliers because 99.7% of all values should fall within it.
The empirical rule graph below displays the standard normal distribution with the ranges and percentages.
In this post, learn the ways you can use the empirical rule, the formula for calculating the data ranges, and work through examples to solve problems.
Related post: Understanding the Normal Distribution
How to Use the Empirical Rule
Analysts use the empirical rule to predict the probabilities and distributions of the outcomes that they’re studying. It’s a valuable tool because it lets you make predictions using several easy-to-calculate statistics. Verify that your data follow a normal distribution at least roughly. If it does, you can start making forecasts by calculating the mean and standard deviation.
Many organizations use the empirical rule as a quality control method because you can safely assume many variables follow the normal distribution, and it’s easy to calculate the mean and standard deviation. Similarly, the value-at-risk (VaR) financial risk assessment assumes that the probabilities for outcomes follow a normal distribution. In short, the empirical rule is a quick and easy prediction method that provides good results.
Additionally, the empirical rule is an easy way to identify outliers. Because 99.7% of all observations should be within three standard deviations of the mean, analysts frequently use the limit of three standard deviations to identify outliers. Investigate observations outside this limit as potential outliers.
Related post: Five Ways to Identify Outliers
The empirical rule is also a simple normality test. Based on the probabilities, you know that 99.7% of all observations should fall within three standard deviations from the mean. Therefore, only 100 – 99.7 = 0.3% should be outside the limit for a normal distribution. If too many values fall outside this limit, your data might not follow a normal distribution. Using the three-sigma limit of the empirical rule, you’d expect about 1 in every 370 observations to exceed the limit. Consequently, if you have 500 observations and 10 (2%) are outside the empirical rule limit, your data might not be normally distributed.
Outliers vs. Non-Normal Data
As an analyst using the empirical rule, you must distinguish between outliers and non-normal distributions. Both conditions can cause an unusual number of data points to lie outside the three-sigma limit. For example, the observations might be valid but follow a skewed distribution, which can create the appearance of outliers. To sort through this question, you’ll need to evaluate your data carefully, determine how it’s distributed, assess the data points in question, and apply a large amount of subject area knowledge.
Related post: How to Identify the Distribution of Your Data
Empirical Rule Formula
To calculate the data ranges associated with the empirical rule percentages of 68%, 95%, and 99.7%, start by calculating the sample mean (x̅) and standard deviation (s). Then input those values into the formulas below to derive the ranges.
|Data range||Percentage of data in the range|
|x̅ − s, x̅ + s||68%|
|x̅ − 2s, x̅ + 2s||95%|
|x̅ − 3s, x̅ + 3s||99.7%|
Example of Using the Empirical Rule
Let’s work through an example problem for the empirical rule. Assume that a pizza restaurant has a mean delivery time of 30 minutes and a standard deviation of 5 minutes, and the data follow the normal distribution.
Using the empirical rule, we can estimate the range in which 68% of delivery times occur by taking the mean and adding and subtracting the standard deviation (30 +/- 5), producing a range of 25-35 minutes.
Use the same process for the other empirical rule percentages by using the 2X and 3X multiples of the standard deviation. 95% are between 20-40 minutes (30 +/- 2*5), and 99.7% are between 15-45 minutes (30 +/-3*5). The chart below illustrates this property graphically.
Using the Empirical Rule to Calculate Other Percentages
However, the empirical rule isn’t limited to just the percentages of 68%, 95%, and 99.7%. Using it creatively, you can figure out other properties. To do that, you’ll need to factor in the properties of the normal distribution. Of particular value are the facts that the normal distribution is symmetrical and centers on the mean.
For example, because the empirical rule states that 95% of the delivery times will be inside the 2X standard deviation range, we know 5% will be outside. Further, the distribution is symmetrical, meaning that 2.5% will be less than 20 minutes and 2.5% will be more. Because we can predict that 2.5% of delivery times will be longer than 40 minutes, we also know that 97.5% will be less than 40 minutes.
Additionally, use the empirical rule to calculate percentiles for particular values. Suppose we wanted to determine the probability of delivery times less than 35 minutes. Using the empirical rule, we know that 68% will fall between 25-35. Because the normal distribution is symmetrical, we know that half of this range (34%) falls above the mean, 30-35 minutes. Additionally, half the entire range of the distribution (50%) falls below the mean (0 -30 minutes). Consequently, we just add those percentages (50% + 34% = 84%) to determine that 84% of deliveries will occur in less than 35 minutes. Equivalently, 35 minutes is the 84th percentile.
Using the empirical rule, can you figure out how to determine the probability of a delivery taking between 35-40 minutes?
Alternatively, you can use z-scores to calculate probabilities and percentiles for data that follow the normal distribution. For more information, read my post, Z-score: Definition, Formula, and Uses.
When your data follow a normal distribution, the empirical rule is a valuable statistical tool. But what do you do when your data are not normally distributed? In that case, use Chebyshev’s Theorem! That method provides similar types of results as the empirical rule but for non-normal data.