Variance is a measure of variability in statistics. It assesses the average squared difference between data values and the mean. Unlike some other statistical measures of variability, it incorporates all data points in its calculations by contrasting each value to the mean.
When there is no variability in a sample, all values are the same, and the variance equals zero. As the data values spread out further, variability increases.
For example, these two distributions have the same mean. However, the dataset on the right has greater variability and, hence, a higher variance.
In this post, learn how to calculate both population and sample variance and how to interpret them.
Related post: Measures of Variability
There are two formulas for the variance. The correct formula depends on whether you are working with the entire population or using a sample to estimate the population value. In other words, decide which formula to use depending on whether you are performing descriptive or inferential statistics.
The equations are below, and then I work through an example of finding the variance to help bring it to life.
Population variance formula
Use the population form of the equation when you have values for all members of the group of interest. In this case, you are not using the sample to estimate the population. Instead, you have measured all people or items and need the variance for that specific group. For example, if you have measured test scores for all class members and need to know the value for that class, use the population variance formula.
The population variance formula is the following:
In the population variance formula:
- σ2 is the population variance.
- Xi is the ith data point.
- µ is the population mean.
- n is the number of observations.
To find the variance, take a data point, subtract the population mean, and square that difference. Repeat this process for all data points. Then, sum all of those squared values and divide by the number of observations. Hence, it’s the average squared difference.
Statisticians refer to the numerator portion of the variance formula as the sum of squares.
Sample variance formula
Use the sample variance formula when you’re using a sample to estimate the value for a population. For example, if you have taken a random sample of statistics students, recorded their test scores, and need to use the sample as an estimate for the population of statistics students, use the sample variance formula.
The population formula tends to underestimate variability when you use it with a sample. The sample formula below corrects for that bias.
In the sample variance formula:
- s2 is the sample variance.
- Xi is the ith data point.
- x̅ is the sample mean.
- n–1 is the degrees of freedom.
The calculation process for samples is very similar to the population method. However, you’re working with a sample instead of a population, and you’re dividing by n–1. This denominator counteracts a bias where samples tend to underestimate the population value.
Let’s work through an example calculation!
How to Find Variance
Here’s an example of how to calculate the variance using the sample formula. The dataset has 17 observations in the table below. The numbers in parentheses correspond to table columns.
To calculate the statistic, take each data value (1) and subtract the mean (2) to calculate the difference (3), and then square the difference (4).
At the bottom of the worksheet, I sum the squared values, and divide it by 17 – 1 = 16 because we’re finding the sample value.
The variance for this dataset is 201.
Interpreting the Variance
The variance in statistics is the average squared distance between the data points and the mean. Because it uses squared units rather than the natural data units, the interpretation is less intuitive. Higher values indicate greater variability, but there is no intuitive interpretation for specific values. Despite this drawback, some statistical hypothesis tests use it in their calculations. For example, read about the F-test and ANOVA.
Squaring the differences serves several purposes.
Squaring the differences prevents values above and below the mean from canceling each other out. Consequently, variance is always greater than or equal to zero. It is almost always a positive value because only datasets containing one repeated value (e.g., all values equal 15) have a value of zero.
Additionally, squaring differences disproportionately increases the impact of data points that are further from the mean. This additional weight mirrors the properties of the normal distribution where outliers are substantially less likely to occur. Extreme values do not fall off linearly.
If you take the square root of the variance, you obtain the standard deviation, which does use the intuitive natural data units. The mean absolute deviation is another measure of variability that also uses natural units, but its formula does not square the differences.
Bob E. says
I think what Jim Frost meant is that there is more variability in the population than in a sample of that population. Therefore, the sample variance with (n) as a denominator underestimates the population variance. However, if you divide by the smaller number (n-1) instead of (n), the resultant variance is otherwise larger and now is a good estimate of the population variance.
Darshan Goswami says
Thanks for your explanation. Is there any specific reason behind having n-1 (DF) as denominator for calculating sample variance? I couldn’t get the logic “to counteract a bias where samples tend to underestimate the population value”. Can you pl. elaborate.