Analysis of variance (ANOVA) uses F-tests to statistically assess the equality of means when you have three or more groups. In this post, I’ll answer several common questions about the F-test.
- How do F-tests work?
- Why do we analyze variances to test means?
I’ll use concepts and graphs to answer these questions about F-tests in the context of a one-way ANOVA example. I’ll use the same approach that I use to explain how t-tests work. If you need a primer on the basics, read my hypothesis testing overview.
To learn more about ANOVA tests, including the more complex forms, read my ANOVA Overview.
Introducing F-tests and F-statistics!
The term F-test is based on the fact that these tests use the F-statistic to test the hypotheses. An F-statistic is the ratio of two variances and it was named after Sir Ronald Fisher. Variances measure the dispersal of the data points around the mean. Higher variances occur when the individual data points tend to fall further from the mean.
It’s difficult to interpret variances directly because they are in squared units of the data. If you take the square root of the variance, you obtain the standard deviation, which is easier to interpret because it uses the data units. While variances are hard to interpret directly, some statistical tests use them in their equations.
An F-statistic is the ratio of two variances, or technically, two mean squares. Mean squares are simply variances that account for the degrees of freedom (DF) used to estimate the variance.
Think of it this way. Variances are the sum of the squared deviations from the mean. If you have a bigger sample, there are more squared deviations to add up. The result is that the sum becomes larger and larger as you add in more observations. By incorporating the DF, mean squares account for the differing numbers of measurements for each estimate of the variance. Otherwise, the variances are not comparable, and the ratio for the F-statistic is meaningless.
Given that F-tests evaluate the ratio of two variances, you might think it’s only suitable for determining whether the variances are equal. Actually, it can do that and a lot more! F-tests are surprisingly flexible because you can include different variances in the ratio to test a wide variety of properties. F-tests can compare the fits of different models, test the overall significance in regression models, test specific terms in linear models, and determine whether a set of means are all equal.
The F-test in One-Way ANOVA
We want to determine whether a set of means are all equal. To evaluate this with an F-test, we need to use the proper variances in the ratio. Here’s the F-statistic ratio for one-way ANOVA.
To see how F-tests work, I’ll go through a one-way ANOVA example. You can download the CSV data file: OneWayExample. The numeric results are below, and I’ll reference them as I illustrate how the test works. This one-way ANOVA assesses the means of four groups.
F-test Numerator: Between-Groups Variance
The one-way ANOVA procedure calculates the average of each of the four groups: 11.203, 8.938, 10.683, and 8.838. The means of these groups spread out around the global mean (9.915) of all 40 data points. The further the groups are from the global mean, the larger the variance in the numerator becomes.
It’s easier to say that the group means are different when they are further apart. That’s pretty self-evident, right? In our F-test, this corresponds to having a higher variance in the numerator.
The dot plot illustrates how this works by comparing two sets of group means. This graph represents each group mean with a dot. The between-group variance increases as the dots spread out.
Looking back at the one-way ANOVA output, which statistic do we use for the between-group variance? The value we use is the adjusted mean square for Factor (Adj MS 15.540). The meaning of this number is not intuitive because it is the sum of the squared distances from the global mean divided by the factor DF. The relevant point is that this number increases as the group means spread further apart.
F-test Denominator: Within-Groups Variance
Now we move on to the denominator of the F-test, which factors in the variances within each group. This variance measures the distance between each data point and its group mean. Again, it is the sum of the squared distances divided by the error DF.
This variance is small when the data points within each group are closer to their group mean. As the data points within each group spread out further from their group mean, the within-group variance increases.
The graph compares low within-group variability to high within-group variability. The distributions represent how tightly the data points within each group cluster around the group mean. The F-statistic denominator, or the within-group variance, is higher for the right panel because the data points tend to be further from the group average.
To conclude that the group means are not equal, you want low within-group variance. Why? The within-group variance represents the variance that the model does not explain. Statisticians refer to this as random error. As the error increases, it becomes more likely that the observed differences between group means are caused by the error rather than by actual differences at the population level. Obviously, you want low amounts of error!
Let’s refer to the ANOVA output again. The within-group variance appears in the output as the adjusted mean squares for error (Adj MS for Error): 4.402.
The F-Statistic: Ratio of Between-Groups to Within-Groups Variances
F-statistics are the ratio of two variances that are approximately the same value when the null hypothesis is true, which yields F-statistics near 1.
We looked at the two different variances used in a one-way ANOVA F-test. Now, let’s put them together to see which combinations produce low and high F-statistics. In the graphs, look at how the spread of the group means compares to the spread of the data points within each group.
- Low F-value graph: The group means cluster together more tightly than the within-group variability. The distance between the means is small relative to the random error within each group. You can’t conclude that these groups are truly different at the population level.
- High F-value graph: The group means spread out more than the variability of the data within groups. In this case, it becomes more likely that the observed differences between group means reflect differences at the population level.
How to Calculate our F-value
Going back to our example output, we can use our F-ratio numerator and denominator to calculate our F-value like this:
To be able to conclude that not all group means are equal, we need a large F-value to reject the null hypothesis. Is ours large enough?
A tricky thing about F-values is that they are a unitless statistic, which makes them hard to interpret. Our F-value of 3.30 indicates that the between-groups variance is 3.3 times the size of the within-group variance. The null hypothesis value is that variances are equal, which produces an F-value of 1. Is our F-value of 3.3 large enough to reject the null hypothesis?
We don’t know exactly how uncommon our F-value is if the null hypothesis is correct. To interpret individual F-values, we need to place them in a larger context. F-distributions provide this broader context and allow us to calculate probabilities.
How F-tests Use F-distributions to Test Hypotheses
A single F-test produces a single F-value. However, imagine we perform the following process.
First, let’s assume that the null hypothesis is true for the population. At the population level, all four group means are equal. Now, we repeat our study many times by drawing many random samples from this population using the same one-way ANOVA design (four groups with 10 samples per group). Next, we perform one-way ANOVA on all of the samples and plot the distribution of the F-values. This distribution is known as a sampling distribution, which is a type of probability distribution.
Related post: Understanding Probability Distributions
If we follow this procedure, we produce a graph that displays the distribution of F-values for a population where the null hypothesis is true. We use sampling distributions to calculate probabilities for how unlikely our sample statistic is if the null hypothesis is true. F-tests use the F-distribution.
Fortunately, we don’t need to go to the trouble of collecting numerous random samples to create this graph! Statisticians understand the properties of F-distributions so we can estimate the sampling distribution using the F-distribution and the details of our one-way ANOVA design.
Our goal is to evaluate whether our sample F-value is so rare that it justifies rejecting the null hypothesis for the entire population. We’ll calculate the probability of obtaining an F-value that is at least as high as our study’s value (3.30).
This probability has a name—the P value! A low probability indicates that our sample data are unlikely when the null hypothesis is true.
Graphing the F-test for Our One-Way ANOVA Example
For one-way ANOVA, the degrees of freedom in the numerator and the denominator define the F-distribution for a design. There is a different F-distribution for each study design. I’ll create a probability distribution plot based on the DF indicated in the statistical output example. Our study has 3 DF in the numerator and 36 in the denominator.
The distribution curve displays the likelihood of F-values for a population where the four group means are equal at the population level. I shaded the region that corresponds to F-values greater than or equal to our study’s F-value (3.3). When the null hypothesis is true, F-values fall in this area approximately 3.1% of the time. Using a significance level of 0.05, our sample data are unusual enough to warrant rejecting the null hypothesis. The sample evidence suggests that not all group means are equal.
Learn how to interpret P values correctly and avoid a common mistake.
Why We Analyze Variances to Test Means
Let’s return to the question about why we analyze variances to determine whether the group means are different. Focus on the “means are different” aspect. This part explicitly involves the variation of the group means. If there is no variation in the means, they can’t be different, right? Similarly, the larger the differences between the means, the more variation must be present.
ANOVA and F-tests assess the amount of variability between the group means in the context of the variation within groups to determine whether the mean differences are statistically significant. While statistically significant ANOVA results indicate that not all means are equal, it doesn’t identify which particular differences between pairs of means are significant. To make that determination, you’ll need to use post hoc tests to supplement the ANOVA results.
If you’d like to learn about other hypothesis tests using the same general approach, read:
- How t-Tests Work: 1-Sample, 2-Sample, and Paired t-Tests
- How t-Tests Work: t-Values, t-Distributions, and Probabilities
- How Chi-Squared Tests of Independence Work
To see an alternative to traditional hypothesis testing that does not use probability distributions and test statistics, learn about bootstrapping in statistics!
Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.