Post hoc tests are an integral part of ANOVA. When you use ANOVA to test the equality of at least three group means, statistically significant results indicate that not all of the group means are equal. However, ANOVA results do not identify which particular differences between pairs of means are significant. Use post hoc tests to explore differences between multiple group means while controlling the experiment-wise error rate.
In this post, I’ll show you what post hoc analyses are, the critical benefits they provide, and help you choose the correct one for your study. Additionally, I’ll show why failure to control the experiment-wise error rate will cause you to have severe doubts about your results.
Starting with the ANOVA Omnibus Test
Typically, when you want to determine whether three or more means are different, you’ll perform ANOVA. Statisticians refer to the ANOVA F-test as an omnibus test. Welch’s ANOVA is another type of omnibus test.
An omnibus test provides overall results for your data. Collectively, are the differences between the means statistically significant—Yes or No?
- Null: All group means are equal.
- Alternative: Not all group means are equal.
However, ANOVA test results don’t map out which groups are different from other groups. As you can see from the hypotheses above, if you can reject the null, you only know that not all of the means are equal. Sometimes you really need to know which groups are significantly different from other groups!
Example One-Way ANOVA to Use with Post Hoc Tests
We’ll start with this one-way ANOVA example, and then use it as the basis for illustrating three different post hoc tests throughout this blog post. Imagine we are testing four materials that we’re considering for making a product part. We want to determine whether the mean differences between the strengths of these four materials are statistically significant. We obtain the following one-way ANOVA results. To follow along with this example, download the CSV dataset: PostHocTests.
The p-value of 0.004 indicates that we can reject the null hypothesis and conclude that the four means are not all equal. The Means table at the bottom displays the group means. However, we don’t know which pairs of groups are significantly different.
To compare group means, we need to perform post hoc tests, also known as multiple comparisons. In Latin, post hoc means “after this.” You conduct post hoc analyses after a statistically significant omnibus test (F-test or Welch’s).
Before we get to these group comparisons, you need to learn about the experiment-wise error rate.
Related post: How to Interpret P-values Correctly
What is the Experiment-wise Error Rate?
Post hoc tests perform two vital tasks. Yes, they tell you which group means are significantly different from other group means. Crucially, they also control the experiment-wise, or familywise, error rate. In this context, experiment-wise, family-wise, and family error rates are all synonyms that I’ll use interchangeably.
What is this experiment-wise error rate? For every hypothesis test you perform, there is a type I error rate, which your significance level (alpha) defines. In other words, there’s a chance that you’ll reject a null hypothesis that is actually true—it’s a false positive. When you perform only one test, the type I error rate equals your significance level, which is often 5%. However, as you conduct more and more tests, your chance of a false positive increases. If you perform enough tests, you’re virtually guaranteed to get a false positive! The error rate for a family of tests is always higher than an individual test.
Imagine you’re rolling a pair of dice and rolling two ones (known as snake eyes) represents a Type I error. The probability of snake eyes for a single roll is ~2.8% rather than 5%, but you get the idea. If you roll the dice just once, your chances of rolling snake eyes aren’t too bad. However, the more times you roll the dice, the more likely you’ll get two ones. With 25 rolls, snake eyes become more likely than not (50.8%). With enough rolls, it becomes inevitable.
Family Error Rates in ANOVA
In the ANOVA context, you want to compare the group means. The more groups you have, the more comparison tests you need to perform. For our example ANOVA with four groups (A B C D), we’ll need to make the following six comparisons.
- A – B
- A – C
- A – D
- B – C
- B – D
- C – D
Our experiment includes this family of six comparisons. Each comparison represents a roll of the dice for obtaining a false positive. What’s the error rate for six comparisons? Unfortunately, as you’ll see next, the experiment-wise error rate snowballs based on the number of groups in your experiment.
The Experiment-wise Error Rate Quickly Becomes Problematic!
The table below shows how increasing the number of groups in your study causes the number of comparisons to rise, which in turn raises the family-wise error rate. Notice how quickly the quantity of comparisons increases by adding just a few groups! Correspondingly, the experiment-wise error rate rapidly becomes problematic.
The table starts with two groups, and the single comparison between them has an experiment-wise error rate that equals the significance level (0.05). Unfortunately, the family-wise error rate rapidly increases from there!
The formula for the maximum number of comparisons you can make for N groups is: (N*(N-1))/2. The total number of comparisons is the family of comparisons for your experiment when you want to compare all possible pairs of groups (i.e., all pairwise comparisons). Additionally, the formula for calculating the error rate for the entire set of comparisons is 1 – (1 – α)^C. Alpha is your significance level for a single comparison, and C equals the number of comparisons.
The experiment-wise error rate represents the probability of a type I error (false positive) over the total family of comparisons. Our ANOVA example has four groups, which produces six comparisons and a family-wise error rate of 0.26. If you increase the groups to five, the error rate jumps to 40%! When you have 15 groups, you are virtually guaranteed to have a false positive (99.5%)!
Post Hoc Tests Control the Experiment-wise Error Rate
The table succinctly illustrates the problem that post hoc tests resolve. Typically, when performing a statistical analysis, you expect a false positive rate of 5%, or whatever value you set for the significance level. As the table shows, when you increase the number of groups from 2 to 3, the error rate nearly triples from 0.05 to 0.143. And, it quickly worsens from there!
These error rates are too high! Upon seeing a significant difference between groups, you would have severe doubts about whether it was a false positive rather than a real difference.
If you use 2-sample t-tests to compare all group means in your study systematically, you’ll encounter this problem. You’d set the significance level for each test (e.g., 0.05), and then the number of comparisons will determine the experiment-wise error rate, as shown in the table.
Fortunately, post hoc tests use a different approach. For these tests, you set the experiment-wise error rate you want for the entire set of comparisons. Then, the post hoc test calculates the significance level for all individual comparisons that produces the familywise error rate you specify.
Understanding how post hoc tests work is much simpler when you see them in action. Let’s get back to our one-way ANOVA example!
Example of Using Tukey’s Method with One-Way ANOVA
For our ANOVA example, we have four groups which require six comparisons to cover all combinations of groups. We’ll use a post hoc test and specify that the family of six comparisons should collectively produce a familywise error rate of 0.05. The post hoc test I’ll use is Tukey’s method. There are a variety of post hoc tests you can choose from, but Tukey’s method is the most common when you want to compare all possible group pairings.
There are two ways to present post hoc test results—adjusted p-values and simultaneous confidence intervals. I’ll show both below.
The table below displays the six different comparisons in our study, the difference between group means, and the adjusted p-value for each comparison.
The adjusted p-value identifies the group comparisons that are significantly different while limiting the family error rate to your significance level. Simply compare the adjusted p-values to your significance level. When adjusted p-values are less than the significance level, the difference between those group means is statistically significant. Importantly, this process controls the family-wise error rate to your significance level. We can be confident that this entire set of comparisons collectively has an error rate of 0.05.
In the output above, only the D – B difference is statistically significant while using a family error rate of 0.05. The mean difference between these two groups is 9.5.
Simultaneous Confidence Intervals
The other way to present post hoc test results is by using simultaneous confidence intervals of the differences between means. In an individual test, the hypothesis test results using a significance level of α are consistent with confidence intervals using a confidence level of 1 – α. For example, hypothesis tests with a significance level of 0.05 correspond to 95% confidence intervals.
In post hoc tests, we use a simultaneous confidence level rather than an individual confidence level. The simultaneous confidence level applies to the entire family of comparisons. With a 95% simultaneous confidence level, we can be 95% confident that all intervals in our set of comparisons contain the actual population difference between groups. A 5% experiment-wise error rate corresponds to 95% simultaneous confidence intervals.
Tukey Simultaneous CIs for our One-Way ANOVA Example
Let’s get to the confidence intervals. While the table above displays these CIs numerically, I like the graph below because it allows for a simple visual assessment and it provides more information than the adjusted p-values.
Zero indicates that the group means are equal. When a confidence interval does not contain zero, the difference between that pair of groups is statistically significant. In the chart, only the difference between D – B is significant. These CI results match the hypothesis test results in the previous table. I prefer these CI results because they also provide additional information that the adjusted p-values do not convey.
These confidence intervals provide ranges of values that likely contain the actual population difference between pairs of groups. As with all CIs, the width of the interval for the difference reveals the precision of the estimate. Narrower intervals suggest a more precise estimate. And, you can assess whether the full range of values is practically significant. Remember, statistical significance doesn’t necessarily indicate that the results are meaningful in the real world.
When the interval is too wide (imprecise) to be helpful and/or the range includes differences that are not practically significant, you have reason to hesitate before making decisions based on the results.
Post Hoc Tests and the Statistical Power Tradeoff
Post hoc tests are great for controlling the family-wise error rate. Many texts would stop at this point. However, a tradeoff occurs behind the scenes. You need to be aware of it because you might be able to manage it effectively. The tradeoff is the following:
Post hoc tests control the experiment-wise error rate by reducing the statistical power of the comparisons.
Here’s how that works and what it means for your study.
To obtain a lower family error rate, the procedures must lower the significance level for all individual comparisons. For example, to end up with a family error rate of 5% for a set of comparisons, the procedure uses an even lower individual significance level.
As the number of comparisons increases, the post hoc analysis is forced to lower the individual significance level even further. For our six comparisons, Tukey’s method uses an individual significance level of approximately 0.011 to produce the family-wise error rate of 0.05. If our ANOVA required more comparisons, it would be even lower.
What’s the problem with using a lower individual significance level? Lower significance levels correspond to lower statistical power. If a difference between group means actually exists in the population, a study with lower power is less likely to detect it. You might miss important findings!
Avoiding this power reduction is why many studies use an individual significance level of 0.05 rather than 0.01. Yet, with just four groups, our example post hoc test is forced to use the lower significance level.
Key Takeaway: The more group comparisons you make, the lower the statistical power of those comparisons.
Related post: Understanding Statistical Power
Managing the Power Tradeoff in Post Hoc Tests by Reducing the Number of Comparisons
One method to mitigate this tradeoff is by reducing the number of comparisons. If you reduce the number of comparisons, the procedure can use a larger individual error rate to achieve the family error rate that you specify—which increases the statistical power.
Throughout this article, I’ve written about performing all pairwise comparisons—which compares all possible group pairings. While this is the most common approach, the number of comparisons quickly piles up! However, depending on the nature and purpose of your study, you might not need to compare all possible groups.
Your study might need to compare only a subset of all possible comparisons for a variety of reasons. I’ll cover two common reasons and show you which post hoc tests you can use. In the following examples, I’ll display only the confidence interval graphs and not the hypothesis test results. Notice how these other methods make fewer comparisons (3 and 4) for our example dataset than Tukey’s method (6).
While you’re designing your study, it’s crucial that you define in advance the multiple comparisons method that you will use. Don’t try various methods, and then choose the one that produces the most favorable results. That’s data dredging, and it can lead to spurious findings. I’m using multiple post hoc tests on a single dataset to show how they differ, but that’s not an appropriate practice for a real study. Define your methodology in advance, including one post hoc analysis, before analyzing the data, and stick to it!
Key Takeaway: When it’s possible, compare a subset of groups to increase your statistical power.
Example of Using Dunnett’s Method to Compare Treatment Groups to a Control Group
If your study has a control group and several treatment groups, you might need to compare the treatment groups only to the control group.
Use Dunnett’s method when the following are true:
- You know in advance which group (control) you want to compare to all the other groups (treatments).
- You don’t need to compare the treatment groups to each other.
Let’s use Dunnett’s method with our example one-way ANOVA, but we’ll tweak the scenario slightly. Suppose we currently use Material A. We performed this experiment to compare the alternative materials (B, C, and D) to it. Material A will be our control group while the other three are the treatments.
Using Dunnett’s method, we see that only the B – A difference is statistically significant because the interval does not include zero. Using Tukey’s method, this comparison was not significant. The additional power gained by making fewer comparisons came through for us. On the other hand, unlike Tukey’s method, Dunnett’s method does not find that the D – B difference is significant because it doesn’t compare the treatment groups to each other.
Example of Using Hsu’s MCB to Find the Strongest Material
If the purpose of your study is to identify the best group, you might not need to compare all possible groups. Hsu’s Multiple Comparisons to the Best (MCB) identifies the groups that are the best, insignificantly different from the best, and significantly different from the best.
Use Hsu’s MCB when you:
- Don’t know in advance which group you want to compare to all the other groups.
- Don’t need to compare groups that are not the best to other groups that are not the best.
- Can define “the best” as either the group with the highest mean or the lowest mean.
Hsu’s MCB compares each group to the group with the best mean (highest or lowest). Using this procedure, you might end up with several groups that are not significantly different than the best group. Keep in mind that the group that is truly best in the entire population might not have the best sample mean due to sampling error. The groups that are not significantly different from the best group could potentially be as good as or even better than the group with the best sample mean.
Simultaneous Confidence Intervals for Hsu’s MCB
For our one-way ANOVA, we want to use the material that produces the strongest parts. Consequently, we’ll use Hsu’s MCB and define the highest mean as the best. We don’t care about all of the other possible comparisons.
Group D is the best group overall because it has the highest mean (41.07). The procedure compares D to all of the other groups. For Hsu’s MCB, a group is significantly better than another group when the confidence interval has zero as an endpoint. From the graph, we can see that Material D is significantly better than B and C. However, the A-D comparison contains zero, which indicates that A is not significantly different from the best.
Hsu’s MCB determines that the candidates for the best group are A and D. D has the highest sample mean, while A is not significantly different from D. On the other hand, the procedure effectively rules out B and C from being the best.
Recap of Using Multiple Comparison Methods
In this blog post, you’ve seen how the omnibus ANOVA test determines whether means are different in general, but it does not identify specific group differences that are statistically significant.
If you obtain significant ANOVA results, use a post hoc test to explore the mean differences between pairs of groups.
You’ve also learned how controlling the experiment-wise error rate is a crucial function of these post hoc tests. These family error rates grow at a surprising rate!
Finally, if you don’t need to perform all pairwise comparisons, it’s worthwhile comparing only a subset because you’ll retain more statistical power.