Post hoc tests are an integral part of ANOVA. When you use ANOVA to test the equality of at least three group means, statistically significant results indicate that not all of the group means are equal. However, ANOVA results do not identify which particular differences between pairs of means are significant. Use post hoc tests to explore differences between multiple group means while controlling the experiment-wise error rate.
In this post, I’ll show you what post hoc analyses are, the critical benefits they provide, and help you choose the correct one for your study. Additionally, I’ll show why failure to control the experiment-wise error rate will cause you to have severe doubts about your results.
Starting with the ANOVA Omnibus Test
Typically, when you want to determine whether three or more means are different, you’ll perform ANOVA. Statisticians refer to the ANOVA F-test as an omnibus test. Welch’s ANOVA is another type of omnibus test.
An omnibus test provides overall results for your data. Collectively, are the differences between the means statistically significant—Yes or No?
If the p-value from your ANOVA F-test or Welch’s test is less than your significance level, you can reject the null hypothesis.
- Null: All group means are equal.
- Alternative: Not all group means are equal.
However, ANOVA test results don’t map out which groups are different from other groups. As you can see from the hypotheses above, if you can reject the null, you only know that not all of the means are equal. Sometimes you really need to know which groups are significantly different from other groups!
Statisticians consider differences between group means to be an unstandardized effect size because these values indicate the strength of the relationship using values that retain the natural units of the dependent variable. Effect sizes help you understand how important the findings are in a real-world sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.
To learn more about ANOVA tests, including the more complex forms, read my ANOVA Overview and One-Way ANOVA Overview.
Related posts: How F-tests Work in ANOVA and Welch’s ANOVA
Example One-Way ANOVA to Use with Post Hoc Tests
We’ll start with this one-way ANOVA example, and then use it to illustrate three post hoc tests throughout this blog post. Imagine we are testing four materials that we’re considering for making a product part. We want to determine whether the mean differences between the strengths of these four materials are statistically significant. We obtain the following one-way ANOVA results. To follow along with this example, download the CSV dataset: PostHocTests.
The p-value of 0.004 indicates that we can reject the null hypothesis and conclude that the four means are not all equal. The Means table at the bottom displays the group means. However, we don’t know which pairs of groups are significantly different.
To compare group means, we need to perform post hoc tests, also known as multiple comparisons. In Latin, post hoc means “after this.” You conduct post hoc analyses after a statistically significant omnibus test (F-test or Welch’s).
Before we get to these group comparisons, you need to learn about the experiment-wise error rate.
Related posts: How to Interpret P-values Correctly and How to do One-Way ANOVA in Excel
What is the Experiment-wise Error Rate?
Post hoc tests perform two vital tasks. Yes, they tell you which group means are significantly different from other group means. Crucially, they also control the experiment-wise, or familywise, error rate. In this context, experiment-wise, family-wise, and family error rates are all synonyms that I’ll use interchangeably.
What is this experiment-wise error rate? For every hypothesis test you perform, there is a type I error rate, which your significance level (alpha) defines. In other words, there’s a chance that you’ll reject a null hypothesis that is actually true—it’s a false positive. When you perform only one test, the type I error rate equals your significance level, which is often 5%. However, as you conduct more and more tests, your chance of a false positive increases. If you perform enough tests, you’re virtually guaranteed to get a false positive! The error rate for a family of tests is always higher than an individual test.
Imagine you’re rolling a pair of dice and rolling two ones (known as snake eyes) represents a Type I error. The probability of snake eyes for a single roll is ~2.8% rather than 5%, but you get the idea. If you roll the dice just once, your chances of rolling snake eyes aren’t too bad. However, the more times you roll the dice, the more likely you’ll get two ones. With 25 rolls, snake eyes become more likely than not (50.8%). With enough rolls, it becomes inevitable.
Related posts: Types of Errors in Hypothesis Testing and Significance Levels and P-values
Family Error Rates in ANOVA
In the ANOVA context, you want to compare the group means. The more groups you have, the more comparison tests you need to perform. For our example ANOVA with four groups (A B C D), we’ll need to make the following six comparisons.
- A – B
- A – C
- A – D
- B – C
- B – D
- C – D
Our experiment includes this family of six comparisons. Each comparison represents a roll of the dice for obtaining a false positive. What’s the error rate for six comparisons? Unfortunately, as you’ll see next, the experiment-wise error rate snowballs based on the number of groups in your experiment.
The Experiment-wise Error Rate Quickly Becomes Problematic!
The table below shows how increasing the number of groups in your study causes the number of comparisons to rise, which in turn raises the family-wise error rate. Notice how quickly the quantity of comparisons increases by adding just a few groups! Correspondingly, the experiment-wise error rate rapidly becomes problematic.
The table starts with two groups, and the single comparison between them has an experiment-wise error rate that equals the significance level (0.05). Unfortunately, the family-wise error rate rapidly increases from there!
The formula for the maximum number of comparisons you can make for N groups is: (N*(N-1))/2. The total number of comparisons is the family of comparisons for your experiment when you compare all possible pairs of groups (i.e., all pairwise comparisons). Additionally, the formula for calculating the error rate for the entire set of comparisons is 1 – (1 – α)^C. Alpha is your significance level for a single comparison, and C equals the number of comparisons.
The experiment-wise error rate represents the probability of a type I error (false positive) over the total family of comparisons. Our ANOVA example has four groups, which produces six comparisons and a family-wise error rate of 0.26. If you increase the groups to five, the error rate jumps to 40%! When you have 15 groups, you are virtually guaranteed to have a false positive (99.5%)!
The false positives become such a problem that not controlling the error rate is a form of p-hacking. Learn more about What is P-Hacking: Methods & Best Practices.
Post Hoc Tests Control the Experiment-wise Error Rate
The table succinctly illustrates the problem that post hoc tests resolve. Typically, when performing statistical analysis, you expect a false positive rate of 5%, or whatever value you set for the significance level. As the table shows, when you increase the number of groups from 2 to 3, the error rate nearly triples from 0.05 to 0.143. And, it quickly worsens from there!
These error rates are too high! Upon seeing a significant difference between groups, you would have severe doubts about whether it was a false positive rather than a real difference.
If you use 2-sample t-tests to systematically compare all group means in your study, you’ll encounter this problem. You’d set the significance level for each test (e.g., 0.05), and then the number of comparisons will determine the experiment-wise error rate, as shown in the table.
Fortunately, post hoc tests use a different approach. For these tests, you set the experiment-wise error rate you want for the entire set of comparisons. Then, the post hoc test calculates the significance level for all individual comparisons that produces the familywise error rate you specify.
Understanding how post hoc tests work is much simpler when you see them in action. Let’s get back to our one-way ANOVA example!
Example of Using Tukey’s Method with One-Way ANOVA
For our ANOVA example, we have four groups that require six comparisons to cover all combinations of groups. We’ll use a post hoc test and specify that the family of six comparisons should collectively produce a familywise error rate of 0.05. The post hoc test I’ll use is Tukey’s method. There are a variety of post hoc tests you can choose from, but Tukey’s method is the most common for comparing all possible group pairings.
There are two ways to present post hoc test results—adjusted p-values and simultaneous confidence intervals. I’ll show them both below.
Adjusted P-values
The table below displays the six different comparisons in our study, the difference between group means, and the adjusted p-value for each comparison.
The adjusted p-value identifies the group comparisons that are significantly different while limiting the family error rate to your significance level. Simply compare the adjusted p-values to your significance level. When adjusted p-values are less than the significance level, the difference between those group means is statistically significant. Importantly, this process controls the family-wise error rate to your significance level. We can be confident that this entire set of comparisons collectively has an error rate of 0.05.
In the output above, only the D – B difference is statistically significant while using a family error rate of 0.05. The mean difference between these two groups is 9.5.
Simultaneous Confidence Intervals
The other way to present post hoc test results is by using simultaneous confidence intervals of the differences between means. In an individual test, the hypothesis test results using a significance level of α are consistent with confidence intervals using a confidence level of 1 – α. For example, hypothesis tests with a significance level of 0.05 correspond to 95% confidence intervals.
In post hoc tests, we use a simultaneous confidence level rather than an individual confidence level. The simultaneous confidence level applies to the entire family of comparisons. With a 95% simultaneous confidence level, we can be 95% confident that all intervals in our set of comparisons contain the actual population differences between groups. A 5% experiment-wise error rate corresponds to 95% simultaneous confidence intervals.
Tukey Simultaneous CIs for our One-Way ANOVA Example
Let’s get to the confidence intervals. While the table above displays these CIs numerically, I like the graph below because it allows for a simple visual assessment, and it provides more information than the adjusted p-values.
Zero indicates that the group means are equal. When a confidence interval does not contain zero, the difference between that pair of groups is statistically significant. In the chart, only the difference between D – B is significant. These CI results match the hypothesis test results in the previous table. I prefer these CI results because they also provide additional information that the adjusted p-values do not convey.
These confidence intervals provide ranges of values that likely contain the actual population difference between pairs of groups. As with all CIs, the width of the interval for the difference reveals the precision of the estimate. Narrower intervals suggest a more precise estimate. And, you can assess whether the full range of values is practically significant. Remember, statistical significance doesn’t necessarily indicate that the results are meaningful in the real world.
When the interval is too wide (imprecise) to be helpful and/or the range includes differences that are not practically significant, you have reason to hesitate before making decisions based on the results.
Post Hoc Tests and the Statistical Power Tradeoff
Post hoc tests are great for controlling the family-wise error rate. Many texts would stop at this point. However, a tradeoff occurs behind the scenes. You need to be aware of it because you might be able to manage it effectively. The tradeoff is the following:
Post hoc tests control the experiment-wise error rate by reducing the statistical power of the comparisons.
Here’s how that works and what it means for your study.
To obtain the family error rate you specify, post hoc procedures must lower the significance level for all individual comparisons. For example, to end up with a family error rate of 5% for a set of comparisons, the procedure uses an even lower individual significance level.
As the number of comparisons increases, the post hoc analysis must lower the individual significance level even further. For our six comparisons, Tukey’s method uses an individual significance level of approximately 0.011 to produce the family-wise error rate of 0.05. If our ANOVA required more comparisons, it would be even lower.
What’s the problem with using a lower individual significance level? Lower significance levels correspond to lower statistical power. If a difference between group means actually exists in the population, a study with lower power is less likely to detect it. You might miss important findings!
Avoiding this power reduction is why many studies use an individual significance level of 0.05 rather than 0.01. Unfortunately, with just four groups, our example post hoc test is forced to use the lower significance level.
Key Takeaway: The more group comparisons you make, the lower the statistical power of those comparisons.
Related post: Understanding Statistical Power
Managing the Power Tradeoff in Post Hoc Tests by Reducing the Number of Comparisons
One method to mitigate this tradeoff is by reducing the number of comparisons. This reduction allows the procedure to use a larger individual error rate to achieve the family error rate that you specify—which increases the statistical power.
Throughout this article, I’ve written about performing all pairwise comparisons—which compares all possible group pairings. While this is the most common approach, the number of contrasts quickly piles up! However, depending on your study’s purpose, you might not need to compare all possible groups.
Your study might need to compare only a subset of all possible comparisons for a variety of reasons. I’ll cover two common reasons and show you which post hoc tests you can use. In the following examples, I’ll display only the confidence interval graphs and not the hypothesis test results. Notice how these other methods make fewer comparisons (3 and 4) for our example dataset than Tukey’s method (6).
While you’re designing your study, it’s crucial that you define in advance the multiple comparisons method that you will use. Don’t try various methods, and then choose the one that produces the most favorable results. That’s data dredging, and it can lead to spurious findings. I’m using multiple post hoc tests on a single dataset to show how they differ, but that’s not an appropriate practice for a real study. Define your methodology in advance, including one post hoc analysis, before analyzing the data, and stick to it!
Key Takeaway: When it’s possible, compare a subset of groups to increase your statistical power.
Example of Using Dunnett’s Method to Compare Treatment Groups to a Control Group
If your study has a control group and several treatment groups, you might need to compare the treatment groups only to the control group.
Use Dunnett’s method when the following are true:
- Before the study, you know which group (control) you want to compare to all the other groups (treatments).
- You don’t need to compare the treatment groups to each other.
Let’s use Dunnett’s method with our example one-way ANOVA, but we’ll tweak the scenario slightly. Suppose we currently use Material A. We performed this experiment to compare the alternative materials (B, C, and D) to it. Material A will be our control group, while the other three are the treatments.
Using Dunnett’s method, we see that only the B – A difference is statistically significant because the interval does not include zero. Using Tukey’s method, this comparison was not significant. The additional power gained by making fewer comparisons came through for us. On the other hand, unlike Tukey’s method, Dunnett’s method does not find that the D – B difference is significant because it doesn’t compare the treatment groups to each other.
Example of Using Hsu’s MCB to Find the Strongest Material
If your study’s goal is to identify the best group, you might not need to compare all possible groups. Hsu’s Multiple Comparisons to the Best (MCB) identifies the groups that are the best, insignificantly different from the best, and significantly different from the best.
Use Hsu’s MCB when you:
- Don’t know in advance which group you want to compare to all the other groups.
- Don’t need to compare groups that are not the best to other groups that are not the best.
- Can define “the best” as either the group with the highest mean or the lowest mean.
Hsu’s MCB compares each group to the group with the best mean (highest or lowest). Using this procedure, you might end up with several groups that are not significantly different than the best group. Keep in mind that the group that is truly best in the entire population might not have the best sample mean due to sampling error. The groups that are not significantly different from the best group might be as good as, or even better than, the group with the best sample mean.
Simultaneous Confidence Intervals for Hsu’s MCB
For our one-way ANOVA, we want to use the material that produces the strongest parts. Consequently, we’ll use Hsu’s MCB and define the highest mean as the best. We don’t care about all of the other possible comparisons.
Group D is the best group overall because it has the highest mean (41.07). The procedure compares D to all of the other groups. For Hsu’s MCB, a group is significantly better than another group when the confidence interval has zero as an endpoint. From the graph, we can see that Material D is significantly better than B and C. However, the A-D comparison contains zero, which indicates that A is not significantly different from the best.
Hsu’s MCB determines that the candidates for the best group are A and D. D has the highest sample mean and A is not significantly different from D. On the other hand, the procedure effectively rules out B and C from being the best.
Recap of Using Multiple Comparison Methods
In this blog post, you’ve seen how the omnibus ANOVA test determines whether means are different in general, but it does not identify specific group differences that are statistically significant.
If you obtain significant ANOVA results, use a post hoc test to explore the mean differences between pairs of groups.
You’ve also learned how controlling the experiment-wise error rate is a crucial function of these post hoc tests. These family error rates grow at a surprising rate!
The Bonferroni correction is another method for controlling the family-wise error rate. Learn more about it in What is the Bonferroni Correction and How to Use It.
Finally, if you don’t need to perform all pairwise comparisons, it’s worthwhile comparing only a subset because you’ll retain more statistical power.
If you’re learning about hypothesis testing and like the approach I use in my blog, check out my eBook!
abhilasha says
jim you are a lifesaver, thankyou
Gayathri says
Hi Jim,
Thank you for the article, it was very useful for understanding ANOVA and post-hoc. I am having a doubt on the same though. When I did a Tukey -B post-hoc on a significant result only one subset came. What should I do and How can I interpret the same? Thanking you in advance.
Michal says
Hi Jim, thank you for your website and tutorials at first. I have a question. I have set of data (about 900 different proteins and 4 experimental groups; I would like to compare level of particular protein between these four grups), and I spilt them into three groups according to: (1) distribution (S-W test) and (2) homogeneity of variances (Levene’s test). Then I made a comparisons by using ANOVA, Welch’s ANOVA and Kruskal-Wallis. For particular tests I made appropriate pos-hoc tests, i.e. Tukey HSD (for ANOVA), Dunnet T3 (for Welch’s ANOVA) and Dunn-multiple comparisons (for K-W). My question is: should I adjust p-values which I received from paricular post hoc-tests (by using Benjamini-Hochberg procedure) to prevent type 1 error? Or there is no need to introduce FDR procedure, cos’ all above-mentioned post-hocs control type 1 error by FWER approach? I would be gratefull for your help. Cheers.
Yvonne Lemlijn says
Hi Jim,
I have a question. I want to perform a post hoc Tukey test over my data. I already saw in the univariate ANOVA that treatment has a significant effect (I saw this in the test of between-subjects effects). However, my treatment is applied in two groups. SPSS will not perform the post hoc tukeya and now I do not know whether my treatment is significant in only one or both of my groups. Do you perhaps know a solution to this?
Jim Frost says
Hi Yvonne,
Before answering, I have several questions for you.
Was it the same treatment applied to two groups? Or different treatments? How many groups are there in total?
Generally, you wouldn’t use the same treatment in two groups. Each group should have distinct treatments or be a control group.
If your goal is to find which groups are different, you can use Tukey’s even if it is the same treatment in the two groups. It won’t know they’re the same treatment and will likely find no significant different between them.
But you’d really want to know why they used the same treatment in two groups. That’s totally atypical. Was something else different between those groups?
sara aparicio says
Hola, tengo que comparar si hay diferencias significativas entre más de tres grupos de datos, la mayoría de los grupos se componen de tres muestras, pero hay uno que se compone de 6, además no presentan homogeneidad de varianzas y algunos no siguen una distribución normal. sería adecuado utilizar el test de Games Howell? en caso negativo que prueba post-hoc podría usar?
Hello, I have to compare if there are significant differences between more than three groups of data, most of the groups are made up of three samples, but there is one that is made up of 6, they also do not present homogeneity of variances and some do not follow a normal distribution . Would it be appropriate to use the Games Howell test? If not, what post-hoc test could I use?
Jim Frost says
Hi Sara,
With such small group sizes, it’ll be hard to perform any type of analysis. You really need larger groups! Unfortunately, there’s not a good answer for such small groups sizes. You won’t be able to draw any inferences about populations with the small sample size. Games-Howell won’t do any better than other analyses. Yes, it doesn’t require equal variances but the nonnormality is a problem with small sample sizes. And the small group sizes themselves are a problem for all analyses.
Instead, you might just calculate group means or medians and present your results as an exploratory analysis or a pilot study?
Luigi says
Hi Jim,
I have a question concerning ANOVA and post-hoc. Sometimes it happens that the ANOVA gives significant result but the post-hoc fails to separate the means. What should/can I do when this happens?
Jim Frost says
Hi Luigi,
That just means that your data provides enough evidence to conclude that not all the means are equal but not quite enough evidence to determine which ones are different. Unfortunately, there’s not much you can do. If you can perform the study again, you might collect a larger sample size. On the plus side, you do have sufficient evidence to conclude the means are not all equal even though you can’t state which ones are different.
Priya says
Can you please let me know how you brought the visual graph presentation please
Jim Frost says
Hi Priya,
I used Minitab Statistical Software to produce the graphs with the simultaneous CIs of the differences.
Maple says
Hi Jim,
Thank you for this article. This is very helpful. I am wondering if there is any adjustment method that works for partial pairwise comparisons. For example, if I have 4 arms in a study, instead of 6 all-pairwise comparisons, I am only interested in 5 post hoc tests. What will be a good adjustment for it except for Bonferroni?
Jim Frost says
Hi Maple,
I’m glad you found the article helpful! You have a great question because researchers might often not be interested in all possible pairwise comparisons.
When considering only a subset of pairwise comparisons, the adjustment method depends on the nature and relationships among the comparisons you’re interested in. The Bonferroni method, as you know, is a straightforward approach where you adjust the alpha level by dividing it by the number of tests. But it can be conservative, especially when the number of comparisons is large.
If your study has a control group and you only want to compare the treatment groups to the control and not compare the treatments to each other, I’d recommend using Dunnett’s. Otherwise you might consider Hsu’s method if you’re interested in finding the best outcome group. I cover both of those methods this post.
Other options include:
Tukey’s Honestly Significant Difference (HSD) is typically used for all pairwise comparisons but you can use it for a subset as well as long as you can tell your software which comparison to perform.
Holm’s Sequential Bonferroni Method: A stepwise approach that is less conservative than the traditional Bonferroni method. Order your p-values in ascending order and compare each one to a decreasing alpha level (e.g., for the smallest p-value, α/5; for the second smallest, α/4; and so on). Once a comparison fails to reject the null hypothesis, no further comparisons are made.
Sidak: This method is slightly less conservative than Bonferroni and you specify the number of comparisons. Adjusted alpha = 1 – (1 – original alpha)^(1/n).
I hope this helps!
Kalu says
High Jim! I found your tutorial very helpful but I have this question to ask. I did Anova using four groups and the result was statistically significant. Then I conducted Turkey HSD post hoc test but surprisingly three out of the six pairs had negative p values. Please I am lost. I need your help. Thanks.
Jim Frost says
Hi Kalu,
Unfortunately, your statistical software is doing something wrong!
P-values are probabilities. And all probabilities must be between 0 and 1. In other words, negative p-values are impossible!
I don’t know what your software is doing, but it can’t be correct!
Shaimaa barr says
Thanks a lot, your answer was very helpful
Shaimaa barr says
Hi, this article was helpful but I have some questions.
I have four groups to compare with unequal variance but it with an equal sample size for all groups so it is better use Welch or anova test?, and if I use anova as the equal sample size robust the violation of equal variance, may I use games howell post hoc or use tukey test?
In other mean is it necessary to use Welch to use games howell post hoc test or not ?
Jim Frost says
Hi Shaimaa,
You really should use the Welch’s version of ANOVA anytime you have unequal variances. When you have both unequal variances and unequal samples sizes, the regular F-test ANOVA performs particularly poorly! However, even with unequal variances but equal group sizes, you should still use Welch’s ANOVA. There’s really no drawback to using it. In fact, some analysts use Welch’s ANOVA all the time so they don’t even have to worry about variances!
And you can use the Games-Howell post hoc test with Welch’s because they both accept unequal variances (i.e, neither method assumes equal variances). There might be other post hoc tests that don’t assume equal variances, but I’m not familiar with them. And, yes, Games-Howell works with Welch’s ANOVA because they both accept unequal variances. However, you could also use G-H even when variances are equal.
But, if you have unequal variances, I recommend Welch’s ANOVA and Game-Howell.
Idgs says
Hi I have a question, let’s say I didn’t plan any post-hoc test in advance. But after seeing the results, I want to do post hoc tests for the group with the highest mean (similar to Hsu’s MCB). Do you think this is a form of p-hacking? If yes, is it better for me to compare all possible pairs instead to lower type I error possibility? Thanks.
Jim Frost says
Hi, yes, that sounds an awful lot like P-hacking because you’re letting your results guide how you proceed. Even if that wasn’t intention, you do have to worry about how knowledge of the results affects your decision making going forward.
However, given that you’ve already gone down that road, what I’d do is try to envision what you would’ve chosen at the outset of the study before you know any results. If my ANOVA is significant, which post hoc test provides the best information for my study. Focus on that aspect. It can be hard when you already know the results and that can influence you even if subconsciously. But come up with a reason for a particular post hoc test, and then perform that. There’s often a most obvious reason for performing the test.
If you can’t think of one, then yes, I’d perform all pairwise comparison.
Kelly Papapavlou says
Dear Jim
How are the simultaneous confidence intervals for each pair difference in Tukey test calculated? I seem to be unable to get the same results as the software
Ram says
Thank you Jim! That was helpful (as usual) You are a stat-saver:)
May I suggest that you extend your impact in the world by providing one to one coaching to researchers especially research students who may be completely lost and would greatly appreciate your help, with their data.
If you have the bandwidth for it, I am pretty sure there will be a lot of takers and it may be something you want to consider! You have the rare skill of simplifying stats!
Once again thanks for everything!
Jim Frost says
You’re very welcome, Ram!
Unfortunately, given time constraints, I don’t typically have time for one-on-one coaching. I do like the idea though, but practically speaking, it would be difficult.
Giulio says
Hi jim
Thanks a lot, this was really insightful.
I’m approaching this ANOVA method, so I’m sorry if my questions seem a bit dummy, but there is just a thing I can’t quite understand: how does Tuckey’s test calculate the new level of significance?
Because if we used the usual equation 1-(1-a)^c = 0.05 we would have that when performing 6 comparisons a = 0.0085, instead you wrote that Tuckey’s value is 0.011; so how does it work?
Furthermore, after finding such value, we shall compare our findings to it, and not anymore to the usual 0.05, right?
Jim Frost says
Hi Giulio,
That’s definitely NOT a dumb question. That’s the value that my statistical software reports. Well, it reports an individual confidence level of 98.89%, which equates to an error rate of 1 – 0.9889 = 0.0111. You’re correct that the familywise error rate for 6 comparisons at 0.05 should equate to an individual error rate of about 0.0085. I don’t know how the software is calculating the other value. I looked for details but couldn’t find them. Conversely, an individual error rate of 0.011 for six comparisons equates to a familywise error rate of 0.065. It’s possible there are some rounding errors and/or DF issues going on here. I’m not sure.
I wish I could provide an answer to you about the apparent discrepancy, but I couldn’t find the answer. But you are thinking about this issue correctly! 🙂
Typically, you’ll use the adjusted p-values. Because they’re adjusted you simply compare them to your target familywise error rate (e.g., 0.05). You can also look at the Tukey’s CIs of mean differences and see if they exclude zero. The widths of those CIs have also been adjusted. In other words, the Tukey’s procedure adjusts the p-values/CIs for you.
Ram says
Hi Jim,
You are absolutely incredible and you explain stats in such a simple manner that I actually understand! Thank you!! I have read many of your articles and it all makes sense. Never thought that was possible:)
I have one question with regards to the following:
“Remember, statistical significance doesn’t necessarily indicate that the results are meaningful in the real world.”
I understand this statement. But then how can we find out about the clinical significance (real-world impact) of our findings?
Are there any other tests which we could run or perhaps the only way out is by running additional experiments?
Thanks in advance. Your help is much appreciated.
Jim Frost says
Hi Ram,
Thanks so much! You made my day! I’m also glad that my website has been helpful.
Remember, statistical significance really just means that your data provides enough evidence to conclude that the effect size is not zero. It’s not indicating that it’s large enough to have practical benefits.
To assess practical significance, you need to look at the effect size (preferably a confidence interval) and then apply your subject area knowledge to determine whether it is practically significance. I’ve written an article about this: Statistical vs. Practical Significance. I recommend checking that out!
Steve N. says
How about this situation? We perform ANOVA with multiple response variables, one omnibus test for each response variable. Ignoring the post-hoc tests for now (assume we don’t run them at all), should we reduce alpha for the omnibus tests in order to maintain a certain family-wise error rate, where a family here is a family of omnibus tests not individual group comparisons?
Jim Frost says
Hi Steve,
Reducing the alpha to maintain a family-wise error rate is a post hoc test. That’s basically what the post hoc tests do. They use a method of lowering alpha for each individual test for that very purpose of maintaining the error rate for a family of tests. Post hoc tests are (essentially) the various procedures for reducing the individual alpha levels by an appropriate amount. In this context, “appropriate” is defined by the comparison approach you want to use (e.g., all pairwise vs. treatments against control).
Steve N. says
The question here refers to situation in which there are two groups that are compared on multiple metrics. Say, one metric is the response variable, another “metric” is to see whether some demographic variable (say, gender) is different between the two groups, then there’s a third which is something else. The first one may be a t-test, the second way may be a chi-square test, the third one another t-test. The person who posted wants to know whether the individual test p-values should be .05/n, where n is the number of metrics being measured. I believe the answer is yes. I believe the .05/n refers to a Bonferroni correction.
Jim Frost says
Hi Steve,
I’m not really sure what you’re asking. If you have two groups and just want to compare the means of those two groups, use a 2-sample t-test. For example, if you want to compare the mean of some outcome measure between males and females, the 2-sample t-test will determine whether the mean difference is statistically significant. Because there are only two groups, you don’t need to use a post-hoc test to compare them.
However, if you’re making many such comparisons, even if you’re using different tests, then you might want to use a post hoc text. The 0.05/n does refer to a Bonferroni correction if the n refers to the number of comparisons.
It’s entirely clear what you want to do, but that’s my best take on it!
Himanshu Thakur says
Dear sir
Greetings
Sir as you have mentioned there are various types of post hoc tests such as Tukey’s HSD, DMRT, Fishers LSD, so which one among them we should perform and what is the criteria of using these test.
I have read some literature where it is written if the number of treatment which we want to compare is more then perform tukeys HSD but what is the limit they have not mentioned.
Kindly clarify which test should be used for a particular set of treatments and what is the criteria for choice of post hoc test.
AA says
Hi Jim,
I have one question please. While I was running Anova test for my data, which has five different groups (or materials in the context of the given example) the Welch’s Anova test results suggest that there is a significant difference between my data. However, when I run Games-Howell test to determine which groups are different from each other, I get the same letter for all my groups. Can you please explain to me what this exactly means?
Thank you in advance.
Jim Frost says
Hi,
The overall test suggests that not all the group means are equal. However, sharing a letter indicates that the G-H test cannot determine which group means are unequal. This situation tends to occur when your overall test is barely significant, just below 0.05 or whatever your significance level is. There’s enough evidence to indicate (barely) that not all the groups are equal but not enough to indicate which groups. It’s really a sign that you need to conduct a study with a larger sample size. The amount of evidence in your data that supports the notion that an effect exists is fairly low.
Shey says
This article was so helpful for me. The way that it is designed is very intuitive and understandable. Thanks a lot!
yanne says
Great information! I was wondering why lsd makes for a better post hoc test as compared to independent t-tests, does it have to do with degrees of freedom?
Zhirajr says
Hi Jim,
1 – If i’m not wrong about the Dunnett’s test, the choice of the control group is free. I ask this because many reviewers would consider only control = placebo/saline groups ecc. In Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc. 1955;50(272):1096-1121. doi:10.1080/01621459.1955.10501294 it is stated:
“When the experimenter only wishes to make comparisons between one of the means and each of the others,…”
2 – Before analysis, a critical value for the Dunnett’s multiple test has to be choosen. This value has to be set by the researcher according to the choosen alpha, usually 0.01 or 0.05 and number of groups/comparisons. This is also stated in: Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc. 1955;50(272):1096-1121, where no further adjustement of the p is made.
In my case, for 0.05 alpha, 28 patients and 4 groups, the critical value is 2.47 according to the tables.
This number has to be compared to the differences in means of each group.
So, if i’m not wrong, the p value it is already set/adjusted before comparisons (according to the number of groups) in a precalulated table. Is it necessary to further adjust the p (Bonferroni method for family wise error rate or Benjamini Hochberg for False discovery rate control) when it is already set according to the number of comparisons?
z
Jim Frost says
Hi Z,
Yes, you can choose whichever group you want to compare to all the other groups. However, the logical choice is frequently a control group.
The Dunnett procedure adjusts the p-value for the number of comparisons. No further adjustment is necessary.
Yeseul says
Hi, thank you for the great info. I have two variables which will affect the results and two groups. I ran two-way ANOVA and got significant difference between two groups. However, I am in doubt since post-hoc test (Sidak) doesn’t show any significance in any pairs. In this case, can I claim that my two groups are showing difference? And this case, how should I present in a graph? Thanks in advance!
Jim Frost says
Hi, I’m a bit confused as to what you’re saying–you “got significant difference between two groups” but yet your post hoc test “doesn’t show any significance between any pairs.” Those statements are contradictory.
Instead, do you mean that you got significant results for your two factors but not differences between the groups? If so, that’s a different condition than what you describe. That can happen in some cases. Your analysis contains enough evidence to conclude that the groups means are not all equal but yet you don’t have enough evidence to determine which pairs specifically are significantly different.
This usually happens when your factors are borderline significant. In these cases, I recommend additional testing with a larger sample size.
Tazyn Fini says
Hi Jim
I was wondering if you could provide some clarity. I am looking into 3 groups with different sample sizes. I am using one-way ANOVA, some variables are homogeneous and others aren’t. For the variables that are homogenous I have used Anova P- value to determine significance and then further use the Bonferroni post hoc significance. For those that aren’t homogenous I am using the Welch P-values and Games -Howell post hoc significances. Firstly is this okay to do ? Or should i only be using the Welch Test?
Secondly, I have had some issues with a some of my ANOVA p-values being significant but then my post hoc values are not. Is this normal or have I done something wrong ?
CK says
Hi Jim, great article! I do have a question regarding the following paragraph :
“Imagine you’re rolling a pair of dice and rolling two ones (known as snake eyes) represents a Type I error. The probability of snake eyes for a single roll is ~2.8% rather than 5%, but you get the idea. If you roll the dice just once, your chances of rolling snake eyes aren’t too bad. However, the more times you roll the dice, the more likely you’ll get two ones. With 25 rolls, snake eyes become more likely than not (50.8%)”
Could you please explain how you calculated that the probability of snake eyes with 25 rolls of two dice to be 0,508?
Thank much!
Jim Frost says
Hi CK,
You bet! It’s actually the same method as I use to calculate the values in the table for the familywise error rate by number of comparisons.
For this example, we’re using a probability for a single role of 2.8%.
Here’s the general equation:
1 – (1 – prob)^N
Note that (1 – prob)^N = the probability of not rolling snake eyes. So, we need to subtract that from one to obtain the probability of rolling snake eyes.
For our example, prob = 0.28 and N = the number of roles, 25.
So, 1 – (1 – 0.28)^25 = 1 – 0.49165 = 0.50835
Anoop says
Dear Jim,
Quick question. I do have your book 🙂
In a pre-post design for an RCT, What is the main effect of group means? If I understand it right, it is comparing the average of Grop1 (Pre and Post) to the average of Group 2 (pre and post). So they are removing the time effect here. So the main effect doesn’t tell us anything useful ina pre-post design for ANOVA right?
What we care about is the interaction of time * group right?
Thank you for all your work!
Jim Frost says
Hi Anoop!
Yes, that’s correct! Using main effects, you can determine whether there is a difference between the pre and post groups for all subjects. Or, you can see if there’s an overall difference between the treatment and control groups. However, what you really want to know is whether the difference between the pre and post groups for the treatment group is larger than the difference between the pre and post groups for the control group. And that’s why you need to assess the interaction effect rather than the main effect. Basically, you’re assessing whether the difference between the pre and post test depends on the group assignment. As you indicate, that’s the time*group interaction. The p-value for that term indicates whether that is significant. An interaction plot will illustrate the results to clarify what the interaction really means!
I hope that helps!
Rabia B says
Hi Jim.
I hope you are well.
I’m working on analysing the data of a study. I ran the Welch ANOVA because Levene ‘s test was significant for the factors under study. The problem is that for some of the factors SPSS is giving an error saying that the Welch test of robustness couldn’t be performed because there is at least one group with zero variance. In this case, is there another test I can do?
Also, the results for the Games Howell test are very weird… I am getting pages upon pages of results with no clear results.
If you could kindly suggest a test instead of Welch ANOVA I could use, as well as a Post hoc test that would apply, I would be sincerely grateful.
Thank you very much!
Jim Frost says
Hi Rabia,
If a group has zero variance, then it either has only one observation or it has multiple observations that are all equal (within that group). In either case, you can’t analyze that group. I’d consider removing that group from your analysis.
Laura says
Jim, I need help, please! when running posthoc test (using Tukey’s test) I only get the p-values and diff, lower and upper limits, but I want the whole information, including t-values. The table you posted there is absolutely perfect, would you mind sharing the code in R to get the full pairwise comparisons including t-values, SES, etc.
Jim Frost says
Hi Laura,
Unfortunately, I didn’t use R to create the table. I used Minitab statistical software. However, to calculate the t-value, just take the difference between two groups and divide by the standard error of the difference. However, the most important aspect in my opinion are the CIs of the differences. Not only do they tell you if it’s a significant difference (when they exclude zero) but also the precision of the estimate. Sometime a difference can be significant but there’s a wide CI making it not the most useful information!
RABIA NOUSHEEN says
Thank you Jim for replying in detail, your comments are very helpful.
RABIA NOUSHEEN says
Hi Jim
I found this article really very interesting and helpful. I have a question that I want to ask about post hoc test for GLM. I analysed binary data and count data using binomial and negative binomial GLM respectively. I am doing tukey pairwise comparison with bonferroni correction for significant parameters, is that correct? Further how much it is important to compare treatments with control group? I have a control but if I make comparison of treatments only, does that make sense? Regarding control, I am really confused how to compare it with treatments.. reason of confusion is the way I have arranged my data (see below)
Predictors: Time (h) Concentration(Particles) Particle type Response:Mortality(%)
1 0 A 9
2 0 A 6
1 500 A 21
2 500 A 27
1 0 B 9
2 0 B 6
1 500 B 21
2 500 B 27
Concentration zero is actually control but it is entered as numeric. I have different particle types and each one has different control, how to make control vs treatments comparison in this case? what about comparison of only treatments? if my interaction term particle type*time is significant then post hoc gives the comparisons of possible combinations seperately for each concentration like for concentration 0, all possible combinations of time and type, then for 500 and so on. I am wondering what is the right approach? Please guide
Jim Frost says
Hi Rabia,
Yes, it’s important to use a method such as Bonferroni that controls for multiple comparisons, otherwise the family error rate quickly grows out of hand! You can use Bonferroni or other methods depending on your needs.
As for whether it’s important to compare to the control, I have both a general and specific answer for you.
Generally speaking, the correct approach is to determine before analyzing the data which comparisons are important to your study and plan and document the plan for those comparisons in advance. That’s driven by the needs of your study, the questions you need answers to, theory, what other studies have done, etc. In other words, identifying the correct comparisons to make depends on a variety of factors. Unfortunately, I don’t know the full context of your study and can’t provide an informed answer. But focus on the questions that you most need to answer. What information/findings do you need to obtain? What comparison provide those answers? Which comparisons don’t add anything important?
A little more specifically, because this appears be a designed experiment with a control group, I have hard time picturing a scenario where you wouldn’t want to compare the treatments to the control? After all, you or the designers must have included the control group as a point of reference. It’s a group that you intentionally add to experiments to which you can compare the treatments. You don’t really know how well the treatments are working compared to no treatment unless you have a control group. So, I’d imagine you’d really want to compare the treatments to the control group to be sure that there is a statistically significant improvement.
If you want to determine which treatment is best, you’d also want to compare the treatments with each other. There are specialize methods for this such as Hsu’s MCB.
I don’t fully understand the experimental design with multiple control groups. That might well influence your thinking on these matters but that’s really getting in deep with details with which I’m not familiar! You might well need to consult with a statistician at your institution who can look into your study and give it the time it deserves!
Christopher says
Yes please. Do you report the group median in the posthoc (KW)?
Christopher says
How can I possibly report that please?
Jim Frost says
Christopher, I’ve answered that question. If there is something else you need to know about it, please ask more specifically. Thanks!
Christopher says
Thank you very much for the quick response. Since all the groups are significantly different, I was wondering how you can report that if I used a DSCF pairwise comparisons (KW).
Jim Frost says
Hi Christopher, I believe you’d just state that all pairs of groups have differences between their medians that are statistically significant.
Jordyn says
Hi Jim,
I am having trouble deciding what is the correct test to run for the One way ANOVA like (LSD, Tukey, Bonferroni) and need to do it correctly. The question is to run one on social media conditions on body satisfaction scores. There is 5 different other options so it’s asking for a specific group and sample sizes are similar (27,23,28,22). I’m not sure if this makes sense. Thank you!
Michael says
Hi Jim,
I wanted to follow up with you and thank you for the great advise. I used the Dunnett’s post hoc test in my analysis as you suggested. Thank you for the wonderful service that you provide.
Regards,
Michael
Jim Frost says
You’re very welcome, Michael. I am so glad to hear that I could be of help! 🙂
Christopher says
Thanks for your explanation. I was wondering, what do you do and how you do interpret if all the results from your posthoc test are significant? I found that in my non-parametric (KW) analysis
Jim Frost says
Hi Christopher,
If you’re finding that all of the differences between your groups are significant, that indicates that each group has a distinct population mean. Your sample provides sufficient evidence to conclude that none of the group means are equal at the population level.
Himanshu says
Hi sir,
How we need to find, which post hoc test among the variety of post hoc tests (such as Duncun DMRT, Tukey’s HSD, LSD etc) need to apply in our data.
Jim Frost says
Hi Himanshu,
You’ll need to define what your goals are for the post hoc test. Do you want to compare all possible groups or just some of the groups. I discuss various possibilities in this article.
sam says
Hi I commented on your regression page- but upon looking at more statistical tests- would it be advisable to do a 1 way anova for my experiment. So it would be type of social media as the IV- 3 groups will be formed- 1st being those who use the new social media site, 2nd those who do not use social media and 3rd being a control- facebook and instagram users (to control for their infleunce on results). Then based upon my results i would to a Post hov test with a bonferroni correction?
Thank you, Sam
Jim Frost says
It’s possible to compare group means like that. You need to decide what question you want to answer. If it’s about group means being different, then ANOVA might be the way to go. If you’re looking at relationships between continuous/ordinal variables, you can go with regression. So, you need to really figure out what you want to answer and what data you’ll have to determine which analysis is best.
I don’t usually recommend the Bonferroni correction because it is relatively conservative. I generally recommend one of the other methods.
Michael says
Hi Jim,
I have a relatively simple 3 sample experiment, but I’m not sure what post hoc test is appropriate. The study design is as follows: A = Control (no disease & no treatment), B = Disease state, C = Disease state with treatment. I’m only interested in two comparisons: BvsA to find the changed between disease vs. Control, and CvsB to see what changes occur upon treatment. I am NOT interested in CvsA. It looks like one-way ANOVA is the way to go here, but Tukey doesn’t seem like the correct post hoc test because I don’t want all pairwise comparisons. I’m uncertain if Dunnett’s is correct because while B is in both comparisons (BvsA & CvsB), it is not the “Control” as you described in your article. Is there another post hoc test that would be appropriate for my scenario? Would it be best just to perform two individual T-tests (BvsA & CvsB)? If this is the way to go, how do I take type 1 error into account when determining the singificance?
Thank you in advance,
Michael
Jim Frost says
Hi Michael,
There’s a fairly simple workaround for this situation. I’d use a Bonferroni correction, which involves dividing your significance level by the number of comparisons (e.g., 0.05 / 2 = 0.025).
Then, perform two 2-sample t-tests for B vs A and C vs B. For those tests, use the Bonferroni corrected significance level. In other words, compare the p-values for your t-tests to 0.025 to determine statistical significance. Piece of cake!
Typically, I don’t recommend using the Bonferroni correction because it’s conservative (meaning it errs on the side of failing to reject the null more frequently than needed). However, in this case, it allows you to reduce the number of comparisons. Tukey’s would have 3 comparisons.
Actually, another thought occurs after I wrote all the above. You could use Dunnett’s if you define Group B as your control group. I know it’s not really the control but that’s ok. It’s just telling the algorithm which groups you want to compare. It would then compare B to both A and C, just as you want! I’d recommend this approach over the Bonferroni method because it should preserve more of your statistical power. But either method would work.
I hope that helps!
emily says
Hi! Can I know why is there a negative in a confidence interval? Like what is the cause of this? Can it be due to small sample size?
Jim Frost says
Hi Emily,
For post hoc tests, you’re comparing two groups and typical the difference is just Group 1 Mean – Group 2 Mean. If group 2’s mean is higher than group 1, it’ll give you a negative point estimate for the difference due to simple arithmetic.
As for the CI, even if Group 1 has a great mean than Group 2, which will give you a positive point estimate for the difference, the CI can still include a negative value. That’s due to the fact that the difference between means is close to zero relative to the amount of variability in the data. When the CI includes zero (it has both a positive and a negative endpoint), the difference between between means is NOT statistically significant because no difference (mean of zero) is included in the likely values.
This can happen for several reasons. Yes, it’s possible that a small sample size causes it by producing wide CIs. Noisy data (large variability) also produces wide CIs. Wide CIs are more likely to crossover zero. However, it’s also possible that there is no difference between group means at the population level and that’s reflected in your sample and the CIs.
I hope that helps!
Santosa Sandy says
Thank you Jim. You explain it intuitively
Hrysanthi says
Hi Jim,
is there any way I do not have significant difference in the beginning, but after post-hoc analysis I do have?
Jim Frost says
Hi Hrysanthi,
Yes, what you describe it possible. The F-tests and post hoc tests use different methods to determine significance. Consequently, they can come to different conclusions occasionally. These differences usually occur in border cases. Your F-test result was probably just not quite significant while your post hoc test was just significant. In these cases, it’s OK to report the significant post hoc results.
Me says
The thing is that I am reporting the results of a pre-test, mid-test and post-test, then, I do not know if you switching the groups would be a idea. What should I do? Is getting negative numbers incorrect though?
Thank you in advance.
Jim Frost says
Hi, in that case, your right, it doesn’t make sense to switch the order. Just be sure to understand what the negative value means so you can explain it. Look at the order of subtraction. For example, if it is X1 – X2. You know that the X2 must be larger than X1. Maybe X2 is your post-test, in which case it indicates that mean is rising over time. However, if X2 is the pre-test score, you know the mean is decreasing.
Just be sure to understand the order in which things are being subtracted and know that the 2nd value is larger than the first (assuming you’re dealing with all positive values). Incorporate that understanding into your interpretation.
Me says
Hi Jim,
I was wonderinfg if it is normal to get negative numbers using a Bonferroni test (Post Hoc Test). I got negative numbers in the mean diffference sections (first two rows) as well as in the “t” section (first two rows). I am actually very confused as I expected postive numbers.
Thank you in advance.
Jim Frost says
Hi, that’s probably because of the order of the groups in the subtraction. If you’re comparing groups X1 and X2 (X1-X2) and X2 has a larger mean, you’ll obtain a negative value. I’m guessing that’s the case. If you want to obtain a positive value, just switch the order of the groups in your analysis.
Dr. Kelly Pivik says
Hi Jim! I have an extremely basic question. You are showing output here, but it’s not clear how you got that output. I am talking about the One-Way ANOVA: Strength vs. Material and the residual plots for Tukey’s. This is the information I need. Just doing the one-way ANOVA does not give the same output.
Thank you!
Jim Frost says
Hi, I’m not sure what you’re asking for? I performed the one-way ANOVA and then did the follow-up post hoc tests, all as shown in the article. I did not show any residual plots.
Susmita K says
Hi Elise,
Did you figure out how to do this as I am stuck at the same problem 🙁
Could you please advice me? Thanks!
Stephanie says
Thank you for your clear explanations in all of your posts. Can you please explain why post-hoc comparison tests cannot be performed when factors are designated as random (at least in Minitab)?
Linda B says
Hi there, great articles, thanks so much. I am just wondering…with a three-way ANOVA would you recommended using t-tests with the bonferroni correction as post hoc tests? …OR would you recommend running a two way anova with the error term syntax from the three way ANOVA and then applying the bonferroni correction to the results? I’m finding it really hard to make a call on this!
Moe says
Hi
I have four groups with 7 observations each. ANOVA shows significance with p=0.032 but Tukey’s does not show any significant difference between any of the group. What to do in this situation? Should I use LSD? How to report the resuls where my aim is to compare all the groups means with each other. Please advise. Thanks!
Elise MBARGA says
Hi Jim
I am trying to see if there is a relationship between drug concentration and cell proliferation. I have used one way ANOVA but I am not sure what post hoc test use. Could you advise me please?
Dr. Ritesh Patel says
Very good information given for the post hoc tests. Thank you sir!
Jim Frost says
Thank you!
Rex Cao says
I see, many thanks for your clarification!
Virginia says
When is a post hoc test inappropriate to use with a one-way ANOVA?
Jim Frost says
Hi Virginia,
I’d say the main case for when it’s inappropriate to use a post hoc test with ANOVA is when your data don’t satisfy the assumptions for ANOVA itself. For example, your groups should have roughly equal variances. And, if your sample size is small, the data should be normally distributed.
Some people will say if the one-way ANOVA is not significant, then you shouldn’t perform a post hoc test. However, others say it is ok in that case. I fall in the group of thinking it’s ok even when the one-way ANOVA is not significant. But, be aware there is debate over that point!
Rex Cao says
Hi Jim,
Many thanks for this article, very helpful.
Now I understand why sometimes the p value of ANOVA showed a significant difference, but the post hoc analysis such as student newman keuls test does not show differentiation between treatments.
I am an agricultural field researcher. I do lots of pesticide field trials, sometimes have upto 15 treatments in my protocol, this increases the error rate significantly.
So, based on what you said, I should not include untreated control (negative control) when I perform test such as Tukey Honest, because this will create highly skewed data? Is Dunnette the only test I should use when I include untreated control?
Jim Frost says
Hi Rex,
If you have a control group and multiple treatment groups, I’d highly recommend a method like Dunnett’s over Tukey’s as long as you don’t need to compare the treatment groups to each other. You’ll gain more statistical power.
By the way, you’re not “skewing” the data if you have additional comparisons. The additional comparisons reduce the statistical power of the test, which is a different concept. It just means the test is less likely to detect a significant difference in a sample when the difference exists in the population.
Greg Stauffer says
Hi Jim;
Do have directions for calculating confidence intervals of differences between means to visualize Tukey HSD data? I’d like to produce the Tukey’s Simultaneous 95% C.I. graph you have listed above. Also, which one of your books would have this information? Thanx so much!
Souad Karam says
How can I cite this article? What was the publication data?
Jim Frost says
Hi Souad,
Purdue University’s Online Writing Lab (OWL) shows you how to cite webpages. Click the link and scroll down to the section titled, “A Page on a Website.” You don’t use the publication date but rather the date you accessed the URL.
I hope that helps!
BANUVATHY says
Hello Jim
Thank u for the explanation. I performed paired t test between two groups (example A and B). It showed statistical significance. But when I performed One way ANOVA with the same groups including two more groups (A,B,C,D) ,post Hoc Tukey comparison between A and B shows statistical insignificance. Why does that happen? Does it mean A and B are not significantly different. Which test to rely ? either paired Ttest or Post hoc tukey test. Kindly give me your feedback on this. Ty
Jim Frost says
Hi Banuvathy,
To understand why, reread the section in this article titled, “Post Hoc Tests and the Statistical Power Tradeoff.” When you compare more groups, the test loses statistical power. In other words, it becomes less able to detect differences. And, that’s what you’re seeing. When you compare just the two groups, there’s no reduction in power. However, with four groups power is reduced.
Because you have four groups, you need to go by the post hoc test results. However, if you don’t need to compare all possible groups, which Tukey’s method does, then you can consider other post hoc methods. I discuss alternatives in this article as well.
Ahmed Sadaka says
Hi Jim,
Many thanks for this brilliant easy to follow article. I was however wondering if the experiment-wise error rate inflation and thus the need for p-value adjustment would apply to an independent t-test, say 2 groups (treatment & control) compared for different n variables (e.g. demographics, treatment effects, etc.). If so, would the adjusted p = 0.05/n
Jim Frost says
Hi Ahmed,
I’m not 100% sure what you’re asking exactly.
If you have just the two groups, and you’re controlling for the other variables you list, say in a regression model, then you don’t need to adjust the p-value because you’re still just comparing two groups.
However, if you’re comparing more than two groups that are based on those other variables, you need the adjustment.
I’m just not quite clear on what your scenario is, whether it’s just two groups but controlling for other variables, or more than two groups based on other variables.
I hope that helps!
Ami Choi says
Hi Jim,
Thanks for the helpful post! Quick question — when you adjust the p value, do you assess the univariate F value with the conventional 0.05 level and then apply the adjusted p value only to the post hoc tests for specific group comparison, OR do you apply the adjusted p value to both?
Jack says
Hi Jim, thanks for the blog post. It was very insightful. I am conducting a study where I’m looking into insider trading abnormal returns, year by year from 2010-2019.
I would like to know whether any year within time period had particularly abnormal insider trading returns. This will give me 9 different groups which after reading this seems like a lot. Am I correct in thinking that the best test to use for this analysis will be Tukey’s method?
h0 : Abnormal trading(AB) 2010 = AB2011 = AB2012 = etc to 2019.
h1: not all equal
Also, would you advise that I minimise the number of groups to 3-6 (instead of 9) to increase statistical power?
Thanks
Jack
Jim Frost says
Hi Jack,
That depends on your data and your goals. If you have a reference year and you want to determine whether the other years are different from that year, that would allow you to reduce the number of comparison. You could use Dunnett’s method, which I detail in this article. But, I’m not sure whether you have a reference year? If you want to compare all years against each other, then, yes, use Tukey’s.
Robert Blasko says
Thanks for the reply Jim! In some cases the block might be significant. I tried to use the Dunnett´s in this model, but it requires to designate one control for all the comparisons, which doesn´t fit my situation since I have a control at every site. Now I used a model with site, treatment as fixed and block as random, block nested within site. I also included site and treatment interaction. The only problem is that often I get a significant treatment effect, but the interaction term is not, so I don´t know what´s the right method to find out at which site the effect occurred. I tried to then do the mixed model at every site separately, but that is quite laborious since I have many variables and not sure quite correct way of doing it? Thanks, Rob
Robert Blasko says
Hi Jim, nicely explained! However, I still have troubles to assess what would be the best method for my situation as I have a slightly more complicated design. I have 5 sites, at each site I have 3 blocks, each block includes a control plot and a treatment plot. This treatment is basically the same at all sites. So I used GLM with site, treatment as fixed and block as random factors in my model and I included the site and treatment interaction too. Now, when I do the post hoc pairwise comparisons for sites, and site*treatment to see at which site the treatment had an effect, I get often contrary results to the ANOVA results, because the number of pairwise comparisons is large. I used Tukey, but I can choose Bonferroni, Fisher LSD, or Sidak in my software. How could I increase the power of my post hoc comparisons and still find out where are the differences? Thanks!
Jim Frost says
Hi Robert, is your blocking variable significant? If not, you can consider removing that and thereby having fewer groups to compare. I would use something like Dunnett’s method, which doesn’t compare all possible pairings. Instead, it just compares the treatments to the controls.
sara says
after a significant interaction, follow-up tests were done to explore the exact nature of the
interaction. and i found an effects of one independent variable within one level of a second independent variable, so only one effect was significant and the other 2 effects were not, do i still conduct post hocs? since one simple main effect was significant?
Jim Frost says
Hi Sara, if you want to determine which pairs of groups specifically have significant differences, you’ll need to perform a post hoc test.
erika zafeiraki says
Hey Jim, thank you for the info. However, i have a question regarding post-hoc.
I have 13groups (locations) with different size (from 2 concentrations to one group to 20 to another group). More specifically i have samples from different places and the concentration of metals in them. The number of samples is not the same for each location.
Elements concentrations were both normally distributed and homogeneous, so i further applied one-way ANOVA and a statistically significant different was observed. So, i need to apply a post-hoc test but i don t know which one. I applied Scheffe, Bonferone and LSD …but i am not sure which one is the best. So, it would be really helpful if you could tell me which one to apply.
Thank you in advace
Erik
Jim Frost says
Hi Erika,
That’s an impossible question to answer in general. It depends on the specifics of what you want to learn. Do you want to compare all possible groups to each other, just compare treatment groups to a control group, or just find out which group is best and not significantly different from the best? It’s your subject area knowledge combine with what you need to learn that determines which method is best.
You can rule out Fisher’s LSD because that only maintains the Type I error rate at your significance level when you have three groups. Bonferroni compares all possible groups but is known to be conservative. If you want to compare all possible groups, I’d consider Tukey’s method. Although, be aware that with 13 groups that’s 78 comparisons if you compare all groups. That would really lower your statistical power as I describe in this article.
I know less about Scheffe’s test than the others. But, I gather it’s good when you want to make many comparisons and they don’t have equal sizes. So, Scheffe’s test might be a good choice for you if you want to compare all possible pairs.
My recommendation is to determine what you really need to compare to learn what you need to learn. With so many groups, you hopefully don’t need to compare all possible combinations of groups.
Agegnehu says
Thanks Frost !
The way you put details using plain English and practical life experiences really helps much particularly for those reasonably far from the discipline of statistics as it is true for me.
could you have time to respond to this question, please ?
how can the statistical power of the post hoc tests
be calculated to know how much we are underpowered ? The details of the role of sample sizes to increase the statistical power of the post hoc tests ?
Thanks ! Age
Jim Frost says
Hi Agegnehu,
I’ve never seen a statistical package that calculates the power of a statistical test. However, based on the properties of the test, we know that you lose power with more comparisons for the reasons I describe. One way to get an idea of how much power you’re losing is by looking at the individual confidence level for a set of comparisons. For example, in the section about Tukey’s method, the output indicates that for the six comparison, the procedure uses an individual confidence level of 98.89%. Using that individual confidence level for each of the six comparisons collectively yields the 95% joint confidence.
If you convert that confidence level to the equivalent significance level, you’d see that it’s as if you were using an alpha of 1- 0.9889 = 0.0111 for each comparison. Any time you use a lower significance level, it decreases the power of the test.
I don’t know if you could plug that information, along with other details, into statistical software for say a 2-sample t-test to get a valid power estimate for a single comparison or not. I’ve never looked into it. But, looking at that individual confidence level gives you an idea of how the procedure needs to effectively lower the significance level more and more for additional comparisons.
Seble says
Thanks so much for a detailed response, Jim. That definitely helped.
Seble says
Hi Jim,
In what circumstances would it be acceptable to report a post-hoc multiple comparison while the main effect of ANOVA is not significant?
Jim Frost says
Hi Seble,
It is possible to obtain the situation you describe because the F-test for the main effect and the post hoc tests use different procedures and assess statistical significance differently.
This might be a bit of a controversial area. I’m not sure. In my mind, it is often valid to report post hoc multiple comparisons that are significant even when the main effect is not. However, be sure to report the full circumstances surrounding what is significant and what is not significant, along with the post hoc details. One caution, be sure you aren’t just picking specific groups to compare because the overall main effect is not significant.
If you have a number of groups that are not very different but say a couple of groups that appear to have a large difference, it’s not valid to intentionally choose a post hoc method that compares just those groups with larger differences. That’s cherry picking your analysis to get the desired results, which gives misleading results. Choose your post hoc multiple comparison methodology at the beginning of your study and stick with it. Don’t cherry pick the methodology later.
Another caution, in my experience when this happens, it’s because the overall evidence is weak. You’re probably just barely getting significant results, which represents fairly week evidence. Pay particular attention to the CIs of the differences between group means to get an idea of the precision. CIs are often wide with weak evidence.
I hope this helps!
Kumar C says
Hi Jim, I have a question with respect to Tukey’s HSD test. I have 3 groups lets say A, B, C and I have to prove that exactly one of the groups is significant. When I run the Tukey’s HSD test after ANOVA, I am getting A-B, A-C and B-C are significantly different. Now how can I arrive at the correct conclusion that A is only significant or B is only Significant or C is only significant ?
Jim Frost says
Hi Kumar,
I’m not sure that I understand your question. But, if you’re using Tukey’s test, the number of pairs of groups that are significant different is determined by the data. You can’t tell the test to find just one group. Tukey’s will compare all possible group pairings and tell you which ones have differences that are statistically significant. That won’t necessarily just be one pair of groups.
You might be thinking of something like Hsu’s test, which I cover in this post. It takes the best group (defined by either the highest or lowest mean) and then compares that group to all other groups. Even then you might well find that the best group is significantly better than multiple other groups.
Wittawat Chantkran says
Dear Jim,
Using post hoc Tukey’s HSD, I’m trying to reduce the number of comparisons by comparing A-B-C-D, A-B-E-F, and, A-B-G-H, separately, when A-B is mutual dataset of each experiment.
However, the result of A-B difference does not stay the same in each comparison (exactly the same of degree of freedom. In this case, although it reduces the statistical power, should I go back to A-B-C-D-E-F-G-H simultaneous comparison?
Many thanks
KK
Wittawat Chantkran says
Dear Jim,
Using post hoc Tukey’s HSD, I’m trying to reduce the number of comparisons by comparing A-B-C-D,
A-B-E-F, A-B-G-H, separately, when A-B is overlapped in every experiment.
Question : Why does the result of A-B difference not stay the same in each time of comparison ?
In this case, although it reduces the statistical power, should I go back to simultaneous A-B-C-D-E-F-G-H comparison ?
Many thanks,
KK
Chris says
Thanks for this overview. I was wondering why if you have a significant ANOVA and then run post-hoc test, in this case Tukey there was no significant difference between any groups? thanks
Jim Frost says
Hi Chris,
There are two primary reasons.
First, the F-test that ANOVA uses and the post hoc tests are assessing different things, which can lead to differing results. The F-test looks at all the differences between the means and determines whether they are collectively significant. In other words, is that entire set of differences statistically significant? The post hoc tests assess the difference between a specific pair of means. It’s entirely possible for the F-test to conclude that the entire set of difference was unlikely to occur if there is no effect while the post hoc tests don’t have sufficient evidence to conclude that the difference between specific pairs of means are statistically significant.
Additionally, with post hoc tests, you need to consider the fact that as the number of comparison increases, the power of the tests decrease. I explain that in the post so I won’t retype it here. That power decrease doesn’t apply to the F-test.
Alex says
Why revealing your address is required to buy a book? Have anyone tried to buy and is there an actual book at the end of the process? 🙂
Jim Frost says
Hi Alex, the system requires an address to calculate taxes. I promise that I don’t do anything with your address. Nothing at all.
Many people have bought both of my ebooks. If you want to see a free sample before buying, go to My Store and choose one of the free samples. No credit cards are required.
Rebekah says
Hello I was wondering if you could help me with a question I have. What exactly is it meant by lower and higher order interaction. And are there any examples of this you can give me?
Afnan says
Thank you for the information
but I have a question if the result for LSD post hoc test significance for negative mean difereance is it ok or it means some thing different .
Jim Frost says
Hi Afnan,
I recommend that you don’t use Fisher’s LSD. It does not control the family error rate, which as I show in this post, can quickly get out of hand and lead to false positives.
That said I don’t see the negative mean difference as a problem itself. Just be sure you understand which mean is higher than the other mean. It’s just subtract and it must be subtraction the larger mean from the smaller mean. But, please don’t use Fisher’s LSD!
Chloe says
Very helpful!
Kami says
Thanks for this really informative post. I have a question regarding using Benjamini–Hochberg method (BH) as a post-hoc method after ANOVA.
Can we use BH as a post hoc test when we do NOT have many groups? (e.g the number of pair wise comparisons are less than 10).
It seems that BH method for controlling FDR is developed for working with large data sets (genomic) when you have a large number of groups. But is there any limitation for using it for low number of groups?
What about using it for two-way-ANOVA?
Thanks
M N WANA says
That is great
Dr M Kaladhar says
Dear Jim Frost! really a good analysis and helps to laymen to understand without any ambiguity! Many thanks!!
Jim Frost says
You’re very welcome! I’m happy to hear that it was helpful!
Manohar Lal says
Hi sir,
Good information, post hoc tests with some information about Strip plot design with three factor
Kelly Papapavlou says
Thank you!! I went through all calculations steps again and MINITAB uses a different pathway to come up with the same result.
Kelly Papapavlou says
Thank you for the enlightening post. How did you calculate the standard error of difference for the Tukey simultaneous tests? I tried to repeat the calculations using the formula SE= sqrt (ME/n) where ME is the ANOVA table variation within the groups (-15.6) and n=sample size per group (6). I get a standard error of 1.61, not of 2.28…..
On a different issue, what is the individual confidence lever??
Thank you!!
Jim Frost says
Hi Kelly,
I received your multiple comments and contacts through the contact me page. As I note on the contact page, it takes some time for the comment approval process, particularly because I’m in a very different time zone than you! Patience please!
I’m using Minitab statistical software to calculate the Tukey’s test. You can see their Method’s and Formula page for Tukey’s method to see how it is calculated in all of its details.
The individual confidence level is the confidence that you have that an individual group comparison falls within that interval. The simultaneous confidence level applies to the entire set of comparisons while the individual level applies to an individual comparison.
Jeremy says
This is so superbly well-written. The Hsu’s MCB test is new to me. And thanks for de-mystifying some of the terminology (experiment-wise, family-wise, etc). Would be nice to add a discussion of Bonferonni, LSD, and the others too.
Janna Beckerman says
This is great. I don’t know why, but I never thought of the Craps analogy. Thank you! And thank you for comparing and contrasting Dunnett’s versus Tukey’s.
Jim Frost says
Thanks, Janna! And, you’re very welcome! The idea of the dice analogy just popped into my head. But, I really love linking additional comparisons to rolling the dice on a false positive. It’s a crapshoot!
Dr Eajaz Dar says
Nice information. I would suggest to discuss other post hoc tests like DMRT and LSD along with this, to gain a clear distinction between them.
Jim Frost says
Thanks for the good suggestion. I felt covering three post hoc tests in one blog post was about the maximum for a reasonably long blog post, but I might need to write another post about it!