Post hoc tests are an integral part of ANOVA. When you use ANOVA to test the equality of at least three group means, statistically significant results indicate that not all of the group means are equal. However, ANOVA results do not identify which particular differences between pairs of means are significant. Use post hoc tests to explore differences between multiple group means while controlling the experiment-wise error rate.

In this post, I’ll show you what post hoc analyses are, the critical benefits they provide, and help you choose the correct one for your study. Additionally, I’ll show why failure to control the experiment-wise error rate will cause you to have severe doubts about your results.

## Starting with the ANOVA Omnibus Test

Typically, when you want to determine whether three or more means are different, you’ll perform ANOVA. Statisticians refer to the ANOVA F-test as an omnibus test. Welch’s ANOVA is another type of omnibus test.

An omnibus test provides overall results for your data. Collectively, are the differences between the means statistically significant—Yes or No?

If the p-value from your ANOVA F-test or Welch’s test is less than your significance level, you can reject the null hypothesis.

- Null: All group means are equal.
- Alternative: Not all group means are equal.

However, ANOVA test results don’t map out which groups are different from other groups. As you can see from the hypotheses above, if you can reject the null, you only know that not all of the means are equal. Sometimes you really need to know which groups are significantly different from other groups!

Statisticians consider differences between group means to be an unstandardized effect size because these values indicate the strength of the relationship using values that retain the natural units of the dependent variable. Effect sizes help you understand how important the findings are in a real-world sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.

**Related posts**: How F-tests Work in ANOVA and Welch’s ANOVA

## Example One-Way ANOVA to Use with Post Hoc Tests

We’ll start with this one-way ANOVA example, and then use it to illustrate three post hoc tests throughout this blog post. Imagine we are testing four materials that we’re considering for making a product part. We want to determine whether the mean differences between the strengths of these four materials are statistically significant. We obtain the following one-way ANOVA results. To follow along with this example, download the CSV dataset: PostHocTests.

The p-value of 0.004 indicates that we can reject the null hypothesis and conclude that the four means are not all equal. The Means table at the bottom displays the group means. However, we don’t know which pairs of groups are significantly different.

To compare group means, we need to perform post hoc tests, also known as multiple comparisons. In Latin, post hoc means “after this.” You conduct post hoc analyses after a statistically significant omnibus test (F-test or Welch’s).

Before we get to these group comparisons, you need to learn about the experiment-wise error rate.

**Related posts**: How to Interpret P-values Correctly and How to do One-Way ANOVA in Excel

## What is the Experiment-wise Error Rate?

Post hoc tests perform two vital tasks. Yes, they tell you which group means are significantly different from other group means. Crucially, they also control the experiment-wise, or familywise, error rate. In this context, experiment-wise, family-wise, and family error rates are all synonyms that I’ll use interchangeably.

What is this experiment-wise error rate? For every hypothesis test you perform, there is a type I error rate, which your significance level (alpha) defines. In other words, there’s a chance that you’ll reject a null hypothesis that is actually true—it’s a false positive. When you perform only one test, the type I error rate equals your significance level, which is often 5%. However, as you conduct more and more tests, your chance of a false positive increases. If you perform enough tests, you’re virtually guaranteed to get a false positive! The error rate for a family of tests is always higher than an individual test.

Imagine you’re rolling a pair of dice and rolling two ones (known as snake eyes) represents a Type I error. The probability of snake eyes for a single roll is ~2.8% rather than 5%, but you get the idea. If you roll the dice just once, your chances of rolling snake eyes aren’t too bad. However, the more times you roll the dice, the more likely you’ll get two ones. With 25 rolls, snake eyes become more likely than not (50.8%). With enough rolls, it becomes inevitable.

**Related posts**: Types of Errors in Hypothesis Testing and Significance Levels and P-values

## Family Error Rates in ANOVA

In the ANOVA context, you want to compare the group means. The more groups you have, the more comparison tests you need to perform. For our example ANOVA with four groups (A B C D), we’ll need to make the following six comparisons.

- A – B
- A – C
- A – D
- B – C
- B – D
- C – D

Our experiment includes this family of six comparisons. Each comparison represents a roll of the dice for obtaining a false positive. What’s the error rate for six comparisons? Unfortunately, as you’ll see next, the experiment-wise error rate snowballs based on the number of groups in your experiment.

## The Experiment-wise Error Rate Quickly Becomes Problematic!

The table below shows how increasing the number of groups in your study causes the number of comparisons to rise, which in turn raises the family-wise error rate. Notice how quickly the quantity of comparisons increases by adding just a few groups! Correspondingly, the experiment-wise error rate rapidly becomes problematic.

The table starts with two groups, and the single comparison between them has an experiment-wise error rate that equals the significance level (0.05). Unfortunately, the family-wise error rate rapidly increases from there!

The formula for the maximum number of comparisons you can make for N groups is: (N*(N-1))/2. The total number of comparisons is the family of comparisons for your experiment when you compare all possible pairs of groups (i.e., all pairwise comparisons). Additionally, the formula for calculating the error rate for the entire set of comparisons is 1 – (1 – α)^C. Alpha is your significance level for a single comparison, and C equals the number of comparisons.

The experiment-wise error rate represents the probability of a type I error (false positive) over the total family of comparisons. Our ANOVA example has four groups, which produces six comparisons and a family-wise error rate of 0.26. If you increase the groups to five, the error rate jumps to 40%! When you have 15 groups, you are virtually guaranteed to have a false positive (99.5%)!

## Post Hoc Tests Control the Experiment-wise Error Rate

The table succinctly illustrates the problem that post hoc tests resolve. Typically, when performing statistical analysis, you expect a false positive rate of 5%, or whatever value you set for the significance level. As the table shows, when you increase the number of groups from 2 to 3, the error rate nearly triples from 0.05 to 0.143. And, it quickly worsens from there!

These error rates are too high! Upon seeing a significant difference between groups, you would have severe doubts about whether it was a false positive rather than a real difference.

If you use 2-sample t-tests to systematically compare all group means in your study, you’ll encounter this problem. You’d set the significance level for each test (e.g., 0.05), and then the number of comparisons will determine the experiment-wise error rate, as shown in the table.

Fortunately, post hoc tests use a different approach. For these tests, you set the experiment-wise error rate you want for the entire set of comparisons. Then, the post hoc test calculates the significance level for all individual comparisons that produces the familywise error rate you specify.

Understanding how post hoc tests work is much simpler when you see them in action. Let’s get back to our one-way ANOVA example!

## Example of Using Tukey’s Method with One-Way ANOVA

For our ANOVA example, we have four groups that require six comparisons to cover all combinations of groups. We’ll use a post hoc test and specify that the family of six comparisons should collectively produce a familywise error rate of 0.05. The post hoc test I’ll use is Tukey’s method. There are a variety of post hoc tests you can choose from, but Tukey’s method is the most common for comparing all possible group pairings.

There are two ways to present post hoc test results—adjusted p-values and simultaneous confidence intervals. I’ll show them both below.

### Adjusted P-values

The table below displays the six different comparisons in our study, the difference between group means, and the adjusted p-value for each comparison.

The adjusted p-value identifies the group comparisons that are significantly different while limiting the family error rate to your significance level. Simply compare the adjusted p-values to your significance level. When adjusted p-values are less than the significance level, the difference between those group means is statistically significant. Importantly, this process controls the family-wise error rate to your significance level. We can be confident that this entire set of comparisons collectively has an error rate of 0.05.

In the output above, only the D – B difference is statistically significant while using a family error rate of 0.05. The mean difference between these two groups is 9.5.

### Simultaneous Confidence Intervals

The other way to present post hoc test results is by using simultaneous confidence intervals of the differences between means. In an individual test, the hypothesis test results using a significance level of α are consistent with confidence intervals using a confidence level of 1 – α. For example, hypothesis tests with a significance level of 0.05 correspond to 95% confidence intervals.

In post hoc tests, we use a simultaneous confidence level rather than an individual confidence level. The simultaneous confidence level applies to the entire family of comparisons. With a 95% simultaneous confidence level, we can be 95% confident that *all* intervals in our set of comparisons contain the actual population differences between groups. A 5% experiment-wise error rate corresponds to 95% simultaneous confidence intervals.

### Tukey Simultaneous CIs for our One-Way ANOVA Example

Let’s get to the confidence intervals. While the table above displays these CIs numerically, I like the graph below because it allows for a simple visual assessment, and it provides more information than the adjusted p-values.

Zero indicates that the group means are equal. When a confidence interval does not contain zero, the difference between that pair of groups is statistically significant. In the chart, only the difference between D – B is significant. These CI results match the hypothesis test results in the previous table. I prefer these CI results because they also provide additional information that the adjusted p-values do not convey.

These confidence intervals provide ranges of values that likely contain the actual population difference between pairs of groups. As with all CIs, the width of the interval for the difference reveals the precision of the estimate. Narrower intervals suggest a more precise estimate. And, you can assess whether the full range of values is practically significant. Remember, statistical significance doesn’t necessarily indicate that the results are meaningful in the real world.

When the interval is too wide (imprecise) to be helpful and/or the range includes differences that are not practically significant, you have reason to hesitate before making decisions based on the results.

## Post Hoc Tests and the Statistical Power Tradeoff

Post hoc tests are great for controlling the family-wise error rate. Many texts would stop at this point. However, a tradeoff occurs behind the scenes. You need to be aware of it because you might be able to manage it effectively. The tradeoff is the following:

Post hoc tests control the experiment-wise error rate by reducing the statistical power of the comparisons.

Here’s how that works and what it means for your study.

To obtain the family error rate you specify, post hoc procedures must lower the significance level for all individual comparisons. For example, to end up with a family error rate of 5% for a set of comparisons, the procedure uses an even lower individual significance level.

As the number of comparisons increases, the post hoc analysis must lower the individual significance level even further. For our six comparisons, Tukey’s method uses an individual significance level of approximately 0.011 to produce the family-wise error rate of 0.05. If our ANOVA required more comparisons, it would be even lower.

What’s the problem with using a lower individual significance level? Lower significance levels correspond to lower statistical power. If a difference between group means actually exists in the population, a study with lower power is less likely to detect it. You might miss important findings!

Avoiding this power reduction is why many studies use an individual significance level of 0.05 rather than 0.01. Unfortunately, with just four groups, our example post hoc test is forced to use the lower significance level.

**Key Takeaway**: The more group comparisons you make, the lower the statistical power of those comparisons.

**Related post**: Understanding Statistical Power

## Managing the Power Tradeoff in Post Hoc Tests by Reducing the Number of Comparisons

One method to mitigate this tradeoff is by reducing the number of comparisons. This reduction allows the procedure to use a larger individual error rate to achieve the family error rate that you specify—which increases the statistical power.

Throughout this article, I’ve written about performing all pairwise comparisons—which compares all possible group pairings. While this is the most common approach, the number of contrasts quickly piles up! However, depending on your study’s purpose, you might not need to compare all possible groups.

Your study might need to compare only a subset of all possible comparisons for a variety of reasons. I’ll cover two common reasons and show you which post hoc tests you can use. In the following examples, I’ll display only the confidence interval graphs and not the hypothesis test results. Notice how these other methods make fewer comparisons (3 and 4) for our example dataset than Tukey’s method (6).

While you’re designing your study, it’s crucial that you define in advance the multiple comparisons method that you will use. Don’t try various methods, and then choose the one that produces the most favorable results. That’s data dredging, and it can lead to spurious findings. I’m using multiple post hoc tests on a single dataset to show how they differ, but that’s not an appropriate practice for a real study. Define your methodology in advance, including one post hoc analysis, before analyzing the data, and stick to it!

**Key Takeaway**: When it’s possible, compare a subset of groups to increase your statistical power.

## Example of Using Dunnett’s Method to Compare Treatment Groups to a Control Group

If your study has a control group and several treatment groups, you might need to compare the treatment groups only to the control group.

Use Dunnett’s method when the following are true:

- Before the study, you know which group (control) you want to compare to all the other groups (treatments).
- You don’t need to compare the treatment groups to each other.

Let’s use Dunnett’s method with our example one-way ANOVA, but we’ll tweak the scenario slightly. Suppose we currently use Material A. We performed this experiment to compare the alternative materials (B, C, and D) to it. Material A will be our control group, while the other three are the treatments.

Using Dunnett’s method, we see that only the B – A difference is statistically significant because the interval does not include zero. Using Tukey’s method, this comparison was not significant. The additional power gained by making fewer comparisons came through for us. On the other hand, unlike Tukey’s method, Dunnett’s method does not find that the D – B difference is significant because it doesn’t compare the treatment groups to each other.

## Example of Using Hsu’s MCB to Find the Strongest Material

If your study’s goal is to identify the best group, you might not need to compare all possible groups. Hsu’s Multiple Comparisons to the Best (MCB) identifies the groups that are the best, insignificantly different from the best, and significantly different from the best.

Use Hsu’s MCB when you:

- Don’t know in advance which group you want to compare to all the other groups.
- Don’t need to compare groups that are not the best to other groups that are not the best.
- Can define “the best” as either the group with the highest mean or the lowest mean.

Hsu’s MCB compares each group to the group with the best mean (highest or lowest). Using this procedure, you might end up with several groups that are not significantly different than the best group. Keep in mind that the group that is truly best in the entire population might not have the best sample mean due to sampling error. The groups that are not significantly different from the best group might be as good as, or even better than, the group with the best sample mean.

### Simultaneous Confidence Intervals for Hsu’s MCB

For our one-way ANOVA, we want to use the material that produces the strongest parts. Consequently, we’ll use Hsu’s MCB and define the highest mean as the best. We don’t care about all of the other possible comparisons.

Group D is the best group overall because it has the highest mean (41.07). The procedure compares D to all of the other groups. For Hsu’s MCB, a group is significantly better than another group when the confidence interval has zero as an endpoint. From the graph, we can see that Material D is significantly better than B and C. However, the A-D comparison contains zero, which indicates that A is not significantly different from the best.

Hsu’s MCB determines that the candidates for the best group are A and D. D has the highest sample mean and A is not significantly different from D. On the other hand, the procedure effectively rules out B and C from being the best.

## Recap of Using Multiple Comparison Methods

In this blog post, you’ve seen how the omnibus ANOVA test determines whether means are different in general, but it does not identify specific group differences that are statistically significant.

If you obtain significant ANOVA results, use a post hoc test to explore the mean differences between pairs of groups.

You’ve also learned how controlling the experiment-wise error rate is a crucial function of these post hoc tests. These family error rates grow at a surprising rate!

Finally, if you don’t need to perform all pairwise comparisons, it’s worthwhile comparing only a subset because you’ll retain more statistical power.

If you’re learning about hypothesis testing and like the approach I use in my blog, check out my eBook!

Anoop says

Dear Jim,

Quick question. I do have your book 🙂

In a pre-post design for an RCT, What is the main effect of group means? If I understand it right, it is comparing the average of Grop1 (Pre and Post) to the average of Group 2 (pre and post). So they are removing the time effect here. So the main effect doesn’t tell us anything useful ina pre-post design for ANOVA right?

What we care about is the interaction of time * group right?

Thank you for all your work!

Jim Frost says

Hi Anoop!

Yes, that’s correct! Using main effects, you can determine whether there is a difference between the pre and post groups for all subjects. Or, you can see if there’s an overall difference between the treatment and control groups. However, what you really want to know is whether the difference between the pre and post groups for the treatment group is larger than the difference between the pre and post groups for the control group. And that’s why you need to assess the interaction effect rather than the main effect. Basically, you’re assessing whether the difference between the pre and post test depends on the group assignment. As you indicate, that’s the time*group interaction. The p-value for that term indicates whether that is significant. An interaction plot will illustrate the results to clarify what the interaction really means!

I hope that helps!

Rabia B says

Hi Jim.

I hope you are well.

I’m working on analysing the data of a study. I ran the Welch ANOVA because Levene ‘s test was significant for the factors under study. The problem is that for some of the factors SPSS is giving an error saying that the Welch test of robustness couldn’t be performed because there is at least one group with zero variance. In this case, is there another test I can do?

Also, the results for the Games Howell test are very weird… I am getting pages upon pages of results with no clear results.

If you could kindly suggest a test instead of Welch ANOVA I could use, as well as a Post hoc test that would apply, I would be sincerely grateful.

Thank you very much!

Jim Frost says

Hi Rabia,

If a group has zero variance, then it either has only one observation or it has multiple observations that are all equal (within that group). In either case, you can’t analyze that group. I’d consider removing that group from your analysis.

Laura says

Jim, I need help, please! when running posthoc test (using Tukey’s test) I only get the p-values and diff, lower and upper limits, but I want the whole information, including t-values. The table you posted there is absolutely perfect, would you mind sharing the code in R to get the full pairwise comparisons including t-values, SES, etc.

Jim Frost says

Hi Laura,

Unfortunately, I didn’t use R to create the table. I used Minitab statistical software. However, to calculate the t-value, just take the difference between two groups and divide by the standard error of the difference. However, the most important aspect in my opinion are the CIs of the differences. Not only do they tell you if it’s a significant difference (when they exclude zero) but also the precision of the estimate. Sometime a difference can be significant but there’s a wide CI making it not the most useful information!

RABIA NOUSHEEN says

Thank you Jim for replying in detail, your comments are very helpful.

RABIA NOUSHEEN says

Hi Jim

I found this article really very interesting and helpful. I have a question that I want to ask about post hoc test for GLM. I analysed binary data and count data using binomial and negative binomial GLM respectively. I am doing tukey pairwise comparison with bonferroni correction for significant parameters, is that correct? Further how much it is important to compare treatments with control group? I have a control but if I make comparison of treatments only, does that make sense? Regarding control, I am really confused how to compare it with treatments.. reason of confusion is the way I have arranged my data (see below)

Predictors: Time (h) Concentration(Particles) Particle type Response:Mortality(%)

1 0 A 9

2 0 A 6

1 500 A 21

2 500 A 27

1 0 B 9

2 0 B 6

1 500 B 21

2 500 B 27

Concentration zero is actually control but it is entered as numeric. I have different particle types and each one has different control, how to make control vs treatments comparison in this case? what about comparison of only treatments? if my interaction term particle type*time is significant then post hoc gives the comparisons of possible combinations seperately for each concentration like for concentration 0, all possible combinations of time and type, then for 500 and so on. I am wondering what is the right approach? Please guide

Jim Frost says

Hi Rabia,

Yes, it’s important to use a method such as Bonferroni that controls for multiple comparisons, otherwise the family error rate quickly grows out of hand! You can use Bonferroni or other methods depending on your needs.

As for whether it’s important to compare to the control, I have both a general and specific answer for you.

Generally speaking, the correct approach is to determine before analyzing the data which comparisons are important to your study and plan and document the plan for those comparisons in advance. That’s driven by the needs of your study, the questions you need answers to, theory, what other studies have done, etc. In other words, identifying the correct comparisons to make depends on a variety of factors. Unfortunately, I don’t know the full context of your study and can’t provide an informed answer. But focus on the questions that you most need to answer. What information/findings do you need to obtain? What comparison provide those answers? Which comparisons don’t add anything important?

A little more specifically, because this appears be a designed experiment with a control group, I have hard time picturing a scenario where you wouldn’t want to compare the treatments to the control? After all, you or the designers must have included the control group as a point of reference. It’s a group that you intentionally add to experiments to which you can compare the treatments. You don’t really know how well the treatments are working compared to no treatment unless you have a control group. So, I’d imagine you’d really want to compare the treatments to the control group to be sure that there is a statistically significant improvement.

If you want to determine which treatment is best, you’d also want to compare the treatments with each other. There are specialize methods for this such as Hsu’s MCB.

I don’t fully understand the experimental design with multiple control groups. That might well influence your thinking on these matters but that’s really getting in deep with details with which I’m not familiar! You might well need to consult with a statistician at your institution who can look into your study and give it the time it deserves!

Christopher says

Yes please. Do you report the group median in the posthoc (KW)?

Christopher says

How can I possibly report that please?

Jim Frost says

Christopher, I’ve answered that question. If there is something else you need to know about it, please ask more specifically. Thanks!

Christopher says

Thank you very much for the quick response. Since all the groups are significantly different, I was wondering how you can report that if I used a DSCF pairwise comparisons (KW).

Jim Frost says

Hi Christopher, I believe you’d just state that all pairs of groups have differences between their medians that are statistically significant.

Jordyn says

Hi Jim,

I am having trouble deciding what is the correct test to run for the One way ANOVA like (LSD, Tukey, Bonferroni) and need to do it correctly. The question is to run one on social media conditions on body satisfaction scores. There is 5 different other options so it’s asking for a specific group and sample sizes are similar (27,23,28,22). I’m not sure if this makes sense. Thank you!

Michael says

Hi Jim,

I wanted to follow up with you and thank you for the great advise. I used the Dunnett’s post hoc test in my analysis as you suggested. Thank you for the wonderful service that you provide.

Regards,

Michael

Jim Frost says

You’re very welcome, Michael. I am so glad to hear that I could be of help! 🙂

Christopher says

Thanks for your explanation. I was wondering, what do you do and how you do interpret if all the results from your posthoc test are significant? I found that in my non-parametric (KW) analysis

Jim Frost says

Hi Christopher,

If you’re finding that all of the differences between your groups are significant, that indicates that each group has a distinct population mean. Your sample provides sufficient evidence to conclude that none of the group means are equal at the population level.

Himanshu says

Hi sir,

How we need to find, which post hoc test among the variety of post hoc tests (such as Duncun DMRT, Tukey’s HSD, LSD etc) need to apply in our data.

Jim Frost says

Hi Himanshu,

You’ll need to define what your goals are for the post hoc test. Do you want to compare all possible groups or just some of the groups. I discuss various possibilities in this article.

sam says

Hi I commented on your regression page- but upon looking at more statistical tests- would it be advisable to do a 1 way anova for my experiment. So it would be type of social media as the IV- 3 groups will be formed- 1st being those who use the new social media site, 2nd those who do not use social media and 3rd being a control- facebook and instagram users (to control for their infleunce on results). Then based upon my results i would to a Post hov test with a bonferroni correction?

Thank you, Sam

Jim Frost says

It’s possible to compare group means like that. You need to decide what question you want to answer. If it’s about group means being different, then ANOVA might be the way to go. If you’re looking at relationships between continuous/ordinal variables, you can go with regression. So, you need to really figure out what you want to answer and what data you’ll have to determine which analysis is best.

I don’t usually recommend the Bonferroni correction because it is relatively conservative. I generally recommend one of the other methods.

Michael says

Hi Jim,

I have a relatively simple 3 sample experiment, but I’m not sure what post hoc test is appropriate. The study design is as follows: A = Control (no disease & no treatment), B = Disease state, C = Disease state with treatment. I’m only interested in two comparisons: BvsA to find the changed between disease vs. Control, and CvsB to see what changes occur upon treatment. I am NOT interested in CvsA. It looks like one-way ANOVA is the way to go here, but Tukey doesn’t seem like the correct post hoc test because I don’t want all pairwise comparisons. I’m uncertain if Dunnett’s is correct because while B is in both comparisons (BvsA & CvsB), it is not the “Control” as you described in your article. Is there another post hoc test that would be appropriate for my scenario? Would it be best just to perform two individual T-tests (BvsA & CvsB)? If this is the way to go, how do I take type 1 error into account when determining the singificance?

Thank you in advance,

Michael

Jim Frost says

Hi Michael,

There’s a fairly simple workaround for this situation. I’d use a Bonferroni correction, which involves dividing your significance level by the number of comparisons (e.g., 0.05 / 2 = 0.025).

Then, perform two 2-sample t-tests for B vs A and C vs B. For those tests, use the Bonferroni corrected significance level. In other words, compare the p-values for your t-tests to 0.025 to determine statistical significance. Piece of cake!

Typically, I don’t recommend using the Bonferroni correction because it’s conservative (meaning it errs on the side of failing to reject the null more frequently than needed). However, in this case, it allows you to reduce the number of comparisons. Tukey’s would have 3 comparisons.

Actually, another thought occurs after I wrote all the above. You could use Dunnett’s if you define Group B as your control group. I know it’s not really the control but that’s ok. It’s just telling the algorithm which groups you want to compare. It would then compare B to both A and C, just as you want! I’d recommend this approach over the Bonferroni method because it should preserve more of your statistical power. But either method would work.

I hope that helps!

emily says

Hi! Can I know why is there a negative in a confidence interval? Like what is the cause of this? Can it be due to small sample size?

Jim Frost says

Hi Emily,

For post hoc tests, you’re comparing two groups and typical the difference is just Group 1 Mean – Group 2 Mean. If group 2’s mean is higher than group 1, it’ll give you a negative point estimate for the difference due to simple arithmetic.

As for the CI, even if Group 1 has a great mean than Group 2, which will give you a positive point estimate for the difference, the CI can still include a negative value. That’s due to the fact that the difference between means is close to zero relative to the amount of variability in the data. When the CI includes zero (it has both a positive and a negative endpoint), the difference between between means is NOT statistically significant because no difference (mean of zero) is included in the likely values.

This can happen for several reasons. Yes, it’s possible that a small sample size causes it by producing wide CIs. Noisy data (large variability) also produces wide CIs. Wide CIs are more likely to crossover zero. However, it’s also possible that there is no difference between group means at the population level and that’s reflected in your sample and the CIs.

I hope that helps!

Santosa Sandy says

Thank you Jim. You explain it intuitively

Hrysanthi says

Hi Jim,

is there any way I do not have significant difference in the beginning, but after post-hoc analysis I do have?

Jim Frost says

Hi Hrysanthi,

Yes, what you describe it possible. The F-tests and post hoc tests use different methods to determine significance. Consequently, they can come to different conclusions occasionally. These differences usually occur in border cases. Your F-test result was probably just not quite significant while your post hoc test was just significant. In these cases, it’s OK to report the significant post hoc results.

Me says

The thing is that I am reporting the results of a pre-test, mid-test and post-test, then, I do not know if you switching the groups would be a idea. What should I do? Is getting negative numbers incorrect though?

Thank you in advance.

Jim Frost says

Hi, in that case, your right, it doesn’t make sense to switch the order. Just be sure to understand what the negative value means so you can explain it. Look at the order of subtraction. For example, if it is X1 – X2. You know that the X2 must be larger than X1. Maybe X2 is your post-test, in which case it indicates that mean is rising over time. However, if X2 is the pre-test score, you know the mean is decreasing.

Just be sure to understand the order in which things are being subtracted and know that the 2nd value is larger than the first (assuming you’re dealing with all positive values). Incorporate that understanding into your interpretation.

Me says

Hi Jim,

I was wonderinfg if it is normal to get negative numbers using a Bonferroni test (Post Hoc Test). I got negative numbers in the mean diffference sections (first two rows) as well as in the “t” section (first two rows). I am actually very confused as I expected postive numbers.

Thank you in advance.

Jim Frost says

Hi, that’s probably because of the order of the groups in the subtraction. If you’re comparing groups X1 and X2 (X1-X2) and X2 has a larger mean, you’ll obtain a negative value. I’m guessing that’s the case. If you want to obtain a positive value, just switch the order of the groups in your analysis.

Dr. Kelly Pivik says

Hi Jim! I have an extremely basic question. You are showing output here, but it’s not clear how you got that output. I am talking about the One-Way ANOVA: Strength vs. Material and the residual plots for Tukey’s. This is the information I need. Just doing the one-way ANOVA does not give the same output.

Thank you!

Jim Frost says

Hi, I’m not sure what you’re asking for? I performed the one-way ANOVA and then did the follow-up post hoc tests, all as shown in the article. I did not show any residual plots.

Susmita K says

Hi Elise,

Did you figure out how to do this as I am stuck at the same problem 🙁

Could you please advice me? Thanks!

Stephanie says

Thank you for your clear explanations in all of your posts. Can you please explain why post-hoc comparison tests cannot be performed when factors are designated as random (at least in Minitab)?

Linda B says

Hi there, great articles, thanks so much. I am just wondering…with a three-way ANOVA would you recommended using t-tests with the bonferroni correction as post hoc tests? …OR would you recommend running a two way anova with the error term syntax from the three way ANOVA and then applying the bonferroni correction to the results? I’m finding it really hard to make a call on this!

Moe says

Hi

I have four groups with 7 observations each. ANOVA shows significance with p=0.032 but Tukey’s does not show any significant difference between any of the group. What to do in this situation? Should I use LSD? How to report the resuls where my aim is to compare all the groups means with each other. Please advise. Thanks!

Elise MBARGA says

Hi Jim

I am trying to see if there is a relationship between drug concentration and cell proliferation. I have used one way ANOVA but I am not sure what post hoc test use. Could you advise me please?

Dr. Ritesh Patel says

Very good information given for the post hoc tests. Thank you sir!

Jim Frost says

Thank you!

Rex Cao says

I see, many thanks for your clarification!

Virginia says

When is a post hoc test inappropriate to use with a one-way ANOVA?

Jim Frost says

Hi Virginia,

I’d say the main case for when it’s inappropriate to use a post hoc test with ANOVA is when your data don’t satisfy the assumptions for ANOVA itself. For example, your groups should have roughly equal variances. And, if your sample size is small, the data should be normally distributed.

Some people will say if the one-way ANOVA is not significant, then you shouldn’t perform a post hoc test. However, others say it is ok in that case. I fall in the group of thinking it’s ok even when the one-way ANOVA is not significant. But, be aware there is debate over that point!

Rex Cao says

Hi Jim,

Many thanks for this article, very helpful.

Now I understand why sometimes the p value of ANOVA showed a significant difference, but the post hoc analysis such as student newman keuls test does not show differentiation between treatments.

I am an agricultural field researcher. I do lots of pesticide field trials, sometimes have upto 15 treatments in my protocol, this increases the error rate significantly.

So, based on what you said, I should not include untreated control (negative control) when I perform test such as Tukey Honest, because this will create highly skewed data? Is Dunnette the only test I should use when I include untreated control?

Jim Frost says

Hi Rex,

If you have a control group and multiple treatment groups, I’d highly recommend a method like Dunnett’s over Tukey’s as long as you don’t need to compare the treatment groups to each other. You’ll gain more statistical power.

By the way, you’re not “skewing” the data if you have additional comparisons. The additional comparisons reduce the statistical power of the test, which is a different concept. It just means the test is less likely to detect a significant difference in a sample when the difference exists in the population.

Greg Stauffer says

Hi Jim;

Do have directions for calculating confidence intervals of differences between means to visualize Tukey HSD data? I’d like to produce the Tukey’s Simultaneous 95% C.I. graph you have listed above. Also, which one of your books would have this information? Thanx so much!

Souad Karam says

How can I cite this article? What was the publication data?

Jim Frost says

Hi Souad,

Purdue University’s Online Writing Lab (OWL) shows you how to cite webpages. Click the link and scroll down to the section titled, “A Page on a Website.” You don’t use the publication date but rather the date you accessed the URL.

I hope that helps!

BANUVATHY says

Hello Jim

Thank u for the explanation. I performed paired t test between two groups (example A and B). It showed statistical significance. But when I performed One way ANOVA with the same groups including two more groups (A,B,C,D) ,post Hoc Tukey comparison between A and B shows statistical insignificance. Why does that happen? Does it mean A and B are not significantly different. Which test to rely ? either paired Ttest or Post hoc tukey test. Kindly give me your feedback on this. Ty

Jim Frost says

Hi Banuvathy,

To understand why, reread the section in this article titled, “Post Hoc Tests and the Statistical Power Tradeoff.” When you compare more groups, the test loses statistical power. In other words, it becomes less able to detect differences. And, that’s what you’re seeing. When you compare just the two groups, there’s no reduction in power. However, with four groups power is reduced.

Because you have four groups, you need to go by the post hoc test results. However, if you don’t need to compare all possible groups, which Tukey’s method does, then you can consider other post hoc methods. I discuss alternatives in this article as well.

Ahmed Sadaka says

Hi Jim,

Many thanks for this brilliant easy to follow article. I was however wondering if the experiment-wise error rate inflation and thus the need for p-value adjustment would apply to an independent t-test, say 2 groups (treatment & control) compared for different n variables (e.g. demographics, treatment effects, etc.). If so, would the adjusted p = 0.05/n

Jim Frost says

Hi Ahmed,

I’m not 100% sure what you’re asking exactly.

If you have just the two groups, and you’re controlling for the other variables you list, say in a regression model, then you don’t need to adjust the p-value because you’re still just comparing two groups.

However, if you’re comparing more than two groups that are based on those other variables, you need the adjustment.

I’m just not quite clear on what your scenario is, whether it’s just two groups but controlling for other variables, or more than two groups based on other variables.

I hope that helps!

Ami Choi says

Hi Jim,

Thanks for the helpful post! Quick question — when you adjust the p value, do you assess the univariate F value with the conventional 0.05 level and then apply the adjusted p value only to the post hoc tests for specific group comparison, OR do you apply the adjusted p value to both?

Jack says

Hi Jim, thanks for the blog post. It was very insightful. I am conducting a study where I’m looking into insider trading abnormal returns, year by year from 2010-2019.

I would like to know whether any year within time period had particularly abnormal insider trading returns. This will give me 9 different groups which after reading this seems like a lot. Am I correct in thinking that the best test to use for this analysis will be Tukey’s method?

h0 : Abnormal trading(AB) 2010 = AB2011 = AB2012 = etc to 2019.

h1: not all equal

Also, would you advise that I minimise the number of groups to 3-6 (instead of 9) to increase statistical power?

Thanks

Jack

Jim Frost says

Hi Jack,

That depends on your data and your goals. If you have a reference year and you want to determine whether the other years are different from that year, that would allow you to reduce the number of comparison. You could use Dunnett’s method, which I detail in this article. But, I’m not sure whether you have a reference year? If you want to compare all years against each other, then, yes, use Tukey’s.

Robert Blasko says

Thanks for the reply Jim! In some cases the block might be significant. I tried to use the Dunnett´s in this model, but it requires to designate one control for all the comparisons, which doesn´t fit my situation since I have a control at every site. Now I used a model with site, treatment as fixed and block as random, block nested within site. I also included site and treatment interaction. The only problem is that often I get a significant treatment effect, but the interaction term is not, so I don´t know what´s the right method to find out at which site the effect occurred. I tried to then do the mixed model at every site separately, but that is quite laborious since I have many variables and not sure quite correct way of doing it? Thanks, Rob

Robert Blasko says

Hi Jim, nicely explained! However, I still have troubles to assess what would be the best method for my situation as I have a slightly more complicated design. I have 5 sites, at each site I have 3 blocks, each block includes a control plot and a treatment plot. This treatment is basically the same at all sites. So I used GLM with site, treatment as fixed and block as random factors in my model and I included the site and treatment interaction too. Now, when I do the post hoc pairwise comparisons for sites, and site*treatment to see at which site the treatment had an effect, I get often contrary results to the ANOVA results, because the number of pairwise comparisons is large. I used Tukey, but I can choose Bonferroni, Fisher LSD, or Sidak in my software. How could I increase the power of my post hoc comparisons and still find out where are the differences? Thanks!

Jim Frost says

Hi Robert, is your blocking variable significant? If not, you can consider removing that and thereby having fewer groups to compare. I would use something like Dunnett’s method, which doesn’t compare all possible pairings. Instead, it just compares the treatments to the controls.

sara says

after a significant interaction, follow-up tests were done to explore the exact nature of the

interaction. and i found an effects of one independent variable within one level of a second independent variable, so only one effect was significant and the other 2 effects were not, do i still conduct post hocs? since one simple main effect was significant?

Jim Frost says

Hi Sara, if you want to determine which pairs of groups specifically have significant differences, you’ll need to perform a post hoc test.

erika zafeiraki says

Hey Jim, thank you for the info. However, i have a question regarding post-hoc.

I have 13groups (locations) with different size (from 2 concentrations to one group to 20 to another group). More specifically i have samples from different places and the concentration of metals in them. The number of samples is not the same for each location.

Elements concentrations were both normally distributed and homogeneous, so i further applied one-way ANOVA and a statistically significant different was observed. So, i need to apply a post-hoc test but i don t know which one. I applied Scheffe, Bonferone and LSD …but i am not sure which one is the best. So, it would be really helpful if you could tell me which one to apply.

Thank you in advace

Erik

Jim Frost says

Hi Erika,

That’s an impossible question to answer in general. It depends on the specifics of what you want to learn. Do you want to compare all possible groups to each other, just compare treatment groups to a control group, or just find out which group is best and not significantly different from the best? It’s your subject area knowledge combine with what you need to learn that determines which method is best.

You can rule out Fisher’s LSD because that only maintains the Type I error rate at your significance level when you have three groups. Bonferroni compares all possible groups but is known to be conservative. If you want to compare all possible groups, I’d consider Tukey’s method. Although, be aware that with 13 groups that’s 78 comparisons if you compare all groups. That would really lower your statistical power as I describe in this article.

I know less about Scheffe’s test than the others. But, I gather it’s good when you want to make many comparisons and they don’t have equal sizes. So, Scheffe’s test might be a good choice for you if you want to compare all possible pairs.

My recommendation is to determine what you really need to compare to learn what you need to learn. With so many groups, you hopefully don’t need to compare all possible combinations of groups.

Agegnehu says

Thanks Frost !

The way you put details using plain English and practical life experiences really helps much particularly for those reasonably far from the discipline of statistics as it is true for me.

could you have time to respond to this question, please ?

how can the statistical power of the post hoc tests

be calculated to know how much we are underpowered ? The details of the role of sample sizes to increase the statistical power of the post hoc tests ?

Thanks ! Age

Jim Frost says

Hi Agegnehu,

I’ve never seen a statistical package that calculates the power of a statistical test. However, based on the properties of the test, we know that you lose power with more comparisons for the reasons I describe. One way to get an idea of how much power you’re losing is by looking at the individual confidence level for a set of comparisons. For example, in the section about Tukey’s method, the output indicates that for the six comparison, the procedure uses an individual confidence level of 98.89%. Using that individual confidence level for each of the six comparisons collectively yields the 95% joint confidence.

If you convert that confidence level to the equivalent significance level, you’d see that it’s as if you were using an alpha of 1- 0.9889 = 0.0111 for each comparison. Any time you use a lower significance level, it decreases the power of the test.

I don’t know if you could plug that information, along with other details, into statistical software for say a 2-sample t-test to get a valid power estimate for a single comparison or not. I’ve never looked into it. But, looking at that individual confidence level gives you an idea of how the procedure needs to effectively lower the significance level more and more for additional comparisons.

Seble says

Thanks so much for a detailed response, Jim. That definitely helped.

Seble says

Hi Jim,

In what circumstances would it be acceptable to report a post-hoc multiple comparison while the main effect of ANOVA is not significant?

Jim Frost says

Hi Seble,

It is possible to obtain the situation you describe because the F-test for the main effect and the post hoc tests use different procedures and assess statistical significance differently.

This might be a bit of a controversial area. I’m not sure. In my mind, it is often valid to report post hoc multiple comparisons that are significant even when the main effect is not. However, be sure to report the full circumstances surrounding what is significant and what is not significant, along with the post hoc details. One caution, be sure you aren’t just picking specific groups to compare because the overall main effect is not significant.

If you have a number of groups that are not very different but say a couple of groups that appear to have a large difference, it’s not valid to intentionally choose a post hoc method that compares just those groups with larger differences. That’s cherry picking your analysis to get the desired results, which gives misleading results. Choose your post hoc multiple comparison methodology at the beginning of your study and stick with it. Don’t cherry pick the methodology later.

Another caution, in my experience when this happens, it’s because the overall evidence is weak. You’re probably just barely getting significant results, which represents fairly week evidence. Pay particular attention to the CIs of the differences between group means to get an idea of the precision. CIs are often wide with weak evidence.

I hope this helps!

Kumar C says

Hi Jim, I have a question with respect to Tukey’s HSD test. I have 3 groups lets say A, B, C and I have to prove that exactly one of the groups is significant. When I run the Tukey’s HSD test after ANOVA, I am getting A-B, A-C and B-C are significantly different. Now how can I arrive at the correct conclusion that A is only significant or B is only Significant or C is only significant ?

Jim Frost says

Hi Kumar,

I’m not sure that I understand your question. But, if you’re using Tukey’s test, the number of pairs of groups that are significant different is determined by the data. You can’t tell the test to find just one group. Tukey’s will compare all possible group pairings and tell you which ones have differences that are statistically significant. That won’t necessarily just be one pair of groups.

You might be thinking of something like Hsu’s test, which I cover in this post. It takes the best group (defined by either the highest or lowest mean) and then compares that group to all other groups. Even then you might well find that the best group is significantly better than multiple other groups.

Wittawat Chantkran says

Dear Jim,

Using post hoc Tukey’s HSD, I’m trying to reduce the number of comparisons by comparing A-B-C-D, A-B-E-F, and, A-B-G-H, separately, when A-B is mutual dataset of each experiment.

However, the result of A-B difference does not stay the same in each comparison (exactly the same of degree of freedom. In this case, although it reduces the statistical power, should I go back to A-B-C-D-E-F-G-H simultaneous comparison?

Many thanks

KK

Wittawat Chantkran says

Dear Jim,

Using post hoc Tukey’s HSD, I’m trying to reduce the number of comparisons by comparing A-B-C-D,

A-B-E-F, A-B-G-H, separately, when A-B is overlapped in every experiment.

Question : Why does the result of A-B difference not stay the same in each time of comparison ?

In this case, although it reduces the statistical power, should I go back to simultaneous A-B-C-D-E-F-G-H comparison ?

Many thanks,

KK

Chris says

Thanks for this overview. I was wondering why if you have a significant ANOVA and then run post-hoc test, in this case Tukey there was no significant difference between any groups? thanks

Jim Frost says

Hi Chris,

There are two primary reasons.

First, the F-test that ANOVA uses and the post hoc tests are assessing different things, which can lead to differing results. The F-test looks at all the differences between the means and determines whether they are collectively significant. In other words, is that entire set of differences statistically significant? The post hoc tests assess the difference between a specific pair of means. It’s entirely possible for the F-test to conclude that the entire set of difference was unlikely to occur if there is no effect while the post hoc tests don’t have sufficient evidence to conclude that the difference between specific pairs of means are statistically significant.

Additionally, with post hoc tests, you need to consider the fact that as the number of comparison increases, the power of the tests decrease. I explain that in the post so I won’t retype it here. That power decrease doesn’t apply to the F-test.

Alex says

Why revealing your address is required to buy a book? Have anyone tried to buy and is there an actual book at the end of the process? 🙂

Jim Frost says

Hi Alex, the system requires an address to calculate taxes. I promise that I don’t do anything with your address. Nothing at all.

Many people have bought both of my ebooks. If you want to see a free sample before buying, go to My Store and choose one of the free samples. No credit cards are required.

Rebekah says

Hello I was wondering if you could help me with a question I have. What exactly is it meant by lower and higher order interaction. And are there any examples of this you can give me?

Afnan says

Thank you for the information

but I have a question if the result for LSD post hoc test significance for negative mean difereance is it ok or it means some thing different .

Jim Frost says

Hi Afnan,

I recommend that you don’t use Fisher’s LSD. It does not control the family error rate, which as I show in this post, can quickly get out of hand and lead to false positives.

That said I don’t see the negative mean difference as a problem itself. Just be sure you understand which mean is higher than the other mean. It’s just subtract and it must be subtraction the larger mean from the smaller mean. But, please don’t use Fisher’s LSD!

Chloe says

Very helpful!

Kami says

Thanks for this really informative post. I have a question regarding using Benjamini–Hochberg method (BH) as a post-hoc method after ANOVA.

Can we use BH as a post hoc test when we do NOT have many groups? (e.g the number of pair wise comparisons are less than 10).

It seems that BH method for controlling FDR is developed for working with large data sets (genomic) when you have a large number of groups. But is there any limitation for using it for low number of groups?

What about using it for two-way-ANOVA?

Thanks

M N WANA says

That is great

Dr M Kaladhar says

Dear Jim Frost! really a good analysis and helps to laymen to understand without any ambiguity! Many thanks!!

Jim Frost says

You’re very welcome! I’m happy to hear that it was helpful!

Manohar Lal says

Hi sir,

Good information, post hoc tests with some information about Strip plot design with three factor

Kelly Papapavlou says

Thank you!! I went through all calculations steps again and MINITAB uses a different pathway to come up with the same result.

Kelly Papapavlou says

Thank you for the enlightening post. How did you calculate the standard error of difference for the Tukey simultaneous tests? I tried to repeat the calculations using the formula SE= sqrt (ME/n) where ME is the ANOVA table variation within the groups (-15.6) and n=sample size per group (6). I get a standard error of 1.61, not of 2.28…..

On a different issue, what is the individual confidence lever??

Thank you!!

Jim Frost says

Hi Kelly,

I received your multiple comments and contacts through the contact me page. As I note on the contact page, it takes some time for the comment approval process, particularly because I’m in a very different time zone than you! Patience please!

I’m using Minitab statistical software to calculate the Tukey’s test. You can see their Method’s and Formula page for Tukey’s method to see how it is calculated in all of its details.

The individual confidence level is the confidence that you have that an individual group comparison falls within that interval. The simultaneous confidence level applies to the entire set of comparisons while the individual level applies to an individual comparison.

Jeremy says

This is so superbly well-written. The Hsu’s MCB test is new to me. And thanks for de-mystifying some of the terminology (experiment-wise, family-wise, etc). Would be nice to add a discussion of Bonferonni, LSD, and the others too.

Janna Beckerman says

This is great. I don’t know why, but I never thought of the Craps analogy. Thank you! And thank you for comparing and contrasting Dunnett’s versus Tukey’s.

Jim Frost says

Thanks, Janna! And, you’re very welcome! The idea of the dice analogy just popped into my head. But, I really love linking additional comparisons to rolling the dice on a false positive. It’s a crapshoot!

Dr Eajaz Dar says

Nice information. I would suggest to discuss other post hoc tests like DMRT and LSD along with this, to gain a clear distinction between them.

Jim Frost says

Thanks for the good suggestion. I felt covering three post hoc tests in one blog post was about the maximum for a reasonably long blog post, but I might need to write another post about it!