To determine whether the difference between two means is statistically significant, analysts often compare the confidence intervals for those groups. If those intervals overlap, they conclude that the difference between groups is not statistically significant. If there is no overlap, the difference is significant.

While this visual method of assessing the overlap is easy to perform, regrettably it comes at the cost of reducing your ability to detect differences. Fortunately, there is a simple solution to this problem that allows you to perform a simple visual assessment and yet not diminish the power of your analysis.

In this post, I’ll start by showing you the problem in action and explain why it happens. Then, we’ll proceed to an easy alternative method that avoids this problem.

## Comparing Groups Using Confidence Intervals of each Group Estimate

For all hypothesis tests and confidence intervals, you are using sample data to make inferences about the properties of population parameters. These parameters can be population means, standard deviations, proportions, and rates. For these examples, I’ll use means, but the same principles apply to the other types of parameters.

**Related posts**: Populations, Parameters, and Samples in Inferential Statistics and Difference between Inferential and Descriptive Statistics

Determining whether confidence intervals overlap is an overly conservative approach for identifying significant differences between groups. It’s true that when confidence intervals don’t overlap, the difference between groups is statistically significant. However, when there is some overlap, the difference might still be significant.

Suppose you’re comparing the mean strength of products from two groups and graph the 95% confidence intervals for the group means, as shown below. Download the CSV dataset that I use throughout this post: DifferenceMeans.

## Jumping to Conclusions

Upon seeing how these intervals overlap, you conclude that the difference between the group means is not statistically significant. After all, if they’re overlapping, they’re not different, right? This conclusion sounds logical, but it’s not necessarily true. In fact, for these data, the 2-sample t-test results are statistically significant with a p-value of 0.044. Despite the overlapping confidence intervals, the difference between these two means is statistically significant.

This example shows how the CI overlapping method fails to reject the null hypothesis more frequently than the corresponding hypothesis test. Using this method decreases the statistical power of your assessment (higher type II error rate), potentially causing you to miss essential findings.

This apparent discrepancy between confidence intervals and hypothesis test results might surprise you. Analysts expect that confidence intervals with a confidence level of (100 – X) will always agree with a hypothesis test that uses a significance level of X percent. For example, analysts often pair 95% confidence intervals with tests that use a 5% significance level. It’s true. Confidence intervals and hypothesis test should always agree. So, what is happening in the example above?

**Related posts**: How Hypothesis Tests Work and Two Types of Error in Hypothesis Testing

## Using the Wrong Types of Confidence Intervals

The problem occurs because we are not comparing the correct confidence intervals to the hypothesis test result. The test results apply to the difference between the means while the CIs apply to the estimate of each group’s mean—not the difference between the means. We’re comparing apples to oranges, so it’s not surprising that the results differ.

To obtain consistent results, we must use confidence intervals for differences between group means—we’ll get to those CIs shortly.

However, if you’re determined to use CIs of each group to make this determination, there are several possible methods.

Goldstein and Healy (1995) find that for barely non-overlapping intervals to represent a 95% significant difference between two means, use an 83% confidence interval of the mean for each group. The graph below uses this confidence level for the same dataset as above, and they don’t overlap.

Cumming & Finch (2005) find that the degree of overlap for two 95% confidence intervals for independent means allows you to estimate the p-value for a 2-sample t-test when sample sizes are greater than 10. When the confidence limit of each CI reaches approximately the midpoint between the point estimate and the limit of the other CI, the p-value is near 0.05. The first graph in this post, with the 95% CIs, approximates this condition, and the p-value is near 0.05. Lower amounts of overlap correspond to lower p-values. For example, 95% CIs where the end of one CI just reaches the end of the other CI corresponds to a p-value of about 0.01.

To me, these approaches seem kludgy. Using a confidence interval of the difference is an easier solution that even provides additional useful information.

## Assessing Confidence Intervals of the Differences between Groups

Previously, we saw how the apparent disagreement between the group CIs and the 2-sample test results occurs because we used the wrong confidence intervals. Instead, we need a CI for the difference between group means. This type of CI will always agree with the 2-sample t-test—just be sure to use the equivalent combination of confidence level and significance level (e.g., 95% and 5%). We’re now comparing apples to apples!

Using the same dataset as above, the confidence interval below presents a range of values that likely contains the difference between the means for the entire population. The interpretation continues to be a simple visual assessment. Zero represents no difference between the means. Does the interval contain zero? If it does not include zero, the difference is statistically significant because the range excludes no difference. At a glance, we can tell that the difference is statistically significant.

This graph corresponds with the 2-sample t-test results below. Both test the difference between the two means. This output also provides a numerical representation of the CI of the difference [0.06, 4.23].

In addition to providing a simple visual assessment, the confidence interval of the difference presents crucial information that neither the group CIs nor the p-value provides. It answers the question, based on our sample, how large is the difference between the two populations likely to be? Like any estimate, there is a margin of error around the point estimate of the difference. It’s important to factor in this margin of error before acting on findings.

For our example, the point estimate of the mean difference is 2.15, and we can be 95% confident that the population difference falls within the range of 0.06 to 4.23.

**Related posts**: How T-tests Work and How Confidence Intervals Work

## Interpreting Confidence Intervals of the Mean Difference

As with all CIs, the width of the interval for the mean difference reveals the precision of the estimate. Narrower intervals suggest a more precise estimate. And, you can assess whether the full range of values is practically significant. Remember, statistical significance doesn’t necessarily indicate that the results are meaningful in the real world. For more information about this issue, see Practical vs. Statistical Significance.

When the interval is too wide (imprecise) to be helpful and/or the range includes differences that are not practically significant, you have reason to hesitate before making decisions based on the results. These types of CI results indicate that you might not obtain meaningful benefits even though the difference is statistically significant.

There’s no statistical method for answering questions about how precise an estimate must be or how large an effect must be to be practically useful. To use the confidence interval of the difference to answer these questions, you’ll need to apply your subject-area knowledge.

For the example in this post, it’s important to note that the low end of the CI is very close to zero. It will not be surprising if the actual population difference falls close to zero, which might not be practically significant despite the statistically significant result. If you are considering switching to Group B for a stronger product, the mean improvement might be too small to be meaningful.

When you’re comparing groups, assess confidence intervals of those differences rather than comparing confidence intervals for each group. This method is simple, and it even provides you with additional valuable information.

### References

Harvey Goldstein; Michael J. R. Healy. The Graphical Presentation of a Collection of Means, Journal of the Royal Statistical Society, Vol. 158, No. 1. (1995), p. 175-177.

Cumming, Geoff; Finch, Sue. Inference by Eye: Confidence Intervals and How to Read Pictures of Data, American Psychologist, Vol 60(2), Feb-Mar 2005, p. 170-180.

Kaitlin Snider says

Hi Jim!

I read through your post on ANOVA and post hoc tests. I think I’m asking about doing something different than an ANOVA. I’ve detailed an example below – hopefully this question is more clear! Below, I start with some example data, show what happens if I put it through a two-way ANOVA test, discuss why I think the ANOVA isn’t effective for my question, and then show what I am thinking of doing instead (a sort of “difference of diffferences” test).

Let’s say I have this data (not my real data, set up for the sake of demonstrating the problem):

Group A: WT Control mean 5.67, SD 6.79, N 10

Group B: WT Treatment mean 37.88, SD 20.63, N 10

Group C: Mutant Control mean 13.01, SD 7.07, N 10

Group D: Mutant Treatment mean 29.72, SD 16.51, N 12

For this test, I don't think it makes sense to use just an ANOVA with post hoc tests. I am not interested in simply whether or not Treatment is different from Control *within* genotype (A-B or C-D). Rather, I am interested in whether the SIZE of difference for the mutant is *different* from the size of difference for the WT. This could mean that WT has a large effect of treatment, while mutant has no effect of treatment; the other way around (WT has no effect of treatment, while mutant has a large effect of treatment); or, the more finicky possibility, WT has a large effect of treatment and mutant has a smaller effect of treatment (true in the sample data).

Does this seem reasonable to you? Would you do anything differently for calculating the SD and N for the differences? Are you aware of any published examples that have used this sort of statistics?

Thank you again for your time – your blog is a fantastic resource!

Jim Frost says

Hi Kaitlin,

Alright, I think I get what you’re asking now! ðŸ™‚

I edited your comment because it was kind of long. But, the gist of what you’re asking is, how can you statistically test whether the relationship between the relationship between treatment condition and the outcome (i.e., effect size) changes by genotype? In other words, does the effect size depend on genotype? To do that, you need to include an interaction effect in your model. If you include an interaction between genotype and treatment condition, and that interaction is statistically significant, you can conclude the size of the treatment effect varies depending on genotype.

For more information, read my post about understanding interaction effects. If you have any more questions, please don’t hesitate to ask!

I hope that helps!

Kaitlin Snider says

Hey Jim ðŸ™‚ Thanks for the great post! I’m trying to see if I can use this for a “differences of differences” test. In the simplest form, I have four groups: two genotypes, and two conditions (“control” and “experiment”). I’m not interested in the raw values; rather, I want to see if the genotype alters the DIFFERENCE caused by experiment (e.g., does WT have a greater effect of experiment than mutant?) Two-way ANOVA kind of works but doesn’t directly ask the question I am asking, so it ends up a bit underpowered. The measurements used are one-and-done, so all four groups are independent – no repeated measures. So, is there a way to get a confidence interval of the mean difference of two mean differences? I can compute the means, but I’m not sure how to handle the standard error or the n. Thanks in advance!!

Jim Frost says

Hi Kaitlin,

Yes, you can do what you describe in conjunction with ANOVA. You’ll need to use post hoc analyses after performing ANOVA. I write all about this in my article about post hoc tests. It’ll work for the scenario you describe. Read that post and if you have further questions, don’t hesitate to ask! But, I think that’ll help you move forward! In a nutshell, these procedures calculate the CIs for differences between multiple group means and, importantly, control the family error rate for the multiple comparisons.

Frank Corotto says

I don’t know who to believe, Geoff Cumming or other sources. Cumming says his rules fall apart when there are multiple comparisons. He makes no mention of other approaches. My copy of Zar tells me what to do. Use a Tukey test backwards to calculate confidence intervals around each difference between two means. See if zero is excluded. Cumming also says confidence intervals can’t be used when there are repeated measures. What about Loftus and Masson (1994), who said use the within-subjects error term, what Zar calls remainder MS, and what about papers that followed up on Loftus and Masson? I’d appreciate any of your insights. Thanks for your time.

Jim Frost says

Hi Frank,

Those references you mention don’t necessarily contradict each other. You should use confidence intervals differently in different situations. I’m not 100% familiar with all the references but here’s how I think they all generally agree. Note as we go how the context changes, which necessitates changes to the CIs.

If you have random samples drawn from several populations and you want to better understand each population’s mean, just use a regular CI.

If you have two populations and you want to understand the difference between their means, use a CI for the difference between means. That’s what I write about in this post.

If you have more than two populations and you need to compare multiple means to each other, you’ll need to use CIs for the differences between means that are generated by post-hoc analyses (a.k.a., multiple comparison methods) to control the family error rate. For example, you might use CIs from Tukey’s test. I write about this in my blog post about using post-hoc tests with ANOVA.

If you’re using repeated measures, that again changes the context. As I understand it, the hypothesis test for these designs use the within-subjects variances to gain greater power. However, the CIs were typically calculated using both the within and between subjects variances. However, there are different recognized methods for removing the between subjects variances. I’m not completely familiar with the specifics of this, but I believe your Loftus and Masson citation is one method for doing just that.

So, there’s no contradiction that I can see and no reason to disbelieve any of them. In different contexts, you’ll need to use CIs differently. That’s the standard practice in statistics. Understand the context and use the analyses appropriately!

LUC says

Thanks Jim, great !

One question: how do you plot the CI for the difference between group means with minitab ?

Lawrence Adu-Gyamfi says

Thanks for this Jim, it clarifies a lot for me!

Ricardo Garza-Mendiola says

Great, thanks

Khursheed says

Hllo sir …

I hve no words so that I can say uh Thnks ….ths is the best thing that we can b able to clear our concepts @statistics ..

May God bless u.

Jim Frost says

Hi Khursheed,

Thank you so much for your kind words. They mean a lot to me! I’m glad my website has been helpful!