To determine whether the difference between two means is statistically significant, analysts often compare the confidence intervals for those groups. If those intervals overlap, they conclude that the difference between groups is not statistically significant. If there is no overlap, the difference is significant.

While this visual method of assessing the overlap is easy to perform, regrettably it comes at the cost of reducing your ability to detect differences. Fortunately, there is a simple solution to this problem that allows you to perform a simple visual assessment and yet not diminish the power of your analysis.

In this post, I’ll start by showing you the problem in action and explain why it happens. Then, we’ll proceed to an easy alternative method that avoids this problem.

## Comparing Groups Using Confidence Intervals of each Group Estimate

For all hypothesis tests and confidence intervals, you are using sample data to make inferences about the properties of population parameters. These parameters can be population means, standard deviations, proportions, and rates. For these examples, I’ll use means, but the same principles apply to the other types of parameters.

**Related posts**: Populations, Parameters, and Samples in Inferential Statistics and Difference between Inferential and Descriptive Statistics

Determining whether confidence intervals overlap is an overly conservative approach for identifying significant differences between groups. It’s true that when confidence intervals don’t overlap, the difference between groups is statistically significant. However, when there is some overlap, the difference might still be significant.

Suppose you’re comparing the mean strength of products from two groups and graph the 95% confidence intervals for the group means, as shown below. Download the CSV dataset that I use throughout this post: DifferenceMeans.

## Jumping to Conclusions

Upon seeing how these intervals overlap, you conclude that the difference between the group means is not statistically significant. After all, if they’re overlapping, they’re not different, right? This conclusion sounds logical, but it’s not necessarily true. In fact, for these data, the 2-sample t-test results are statistically significant with a p-value of 0.044. Despite the overlapping confidence intervals, the difference between these two means is statistically significant.

This example shows how the CI overlapping method fails to reject the null hypothesis more frequently than the corresponding hypothesis test. Using this method decreases the statistical power of your assessment (higher type II error rate), potentially causing you to miss essential findings.

This apparent discrepancy between confidence intervals and hypothesis test results might surprise you. Analysts expect that confidence intervals with a confidence level of (100 – X) will always agree with a hypothesis test that uses a significance level of X percent. For example, analysts often pair 95% confidence intervals with tests that use a 5% significance level. It’s true. Confidence intervals and hypothesis test should always agree. So, what is happening in the example above?

**Related posts**: How Hypothesis Tests Work and Two Types of Error in Hypothesis Testing

## Using the Wrong Types of Confidence Intervals

The problem occurs because we are not comparing the correct confidence intervals to the hypothesis test result. The test results apply to the difference between the means while the CIs apply to the estimate of each group’s mean—not the difference between the means. We’re comparing apples to oranges, so it’s not surprising that the results differ.

To obtain consistent results, we must use confidence intervals for differences between group means—we’ll get to those CIs shortly.

However, if you’re determined to use CIs of each group to make this determination, there are several possible methods.

Goldstein and Healy (1995) find that for barely non-overlapping intervals to represent a 95% significant difference between two means, use an 83% confidence interval of the mean for each group. The graph below uses this confidence level for the same dataset as above, and they don’t overlap.

Cumming & Finch (2005) find that the degree of overlap for two 95% confidence intervals for independent means allows you to estimate the p-value for a 2-sample t-test when sample sizes are greater than 10. When the confidence limit of each CI reaches approximately the midpoint between the point estimate and the limit of the other CI, the p-value is near 0.05. The first graph in this post, with the 95% CIs, approximates this condition, and the p-value is near 0.05. Lower amounts of overlap correspond to lower p-values. For example, 95% CIs where the end of one CI just reaches the end of the other CI corresponds to a p-value of about 0.01.

To me, these approaches seem kludgy. Using a confidence interval of the difference is an easier solution that even provides additional useful information.

## Assessing Confidence Intervals of the Differences between Groups

Previously, we saw how the apparent disagreement between the group CIs and the 2-sample test results occurs because we used the wrong confidence intervals. Instead, we need a CI for the difference between group means. This type of CI will always agree with the 2-sample t-test—just be sure to use the equivalent combination of confidence level and significance level (e.g., 95% and 5%). We’re now comparing apples to apples!

Using the same dataset as above, the confidence interval below presents a range of values that likely contains the difference between the means for the entire population. The interpretation continues to be a simple visual assessment. Zero represents no difference between the means. Does the interval contain zero? If it does not include zero, the difference is statistically significant because the range excludes no difference. At a glance, we can tell that the difference is statistically significant.

This graph corresponds with the 2-sample t-test results below. Both test the difference between the two means. This output also provides a numerical representation of the CI of the difference [0.06, 4.23].

In addition to providing a simple visual assessment, the confidence interval of the difference presents crucial information that neither the group CIs nor the p-value provides. It answers the question, based on our sample, how large is the difference between the two populations likely to be? Like any estimate, there is a margin of error around the point estimate of the difference. It’s important to factor in this margin of error before acting on findings.

For our example, the point estimate of the mean difference is 2.15, and we can be 95% confident that the population difference falls within the range of 0.06 to 4.23.

**Related posts**: How T-tests Work and How Confidence Intervals Work

## Interpreting Confidence Intervals of the Mean Difference

As with all CIs, the width of the interval for the mean difference reveals the precision of the estimate. Narrower intervals suggest a more precise estimate. And, you can assess whether the full range of values is practically significant. Remember, statistical significance doesn’t necessarily indicate that the results are meaningful in the real world. For more information about this issue, see Practical vs. Statistical Significance.

When the interval is too wide (imprecise) to be helpful and/or the range includes differences that are not practically significant, you have reason to hesitate before making decisions based on the results. These types of CI results indicate that you might not obtain meaningful benefits even though the difference is statistically significant.

There’s no statistical method for answering questions about how precise an estimate must be or how large an effect must be to be practically useful. To use the confidence interval of the difference to answer these questions, you’ll need to apply your subject-area knowledge.

For the example in this post, it’s important to note that the low end of the CI is very close to zero. It will not be surprising if the actual population difference falls close to zero, which might not be practically significant despite the statistically significant result. If you are considering switching to Group B for a stronger product, the mean improvement might be too small to be meaningful.

When you’re comparing groups, assess confidence intervals of those differences rather than comparing confidence intervals for each group. This method is simple, and it even provides you with additional valuable information.

### References

Harvey Goldstein; Michael J. R. Healy. The Graphical Presentation of a Collection of Means, Journal of the Royal Statistical Society, Vol. 158, No. 1. (1995), p. 175-177.

Cumming, Geoff; Finch, Sue. Inference by Eye: Confidence Intervals and How to Read Pictures of Data, American Psychologist, Vol 60(2), Feb-Mar 2005, p. 170-180.

Paul says

I have another question. Referring back to the sample mean hypothesis testing versus CIs. For symmetric distributions I think their result would always agrees, how about for non symmetric ones, such as Weibull or Beta?

Once again Jim thank you so much for replying my questions.

Jim Frost says

Hi Paul,

You’re very welcome!

There’s two issues with what you’re asking about. First, will they agree, yes, they’re using the same underlying calculations. The reasons I discuss in the post about why they always agree still apply for these other distributions.

However, because you’re talking about sample means, the central limit theorem comes into play. So, the second question becomes, ok, they agree, but are they valid given the nonnormal distributions? With small sample sizes, the answer is no, neither are valid. However, the answer becomes “yes” for both as sample size increases.

Paul says

Thanks a lot Jim.

I will read further on this topic. It seems like there are still even papers publishing in recent years asking people not to use the default normal SE formulas on sample skewness and kurtosis for unknown distributions.

Jim Frost says

Hi Paul, yes, as I indicated, these CIs assume the data follow a normal distribution and are sensitive to departures from this assumption. Consequently, don’t use them for other distributions. At least for parametric methods. I’m not sure about bootstrapping. However, performing a normality test will give you some of the same information.

Paul says

Hi Jim,

Thank you. I have a question which is related but not exactly comparing means.

Instead of comparing mean from two sample data, if I just have one set of sample data, and I want to know if the mean is significantly non-zero. A traditional approach would be doing a one sample t-test.

Can I instead use confidence interval of the sample mean to infer whether the true mean significantly non zero?

Intuitively (or graphically) to me it is a yes. However under the hood they are slightly different.

If it is t-test, it is first assuming the null hypothesis that mean is zero, and then based on the distribution implied by the hypothesis (i.e. using the sample SD and t-dist with df of n-1), if my sample mean is outside the 95% range then the null hypothesis is rejected and indicates non zero.

However if I am using my “confidence interval method”, conceptually there is no hypothesis at all. It is just that “okay, my estimate is roughly span across this region, but zero is so far.. so true mean very unlikely zero” and I declare it is non-zero.

===========

The reason I am asking this is actually that I have a much deeper question. I have a data set, and I am tasked to answer the question whether the skewness and excess kurtosis are significantly non zero.

Of course I can use some standard analytical formulas for the standard error but actually they may not be very correct for empirical distributions.

So I am thinking of using bootstrapping method to resample the data, recompute the skewness and excess kurtosis, and use the 2.5% and 97.5% percentile of the recomputed numbers to determine their confidence interval. If zero is outside those confidence intervals, I declare the skewness / excess kurtosis being non zero.

Once again my intuition told me it is right to do this, but I am not 100% sure it is a valid approach.

Jim Frost says

Hi Paul,

In terms of using CIs or a hypothesis, that’s a definite yes!

You can definitely use a confidence interval for hypothesis testing purposes. They’re variants of the same underlying methodology. In fact, if the results from a hypothesis test with a significance level of 0.05 will always match the corresponding CI with a 95% confidence level. To learn more about this, ready my post about how confidence intervals work. A section near the end explains why the two methods always yield consistent results.

However, in terms of CIs for skewness and kurtosis, there’s a complication. You can calculate CIs for these measures but they’re only valid for normally distributed data–at least using parametric methods. I don’t know if the same limitation applies to bootstrapping. Perhaps not.

Fortunately, I think there’s an easy solution for you. If you’re referring to “excess” kurtosis, then it seems like you essentially want to determine whether your data are normally distributed. For normally distributed data, excess skewness equals zero and skew equals zero. If that’s the case, just use a normality test! If the results for your normality test are statistically significant, your data are nonnormal–excess kurtosis and skew can’t both equal zero.

I hope that helps!

Narayan says

Dear sir,

Thank you for the article it is very informative. I had a doubt regarding which test to conduct for my research which involves cross border mergers and acquisitions. I have a sample of 3000 transactions conducted between a specific time period. These transactions were conducted by developed market acquirers (eg firms in countries like Germany, USA, UK, etc) and the targets included both emerging market targets (Firms in countries like India, China, etc) and developed market targets (firms in countries like Germany, USA, UK, etc). I want to test the statistical significance of the difference in the means of the returns between two samples (Developed market acquirers-Developed market targets) and (Developed market acquirers-Emerging market targets) on a yearly basis. The returns for each transaction are in percentage format. While researching on the appropriate test to be conducted in stata, I came across two tests which I think would be relevant for this specific study (Paired t-test and Independent t-test). I have a doubt regarding which test would be the most appropriate. The countries in my sample are the same but the acquirer firms and target firms in each country are different.

Thanks and regards,

Narayan

Peter says

Hi Sir,

Thank you for clear explanation! It definitely helps me to understand statistics. However, when applying it to other research papers, I am having some difficulties to interpret the C.I. differences between groups.

Here an example;

Group 1 No Choice & Public RPI No Choice & Private RPI

Group 2 No Choice & Private RPI No Choice & No RPI

Effect Size 1.0844 1.0229

C.I. (95%): 0.3179 and 1.8509 (95%): 1.7839 and 0.2618

Disregarding the variables (no/yes choice, public/private/no RPI), I fail to understand what the difference in confidence interval between (0.3179 and 1.8509) and (0.2618 and 1.7839) actually tells me. Can I form any statistical conclusion between these two? Or do I need to use additional information?

Once again, thank you for your help!

Zahra says

I would like to plot CI for data lets say 30 group and each group has 6 sample in it, and i also would like to include Process Upper and Lower control limit in it to show if the process been stable

Do you think there is a way to do in Minitab?

thanks in advance

Kaitlin Snider says

Hi Jim!

I read through your post on ANOVA and post hoc tests. I think I’m asking about doing something different than an ANOVA. I’ve detailed an example below – hopefully this question is more clear! Below, I start with some example data, show what happens if I put it through a two-way ANOVA test, discuss why I think the ANOVA isn’t effective for my question, and then show what I am thinking of doing instead (a sort of “difference of diffferences” test).

Let’s say I have this data (not my real data, set up for the sake of demonstrating the problem):

Group A: WT Control mean 5.67, SD 6.79, N 10

Group B: WT Treatment mean 37.88, SD 20.63, N 10

Group C: Mutant Control mean 13.01, SD 7.07, N 10

Group D: Mutant Treatment mean 29.72, SD 16.51, N 12

For this test, I don't think it makes sense to use just an ANOVA with post hoc tests. I am not interested in simply whether or not Treatment is different from Control *within* genotype (A-B or C-D). Rather, I am interested in whether the SIZE of difference for the mutant is *different* from the size of difference for the WT. This could mean that WT has a large effect of treatment, while mutant has no effect of treatment; the other way around (WT has no effect of treatment, while mutant has a large effect of treatment); or, the more finicky possibility, WT has a large effect of treatment and mutant has a smaller effect of treatment (true in the sample data).

Does this seem reasonable to you? Would you do anything differently for calculating the SD and N for the differences? Are you aware of any published examples that have used this sort of statistics?

Thank you again for your time – your blog is a fantastic resource!

Jim Frost says

Hi Kaitlin,

Alright, I think I get what you’re asking now! ðŸ™‚

I edited your comment because it was kind of long. But, the gist of what you’re asking is, how can you statistically test whether the relationship between the relationship between treatment condition and the outcome (i.e., effect size) changes by genotype? In other words, does the effect size depend on genotype? To do that, you need to include an interaction effect in your model. If you include an interaction between genotype and treatment condition, and that interaction is statistically significant, you can conclude the size of the treatment effect varies depending on genotype.

For more information, read my post about understanding interaction effects. If you have any more questions, please don’t hesitate to ask!

I hope that helps!

Kaitlin Snider says

Hey Jim ðŸ™‚ Thanks for the great post! I’m trying to see if I can use this for a “differences of differences” test. In the simplest form, I have four groups: two genotypes, and two conditions (“control” and “experiment”). I’m not interested in the raw values; rather, I want to see if the genotype alters the DIFFERENCE caused by experiment (e.g., does WT have a greater effect of experiment than mutant?) Two-way ANOVA kind of works but doesn’t directly ask the question I am asking, so it ends up a bit underpowered. The measurements used are one-and-done, so all four groups are independent – no repeated measures. So, is there a way to get a confidence interval of the mean difference of two mean differences? I can compute the means, but I’m not sure how to handle the standard error or the n. Thanks in advance!!

Jim Frost says

Hi Kaitlin,

Yes, you can do what you describe in conjunction with ANOVA. You’ll need to use post hoc analyses after performing ANOVA. I write all about this in my article about post hoc tests. It’ll work for the scenario you describe. Read that post and if you have further questions, don’t hesitate to ask! But, I think that’ll help you move forward! In a nutshell, these procedures calculate the CIs for differences between multiple group means and, importantly, control the family error rate for the multiple comparisons.

Frank Corotto says

I don’t know who to believe, Geoff Cumming or other sources. Cumming says his rules fall apart when there are multiple comparisons. He makes no mention of other approaches. My copy of Zar tells me what to do. Use a Tukey test backwards to calculate confidence intervals around each difference between two means. See if zero is excluded. Cumming also says confidence intervals can’t be used when there are repeated measures. What about Loftus and Masson (1994), who said use the within-subjects error term, what Zar calls remainder MS, and what about papers that followed up on Loftus and Masson? I’d appreciate any of your insights. Thanks for your time.

Jim Frost says

Hi Frank,

Those references you mention don’t necessarily contradict each other. You should use confidence intervals differently in different situations. I’m not 100% familiar with all the references but here’s how I think they all generally agree. Note as we go how the context changes, which necessitates changes to the CIs.

If you have random samples drawn from several populations and you want to better understand each population’s mean, just use a regular CI.

If you have two populations and you want to understand the difference between their means, use a CI for the difference between means. That’s what I write about in this post.

If you have more than two populations and you need to compare multiple means to each other, you’ll need to use CIs for the differences between means that are generated by post-hoc analyses (a.k.a., multiple comparison methods) to control the family error rate. For example, you might use CIs from Tukey’s test. I write about this in my blog post about using post-hoc tests with ANOVA.

If you’re using repeated measures, that again changes the context. As I understand it, the hypothesis test for these designs use the within-subjects variances to gain greater power. However, the CIs were typically calculated using both the within and between subjects variances. However, there are different recognized methods for removing the between subjects variances. I’m not completely familiar with the specifics of this, but I believe your Loftus and Masson citation is one method for doing just that.

So, there’s no contradiction that I can see and no reason to disbelieve any of them. In different contexts, you’ll need to use CIs differently. That’s the standard practice in statistics. Understand the context and use the analyses appropriately!

LUC says

Thanks Jim, great !

One question: how do you plot the CI for the difference between group means with minitab ?

Lawrence Adu-Gyamfi says

Thanks for this Jim, it clarifies a lot for me!

Ricardo Garza-Mendiola says

Great, thanks

Khursheed says

Hllo sir …

I hve no words so that I can say uh Thnks ….ths is the best thing that we can b able to clear our concepts @statistics ..

May God bless u.

Jim Frost says

Hi Khursheed,

Thank you so much for your kind words. They mean a lot to me! I’m glad my website has been helpful!