To determine whether the difference between two means is statistically significant, analysts often compare the confidence intervals for those groups. If those intervals overlap, they conclude that the difference between groups is not statistically significant. If there is no overlap, the difference is significant.

While this visual method of assessing the overlap is easy to perform, regrettably it comes at the cost of reducing your ability to detect differences. Fortunately, there is a simple solution to this problem that allows you to perform a simple visual assessment and yet not diminish the power of your analysis.

In this post, I’ll start by showing you the problem in action and explain why it happens. Then, we’ll proceed to an easy alternative method that avoids this problem.

## Comparing Groups Using Confidence Intervals of each Group Estimate

For all hypothesis tests and confidence intervals, you are using sample data to make inferences about the properties of population parameters. These parameters can be population means, standard deviations, proportions, and rates. For these examples, I’ll use means, but the same principles apply to the other types of parameters.

**Related posts**: Populations, Parameters, and Samples in Inferential Statistics and Difference between Inferential and Descriptive Statistics

Determining whether confidence intervals overlap is an overly conservative approach for identifying significant differences between groups. It’s true that when confidence intervals don’t overlap, the difference between groups is statistically significant. However, when there is some overlap, the difference might still be significant.

Suppose you’re comparing the mean strength of products from two groups and graph the 95% confidence intervals for the group means, as shown below. Download the CSV dataset that I use throughout this post: DifferenceMeans.

**Related post**: Understanding Confidence Intervals

## Jumping to Conclusions

Upon seeing how these intervals overlap, you conclude that the difference between the group means is not statistically significant. After all, if they’re overlapping, they’re not different, right? This conclusion sounds logical, but it’s not necessarily true. In fact, for these data, the 2-sample t-test results are statistically significant with a p-value of 0.044. Despite the overlapping confidence intervals, the difference between these two means is statistically significant.

This example shows how the CI overlapping method fails to reject the null hypothesis more frequently than the corresponding hypothesis test. Using this method decreases the statistical power of your assessment (higher type II error rate), potentially causing you to miss essential findings.

This apparent discrepancy between confidence intervals and hypothesis test results might surprise you. Analysts expect that confidence intervals with a confidence level of (100 – X) will always agree with a hypothesis test that uses a significance level of X percent. For example, analysts often pair 95% confidence intervals with tests that use a 5% significance level. It’s true. Confidence intervals and hypothesis test should always agree. So, what is happening in the example above?

**Related posts**: How Hypothesis Tests Work and Two Types of Error in Hypothesis Testing

## Using the Wrong Types of Confidence Intervals

The problem occurs because we are not comparing the correct confidence intervals to the hypothesis test result. The test results apply to the difference between the means while the CIs apply to the estimate of each group’s mean—not the difference between the means. We’re comparing apples to oranges, so it’s not surprising that the results differ.

To obtain consistent results, we must use confidence intervals for differences between group means—we’ll get to those CIs shortly.

However, if you’re determined to use CIs of each group to make this determination, there are several possible methods.

Goldstein and Healy (1995) find that for barely non-overlapping intervals to represent a 95% significant difference between two means, use an 83% confidence interval of the mean for each group. The graph below uses this confidence level for the same dataset as above, and they don’t overlap.

Cumming & Finch (2005) find that the degree of overlap for two 95% confidence intervals for independent means allows you to estimate the p-value for a 2-sample t-test when sample sizes are greater than 10. When the confidence limit of each CI reaches approximately the midpoint between the point estimate and the limit of the other CI, the p-value is near 0.05. The first graph in this post, with the 95% CIs, approximates this condition, and the p-value is near 0.05. Lower amounts of overlap correspond to lower p-values. For example, 95% CIs where the end of one CI just reaches the end of the other CI corresponds to a p-value of about 0.01.

To me, these approaches seem kludgy. Using a confidence interval of the difference is an easier solution that even provides additional useful information.

## Assessing Confidence Intervals of the Differences between Groups

Previously, we saw how the apparent disagreement between the group CIs and the 2-sample test results occurs because we used the wrong confidence intervals. Instead, we need a CI for the difference between group means. This type of CI will always agree with the 2-sample t-test—just be sure to use the equivalent combination of confidence level and significance level (e.g., 95% and 5%). We’re now comparing apples to apples!

Using the same dataset as above, the confidence interval below presents a range of values that likely contains the difference between the means for the entire population. The interpretation continues to be a simple visual assessment. Zero represents no difference between the means. Does the interval contain zero? If it does not include zero, the difference is statistically significant because the range excludes no difference. At a glance, we can tell that the difference is statistically significant.

This graph corresponds with the 2-sample t-test results below. Both test the difference between the two means. This output also provides a numerical representation of the CI of the difference [0.06, 4.23].

In addition to providing a simple visual assessment, the confidence interval of the difference presents crucial information that neither the group CIs nor the p-value provides. It answers the question, based on our sample, how large is the difference between the two populations likely to be? Like any estimate, there is a margin of error around the point estimate of the difference. It’s important to factor in this margin of error before acting on findings.

For our example, the point estimate of the mean difference is 2.15, and we can be 95% confident that the population difference falls within the range of 0.06 to 4.23.

**Related posts**: How T-tests Work and How Confidence Intervals Work

## Interpreting Confidence Intervals of the Mean Difference

Statisticians consider differences between group means to be an unstandardized effect size because these values indicate the strength of the effect using values that retain the natural data units. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.

As with all CIs, the width of the interval for the mean difference reveals the precision of the estimated effect size. Narrower intervals suggest a more precise estimate. And, you can assess whether the full range of values is practically significant. Remember, statistical significance doesn’t necessarily indicate that the results are meaningful in the real world. For more information about this issue, see Practical vs. Statistical Significance.

When the interval is too wide (imprecise) to be helpful and/or the range includes differences that are not practically significant, you have reason to hesitate before making decisions based on the results. These types of CI results indicate that you might not obtain meaningful benefits even though the difference is statistically significant.

There’s no statistical method for answering questions about how precise an estimate must be or how large an effect must be to be practically useful. To use the confidence interval of the difference to answer these questions, you’ll need to apply your subject-area knowledge.

For the example in this post, it’s important to note that the low end of the CI is very close to zero. It will not be surprising if the actual population difference falls close to zero, which might not be practically significant despite the statistically significant result. If you are considering switching to Group B for a stronger product, the mean improvement might be too small to be meaningful.

When you’re comparing groups, assess confidence intervals of those differences rather than comparing confidence intervals for each group. This method is simple, and it even provides you with additional valuable information.

### References

Harvey Goldstein; Michael J. R. Healy. The Graphical Presentation of a Collection of Means, Journal of the Royal Statistical Society, Vol. 158, No. 1. (1995), p. 175-177.

Cumming, Geoff; Finch, Sue. Inference by Eye: Confidence Intervals and How to Read Pictures of Data, American Psychologist, Vol 60(2), Feb-Mar 2005, p. 170-180.

Steve says

Sorry Jim if it wasn’t clear, but my post/question was a response to your response on Jan 6, 2020 to Kaitlyn Snyder. The data example I presented is what she presented in her original post.

Your response to her was: “In other words, does the effect size depend on genotype? To do that, you need to include an interaction effect in your model. If you include an interaction between genotype and treatment condition, and that interaction is statistically significant, you can conclude the size of the treatment effect varies depending on genotype.”

Within genotype A: treatment vs. control, genotype B: treatment vs. control. So there are two calculated effect sizes, one for A and one for B. There’s a CI for each effect size. Can we compare those two CIs using the G&H or C&F methods to determine whether the effect sizes are different? Assume I have sufficient data to calculate 83% CIs for the G&H method. In Kaitlyn’s post, she described it as assessing a “difference in differences” (as opposed to a difference), which is exactly a good way to put it.

Jim Frost says

Hi Steve, ah, no I didn’t remember that post from almost three years ago! Now, your original question makes much more sense! And I see how the interaction term comes into play.

You want to determine whether the effect size for the treatment is different between WT and mutants. I guess you can call that a difference of differences, but it seems a little convoluted.

Ideally, you’d use an interaction effect to test it. However, if that’s not feasible . . .

Those two methods apply to CIs of the mean. The effect size CIs are CIs of the difference between the means. I don’t know offhand if it would work. I want to say that it should work as long as the CIs of the differences do not use any form of adjustment to control the familywise error rate. However, I haven’t looked into that specifically. It’s not a normal use case!

Steve says

Thanks, Jim. I’m actually asking about heterogeneity of treatment effects. So, we have the CI of the difference for WT and CI of the difference for Mutant. To answer whether the treatment effects are different and statistically significant, can we use the CI comparison guidelines from Goldstein & Healy or Cumming & Finch? So, it’s a question not of assessing difference of means between two subgroups but rather the difference between those difference of means.

Jim Frost says

Just to clarify. Do you have the CIs for the mean difference between WT and a control group and the mean difference between Mutant and a control group? And you want to determine whether the difference between WT and Mutant is significant?

If that’s the case, the interaction term is irrelevant. You’re using the same model but using a different post hoc comparison method.

At any rate, to use either the G&H or C&F methods, you’ll need the CI for the WT mean and the CI for the Mutant mean. That’s different than the CI of the mean difference.

If you have those two CIs of the mean, you can use the C&F method to determine whether those mean effects are different. This approach is simpler because you don’t need to construct a new CI and just use the existing CIs of the mean.

If you have sufficient data to create the 83% CIs, you could use the G&H method. But that requires a bit more information (you’ll need to know the critical value for the correct degrees of freedom and be able to calculate the standard error of the mean).

Don’t use CIs of the mean difference for either of these methods.

Note that neither the G&H nor the C&F method controls the familywise error rate for multiple comparisons, which you’d probably want to do for this case. The link in this comment talks about that aspect.

Steve says

What if we have just the confidence interval of the treatment effect (difference between treatment and control) for each of the two subgroups (WT and Mutant, in this case), so there’s no dataset to perform a regression with the interaction term. Is it acceptable to use either the Goldstein and Healy rule of 83% confidence intervals or the Cumming & Finch rule of comparing the limit of one CI to the midpoint between the midpoint and limit of the other CI?

Jim Frost says

Hi Steve,

I’m not entirely certain what you want to achieve because you mention both an interaction term but also comparing the difference between means.

If you have a CI of the difference (i.e., treatment effect) you don’t need those other methods to determine statistical significance. As I write in this blog post, the best method is the CI of the difference, which you have! So, just look at the CI of the difference and see if it excludes zero. If it does, the difference between the groups is statistically significant.

As for the interaction term, there’s no way to use the CI of the difference or the other methods to evaluate the interaction term. You’d need the original data and then fit the model with the interaction term in it.

Sebastian says

Dear Jim,

thanks for the clear and instructive clarifications!

I ponder about one practical question: I try to transfer what you described to the situation of deciding whether two measurements are significantly different. When using a measurement apparatus, you usually assume measurements of one and the same phenomenon being normally distributed with a known variance that is the same for different phenomena (e.g. true value + random error). Now, if we want to compare two measurements of two different phenomena and want to decide if the measurement results are significantly different, it seems that we might do this with a 2-sample z-test. However, I wonder whether this is correct or advisable since the sample size is 1 for both samples (i.e. one measurement for each).

Thanks in advance!

Jim Frost says

Hi Sebastian,

I’ve never seen t-tests used for this purpose, but I think it would be correct. Although, it would be a t-test, not a z-test, because you don’t know the population standard deviation. You only have an estimate of the SD from your sample. In this case, each measurement would be an observation. If the results are statistically significant, you’d have enough evidence to favor the hypothesis that the two items have different measurements when factoring in measurement error.

I’m not familiar enough with measurement systems analysis to know whether this is a standard approach. But it seems to be theoretically sound.

Marlene de Varona says

Hi Jim,

I wanted to double check if I am following a good approach, although conservative to my analysis. I follow everything explained and conduct my hypothesis testing for continuous data like you show and I also use differences rather than relying only on comparing the CI. I did not know the tip about the 83% adjustment rather than using 95, so that was useful. Nonetheless I analyze a lot of survey data these days. I hate when people compare survey indicators, survey questions or one survey with another conducted later on and determine there are differences just cause one # is different than the other. I mean like lets say Q1 has 73% of the people that filled put the survey in agreement and Q2 has 80%. Most people would say that Q2 is better than Q1. But, as we all know, there is a margin of error based on total population and sample surveyed or %participation if everyone was surveyed. Since hypothesis testing is too difficult for the average person to understand and some of these leaders or clients want to know why and not just cause a stats expert said so… I have been using the response +/- the margin of error calculated and if these intervals don’t overlap, then I declare it significant. For example, if the margin of error is 4% then the 73 for Question 1 becomes that 95% of the times the true population response is between 69 to 77% and question 2 is from 76% to 84% so, because they overlap, there is no difference. Of course this example is borderline and this method is conservative but is this statistically sound? I am comparing question responses to determine where we are ok and where we need to focus… then if I repeat the survey, to compare if there has been improvement. I have been doing this for a while now, would you say this is a good approach sinple enough to explain or do you have a better suggestion?

Jim Frost says

Hi Marlene,

You really need to use the appropriate hypothesis test to determine whether the differences are statistically significant. That’s always the best advice. It’s not good practice, and can lead you astray, to look at the CIs or margin of errors for individual groups or items.

That’s particularly true for survey items and margins of error. Sometimes the surveys only report the maximum MOE, but not the actual MOE. For more information, read my post about Margin of Error.

In your examples, you might need to use the two-sample proportion test. That’ll allow you to compare two proportions to determine whether the difference between them is statistically significant. That method specifically accesses the difference for those proportions and the sample sizes for each item/group.

Faith says

Thank you very much for this informative post. I would like to ask whether we can conclude that there is a significant difference between the groups if the confidence interval of one of the groups contains the mean of the other group?

Jim Frost says

Hi Faith,

In that case, you know that the difference between group means is NOT significant.

Dafeng Xu says

If two CIs do not overlap, then the two means are significantly different. If the two CIs overlap, it is not necessarily true that the two means are not significant. It is possible that the two CIs have some overlap but the two means are still significantly different from each other.

Tony B says

Hi Jim,

Thank you for all of your clarity and insight that you share so generously on your website. I was wondering if you could help me think through two questions related to:

1. how to assess how meaningful a statistically significant different finding is based on confidence intervals. I am trying to compare interrater reliability coefficients for two different variables. For the first variable, M=.94, SD=.02, 95%CI=..91-.97. For the second variable, M=.83, SD=.04, 95% CI=.80-.88. The difference between the low end of the CI (.91) for the first variable is only .03 off from the high end of the CI (.88) for the second variable, so I was thinking perhaps the statistical difference might not be meaningful. Is there some kind of effect size-type method that could be applied systematically for these types of comparisons?

2. how to address Type I (experiment-wise) error when making multiple comparisons using confidence intervals. I have made 20 comparisons similar to the one described above and found 4 statistically significant (more than expected by chance at p=.05.

I hope my questions make sense, and I am most appreciative of your consideration.

Kind regards,

Tony B

Pernilla Pierre says

Thank you for a very helpful discussion! The degrees of freedom is 37 in your example. According to what I have learnt from my statistics text book, the df should be 38 as the number of observations is 40 and the number of estimated parameters is two as sd for strength A and sd for strength B each are used to estimate the sd for two different populations. Is my text book wrong?

Jim Frost says

Hi Pernilla,

Yes, for a 2-sample t-test where you assume equal variances, the DF = N – 2.

The output is directly from my statistical software. I’ll need to look into that. It’s a more complicated calculation for unequal variances, and perhaps I used that method.

Marjolein says

Hi,

This means that it is only possible for equal sample sizes?

Jim Frost says

Hi Marjolein,

This works for unequal sample sizes as well.

lkacz says

Thanks, Jim. Am I the only one who wonders how you calculated “Assessing Confidence Intervals of the Differences between Groups”?

Jim Frost says

Hi Ikacz,

Well, I have no idea if you’re the only one wondering or not. Fortunately, most statistical software will compute it for you! Here’s the equation.

S is the sample standard deviation for the test statistic.

lkacz says

I had the same question regarding the 83% Goldstein mentions 1.39 omega, is this it?

David says

This analysis and explanation is just fantastic. I have been trying to find a way to confirm that the visual overlapping method I have been using is robust but was doing t-tests on the sample means rather than the difference between the two means. Is there a way to do this with three or more groups? I have F values and p values from ANOVA on three overlapping groups, how can I confirm the F stat is good?

Jim Frost says

Hi David,

For three or more groups, you need to use the Post Hoc tests associated with ANOVA. Click the link to learn more.

Delphine Ducros says

Hi Jim,

Firstly, thank you very much for replying so fast !

I saw the references you cited, but thought they only refered to the specific instances where you cited them, and not to this section ‘Assessing Confidence Intervals of the Differences between Groups’. But indeed, I checked them and they were really useful !! Thank you so much again !!!

Delphine

Jim Frost says

You’re very welcome, Delphine! I’m so glad I could help!

Delphine Ducros says

Hi Jim,

Thank you for this very interesting post !

I have been looking for a proper scientific article that I could cite to mention confidence intervals of the mean difference but I cannot find any that explains what it is. Would you have references on this? That would really great !

Thank you very much if you can help me on that and have a nice day !

Delphine

Jim Frost says

Hi Delphine, did you notice the references that I put at the bottom of the article? You’ll probably find those helpful.

Collinz says

hi Jim, well done.

Jim, help me clarify on how “error bars” constructed in conjunction with bar charts are different from confidence intervals.

And also, I have very many people erroneously concluding that when the error bars overlap, the differences are present.

And finally, I keep on wondering why the post-hoc tests of ANOVA like Turkey’s HSD and the others which display confidence intervals for the levels of factors don’t use confidence interval for the difference because it seems we still erroneously apply the other concept of comparing apples to oranges.

I will be glad to hear from you.

Jim Frost says

Hi Collinz,

Depending on how they’re calculated, the error bars on bar charts could certainly be equivalent to confident intervals. You’d have to assess their precise calculations to be sure. If they are equivalent to confidence interval, be aware that CIs are a form of inferential statistics, which introduces requirements about representative sampling to be valid.

As I show in the post, it is possible to have significant differences even when the bars overlap. It depends on the degree of overlap. However, it’s best to calculate a confidence interval of the difference.

As for post hoc confidence intervals, I guess the type your software displays depends on the software! In my post about post hoc analyses, you can see that the software I use displays CIs of the differences. Again, for this scenario, CIs of the differences are the preferred choice!

Thanks for writing!

Ken says

Hello Jim,

Thank you for your rapid response. I think you provided a good suggestion in your last paragraph.

Best,

Ken

Ken says

Dear Jim,

Very clear and succinct.

My question, how can I, or can I use confidence intervals to judge the likely hood that an individual observation is of one group or another.

A concrete example:

I have two groups, placebo and treatment, N=100 for each group. I know the CI of the means and the CI of the difference of the means.

I do another experiment, N=5 per group and determine mean, variance … and I want to estimate (feel good) that the placebo and treatment groups of the new experiment are in accord with my large baseline data?

And to take it to an individual level, suppose I have a patient (treatment unknown) with a score. Can I place that patient in placebo or treatment groups based only on their score and with what confidence?

Thanx,

Ken

Jim Frost says

Hi Ken,

By going to such a smaller sample size (n=5), you have to expect that the CIs will be much wider. It’s possible that each group’s CI will include the other group’s mean from the first sample. In other words, it might be difficult confirming the initial results with the second sample that is so much smaller. You should be able to use power and sample size analysis to help you out. Using that procedure, enter the estimates from your first sample. Then enter different sample sizes. You want a reasonable power that you can reject the null. Perhaps you’ll find that you only need a sample size of 20. That’s not quite as small as 5 per group but it’s a lot less than 100!

In other words, I doubt 5 will be satisfactory. However, depending on the characteristics of your data/original results, you might find some compromise for the second sample size.

For your question about individuals, keep this is mind. CIs refer to parameters (mean, standard deviation) rather than individuals. So, they won’t help you there. I’m not familiar with an analysis dedicated to that type of question, but one might exist. So, it might be worthwhile to do a search. However, if you estimate the probability distribution function for the control and treatment groups, you can place an individual score in each distribution and calculate probabilities.

For example, you might determine that for a score of X, which is between the treatment and control group, 10% of the treatment group is less than X while 30% of the control group is greater than X. You can see that X is more likely to belong to control group. In fact, X is 3 times as likely (30%/10%) to be within that portion of the control group’s tail than it is to be in that portion of the treatment groups tail.

I hope that helps!

Collinz says

Hello, could any reader here be knowing how I can plot the confidence interval for the difference in means using rstudio?

If so help me by letting me know which package to use and the code for it

Thank you.

axelkowald says

Hello Jim,

I got a copy of the article by Goldstein & Healy, but struggle to understand how to calculate an adjusted CI for two groups with different standard errors. How did you calculate the 83% that you mention in your post?

Unfortunately, I cannot calculate a CI for the difference of the means since I only have the means and standard errors (but no information about the sample sizes).

Any hint is welcomed…

Stephanie says

Hi Jim,

I’m sure this question is below your pay grade – but how do I create a chart comparing confidence intervals for the means of two separate groups in SPSS (as you’ve done in your first chart)? I’m basically trying to see if there are overlap in the CIs of mean scores across two conditions as you were discussing.

Thank you so much!

Juvenal Nsengiyumva says

Dear Jim,

Many thanks. I have been helped

Blessings

Juvenal Nsengiyumva says

Dear Jim,

Let me thank you for this website, it is so helpful.

I have a question:

– How does the confidence interval show when research has found something new and unexpected?

Thank you

Jim Frost says

Hi Juvenal,

First, you need to know what the expected is. Typically, that involves subject-area knowledge and research. Then, after calculating the results, if the range that confidence covers excludes the expected results, then your results are unexpected. Of course, what is expected varies greatly by subject area!

Often, we’ll look to determine whether the CI includes or excludes the null hypothesis value. Typically, the null hypothesis represents no effect, which is zero in numeric terms. For example, if you have a control group and a treatment group and the mean difference between them is zero, there is no effect. If the CI includes no effect (zero), then nothing interesting has been found. However, if the CI excludes zero, then the results suggest there is an effect. That’s interesting!

To learn more about this topic, read my post about how confidence intervals work. I go into more detail there!

I hope that helps!

Juvenal N. says

Dear Jim,

Thank you for this web, it helps so much.

I have a question,

“How does the confidence interval show when research has found something new and unexpected?”

Thank you.

Paul says

I have another question. Referring back to the sample mean hypothesis testing versus CIs. For symmetric distributions I think their result would always agrees, how about for non symmetric ones, such as Weibull or Beta?

Once again Jim thank you so much for replying my questions.

Jim Frost says

Hi Paul,

You’re very welcome!

There’s two issues with what you’re asking about. First, will they agree, yes, they’re using the same underlying calculations. The reasons I discuss in the post about why they always agree still apply for these other distributions.

However, because you’re talking about sample means, the central limit theorem comes into play. So, the second question becomes, ok, they agree, but are they valid given the nonnormal distributions? With small sample sizes, the answer is no, neither are valid. However, the answer becomes “yes” for both as sample size increases.

Paul says

Thanks a lot Jim.

I will read further on this topic. It seems like there are still even papers publishing in recent years asking people not to use the default normal SE formulas on sample skewness and kurtosis for unknown distributions.

Jim Frost says

Hi Paul, yes, as I indicated, these CIs assume the data follow a normal distribution and are sensitive to departures from this assumption. Consequently, don’t use them for other distributions. At least for parametric methods. I’m not sure about bootstrapping. However, performing a normality test will give you some of the same information.

Paul says

Hi Jim,

Thank you. I have a question which is related but not exactly comparing means.

Instead of comparing mean from two sample data, if I just have one set of sample data, and I want to know if the mean is significantly non-zero. A traditional approach would be doing a one sample t-test.

Can I instead use confidence interval of the sample mean to infer whether the true mean significantly non zero?

Intuitively (or graphically) to me it is a yes. However under the hood they are slightly different.

If it is t-test, it is first assuming the null hypothesis that mean is zero, and then based on the distribution implied by the hypothesis (i.e. using the sample SD and t-dist with df of n-1), if my sample mean is outside the 95% range then the null hypothesis is rejected and indicates non zero.

However if I am using my “confidence interval method”, conceptually there is no hypothesis at all. It is just that “okay, my estimate is roughly span across this region, but zero is so far.. so true mean very unlikely zero” and I declare it is non-zero.

===========

The reason I am asking this is actually that I have a much deeper question. I have a data set, and I am tasked to answer the question whether the skewness and excess kurtosis are significantly non zero.

Of course I can use some standard analytical formulas for the standard error but actually they may not be very correct for empirical distributions.

So I am thinking of using bootstrapping method to resample the data, recompute the skewness and excess kurtosis, and use the 2.5% and 97.5% percentile of the recomputed numbers to determine their confidence interval. If zero is outside those confidence intervals, I declare the skewness / excess kurtosis being non zero.

Once again my intuition told me it is right to do this, but I am not 100% sure it is a valid approach.

Jim Frost says

Hi Paul,

In terms of using CIs or a hypothesis, that’s a definite yes!

You can definitely use a confidence interval for hypothesis testing purposes. They’re variants of the same underlying methodology. In fact, if the results from a hypothesis test with a significance level of 0.05 will always match the corresponding CI with a 95% confidence level. To learn more about this, ready my post about how confidence intervals work. A section near the end explains why the two methods always yield consistent results.

However, in terms of CIs for skewness and kurtosis, there’s a complication. You can calculate CIs for these measures but they’re only valid for normally distributed data–at least using parametric methods. I don’t know if the same limitation applies to bootstrapping. Perhaps not.

Fortunately, I think there’s an easy solution for you. If you’re referring to “excess” kurtosis, then it seems like you essentially want to determine whether your data are normally distributed. For normally distributed data, excess skewness equals zero and skew equals zero. If that’s the case, just use a normality test! If the results for your normality test are statistically significant, your data are nonnormal–excess kurtosis and skew can’t both equal zero.

I hope that helps!

Narayan says

Dear sir,

Thank you for the article it is very informative. I had a doubt regarding which test to conduct for my research which involves cross border mergers and acquisitions. I have a sample of 3000 transactions conducted between a specific time period. These transactions were conducted by developed market acquirers (eg firms in countries like Germany, USA, UK, etc) and the targets included both emerging market targets (Firms in countries like India, China, etc) and developed market targets (firms in countries like Germany, USA, UK, etc). I want to test the statistical significance of the difference in the means of the returns between two samples (Developed market acquirers-Developed market targets) and (Developed market acquirers-Emerging market targets) on a yearly basis. The returns for each transaction are in percentage format. While researching on the appropriate test to be conducted in stata, I came across two tests which I think would be relevant for this specific study (Paired t-test and Independent t-test). I have a doubt regarding which test would be the most appropriate. The countries in my sample are the same but the acquirer firms and target firms in each country are different.

Thanks and regards,

Narayan

Peter says

Hi Sir,

Thank you for clear explanation! It definitely helps me to understand statistics. However, when applying it to other research papers, I am having some difficulties to interpret the C.I. differences between groups.

Here an example;

Group 1 No Choice & Public RPI No Choice & Private RPI

Group 2 No Choice & Private RPI No Choice & No RPI

Effect Size 1.0844 1.0229

C.I. (95%): 0.3179 and 1.8509 (95%): 1.7839 and 0.2618

Disregarding the variables (no/yes choice, public/private/no RPI), I fail to understand what the difference in confidence interval between (0.3179 and 1.8509) and (0.2618 and 1.7839) actually tells me. Can I form any statistical conclusion between these two? Or do I need to use additional information?

Once again, thank you for your help!

Zahra says

I would like to plot CI for data lets say 30 group and each group has 6 sample in it, and i also would like to include Process Upper and Lower control limit in it to show if the process been stable

Do you think there is a way to do in Minitab?

thanks in advance

Kaitlin Snider says

Hi Jim!

I read through your post on ANOVA and post hoc tests. I think I’m asking about doing something different than an ANOVA. I’ve detailed an example below – hopefully this question is more clear! Below, I start with some example data, show what happens if I put it through a two-way ANOVA test, discuss why I think the ANOVA isn’t effective for my question, and then show what I am thinking of doing instead (a sort of “difference of diffferences” test).

Let’s say I have this data (not my real data, set up for the sake of demonstrating the problem):

Group A: WT Control mean 5.67, SD 6.79, N 10

Group B: WT Treatment mean 37.88, SD 20.63, N 10

Group C: Mutant Control mean 13.01, SD 7.07, N 10

Group D: Mutant Treatment mean 29.72, SD 16.51, N 12

For this test, I don't think it makes sense to use just an ANOVA with post hoc tests. I am not interested in simply whether or not Treatment is different from Control *within* genotype (A-B or C-D). Rather, I am interested in whether the SIZE of difference for the mutant is *different* from the size of difference for the WT. This could mean that WT has a large effect of treatment, while mutant has no effect of treatment; the other way around (WT has no effect of treatment, while mutant has a large effect of treatment); or, the more finicky possibility, WT has a large effect of treatment and mutant has a smaller effect of treatment (true in the sample data).

Does this seem reasonable to you? Would you do anything differently for calculating the SD and N for the differences? Are you aware of any published examples that have used this sort of statistics?

Thank you again for your time – your blog is a fantastic resource!

Jim Frost says

Hi Kaitlin,

Alright, I think I get what you’re asking now! 🙂

I edited your comment because it was kind of long. But, the gist of what you’re asking is, how can you statistically test whether the relationship between the relationship between treatment condition and the outcome (i.e., effect size) changes by genotype? In other words, does the effect size depend on genotype? To do that, you need to include an interaction effect in your model. If you include an interaction between genotype and treatment condition, and that interaction is statistically significant, you can conclude the size of the treatment effect varies depending on genotype.

For more information, read my post about understanding interaction effects. If you have any more questions, please don’t hesitate to ask!

I hope that helps!

Kaitlin Snider says

Hey Jim 🙂 Thanks for the great post! I’m trying to see if I can use this for a “differences of differences” test. In the simplest form, I have four groups: two genotypes, and two conditions (“control” and “experiment”). I’m not interested in the raw values; rather, I want to see if the genotype alters the DIFFERENCE caused by experiment (e.g., does WT have a greater effect of experiment than mutant?) Two-way ANOVA kind of works but doesn’t directly ask the question I am asking, so it ends up a bit underpowered. The measurements used are one-and-done, so all four groups are independent – no repeated measures. So, is there a way to get a confidence interval of the mean difference of two mean differences? I can compute the means, but I’m not sure how to handle the standard error or the n. Thanks in advance!!

Jim Frost says

Hi Kaitlin,

Yes, you can do what you describe in conjunction with ANOVA. You’ll need to use post hoc analyses after performing ANOVA. I write all about this in my article about post hoc tests. It’ll work for the scenario you describe. Read that post and if you have further questions, don’t hesitate to ask! But, I think that’ll help you move forward! In a nutshell, these procedures calculate the CIs for differences between multiple group means and, importantly, control the family error rate for the multiple comparisons.

Frank Corotto says

I don’t know who to believe, Geoff Cumming or other sources. Cumming says his rules fall apart when there are multiple comparisons. He makes no mention of other approaches. My copy of Zar tells me what to do. Use a Tukey test backwards to calculate confidence intervals around each difference between two means. See if zero is excluded. Cumming also says confidence intervals can’t be used when there are repeated measures. What about Loftus and Masson (1994), who said use the within-subjects error term, what Zar calls remainder MS, and what about papers that followed up on Loftus and Masson? I’d appreciate any of your insights. Thanks for your time.

Jim Frost says

Hi Frank,

Those references you mention don’t necessarily contradict each other. You should use confidence intervals differently in different situations. I’m not 100% familiar with all the references but here’s how I think they all generally agree. Note as we go how the context changes, which necessitates changes to the CIs.

If you have random samples drawn from several populations and you want to better understand each population’s mean, just use a regular CI.

If you have two populations and you want to understand the difference between their means, use a CI for the difference between means. That’s what I write about in this post.

If you have more than two populations and you need to compare multiple means to each other, you’ll need to use CIs for the differences between means that are generated by post-hoc analyses (a.k.a., multiple comparison methods) to control the family error rate. For example, you might use CIs from Tukey’s test. I write about this in my blog post about using post-hoc tests with ANOVA.

If you’re using repeated measures, that again changes the context. As I understand it, the hypothesis test for these designs use the within-subjects variances to gain greater power. However, the CIs were typically calculated using both the within and between subjects variances. However, there are different recognized methods for removing the between subjects variances. I’m not completely familiar with the specifics of this, but I believe your Loftus and Masson citation is one method for doing just that.

So, there’s no contradiction that I can see and no reason to disbelieve any of them. In different contexts, you’ll need to use CIs differently. That’s the standard practice in statistics. Understand the context and use the analyses appropriately!

LUC says

Thanks Jim, great !

One question: how do you plot the CI for the difference between group means with minitab ?

Lawrence Adu-Gyamfi says

Thanks for this Jim, it clarifies a lot for me!

Ricardo Garza-Mendiola says

Great, thanks

Khursheed says

Hllo sir …

I hve no words so that I can say uh Thnks ….ths is the best thing that we can b able to clear our concepts @statistics ..

May God bless u.

Jim Frost says

Hi Khursheed,

Thank you so much for your kind words. They mean a lot to me! I’m glad my website has been helpful!