A confidence interval is calculated from a sample and provides a range of values that likely contains the unknown value of a population parameter. In this post, I demonstrate how confidence intervals and confidence levels work using graphs and concepts instead of formulas. In the process, you’ll see how confidence intervals are very similar to P values and significance levels.

Read the companion post for this one: How Hypothesis Tests Work: Significance Levels (Alpha) and P-values. In that post, I use the same graphical approach to illustrate why we need hypothesis tests, how significance levels and P values can determine whether a result is statistically significant, and what that actually means.

## How to Interpret Confidence Intervals

You can calculate a confidence interval from a sample to obtain a range for where the population parameter is likely to reside. For example, a confidence interval of [9 11] indicates that the population mean is likely to be between 9 and 11.

Different random samples drawn from the same population are liable to produce slightly different intervals. If you draw many random samples and calculate a confidence interval for each sample, a specific proportion of the intervals contain the population parameter. That percentage is the confidence level.

For example, a 95% confidence level suggests that if you draw 20 random samples from the same population, you’d expect 19 of the confidence intervals to include the population value.

The confidence interval procedure provides meaningful estimates because it produces ranges that usually contain the parameter.

We’ll create a confidence interval for the population mean using the fuel cost example that we’ve been developing. With other types of data, you can create intervals for proportions, frequencies, regression coefficients, and differences between populations.

**Related post**: See how confidence intervals compare to prediction intervals and tolerance intervals.

## Confidence Intervals Indicate the Precision of the Estimate

Confidence intervals include the point estimate for the sample with a margin of error around the point estimate. The point estimate is the most likely value of the parameter and equals the sample value. The margin of error accounts for the amount of doubt involved in estimating the population parameter. The more variability there is in the sample data, the less precise the estimate, which causes the margin of error to extend further out from the point estimate. Confidence intervals help you navigate the uncertainty of how well a sample estimates a value for an entire population.

With this in mind, confidence intervals can help you compare the precision of different estimates. Suppose two different samples estimate the same population parameter with 95% confidence intervals. One interval is [5 15] while the other is [9 11]. The later confidence interval is narrower, which suggests that it is a more precise estimate.

**Related post**: Sample Statistics Are Always Wrong (to Some Extent)!

## Creating Confidence Intervals Graphically

Let’s delve into how confidence intervals incorporate the margin of error. Like the previous posts, I’ll use the same type of sampling distribution that showed us how hypothesis tests work. This sampling distribution is based on the t-distribution, our sample size, and the variability in our sample. Download the CSV data file: FuelsCosts.

There are two key differences between the sampling distribution graphs for significance levels and confidence intervals. The significance level chart centers on the null value, and we shade the outside 5% of the distribution. Conversely, the confidence interval graph centers on the sample mean, and we shade the center 95% of the distribution.

The shaded range of sample means [267 392] covers 95% of this sampling distribution. This range is the 95% confidence interval for our sample data. We can be 95% confident that the population mean for fuel costs fall between 267 and 392.

## Confidence Intervals and the Inherent Uncertainty of Using Sample Data

The graph emphasizes the role of uncertainty around the point estimate. This graph centers on our sample mean. If the population mean equals our sample mean, random samples from this population (N=25) will fall within this range 95% of the time.

We don’t really know whether our sample mean is near the population mean. However, we know that the sample mean is an unbiased estimate of the population mean. An unbiased estimate is one that doesn’t tend to be too high or too low. It’s correct on average. Confidence intervals are correct on average because they use sample estimates that are correct on average. Given what we know, the sample mean is the most likely value for the population mean.

Given the sampling distribution, it would not be unusual for other random samples drawn from the same population to have means that fall within the shaded area. In other words, given that we did, in fact, obtain the sample mean of 330.6, it would not be surprising to get other sample means within the shaded range.

If these other sample means would not be unusual, then we must conclude that these other values are also likely candidates for the population mean. There is an inherent uncertainty when you use sample data to make inferences about the entire population. Confidence intervals help you gauge the amount of uncertainty in your sample.

## Confidence Intervals and P Values Always Agree on Statistical Significance

If you want to determine whether your test results are statistically significant, you can use either P values with significance levels or confidence intervals. These two approaches always agree.

The relationship between the confidence level and the significance level for a hypothesis test is as follows:

Confidence level = 1 – Significance level (alpha)

For example, if your significance level is 0.05, the equivalent confidence level is 95%.

Both of the following conditions represent a hypothesis test with statistically significant results:

- The P value is smaller than the significance level.
- The confidence interval excludes the null hypothesis value.

Further, it is always true that when the P value is less than your significance level, the interval excludes the value of the null hypothesis.

In the fuel cost example, our hypothesis test results are statistically significant because the P value (0.03112) is less than the significance level (0.05). Likewise, the 95% confidence interval [267 394] excludes the null hypotheses value (260). Using either method, we draw the same conclusion.

## Why They Always Agree

The P-value and confidence interval results always agree. To understand the basis of this agreement, we need to remember how confidence levels and significance levels function:

- A confidence level determines the distance between the sample mean and the confidence limits.
- A significance level determines the distance between the null hypothesis value and the critical regions.

Both of these concepts specify a distance from the mean to a limit. Surprise! These distances are precisely the same length.

A 1-sample t-test calculates this distance as follows:

The critical t-value * standard error of the mean

Interpreting these statistics goes beyond the scope of this article. But, using this equation, the distance for our fuel cost example is $63.57.

**P Value and significance level approach**: If the sample mean is more than $63.57 from the null hypothesis mean, the sample mean falls within the critical region and the difference is statistically significant.

**Confidence interval approach**: If the null hypothesis mean is more than $63.57 from the sample mean, the interval does not contain this value, and the difference is statistically significant.

Of course, they always agree!

As long as the P values and confidence intervals are generated by the same hypothesis test, and you use an equivalent confidence level and significance level, the two approaches always agree.

**Related post**: Standard Error of the Mean

## I Really Like Confidence Intervals!

In statistics, more emphasis is placed on using P values to determine whether a result is statistically significant. Unfortunately, an effect that is statistically significant might not always be practically significant. For example, a significant effect can be too small to be of any importance in the real world. Learn how to use confidence intervals to compare group means!

You should always consider both the size and precision of the estimated effect. Ideally, an estimated effect is both large enough to be meaningful and sufficiently precise for you to trust. Confidence intervals allow you to assess both of these considerations! Learn more about this distinction in my post about Practical vs. Statistical Significance.

To see an alternative to traditional confidence intervals that does not use probability distributions and test statistics, learn about bootstrapping in statistics! In that post, I create bootstrapped confidence intervals.

Loukas says

Hi Jim,

Thank you very much.

When training a machine learning model using bootstrap, in the end we will have the confidence interval of accuracy.

How can I say that this result is statistically significant?

Do I have to convert the confidence interval to p-values first and if p-value is less than 0.05, then it is statistically significant?

Jim Frost says

Hi Loukas,

As I mention in this article, you determine significance using a confidence interval by assessing whether it excludes the null hypothesis value. When it excludes the null value, your results are statistically significant.

Loukas says

Dear Jim,

Thanks for this post.

I am new to hypothesis testing and would like to ask you how we know that the null hypotheses value is equal to 260.

Thank you.

Kind regards,

Loukas

Jim Frost says

Hi Loukas,

For this example, the null hypothesis is 260 because that is the value from the previous year and they wanted to compare the current year to the previous year. It’s defined as the previous year value because the goal of the study was to determine whether it has changed since last year.

In general, the null hypothesis will often be a meaningful target value for the study based on their knowledge, such as this case. In other cases, they’ll use a value that represents no effect, such as zero.

I hope that helps clarify it!

Victor Talavera says

Hello, Mr. Jim Frost.

Thank you for publishing precise information about statistics, I always read your posts and bought your excellent e-book about regression! I really learn from you.

I got a couple of questions about the confidence level of the confidence intervals. Jacob Cohen, in his article “things I’ve learned (so far)” said that, in his experience, the most useful and informative confidence level is 80%; other authors state that if that level is below 90% it would be very hard to compare across results, as it is uncommon.

My first question is: in exploratory studies, with small samples (for example, N=85), if one wishes to generate correlational hypothesis for future research, would it be better to use a lower confidence level? What is the lowest level you would consider to be acceptable? I ask that because of my own research now, and with a sample size 85 (non-probabilistic sampling) I know all I can do is generate some hypothesis to be explored in the future, so I would like my confidence intervals to be more informative, because I am not looking forward to generalize to the population.

My second question is: could you please provide an example of an appropriate way to describe the information about the confidence interval values/limits, beyond the classic “it contains a difference of 0; it contains a ratio of 1”.

I would really appreciate your answers.

Greetings from Peru!

Jim Frost says

Hi Victor,

Thanks so much for your kind words and for supporting my regression ebook! I’m glad it’s been helpful! ðŸ™‚

On to your questions!

I haven’t read Cohen’s article, so I don’t understand his rationale. However, I’m extremely dubious of using a confidence level as low as 80%. Lowering the confidence level will create a narrower CI, which looks good. However, it comes at the expense of dramatically increasing the likelihood that the CI won’t contain the correct population value! My position is to leave the confidence level at 95%. Or, possibly lower it to 90%. But, I wouldn’t go further. Your CI will be wider, but that’s OK. It’s reflecting the uncertainty that truly exists in your data. That’s important. The problem with lowering the CIs is that it makes your results appear more precise than they actually are.

When I think of exploratory research, I think of studies that are looking at tendencies or trends. Is the overall pattern of results consistent with theoretical expectations and justify further research? At that stage, it shouldn’t be about obtaining statistically significant results–at least not as the primary objective. Additionally, exploratory research can help you derive estimated effect sizes, variability, etc. that you can use for power calculations. A smaller, exploratory study can also help you refine your methodology and not waste your resources by going straight to a larger study that, as a result, might not be as refined as it would without a test run in the smaller study. Consequently, obtaining significant results, or results that look precise when they aren’t, aren’t the top priorities.

I know that lowering the confidence level makes your CI look more information but that is deceptive! I’d resist that temptation. Maybe go down to 90%. Personally, I would not go lower.

As for the interpretation, CIs indicate the likely range that a population parameter is likely to fall within. The parameter can be a mean, effect size, ratio, etc. Often times, you as the researcher are hoping the CI excludes an important value. For example, if the CI is of the effect size, you want the CI to exclude zero (no effect). In that case, you can say that there is unlikely to be no effect in the population (i.e., there probably is a non-zero effect in the population). Additionally, the effect size is likely to be within this range. Other times, you might just want to know the range of values itself. For example, if you have a CI for the mean height of a population, it might be valuable on its own knowing that the population mean height is likely to fall between X and Y. If you have specific example of what the CI assesses, I can give you a more specific interpretation.

Additionally, I cover confidence intervals associated with many different types of hypothesis tests in my Hypothesis Testing ebook. You might consider looking in to that!

I hope that helps!

Yvonne says

Hi Jim

I got a very wide 95% CI of the HR of height in the cox PH model from a very large sample. I already deleted the outliers defined as 1.5 IQR, but it doesn’t work. Do you know how to resolve it?

Pedro Cunha says

Hello, Jim!

I appreciate the thoughtful and thorough answer you provided. It really helped in crystallizing the topic for me.

If I may ask for a bit more of your time, as long as we are talking about CIs I have another question:

How would you go about constructing a CI for the difference of variances?

I am asking because while creating CIs for the difference of means or proportions is relatively straightforward, I couldn’t find any references for the difference of variances in any of my textbooks (or on the Web for that matter); I did find information regarding CIs for the ratio of variances, but it’s not the same thing.

Could you help me with that?

Thanks a lot!

Pedro Cunha says

Hello, Jim!

I want to start by thanking you for a great post and an overall great blog! Top notch material.

I have a doubt regarding the difference between confidence intervals for a point estimate and confidence intervals for a hypothesis test.

As I understand, if we are using CIs to test a hypothesis, then our point estimate would be whatever the null hypothesis is; conversely, if we are simply constructing a CI to go along with our point estimate, we’d use the point estimate derived from our sample. Am I correct so far?

The reason I am asking is that because while reading from various sources, I’ve never found a distinction between the two cases, and they seem very different to me.

Bottom line, what I am trying to ask is: assuming the null hypothesis is true, shouldn’t the CI be changed?

Thank you very much for your attention!

Jim Frost says

Hi Pedro,

There’s no difference in the math behind the scenes. The real difference is that when you create a confidence interval in conjunction with a hypothesis test, the software ensures that they’re using consistent methodology. For example, the significance level and confidence level will correspond correctly (i.e., alpha = 0.05 and confidence level = 0.95). Additionally, if you perform a two-tailed test, you will obtain a two-sided CI. On the other hand, if you perform a one-tailed test, you will obtain the appropriate upper or lower bound (i.e., one-sided CIs). The software also ensures any other methodological choices you make will match between the hypothesis test and CI, which ensures the results always agree.

You can perform them separately. However, if you don’t match all the methodology options, the results can differ.

As for your question about assuming the null is true. Keep in mind that hypothesis tests create sampling distributions that center on the null hypothesis value. That’s the assumption that the null is true. However, the sampling distributions for CIs center on the sample estimate. So, yes, CIs change that detail because they don’t assume the null is correct. But that’s always true whether you perform the hypothesis test or not.

Thanks for the great questions!

Jaser Ali says

Hi Jim !!

Confidence interval has sample static as the most likely value ( value in the center) – and sample distribution assumes the null value to be the most likely value( value in the center). I am a little confused about this. Would be really kind of you if you could show both in the same graph and explain how both are related. How the the distance from the mean to a limit in case of Significance level and CI same?

Jim Frost says

Hi Jaser,

That’s a great question. I

thinkpart of your confusion is due to terminology.The sampling distribution of the means centers on the sample mean. This sampling distribution uses your sample mean as its mean and the standard error of the mean as its standard deviation.

The sampling distribution of the test statistic (t) centers on the null hypothesis value (0). This distribution uses zero as its mean and also uses the SEM for its standard deviation.

They’re two different things and center on different points. But, they both incorporate the SEM, which is why they always agree! I do have section in this post about why that distance is always the same. Look for the section titled, “Why They Always Agree.”

I hope that helps!

Gerard Versluis says

Hi Jim, Iâ€™m the proud owner of 2 of your ebooks. Thereâ€™s one topic though that keeps puzzling me:

If I would take 9 samples of size 15 in order to estimate the population mean, the se of the mean would be substantial larger than if I would take 1 sample of size 135 (divide pop sd by sqrt(15) or sqrt(135) ) whereas the E(x) (or mean of means) would be the same.

Can you please shine a little light on that.

Tx in advance

Gerard

Jim Frost says

Hi Gerard,

Thanks so much for supporting my ebooks. I really appreciate that!! ðŸ™‚

So, let’s flip that scenario around. If you know that a single large sample of 135 will produce more precise estimates of the population, why would you collect nine smaller samples? Knowing how statistics works, that’s not a good decision. If you did that in the real world, it would be because there was some practical reason that you could not collect one big example. Further, it would suggest that you had some reason for not being able to combine them later. For example, if you follow the same random sampling procedure on the same population and used all the same methodology and at the same general time, you might feel comfortable combining them together into one larger sample. So, if you couldn’t collect one larger example and you didn’t feel comfortable combining them together, it suggests that you have some reason for doubting that they all measure the same thing for the same population. Maybe you had differences in methodology? Or subjective measurements across different personnel? Or, maybe you collected the samples at different times and you’re worried that the population changed over time?

So, that’s the real world reason for why a researcher would not combine smaller samples into a larger one.

The mathematical reason for why one larger sample is better than a number of smaller samples is that the calculations for the small samples aren’t “aware” of the similar data in the other samples. The formula for the standard error of the mean (SEM) is below:

As you can see, the expected value for the population standard deviation is in the numerator (sigma). As the sample size increases, the numerator remains constant (plus or minus random error) because the expected value for the population parameter does not change. Conversely, the square root of the sample size is in the denominator. As the sample size increases, it produces a larger values in the denominator. So, if the expected value of the numerator is constant but the value of the denominator increases with a larger sample size, you expect the SEM to decrease. Smaller SEM’s indicate more precise estimates of the population parameter. For instance, the equations for confidence intervals use the SEM. Hence, for the same population, larger samples tend to produce smaller SEMS, and more precise estimates of the population parameter.

I hope that answers your question!

Georg Datler says

Hi Jim,

first of all: Thanks for your effort and your effective way of explaining!

You say that p-values and C.I.s always agree. I agree.

Why does Tim van der Zee claim the opposite?

I’m not enough into statistcs to figure this out.

http://www.timvanderzee.com/not-interpret-confidence-intervals/

Best regards

Georg

Jim Frost says

Hi Georg,

I

thinkhe is saying that they do agree–just that people often compare the wrong pair of CIs and p-values. I assume you’re referring to the section “What do overlapping intervals (not) mean?” And, he’s correct in what he says. In a 2-sample t-test, it’s not valid to compare the CI for each of the two group means to the test’s p-values because they have different purposes. Consequently, they won’t necessarily agree. However, that’s because you’re comparing results from two different tests/intervals.On the one hand, you have the CIs for each group. On the other hand, you have the p-value for the difference between the two groups. Those are not the same thing and so it’s not surprising that they won’t agree necessarily.

However, if you compare the p-value of the difference between means to a CI of the difference between means, they will always agree. You have to compare apples to apples!

Devansh Malik says

Hey Jim,

First of all, I love all your posts and you really do make people appreciate statistics by explaining it intuitively compared to theoretical approaches I’ve come across in university courses and other online resources. Please continue the fantastic work!!!

At the end, you mentioned how you prefer confidence intervals as they consider both “size and precision of the estimated effect”. I’m confused as to what exactly size and precision mean in this context. I’d appreciate an explanation with reference to specific numbers from the example above.

Second, do p-values lack both size and precision in determination of statistical significance?

Thanks,

Devansh

Jim Frost says

Hi Devansh,

Thanks for the nice comments. I really appreciate them!

I really need to write a post specifically about this issue.

Let’s first assume that we conduct our study and find that the mean cost is 330.6 and that we are testing whether that is different than 260. Further suppose that we perform the the hypothesis test and obtain a p-value that is statistically significant. We can reject the null and conclude that population mean does not equal 260. And we can see our sample estimate is 330.6. So, that’s what we learn using p-values and the sample estimate.

Confidence intervals add to that information. We know that if we were to perform the experiment again, we’d get different results. How different? Is the true population mean likely to be close to 330.6 or further away? CIs help us answer these questions. The 95% CI is [267 394]. The true population value is likely to be within this range. That range spans 127 dollars.

However, let’s suppose we perform the experiment again but this time use a much larger sample size and obtain a mean of 351 and again a significant p-value. However, thanks to the large sample size, we obtain a 95 CI of [340 362]. Now we know that the population value is likely to fall within this much tighter interval of only 22 dollars. This estimate is much more precise.

Sometimes you can obtain a significant p-value for a result that is too imprecise to be useful. For example, with first CI, it might be too wide to be useful for what we need to do with our results. Maybe we’re helping people make budgets and that is too wide to allow for practical planning. However, the more precise estimate of the second study allows for better budgetary planning! That determination how much precision is required must be made using subject-area knowledge and focusing on the practical usage of the results. P-values don’t indicate the precision of the estimates in this manner!

I hope this helps clarify this precision issue!