A confidence interval is calculated from a sample and provides a range of values that likely contains the unknown value of a population parameter. In this post, I demonstrate how confidence intervals and confidence levels work using graphs and concepts instead of formulas. In the process, you’ll see how confidence intervals are very similar to P values and significance levels.

Read the companion post for this one: How Hypothesis Tests Work: Significance Levels (Alpha) and P-values. In that post, I use the same graphical approach to illustrate why we need hypothesis tests, how significance levels and P values can determine whether a result is statistically significant, and what that actually means.

## How to Interpret Confidence Intervals

You can calculate a confidence interval from a sample to obtain a range for where the population parameter is likely to reside. For example, a confidence interval of [9 11] indicates that the population mean is likely to be between 9 and 11.

Different random samples drawn from the same population are liable to produce slightly different intervals. If you draw many random samples and calculate a confidence interval for each sample, a specific proportion of the intervals contain the population parameter. That percentage is the confidence level.

For example, a 95% confidence level suggests that if you draw 20 random samples from the same population, you’d expect 19 of the confidence intervals to include the population value.

The confidence interval procedure provides meaningful estimates because it produces ranges that usually contain the parameter.

We’ll create a confidence interval for the population mean using the fuel cost example that we’ve been developing. With other types of data, you can create intervals for proportions, frequencies, regression coefficients, and differences between populations.

**Related post**: See how confidence intervals compare to prediction intervals and tolerance intervals.

## Confidence Intervals Indicate the Precision of the Estimate

Confidence intervals include the point estimate for the sample with a margin of error around the point estimate. The point estimate is the most likely value of the parameter and equals the sample value. The margin of error accounts for the amount of doubt involved in estimating the population parameter. The more variability there is in the sample data, the less precise the estimate, which causes the margin of error to extend further out from the point estimate. Confidence intervals help you navigate the uncertainty of how well a sample estimates a value for an entire population.

With this in mind, confidence intervals can help you compare the precision of different estimates. Suppose two different samples estimate the same population parameter with 95% confidence intervals. One interval is [5 15] while the other is [9 11]. The later confidence interval is narrower, which suggests that it is a more precise estimate.

**Related post**: Sample Statistics Are Always Wrong (to Some Extent)!

## Creating Confidence Intervals Graphically

Let’s delve into how confidence intervals incorporate the margin of error. Like the previous posts, I’ll use the same type of sampling distribution that showed us how hypothesis tests work. This sampling distribution is based on the t-distribution, our sample size, and the variability in our sample. Download the CSV data file: FuelsCosts.

There are two key differences between the sampling distribution graphs for significance levels and confidence intervals. The significance level chart centers on the null value, and we shade the outside 5% of the distribution. Conversely, the confidence interval graph centers on the sample mean, and we shade the center 95% of the distribution.

The shaded range of sample means [267 392] covers 95% of this sampling distribution. This range is the 95% confidence interval for our sample data. We can be 95% confident that the population mean for fuel costs fall between 267 and 392.

## Confidence Intervals and the Inherent Uncertainty of Using Sample Data

The graph emphasizes the role of uncertainty around the point estimate. This graph centers on our sample mean. If the population mean equals our sample mean, random samples from this population (N=25) will fall within this range 95% of the time.

We don’t really know whether our sample mean is near the population mean. However, we know that the sample mean is an unbiased estimate of the population mean. An unbiased estimate is one that doesn’t tend to be too high or too low. It’s correct on average. Confidence intervals are correct on average because they use sample estimates that are correct on average. Given what we know, the sample mean is the most likely value for the population mean.

Given the sampling distribution, it would not be unusual for other random samples drawn from the same population to have means that fall within the shaded area. In other words, given that we did, in fact, obtain the sample mean of 330.6, it would not be surprising to get other sample means within the shaded range.

If these other sample means would not be unusual, then we must conclude that these other values are also likely candidates for the population mean. There is an inherent uncertainty when you use sample data to make inferences about the entire population. Confidence intervals help you gauge the amount of uncertainty in your sample.

## Confidence Intervals and P Values Always Agree on Statistical Significance

If you want to determine whether your test results are statistically significant, you can use either P values with significance levels or confidence intervals. These two approaches always agree.

The relationship between the confidence level and the significance level for a hypothesis test is as follows:

Confidence level = 1 – Significance level (alpha)

For example, if your significance level is 0.05, the equivalent confidence level is 95%.

Both of the following conditions represent a hypothesis test with statistically significant results:

- The P value is smaller than the significance level.
- The confidence interval excludes the null hypothesis value.

Further, it is always true that when the P value is less than your significance level, the interval excludes the value of the null hypothesis.

In the fuel cost example, our hypothesis test results are statistically significant because the P value (0.03112) is less than the significance level (0.05). Likewise, the 95% confidence interval [267 394] excludes the null hypotheses value (260). Using either method, we draw the same conclusion.

## Why They Always Agree

The P-value and confidence interval results always agree. To understand the basis of this agreement, we need to remember how confidence levels and significance levels function:

- A confidence level determines the distance between the sample mean and the confidence limits.
- A significance level determines the distance between the sample mean and the critical regions.

Both of these concepts specify a distance from the mean to a limit. Surprise! These distances are precisely the same length.

A 1-sample t-test calculates this distance as follows:

The critical t-value * standard error of the mean

Interpreting these statistics goes beyond the scope of this article. But, using this equation, the distance for our fuel cost example is $63.57.

**P Value and significance level approach**: If the sample mean is more than $63.57 from the null hypothesis mean, the sample mean falls within the critical region and the difference is statistically significant.

**Confidence interval approach**: If the null hypothesis mean is more than $63.57 from the sample mean, the interval does not contain this value, and the difference is statistically significant.

Of course, they always agree!

As long as the P values and confidence intervals are generated by the same hypothesis test, and you use an equivalent confidence level and significance level, the two approaches always agree.

## I Really Like Confidence Intervals!

In statistics, more emphasis is placed on using P values to determine whether a result is statistically significant. Unfortunately, an effect that is statistically significant might not always be practically significant. For example, a significant effect can be too small to be of any importance in the real world.

You should always consider both the size and precision of the estimated effect. Ideally, an estimated effect is both large enough to be meaningful and sufficiently precise for you to trust. Confidence intervals allow you to assess both of these considerations! Learn more about this distinction in my post about Practical vs. Statistical Significance.

To see an alternative to traditional confidence intervals that does not use probability distributions and test statistics, learn about bootstrapping in statistics! In that post, I create bootstrapped confidence intervals.

Georg Datler says

November 6, 2018 at 10:26 amHi Jim,

first of all: Thanks for your effort and your effective way of explaining!

You say that p-values and C.I.s always agree. I agree.

Why does Tim van der Zee claim the opposite?

I’m not enough into statistcs to figure this out.

http://www.timvanderzee.com/not-interpret-confidence-intervals/

Best regards

Georg

Jim Frost says

November 7, 2018 at 9:31 amHi Georg,

I

thinkhe is saying that they do agree–just that people often compare the wrong pair of CIs and p-values. I assume you’re referring to the section “What do overlapping intervals (not) mean?” And, he’s correct in what he says. In a 2-sample t-test, it’s not valid to compare the CI for each of the two group means to the test’s p-values because they have different purposes. Consequently, they won’t necessarily agree. However, that’s because you’re comparing results from two different tests/intervals.On the one hand, you have the CIs for each group. On the other hand, you have the p-value for the difference between the two groups. Those are not the same thing and so it’s not surprising that they won’t agree necessarily.

However, if you compare the p-value of the difference between means to a CI of the difference between means, they will always agree. You have to compare apples to apples!

Devansh Malik says

April 14, 2018 at 8:54 pmHey Jim,

First of all, I love all your posts and you really do make people appreciate statistics by explaining it intuitively compared to theoretical approaches I’ve come across in university courses and other online resources. Please continue the fantastic work!!!

At the end, you mentioned how you prefer confidence intervals as they consider both “size and precision of the estimated effect”. I’m confused as to what exactly size and precision mean in this context. I’d appreciate an explanation with reference to specific numbers from the example above.

Second, do p-values lack both size and precision in determination of statistical significance?

Thanks,

Devansh

Jim Frost says

April 17, 2018 at 11:41 amHi Devansh,

Thanks for the nice comments. I really appreciate them!

I really need to write a post specifically about this issue.

Let’s first assume that we conduct our study and find that the mean cost is 330.6 and that we are testing whether that is different than 260. Further suppose that we perform the the hypothesis test and obtain a p-value that is statistically significant. We can reject the null and conclude that population mean does not equal 260. And we can see our sample estimate is 330.6. So, that’s what we learn using p-values and the sample estimate.

Confidence intervals add to that information. We know that if we were to perform the experiment again, we’d get different results. How different? Is the true population mean likely to be close to 330.6 or further away? CIs help us answer these questions. The 95% CI is [267 394]. The true population value is likely to be within this range. That range spans 127 dollars.

However, let’s suppose we perform the experiment again but this time use a much larger sample size and obtain a mean of 351 and again a significant p-value. However, thanks to the large sample size, we obtain a 95 CI of [340 362]. Now we know that the population value is likely to fall within this much tighter interval of only 22 dollars. This estimate is much more precise.

Sometimes you can obtain a significant p-value for a result that is too imprecise to be useful. For example, with first CI, it might be too wide to be useful for what we need to do with our results. Maybe we’re helping people make budgets and that is too wide to allow for practical planning. However, the more precise estimate of the second study allows for better budgetary planning! That determination how much precision is required must be made using subject-area knowledge and focusing on the practical usage of the results. P-values don’t indicate the precision of the estimates in this manner!

I hope this helps clarify this precision issue!