Hypothesis testing is a vital process in inferential statistics where the goal is to use sample data to draw conclusions about an entire population. In the testing process, you use significance levels and p-values to determine whether the test results are statistically significant.

You hear about results being statistically significant all of the time. But, what do significance levels, P values, and statistical significance actually represent? Why do we even need to use hypothesis tests in statistics?

In this post, I answer all of these questions. I use graphs and concepts to explain how hypothesis tests function in order to provide a more intuitive explanation. This helps you move on to understanding your statistical results.

## Hypothesis Test Example Scenario

To start, I’ll demonstrate why we need to use hypothesis tests using an example.

A researcher is studying fuel expenditures for families and wants to determine if the monthly cost has changed since last year when the average was $260 per month. The researcher draws a random sample of 25 families and enters their monthly costs for this year into statistical software. You can download the CSV data file: FuelsCosts. Below are the descriptive statistics for this year.

We’ll build on this example to answer the research question and show how hypothesis tests work.

## Descriptive Statistics Alone Won’t Answer the Question

The researcher collected a random sample and found that this year’s sample mean (330.6) is greater than last year’s mean (260). Why perform a hypothesis test at all? We can see that this year’s mean is higher by $70! Isn’t that different?

Regrettably, the situation isn’t as clear as you might think because we’re analyzing a sample instead of the full population. There are huge benefits when working with samples because it is usually impossible to collect data from an entire population. However, the tradeoff for working with a manageable sample is that we need to account for sample error.

The sampling error is the gap between the sample statistic and the population parameter. For our example, the sample statistic is the sample mean, which is 330.6. The population parameter is μ, or mu, which is the average of the entire population. Unfortunately, the value of the population parameter is not only unknown but usually unknowable.

We obtained a sample mean of 330.6. However, it’s conceivable that, due to sampling error, the mean of the population might be only 260. If the researcher drew another random sample, the next sample mean might be closer to 260. It’s impossible to assess this possibility by looking at only the sample mean. Hypothesis testing is a form of inferential statistics that allows us to draw conclusions about an entire population based on a representative sample. We need to use a hypothesis test to determine the likelihood of obtaining our sample mean if the population mean is 260.

**Background information**: The Difference between Descriptive and Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics

## A Sampling Distribution Determines Whether Our Sample Mean is Unlikely

It is very unlikely for any sample mean to equal the population mean because of sample error. In our case, the sample mean of 330.6 is almost definitely not equal to the population mean for fuel expenditures.

If we could obtain a substantial number of random samples and calculate the sample mean for each sample, we’d observe a broad spectrum of sample means. We’d even be able to graph the distribution of sample means from this process.

This type of distribution is called a sampling distribution. You obtain a sampling distribution by drawing many random samples of the same size from the same population. Why the heck would we do this?

Because sampling distributions allow you to determine the likelihood of obtaining your sample statistic and they’re crucial for performing hypothesis tests.

Luckily, we don’t need to go to the trouble of collecting numerous random samples! We can estimate the sampling distribution using the t-distribution, our sample size, and the variability in our sample.

We want to find out if the average fuel expenditure this year (330.6) is different from last year (260). To answer this question, we’ll graph the sampling distribution based on the assumption that the mean fuel cost for the entire population has not changed and is still 260. In statistics, we call this lack of effect, or no change, the null hypothesis. We use the null hypothesis value as the basis of comparison for our observed sample value.

Sampling distributions and t-distributions are types of probability distributions. Learn more about probability distributions!

## Graphing our Sample Mean in the Context of the Sampling Distribution

The graph below shows which sample means are more likely and less likely if the population mean is 260. We can place our sample mean in this distribution. This larger context helps us see how unlikely our sample mean is if the null hypothesis is true (μ = 260).

The graph displays the estimated distribution of sample means. The most likely values are near 260 because the plot assumes that this is the true population mean. However, given random sampling error, it would not be surprising to observe sample means ranging from 167 to 352. If the population mean is still 260, our observed sample mean (330.6) isn’t the most likely value, but it’s not completely implausible either.

## The Role of Hypothesis Tests

The sampling distribution shows us that we are relatively unlikely to obtain a sample of 330.6 if the population mean is 260. Is our sample mean *so* unlikely that we can reject the notion that the population mean is 260?

In statistics, we call this rejecting the null hypothesis. If we reject the null for our example, the difference between the sample mean (330.6) and 260 is statistically significant. In other words, the sample data favor the hypothesis that the population average does *not* equal 260.

However, look at the sampling distribution chart again. Notice that there is no special location on the curve where you can definitively draw this conclusion. There is only a consistent decrease in the likelihood of observing sample means that are farther from the null hypothesis value. Where do we decide a sample mean is far away enough?

To answer this question, we’ll need more tools—hypothesis tests! The hypothesis testing procedure quantifies the unusualness of our sample with a probability and then compares it to an evidentiary standard. This process allows you to make an objective decision about the strength of the evidence.

We’re going to add the tools we need to make this decision to the graph—significance levels and p-values!

These tools allow us to test these two hypotheses:

- Null hypothesis: The population mean equals the null hypothesis mean (260).
- Alternative hypothesis: The population mean does not equal the null hypothesis mean (260).

**Related post**: Hypothesis Testing Overview

## What are Significance Levels (Alpha)?

A significance level, also known as alpha or α, is an evidentiary standard that a researcher sets before the study. It defines how strongly the sample evidence must contradict the null hypothesis before you can reject the null hypothesis for the entire population. The strength of the evidence is defined by the probability of rejecting a null hypothesis that is true. In other words, it is the probability that you say there is an effect when there is no effect.

For instance, a significance level of 0.05 signifies a 5% risk of deciding that an effect exists when it does not exist.

Lower significance levels require stronger sample evidence to be able to reject the null hypothesis. For example, to be statistically significant at the 0.01 significance level requires more substantial evidence than the 0.05 significance level. However, there is a tradeoff in hypothesis tests. Lower significance levels also reduce the power of a hypothesis test to detect a difference that does exist.

The technical nature of these types of questions can make your head spin. A picture can bring these ideas to life!

## Graphing Significance Levels as Critical Regions

On the probability distribution plot, the significance level defines how far the sample value must be from the null value before we can reject the null. The percentage of the area under the curve that is shaded equals the probability that the sample value will fall in those regions if the null hypothesis is correct.

To represent a significance level of 0.05, I’ll shade 5% of the distribution furthest from the null value.

The two shaded regions in the graph are equidistant from the central value of the null hypothesis. Each region has a probability of 0.025, which sums to our desired total of 0.05. These shaded areas are called the critical region for a two-tailed hypothesis test.

The critical region defines sample values that are improbable enough to warrant rejecting the null hypothesis. If the null hypothesis is correct and the population mean is 260, random samples (n=25) from this population have means that fall in the critical region 5% of the time.

Our sample mean is statistically significant at the 0.05 level because it falls in the critical region.

## Comparing Significance Levels

Let’s redo this hypothesis test using the other common significance level of 0.01 to see how it compares.

This time the sum of the two shaded regions equals our new significance level of 0.01. The mean of our sample does not fall within with the critical region. Consequently, we fail to reject the null hypothesis. We have the same exact sample data, the same difference between the sample mean and the null hypothesis value, but a different test result.

What happened? By specifying a lower significance level, we set a higher bar for the sample evidence. As the graph shows, lower significance levels move the critical regions further away from the null value. Consequently, lower significance levels require more extreme sample means to be statistically significant.

You must set the significance level before conducting a study. You don’t want the temptation of choosing a level after the study that yields significant results. The only reason I compared the two significance levels was to illustrate the effects and explain the differing results.

The graphical version of the 1-sample t-test we created allows us to determine statistical significance without assessing the P value. Typically, you need to compare the P value to the significance level to make this determination.

## What Are P values?

P values are the probability that a sample will have an effect at least as extreme as the effect observed in your sample *if* the null hypothesis is correct.

This tortuous, technical definition for P values can make your head spin. Let’s graph it!

First, we need to calculate the effect that is present in our sample. The effect is the distance between the sample value and null value: 330.6 – 260 = 70.6. Next, I’ll shade the regions on both sides of the distribution that are at least as far away as 70.6 from the null (260 +/- 70.6). This process graphs the probability of observing a sample mean at least as extreme as our sample mean.

The total probability of the two shaded regions is 0.03112. If the null hypothesis value (260) is true and you drew many random samples, you’d expect sample means to fall in the shaded regions about 3.1% of the time. In other words, you will observe sample effects at least as large as 70.6 about 3.1% of the time if the null is true. That’s the P value!

## Using P values and Significance Levels Together

If your P value is less than or equal to your alpha level, reject the null hypothesis.

The P value results are consistent with our graphical representation. The P value of 0.03112 is significant at the alpha level of 0.05 but not 0.01. Again, in practice, you pick one significance level before the experiment and stick with it!

Using the significance level of 0.05, the sample effect is statistically significant. Our data support the alternative hypothesis, which states that the population mean doesn’t equal 260. We can conclude that mean fuel expenditures have increased since last year.

P values are very frequently misinterpreted as the probability of rejecting a null hypothesis that is actually true. This interpretation is wrong! To understand why, please read my post: How to Interpret P-values Correctly.

## Discussion about Statistically Significant Results

Hypothesis tests determine whether your sample data provide sufficient evidence to reject the null hypothesis for the entire population. To perform this test, the procedure compares your sample statistic to the null value and determines whether it is sufficiently rare. “Sufficiently rare” is defined in a hypothesis test by:

- Assuming that the null hypothesis is true—the graphs center on the null value.
- The significance (alpha) level—how far out from the null value is the critical region?
- The sample statistic—is it within the critical region?

There is no special significance level that correctly determines which studies have real population effects 100% of the time. The traditional significance levels of 0.05 and 0.01 are attempts to manage the tradeoff between having a low probability of rejecting a true null hypothesis and having adequate power to detect an effect if one actually exists.

The significance level is the rate at which you incorrectly reject null hypotheses that are actually true (type I error). For example, for all studies that use a significance level of 0.05 and the null hypothesis is correct, you can expect 5% of them to have sample statistics that fall in the critical region. When this error occurs, you aren’t aware that the null hypothesis is correct, but you’ll reject it because the p-value is less than 0.05.

This error does not indicate that the researcher made a mistake. As the graphs show, you can observe extreme sample statistics due to sample error alone. It’s the luck of the draw!

**Related post**: Types of Errors in Hypothesis Testing

Hypothesis tests are crucial when you want to use sample data to make conclusions about a population because these tests account for sample error. Using significance levels and P values to determine when to reject the null hypothesis improves the probability that you will draw the correct conclusion.

Keep in mind that statistical significance doesn’t necessarily mean that the effect is important in a practical, real-world sense. For more information, read my post about Practical vs. Statistical Significance.

If you like this post, read the companion post: How Hypothesis Tests Work: Confidence Intervals and Confidence Levels.

You can also read my posts about how t-tests work and how the F-test works in ANOVA.

Khalid Kamal Rumi says

Dear Sir, please confirm if there is an inadvertent mistake in interpretation as, “We can conclude that mean fuel expenditures have increased since last year.”

Our null hypothesis is =260. If found significant, it implies two possibilities – both increase and decrease.

Please let us know if we are mistaken here.

Many Thanks!

Jim Frost says

Hi Khalid, the null hypothesis as it is defined for this test represents the mean monthly expenditure for the previous year (260). The mean expenditure for the current year is 330.6 whereas it was 260 for the previous year. Consequently, the mean has increased from 260 to 330.7 over the course of a year. The p-value indicates that this increase is statistically significant. This finding does not suggest both an increase and a decrease–just an increase. Keep in mind that a significant result prompts us to reject the null hypothesis. So, we reject the null that the mean equals 260.

Let’s explore the other possible findings to be sure that this makes sense. Suppose the sample mean had been closer to 260 and the p-value was greater than the significance level, those results would indicate that the results were not statistically significant. The conclusion that we’d draw is that we have insufficient evidence to conclude that mean fuel expenditures have changed since the previous year.

If the sample mean was less than the null hypothesis (260) and if the p-value is statistically significant, we’d concluded that mean fuel expenditures have decreased and that this decrease is statistically significant.

When you interpret the results, you have to be sure to understand what the null hypothesis represents. In this case, it represents the mean monthly expenditure for the previous year and we’re comparing this year’s mean to it–hence our sample suggests an increase.

Georgi Georgiev says

Hi Jim,

A quick question regarding your use of two-tailed critical regions in the article above: why? I mean, what is a real-world scenario that would warrant a two-tailed test of any kind (z, t, etc.)? And if there are none, why keep using the two-tailed scenario as an example, instead of the one-tailed which is both more intuitive and applicable to most if not all practical situations. Just curious, as one person attempting to educate people on stats to another (my take on the one vs. two-tailed tests can be seen here: http://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/ )

Thanks,

Georgi

Jim Frost says

Hi Georgi,

There’s the appropriate time and place for both one-tailed and two-tailed tests. I plan to write a post on this issue specifically, so I’ll keep my comments here brief.

So much of statistics is context sensitive. People often want concrete rules for how to do things in statistics but that’s often hard to provide because the answer depends on the context, goals, etc. The question of whether to use a one-tailed or two-tailed test falls firmly in this category of it depends.

I did read the article you wrote. I’ll say that I can see how in the context of A/B testing specifically there might be a propensity to use one-tailed tests. You only care about improvements. There’s probably not too much downside in only caring about one direction. In fact, in a post where I compare different tests and different options, I suggest using a one-tailed test for a similar type of casing involving defects. So, I’m onboard with the idea of using one-tailed tests when they’re appropriate. However, I do think that two-tailed tests should be considered the default choice and that you need good reasons to move to a one-tailed test. Again, your A/B testing area

mightsupply those reasons on a regular basis, but I can’t make that a blanket statement for all research areas.I think your article mischaracterizes some of the pros and cons of both types of tests. Just a couple of for instances. In a two-tailed test, you don’t have to take the same action regardless of which direction the results are significant (example below). And, yes, you

candetermine the direction of the effect in a two-tailed test. You simply look at the estimated effect. Is it positive or negative?On the other hand, I do agree that one-tailed tests don’t increase the overall Type I error. However, there is a big caveat for that. In a two-tailed test, the Type I error rate is evenly split in both tails. For a one-tailed test, the overall Type I error rate does not change, but the Type I errors are redistributed so they all occur in the direction that you are interested in rather than being split between the positive and negative directions. In other words, you’ll have twice as many Type I errors in the specific direction that you’re interested in. That’s not good.

My big concerns with one-tailed tests are that it makes it easier to obtain the results that you want to obtain. And, all of the Type I errors (false positives) are in that direction too. It’s just not a good combination.

To answer your question about when you might want to use two-tailed tests, there are plenty of reasons. For one, you might want to avoid the situation I describe above. Additionally, in a lot of scientific research, the researchers truly are interested in detecting effects in either direction for the sake of science. Even in cases with a practical application, you might want to learn about effects in either direction.

For example, I was involved in a research study that looked at the effects of an exercise intervention on bone density. The idea was that it might be a good way to prevent osteoporosis. I used a two-tailed test. Obviously, we’re hoping that there was positive effect. However, we’d be very interested in knowing whether there was a negative effect too. And, this illustrates how you can have different actions based on both directions. If there was a positive effect, you can recommend that as a good approach and try to promote its use. If there’s a negative effect, you’d issue a warning to not do that intervention. You have the potential for learning both what is good and what is bad. The extra false-positives would’ve cause problems because we’d think that there’d be health benefits for participants when those benefits don’t actually exist. Also, if we had performed only a one-tailed test and didn’t obtain significant results, we’d learn that it wasn’t a positive effect, but we would not know whether it was actually detrimental or not.

Here’s when I’d say it’s OK to use a one-tailed test. Consider a one-tailed test when you’re in situation where you truly only need to know whether an effect exists in one direction, and the extra Type I errors in that direction are an acceptable risk (false positives don’t cause problems), and there’s no benefit in determining whether an effect exists in the other direction. Those conditions really restrict when one-tailed tests are the best choice. Again, those restrictions might not be relevant for your specific field, but as for the usage of statistics as a whole, they’re absolutely crucial to consider.

On the other hand, according to this article, two-tailed tests might be important in A/B testing!