Significance levels in statistics are a crucial component of hypothesis testing. However, unlike other values in your statistical output, the significance level is not something that statistical software calculates. Instead, you choose the significance level. Have you ever wondered why?

In this post, I’ll explain the significance level conceptually, why you choose its value, and how to choose a good value. Statisticians also refer to the significance level as alpha (α).

First, it’s crucial to remember that hypothesis tests are inferential procedures. These tests determine whether your sample evidence is strong enough to suggest that an effect exists in an entire population. Suppose you’re comparing the means of two groups. Your sample data show that there is a difference between those means. Does the sample difference represent a difference between the two populations? Or, is that difference likely due to random sampling error? That’s where hypothesis tests come in!

Your sample data provide evidence for an effect. The significance level is a measure of how strong the sample evidence must be before determining the results are statistically significant. Because we’re talking about evidence, let’s look at a courtroom analogy.

**Related posts**: Hypothesis Test Overview and Difference between Descriptive and Inferential Statistics

## Evidentiary Standards in the Courtroom

Criminal cases and civil cases vary greatly, but they both require a minimum amount of evidence to convince a judge or jury to prove a claim against the defendant. Prosecutors in criminal cases must prove the defendant is guilty “beyond a reasonable doubt,” whereas plaintiffs in a civil case must present a “preponderance of the evidence.” These terms are evidentiary standards that reflect the amount of evidence that civil and criminal cases require.

For civil cases, most scholars define a preponderance of evidence as meaning that at least 51% of the evidence shown supports the plaintiff’s claim. However, criminal cases are more severe and require stronger evidence, which must go beyond a reasonable doubt. Most scholars define that evidentiary standard as being 90%, 95%, or even 99% sure that the defendant is guilty.

In statistics, the significance level is the evidentiary standard. For researchers to successfully make the case that the effect exists in the population, the sample must contain a sufficient amount of evidence.

In court cases, you have evidentiary standards because you don’t want to convict innocent people.

In hypothesis tests, we have the significance level because we don’t want to claim that an effect or relationship exists when it does not exist.

## Significance Levels as an Evidentiary Standard

In statistics, the significance level defines the strength of evidence in probabilistic terms. Specifically, alpha represents the probability that tests will produce statistically significant results when the null hypothesis is correct. Rejecting a true null hypothesis is a type I error. And, the significance level equals the type I error rate. You can think of this error rate as the probability of a false positive. The test results lead you to believe that an effect exists when it actually does not exist.

Obviously, when the null hypothesis is correct, we want a low probability that hypothesis tests will produce statistically significant results. For example, if alpha is 0.05, your analysis has a 5% chance of producing a significant result when the null hypothesis is correct.

Just as the evidentiary standard varies by the type of court case, you can set the significance level for a hypothesis test depending on the consequences of a false positive. By changing alpha, you increase or decrease the amount of evidence you require in the sample to conclude that the effect exists in the population.

## Changing Significance Levels

Because 0.05 is the standard alpha, we’ll start by adjusting away from that value. Typically, you’ll need a good reason to change the significance level to something other than 0.05. Also, note the inverse relationship between alpha and the amount of required evidence. For instance, increasing the significance level from 0.05 to 0.10 lowers the evidentiary standard. Conversely, decreasing it from 0.05 to 0.01 increases the standard. Let’s look at why you would consider changing alpha and how it affects your hypothesis test.

### Increasing the Significance Level

Imagine you’re testing the strength of party balloons. You’ll use the test results to determine which brand of balloons to buy. A false positive here leads you to buy balloons that are not stronger. The drawbacks of a false positive are very low. Consequently, you could consider lessening the amount of evidence required by changing the significance level to 0.10. Because this change decreases the amount of required evidence, it makes your test more sensitive to detecting differences, but it also increases the chance of a false positive from 5% to 10%.

### Decreasing the Significance Level

Conversely, imagine you’re testing the strength of fabric for hot air balloons. A false positive here is very risky because lives are on the line! You want to be very confident that the material from one manufacturer is stronger than the other. In this case, you should increase the amount of evidence required by changing alpha to 0.01. Because this change increases the amount of required evidence, it makes your test less sensitive to detecting differences, but it decreases the chance of a false positive from 5% to 1%.

It’s all about the tradeoff between sensitivity and false positives!

In conclusion, a significance level of 0.05 is the most common. However, it’s the analyst’s responsibility to determine how much evidence to require for concluding that an effect exists. How problematic is a false positive? There is no single correct answer for all circumstances. Consequently, you need to choose the significance level!

While the significance level indicates the amount of evidence that you require, the p-value represents the strength of the evidence that exists in your sample. When your p-value is less than or equal to the significance level, the strength of the sample evidence meets or exceeds your evidentiary standard for rejecting the null hypothesis and concluding that the effect exists.

While this post looks at significance levels from a conceptual standpoint, learn about the significance level and p-values using a graphical representation of how hypothesis tests work. Additionally, my post about the types of errors in hypothesis testing takes a deeper look at both Type 1 and Type II errors, and the tradeoffs between them.

Robin Oyando says

Hi Jim,

Thank you for your educative piece on significance level in Statistics. Please comment on my question: I understand (I hope rightly so) that the 1% level relative to the 5% significance level ‘is a more stricter level of significance’ and hence allows very little room for error to be made in committing a Type I error. Would it therefore be right to conclude that if you fail to reject the null hypothesis at the 1% level of significance then you always won’t reject it at the 5% level of significance too? If not, please elaborate how the chosen significance levels (1%, 5% and 10%) relate to each other in rejecting or failing to reject the null hypothesis? Many thanks

Jim Frost says

Hi Robin,

Yes, you’re correct that a 1% level is stricter. The significance level is the Type I error rate. So, a lower significance level (e.g., 1%) has, by definition, a lower Type I error rate. And, yes, it is possible to reject at one level, say 5%, and not reject at a lower level (1%). I show an example of this in my post about p-values and significance levels. It’s important to choose your significance level before conducting the study and then stick with it. Don’t change the significance level to obtain significant results.

Madhav says

Hi,thanks for the wonderful content.Would appreciate if you help me understand better.Imagining (tried to avoid the scary word hypothetical)a scenario of p being 0.07 and @ at 0.05. Will that still mean :”If the medicine has no effect in the population as a whole, 7 % of studies will obtain the

effect observed in your sample, or larger, because of random sample error?”.And in that case what would be the extrapolated (although non advisable) error rate similar to the example quoted below??

While the exact error rate varies based on different

assumptions, the values below use run-of-the-mill assumptions……….

Regards

Jim Frost says

Hi Madhav,

Yes, that would be the correct way interpret a p-value of 0.07.

As for the error rate based on run of the mill assumptions, I don’t know for sure. I don’t have the article handy. But, I’d guess around 30%. However, it really depends on the prevalence of true effects. That’s essentially the overall probability of the alternative being true at the beginning of your study. And, that’s hard to say, but you can look at the significance of similar studies. But, you’d need to know how many were significant and not significant. Usually, we only hear about the significant studies. Bayesian statistics uses that approach. If your p-value is 0.07 but the alternative hypothesis is unlikely, the error rate could be much higher. So, take it with a grain of salt. The key point is that a p-value of near 0.05 (plus a little bit or minus a little bit) really is not strong evidence. 0.07 is a bit weaker. You really shouldn’t try to take it much further than that.

Jae Kim says

Hi

Interesting conversations on the choice of significance level.

May I introduce my paper: Choosing the level of significance: decision theoretic approach

https://onlinelibrary.wiley.com/doi/abs/10.1111/abac.12172

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2652773

I also have R package OptSig, freely available from

https://cran.r-project.org/web/packages/OptSig/index.html

You may choose 1%, 5% or 10% level based on risk and power, but these levels have no theoretical justification at all and still completely arbitrary. My paper proposes that the optimal level be obtained by minimizing expected loss.

Jerry says

When you say changing your alpha from .05 to .10 decreases the required evidence, that is literally true. It decreases the sample size required in order to detect the difference as significant; and/or increases the power of the study to detect this difference as statistically significant. I like to think in terms of all three: power, alpha, and sample size related to your effect size. The free software G*Power does a good job of determining these. That software also lets you graph, for example, statistical power as a function of sample size, but it isn’t always a smooth plot even though the axes are continuously and evenly scaled. For some statistical tests, that line is jerky. I think it has something to do with the shape of the distribution curve of something used in the calculation, but I’m embarrassed to say that I can’t recall what that is.

Vishnu Vinjamuri says

Hi Jim,

If the concept of choosing the confidence level flexibly is integrated into six sigma, that will be even better and will lead to higher refinement of the outcome. Hope I am thinking in the right direction.

Stan Alekman says

I consider “detectable difference” when designing a hypothesis study for inference.

If p is greater than 0.10, any difference has been overwhelmed by the noise.

If p is smaller than 0.10, there is a potential difference.

If p is smaller than 0.01, there is a detectable difference that will convince a skeptic.

Jim Frost says

Hi Stan,

As always, it’s great to hear from you!

In terms of p-values and the strength of the evidence they represent against the null hypothesis, I generally agree with your summary.

Personally, I consider p-values around 0.05 to represent a potential difference. The strength of evidence from a single study near 0.05 isn’t particularly strong by itself–but it probably warrants follow up. That’s based on simulation studies, which I discuss towards the end of my post about interpreting p-values and a different post about an empirical reproducibility study.

I agree that for p-values less than 0.01, the evidence is getting stronger. While a very low p-value doesn’t guarantee that the effect size is practically significant, it is pretty strong evidence.

Ratnadeep Sharma says

I think there is something wrong in the Type 2 error. Rejecting a true null hypothesis is type 1 error will be a better explanation. Correct me if i am wrong.

Jim Frost says

Hi Ratnadeep, Yes, you’re entirely correct! Thanks for pointing that out. I accidentally flipped the numbers around. I’ve made the correction.