Significance levels in statistics are a crucial component of hypothesis testing. However, unlike other values in your statistical output, the significance level is not something that statistical software calculates. Instead, you choose the significance level. Have you ever wondered why?
In this post, I’ll explain the significance level conceptually, why you choose its value, and how to choose a good value. Statisticians also refer to the significance level as alpha (α).
First, it’s crucial to remember that hypothesis tests are inferential procedures. These tests determine whether your sample evidence is strong enough to suggest that an effect exists in an entire population. Suppose you’re comparing the means of two groups. Your sample data show that there is a difference between those means. Does the sample difference represent a difference between the two populations? Or, is that difference likely due to random sampling error? That’s where hypothesis tests come in!
Your sample data provide evidence for an effect. The significance level is a measure of how strong the sample evidence must be before determining the results are statistically significant. Because we’re talking about evidence, let’s look at a courtroom analogy.
Related posts: Hypothesis Test Overview and Difference between Descriptive and Inferential Statistics
Evidentiary Standards in the Courtroom
Criminal cases and civil cases vary greatly, but they both require a minimum amount of evidence to convince a judge or jury to prove a claim against the defendant. Prosecutors in criminal cases must prove the defendant is guilty “beyond a reasonable doubt,” whereas plaintiffs in a civil case must present a “preponderance of the evidence.” These terms are evidentiary standards that reflect the amount of evidence that civil and criminal cases require.
For civil cases, most scholars define a preponderance of evidence as meaning that at least 51% of the evidence shown supports the plaintiff’s claim. However, criminal cases are more severe and require stronger evidence, which must go beyond a reasonable doubt. Most scholars define that evidentiary standard as being 90%, 95%, or even 99% sure that the defendant is guilty.
In statistics, the significance level is the evidentiary standard. For researchers to successfully make the case that the effect exists in the population, the sample must contain a sufficient amount of evidence.
In court cases, you have evidentiary standards because you don’t want to convict innocent people.
In hypothesis tests, we have the significance level because we don’t want to claim that an effect or relationship exists when it does not exist.
Significance Levels as an Evidentiary Standard
In statistics, the significance level defines the strength of evidence in probabilistic terms. Specifically, alpha represents the probability that tests will produce statistically significant results when the null hypothesis is correct. Rejecting a true null hypothesis is a type I error. And, the significance level equals the type I error rate. You can think of this error rate as the probability of a false positive. The test results lead you to believe that an effect exists when it actually does not exist.
Obviously, when the null hypothesis is correct, we want a low probability that hypothesis tests will produce statistically significant results. For example, if alpha is 0.05, your analysis has a 5% chance of producing a significant result when the null hypothesis is correct.
Just as the evidentiary standard varies by the type of court case, you can set the significance level for a hypothesis test depending on the consequences of a false positive. By changing alpha, you increase or decrease the amount of evidence you require in the sample to conclude that the effect exists in the population.
Learn more about Statistical Significance: Definition & Meaning.
Changing Significance Levels
Because 0.05 is the standard alpha, we’ll start by adjusting away from that value. Typically, you’ll need a good reason to change the significance level to something other than 0.05. Also, note the inverse relationship between alpha and the amount of required evidence. For instance, increasing the significance level from 0.05 to 0.10 lowers the evidentiary standard. Conversely, decreasing it from 0.05 to 0.01 increases the standard. Let’s look at why you would consider changing alpha and how it affects your hypothesis test.
Increasing the Significance Level
Imagine you’re testing the strength of party balloons. You’ll use the test results to determine which brand of balloons to buy. A false positive here leads you to buy balloons that are not stronger. The drawbacks of a false positive are very low. Consequently, you could consider lessening the amount of evidence required by changing the significance level to 0.10. Because this change decreases the amount of required evidence, it makes your test more sensitive to detecting differences, but it also increases the chance of a false positive from 5% to 10%.
Decreasing the Significance Level
Conversely, imagine you’re testing the strength of fabric for hot air balloons. A false positive here is very risky because lives are on the line! You want to be very confident that the material from one manufacturer is stronger than the other. In this case, you should increase the amount of evidence required by changing alpha to 0.01. Because this change increases the amount of required evidence, it makes your test less sensitive to detecting differences, but it decreases the chance of a false positive from 5% to 1%.
It’s all about the tradeoff between sensitivity and false positives!
In conclusion, a significance level of 0.05 is the most common. However, it’s the analyst’s responsibility to determine how much evidence to require for concluding that an effect exists. How problematic is a false positive? There is no single correct answer for all circumstances. Consequently, you need to choose the significance level!
While the significance level indicates the amount of evidence that you require, the p-value represents the strength of the evidence that exists in your sample. When your p-value is less than or equal to the significance level, the strength of the sample evidence meets or exceeds your evidentiary standard for rejecting the null hypothesis and concluding that the effect exists.
While this post looks at significance levels from a conceptual standpoint, learn about the significance level and p-values using a graphical representation of how hypothesis tests work. Additionally, my post about the types of errors in hypothesis testing takes a deeper look at both Type 1 and Type II errors, and the tradeoffs between them.
Hi Jim! Greetings of the day!
I we use 95%CI along with 10% absolute precision in order to calculate sample size?
For example; A two-stage cluster sampling technique was executed to select the herd (primary unit) and individual dairy cattle (secondary unit). The number of herds required for the study was determined using the cluster formula described by (Ferrari et al., 2016) with an assumption of expected prevalence of BVD and IBR 11.7% (Asmare et al., 2013) and 81.8% (Sibhat et al., 2018) respectively, and 10% absolute precision with a 95% confidence interval (CI).
Is this correct?
I did it actually to reduce my sample size….to minimize logistics.
Hope you will answer soon!
With regards
Hi Jim,
It’s been hard for me to understand why we need significance levels and p-values before I read this post.
Thanks for your friendly guide!
All my old knowledge about significance levels was based on that significance levels are the maxima of probabilities of type I errors. With type I errors being the mostly unwanted, I usually think of significance levels as measures of how aggressive we are. The higher a significance level is, the more tolerance of a type one error exists, and the more likely we are to reject the null hypothesis by mistake, because we are aggressive enough.
But this appreciation seems reasonable.
What do you think?
Look forward to your reply.
Hi Yang,
They are slippery concepts for sure!
The significance level = the Type I error rate.
You’ve got the right idea. I guess what I’d add is, why is a higher significance level more aggressive? It’s because we’re willing to make a decision based on weaker evidence. I like to think of the significance level as an evidentiary standard. How much sample evidence do you need to decide an effect exists in the population? If you are willing to accept weaker evidence, then of course it’ll be easier for you to conclude that an effect exists and also it’ll be more likely that you’re wrong.
That’s what you’re saying, but thinking about it in terms of the required amount of evidence to draw that conclusion helps clarify the effects of increasing or lowering the significance level. If you require stronger evidence (lower significance level), it’s harder to reject the null but when you do, you’re more likely to be correct. When you require weaker evidence (higher significance level), it’s easier to reject the null but it increases the chances you’ll be incorrect.
I hope that helps!
Hi, Jim! Thank you so much for the article. Do you have literature or references that said the reasons for choosing confidence level of 90% in quantitative research or social research? Because my lecturer told I have to include theoretical reasons from literature about my selection to choose convidence level of 90% in quantitative research. Thank you. Best regards.
Hello Sir jim!
If I gonna gonna decrease my significance level alpha from 0.10 to 0.05, do I have to re-compute my sample size? Note: this is a conducted study and my panel suddenly wants to change my alpha level to be less than 0.1. You’ve said that if I lower my alpha level, the analysis will have lower statistical power, meaning the results will be questionable??
Hi,
Using a significance level of 0.10 is unusual. I’m not surprised they want to lower it!
I’m not sure at what point you’re at for your study. Are you at the planning stages and haven’t collected data yet? If so, yes, you should do a power analysis to estimate a good sample size. You’ll need to include the significance level in the power analysis. Using a lower significance level will cause your sample size to increase to maintain a constant level of statistical power.
If you already have your data and are just deciding your significance level (which you should’ve done before collecting any data) before analyzing the data, lowering the significance level will reduce the statistical power of the analysis. However, it doesn’t make the results questionable. You are balancing a set of risks. Specifically, you’re reducing the risk of a Type I error but increasing the risk of a Type II error. That’s all a normal part of statistical analysis. Read my post about The Types of Error in Hypothesis Testing to understand that tradeoff. It’s a balancing act!
Using a significance level of 0.05 is standard and almost always a good decision. Don’t worry about it messing up your results. 🙂
Hi Jim and Others,
I find the discussions on choosing the significance level, and would like to inform you of my recent works on this issue:
1. Kim, J. H., Choi, I., 2021, Choosing the Level of Significance: A Decision-Theoretic Approach, Abacus: A Journal of Accounting, Finance and Business Studies. 57 (1), 27-71,
2. Kim, J. H, 2020, Decision-theoretic hypothesis testing: A primer with R package OptSig, The American Statistician, 74 (4), 370-379.
if you have any questions, please feel free to contact me.
Jae Kim
Yup, this helps a lot! Thank you, Sir Jim!
You’re very welcome! 🙂
Does changing the alpha level in the conducted study would affect the sample size?
Or can you just simply change alpha 0.1 to 0.05 like nothing?
What steps should I consider with this kind of scenario? Thank you!
Hi Maverick,
You’re free to change the significance level (alpha) however you should have good reasons as I discuss in this article. There are implications for your choice. If you increase alpha, your analysis has more statistical power to detect findings but you’ll also have more false positives. On the other hand, if you lower alpha, your analysis has lower statistical power but there will be fewer false positives. Read this post to learn about all that in more detail.
As for sample size, well, there are several factors involved in determining a good sample size. But, if you increase alpha to 0.10, you could use a smaller sample size to detect a specific effect size while maintaining statistical power. However, as I mention, you do that at the risk of increasing false positives. And, generally speaking, 0.10 is considered too high and would often not be taken seriously because it represents a weak standard of evidence. Again, there are implications to such a decision.
I hope that helps!
Jim, thank you for your Q&A on Statistical p value questions. I know that significance levels are set by the statistician. My question is whether a p value of, say 0.103, when rounded to the second decimal point is 0.10 and 10% significant. Would you agree with this, i.e., would the rounding issue work in this example?
Hi Gary,
There’s really two issues related to your question, even though you’re asking about only one of them. Let me start with the other question you’re not asking.
Is a significance level of 0.10 ok to use? It’s not the standard level of 0.05. If you were to use a significance level of 0.05, then a p-value of 0.049 would be significant. However, a p-value in that range does not really provide strong evidence that an effect exists in the population. In other words, there’s a relatively high chance of a false positive even in that p-value region. For more details about that issue, I recommend reading my post about interpreting p-values. Pay particular attention to the table near the end.
You can imagine that if a p-value around 0.049 is weak evidence, then a p-value near 0.10, plus or minus a little bit, is extremely weak! I’d only use a significance level of 0.10 if there’s mainly just an upside to detecting an effect but no downside if it’s a false positive. Be aware that while all studies that are significant at the 0.10 significance level will have a false positive rate of 10%, a study with a p-value near the cutoff value will have a higher false positive rate than that (Read my link above).
Now, on to your question. If you’re already using a significance level that high (allowing weak evidence), there’s probably little difference between a 0.103 and 0.10. You’ve already accepted the high chances of a false positive. So, in practice, there’s very little difference. However, you might well run into strong opinions about the matter. Some statisticians will say, “No way! It’s a sharp cutoff!” However, I have seen even in peer reviewed journals wording about “nearly significant” and “almost significant.” Yours would fit that. However, I’m guessing your study is not be for a peer reviewed journal because they typically don’t accept significance levels of 0.10.
So, for your specific case, if you’ve made a considered decision that a significance level of 0.10 is appropriate (see what I wrote above), then I don’t see a problem with 0.103. Just be aware that you already have a relatively high chance of a false positive.
Finally, I hope that you didn’t choose the significance level based on the p-value that you obtained. You should choose the significance level before you conduct your study based on the pros and cons of Type 1 and Type II errors. Cherry picking a significance level based on your results can cause problems!
I hope this helps!
Hi Jim (if I may),
It’s been a while since I worked with many of the statistical concepts – reading through your brief guides has been really helpfull! I think this really helps to be back up to speed in no time.
– Martijn
Hi Martijn,
I’m so happy to hear that my website has been helpful in get you up to speed! 🙂 You might consider my Introduction to Statistics ebook (and now in print) for an even more thorough introduction! A free sample is available in My Store.
Happy reading!
Jim
Thanks for the reply
Just would like to know if we have p value greater than 0.05. Lets say for example we have p equal to 0.35. In this case we fail to reject null hypothesis. Is this means, we failed to reject null hypothesis with (1-0.35) = 65% confidence interval? Is 65% confidence interval significant?
Hi, as I mentioned, the confidence level is something that you set at the beginning of the study when you determine what significance level you will use. You do not change the confidence level based on your results.
In your example, you’re choosing a significance level of 0.05, which corresponds to using a confidence level of 95%. Those values are now fixed for your study. You don’t change them based on the results.
Then you analyze the data and your p-value of 0.35 is not significant. If you look at the CI with a confidence level of 95%, you’ll notice that it contains the null hypothesis value for your test. When the CI contains the null hypothesis value, that’s another way of determining that your results are not significant. If you use the corresponding p-values and CIs, those results will always agree. Read my article about confidence intervals to learn about that.
You don’t determine the confidence level at which you failed to reject the null. Just report the exact p-value with your findings to present that type of information.
Hello ,
In case when P0.05, we fail to reject null hypothesis. So what will be the confidence level? and what are the chances of getting opposite results?
Hi Tulajaram,
You set the confidence level so it equals 1 – significance level. So, if you use a significance level of 0.05, then you use a confidence level of 1 – 0.05 = 0.95. In this way, the confidence level results will match your hypothesis test results.
I’m not sure what you mean by “getting opposite results”?
Hi Jim,
Thank you for your educative piece on significance level in Statistics. Please comment on my question: I understand (I hope rightly so) that the 1% level relative to the 5% significance level ‘is a more stricter level of significance’ and hence allows very little room for error to be made in committing a Type I error. Would it therefore be right to conclude that if you fail to reject the null hypothesis at the 1% level of significance then you always won’t reject it at the 5% level of significance too? If not, please elaborate how the chosen significance levels (1%, 5% and 10%) relate to each other in rejecting or failing to reject the null hypothesis? Many thanks
Hi Robin,
Yes, you’re correct that a 1% level is stricter. The significance level is the Type I error rate. So, a lower significance level (e.g., 1%) has, by definition, a lower Type I error rate. And, yes, it is possible to reject at one level, say 5%, and not reject at a lower level (1%). I show an example of this in my post about p-values and significance levels. It’s important to choose your significance level before conducting the study and then stick with it. Don’t change the significance level to obtain significant results.
Hi,thanks for the wonderful content.Would appreciate if you help me understand better.Imagining (tried to avoid the scary word hypothetical)a scenario of p being 0.07 and @ at 0.05. Will that still mean :”If the medicine has no effect in the population as a whole, 7 % of studies will obtain the
effect observed in your sample, or larger, because of random sample error?”.And in that case what would be the extrapolated (although non advisable) error rate similar to the example quoted below??
While the exact error rate varies based on different
assumptions, the values below use run-of-the-mill assumptions……….
Regards
Hi Madhav,
Yes, that would be the correct way interpret a p-value of 0.07.
As for the error rate based on run of the mill assumptions, I don’t know for sure. I don’t have the article handy. But, I’d guess around 30%. However, it really depends on the prevalence of true effects. That’s essentially the overall probability of the alternative being true at the beginning of your study. And, that’s hard to say, but you can look at the significance of similar studies. But, you’d need to know how many were significant and not significant. Usually, we only hear about the significant studies. Bayesian statistics uses that approach. If your p-value is 0.07 but the alternative hypothesis is unlikely, the error rate could be much higher. So, take it with a grain of salt. The key point is that a p-value of near 0.05 (plus a little bit or minus a little bit) really is not strong evidence. 0.07 is a bit weaker. You really shouldn’t try to take it much further than that.
Hi
Interesting conversations on the choice of significance level.
May I introduce my paper: Choosing the level of significance: decision theoretic approach
https://onlinelibrary.wiley.com/doi/abs/10.1111/abac.12172
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2652773
I also have R package OptSig, freely available from
https://cran.r-project.org/web/packages/OptSig/index.html
You may choose 1%, 5% or 10% level based on risk and power, but these levels have no theoretical justification at all and still completely arbitrary. My paper proposes that the optimal level be obtained by minimizing expected loss.
When you say changing your alpha from .05 to .10 decreases the required evidence, that is literally true. It decreases the sample size required in order to detect the difference as significant; and/or increases the power of the study to detect this difference as statistically significant. I like to think in terms of all three: power, alpha, and sample size related to your effect size. The free software G*Power does a good job of determining these. That software also lets you graph, for example, statistical power as a function of sample size, but it isn’t always a smooth plot even though the axes are continuously and evenly scaled. For some statistical tests, that line is jerky. I think it has something to do with the shape of the distribution curve of something used in the calculation, but I’m embarrassed to say that I can’t recall what that is.
Hi Jim,
If the concept of choosing the confidence level flexibly is integrated into six sigma, that will be even better and will lead to higher refinement of the outcome. Hope I am thinking in the right direction.
I consider “detectable difference” when designing a hypothesis study for inference.
If p is greater than 0.10, any difference has been overwhelmed by the noise.
If p is smaller than 0.10, there is a potential difference.
If p is smaller than 0.01, there is a detectable difference that will convince a skeptic.
Hi Stan,
As always, it’s great to hear from you!
In terms of p-values and the strength of the evidence they represent against the null hypothesis, I generally agree with your summary.
Personally, I consider p-values around 0.05 to represent a potential difference. The strength of evidence from a single study near 0.05 isn’t particularly strong by itself–but it probably warrants follow up. That’s based on simulation studies, which I discuss towards the end of my post about interpreting p-values and a different post about an empirical reproducibility study.
I agree that for p-values less than 0.01, the evidence is getting stronger. While a very low p-value doesn’t guarantee that the effect size is practically significant, it is pretty strong evidence.
I think there is something wrong in the Type 2 error. Rejecting a true null hypothesis is type 1 error will be a better explanation. Correct me if i am wrong.
Hi Ratnadeep, Yes, you’re entirely correct! Thanks for pointing that out. I accidentally flipped the numbers around. I’ve made the correction.