Significance levels in statistics are a crucial component of hypothesis testing. However, unlike other values in your statistical output, the significance level is not something that statistical software calculates. Instead, you choose the significance level. Have you ever wondered why?
First, it’s crucial to remember that hypothesis tests are inferential procedures. These tests determine whether your sample evidence is strong enough to suggest that an effect exists in an entire population. Suppose you’re comparing the means of two groups. Your sample data show that there is a difference between those means. Does the sample difference represent a difference between the two populations? Or, is that difference likely due to random sampling error? That’s where hypothesis tests come in!
Your sample data provide evidence for an effect. The significance level is a measure of how strong the sample evidence must be before determining the results are statistically significant. Because we’re talking about evidence, let’s look at a courtroom analogy.
Evidentiary Standards in the Courtroom
Criminal cases and civil cases vary greatly, but they both require a minimum amount of evidence to convince a judge or jury to prove a claim against the defendant. Prosecutors in criminal cases must prove the defendant is guilty “beyond a reasonable doubt,” whereas plaintiffs in a civil case must present a “preponderance of the evidence.” These terms are evidentiary standards that reflect the amount of evidence that civil and criminal cases require.
For civil cases, most scholars define a preponderance of evidence as meaning that at least 51% of the evidence shown supports the plaintiff’s claim. However, criminal cases are more severe and require stronger evidence, which must go beyond a reasonable doubt. Most scholars define that evidentiary standard as being 90%, 95%, or even 99% sure that the defendant is guilty.
In statistics, the significance level is the evidentiary standard. For researchers to successfully make the case that the effect exists in the population, the sample must contain a sufficient amount of evidence.
In court cases, you have evidentiary standards because you don’t want to convict innocent people.
In hypothesis tests, we have the significance level because we don’t want to claim that an effect or relationship exists when it does not exist.
Significance Levels as an Evidentiary Standard
In statistics, the significance level defines the strength of evidence in probabilistic terms. Specifically, alpha represents the probability that tests will produce statistically significant results when the null hypothesis is correct. Rejecting a true null hypothesis is a type I error. And, the significance level equals the type I error rate. You can think of this error rate as the probability of a false positive. The test results lead you to believe that an effect exists when it actually does not exist.
Obviously, when the null hypothesis is correct, we want a low probability that hypothesis tests will produce statistically significant results. For example, if alpha is 0.05, your analysis has a 5% chance of producing a significant result when the null hypothesis is correct.
Just as the evidentiary standard varies by the type of court case, you can set the significance level for a hypothesis test depending on the consequences of a false positive. By changing alpha, you increase or decrease the amount of evidence you require in the sample to conclude that the effect exists in the population.
Changing Significance Levels
Because 0.05 is the standard alpha, we’ll start by adjusting away from that value. Typically, you’ll need a good reason to change the significance level to something other than 0.05. Also, note the inverse relationship between alpha and the amount of required evidence. For instance, increasing the significance level from 0.05 to 0.10 lowers the evidentiary standard. Conversely, decreasing it from 0.05 to 0.01 increases the standard. Let’s look at why you would consider changing alpha and how it affects your hypothesis test.
Increasing the Significance Level
Imagine you’re testing the strength of party balloons. You’ll use the test results to determine which brand of balloons to buy. A false positive here leads you to buy balloons that are not stronger. The drawbacks of a false positive are very low. Consequently, you could consider lessening the amount of evidence required by changing the significance level to 0.10. Because this change decreases the amount of required evidence, it makes your test more sensitive to detecting differences, but it also increases the chance of a false positive from 5% to 10%.
Decreasing the Significance Level
Conversely, imagine you’re testing the strength of fabric for hot air balloons. A false positive here is very risky because lives are on the line! You want to be very confident that the material from one manufacturer is stronger than the other. In this case, you should increase the amount of evidence required by changing alpha to 0.01. Because this change increases the amount of required evidence, it makes your test less sensitive to detecting differences, but it decreases the chance of a false positive from 5% to 1%.
It’s all about the tradeoff between sensitivity and false positives!
In conclusion, a significance level of 0.05 is the most common. However, it’s the analyst’s responsibility to determine how much evidence to require for concluding that an effect exists. How problematic is a false positive? There is no single correct answer for all circumstances. Consequently, you need to choose the significance level!
While the significance level indicates the amount of evidence that you require, the p-value represents the strength of the evidence that exists in your sample. When your p-value is less than or equal to the significance level, the strength of the sample evidence meets or exceeds your evidentiary standard for rejecting the null hypothesis and concluding that the effect exists.
While this post looks at significance levels from a conceptual standpoint, learn about the significance level and p-values using a graphical representation of how hypothesis tests work. Additionally, my post about the types of errors in hypothesis testing takes a deeper look at both Type 1 and Type II errors, and the tradeoffs between them.