What is Statistical Significance?
Statistical significance is the goal for most researchers analyzing data. But what does statistically significant mean? Why and when is it important to consider? How do P values fit in with statistical significance? I’ll answer all these questions in this blog post!
Evaluate statistical significance when using a sample to estimate an effect in a population. It helps you determine whether your findings are the result of chance versus an actual effect of a variable of interest.
Statistical significance indicates that an effect you observe in a sample is unlikely to be the product of chance. For statistically significant results, you can conclude that an effect you observe in a sample also exists in the population.
Let’s dig into statistical significance more deeply!
Why We Need to Assess Statistical Significance
In most research studies, the investigators evaluate an effect of some sort. It can be the effectiveness of a new medication, the strength of a product, the relationship between variables, etc. There is some benefit or relationship that they hope to find. Learn about Effect Sizes in Statistics.
When using a sample to estimate an effect, researchers need to evaluate statistical significance.
Researchers typically use representative samples in experiments because measuring an entire population is almost always impractical. Even though they’re using a sample, they really want to determine if that effect exists in the whole population. Discovering an effect that exists in only the relatively small group of study participants doesn’t help move science forward!
While samples are manageable, they introduce sampling error because you’re not appraising the whole population.
When you draw a random sample from a population, sampling error is the difference between a sample estimate and the population value (known as a parameter). This difference occurs by chance, literally the luck of the draw, and it is unavoidable when working with samples. It is this sampling error that creates the need to assess statistical significance.
Unfortunately, sampling error virtually guarantees that a sample estimate of the effect won’t equal the correct population parameter exactly. In fact, if you were to draw many random samples from a population and perform the same experiment, you’d get different results each time due to sampling error.
Because the results change depending on the sample you draw, you need a way to account for sampling error. After all, you might see a nice effect in one sample but not another. Statistical significance helps you evaluate the potential role of sampling error in your results.
Related posts: Random Sampling, Representative Samples, and Sampling Error
What does Statistically Significant Mean?
Sampling error forces us to consider statistical significance. When you draw a random sample from a population, there is always a chance that sampling error created the observed effect.
Imagine a hypothetical study for a medicine that we know is entirely useless. In other words, the effect size is zero. No difference exists at the population level between subjects who take the vaccine and those who don’t. However, thanks to sampling error, we’re bound to see some difference between those two groups even though there is no vaccine effect.
How do we know whether the sample estimate reflects sampling error or a true effect?
Statistical significance tells us that the sample effect is unlikely to be a mirage caused by sampling error. When we have statistically significant results, we conclude it is an actual effect existing in the population.
Definition of Statistically Significant
The definition of statistically significant is that the sample effect is unlikely to be caused by chance (i.e., sampling error). In other words, what we see in the sample likely reflects an effect or relationship that exists in the population.
Using a more statistically correct technical definition, statistical significance relates to the following conditional probability.
If there is no effect at the population level, sampling error is unlikely to have produced your sample results.
The flip side of statistical significance is non-significant results. This condition indicates you can’t conclude the sample effect exists in the population. It would not be surprising if sampling error created the appearance of an effect in your sample by chance alone. If that is the case, the benefit you see in your sample does not exist in the population.
Statistical Significance Example
Suppose you perform an experiment with a new drug that purportedly increases intelligence. The treatment group takes the drug, while the control group does not. After the experiment, you find that the treatment group has an average IQ that is 10 points higher than the control group. That’s your sample estimate for the treatment effect.
That looks great!
However, random chance might have selected a sample for the treatment group that, by sheer luck, got better results. In other words, the difference might not be due to the medicine but sampling error instead.
If you perform the experiment again, would you get similarly great results?
By performing the correct statistical analysis, you can determine the likelihood of obtaining your sample results if the medication has no effect.
For this study, statistically significant results indicate that the sample effect of 10 IQ points was unlikely to occur if there was no effect in the population. Therefore, you can be confident that the drug has a real effect.
Conversely, non-statistically significant results indicate the sample effect might be sampling error rather than an actual effect. Random chance could have conspired to create the illusion of an effect in your sample. If the medicine’s effect does not exist in the population, it won’t have the benefits we expect based on our sample results.
That’s why determining statistical significance is crucial.
P values and Statistical Significance
Statistical significance occurs in the context of hypothesis testing. These analyses are a form of inferential statistics—using a sample to estimate the properties of a population. As I discussed previously, we want to determine whether the sample effect also exists in the population. Or is the effect a mirage produced by chance during the random sampling process?
Hypothesis testing provides the tools to evaluate statistical significance. These two tools are p-values and the significance level.
P-values: The probability of obtaining the observed sample effect or larger if there is no effect in the population.
Significance level: An evidentiary standard that researchers select as the threshold for statistical significance.
P-values indicate the strength of the evidence against sampling error producing the sample effect. And the significance level defines how strong the evidence must be to reject the notion that the sample effect is a mirage caused by sampling error. Therefore, you need to compare your p-value to the significance level to determine statistical significance.
Your results are statistically significant if the p-value is ≤ your significance level. For example, if your p-value is 0.01 and your significance level is 0.05, your results are statistically significant.
Learn more about the rationale behind How Hypothesis Tests Work.
Related posts: Interpreting P-values and the Significance Level.
Caution: Statistical Significance ≠ Practical Significance
Finally, it might seem logical to think that statistically significant results indicate you have a large effect that is meaningful in a practical, real-world sense. However, that is not necessarily true.
Effect size is only one factor in assessing statistical significance. Sample size and data variability are two others. You can obtain statistically significant results when you have a negligible effect but have a large sample size and low variability. In this situation, the effect likely exists but doesn’t amount to much in the real world.
Use subject-area knowledge to assess the practical significance of your findings.
To learn more about this distinction and how to assess it, read my post about Practical vs. Statistical Significance.
Quick question, had an a/b test that I was running and checked the confidence level for the data to be ~79%. This was after running the test for two weeks. I went back and looked a week later and the confidence level was 11%. Could this mean I hit confidence sometime during the week before I returned and looked at my data? If so would this 11% number represent a confidence on a bigger population. Trying to figure out how a confidence level can drop so much after running a test for so long. Not sure if there is any theory or concept that could explain this.
Jim Frost says
When you run a hypothesis test and use it to create a confidence interval, you set the confidence level at a specific value, such as 95%. Your software should not be changing the confidence level over time. So, I don’t know what is happening. I have no theories other than your software is doing something it shouldn’t do.
I’m not sure if you’re doing the following or not. But observing incoming data until you reach statistical significance is a form of cherry picking. This practice tends to produce significant results even when it’s not warranted. I know it’s tempting because it’s exciting when you see significant results. However, it won’t be as exciting when you use the results of this cherry-picking method to implement changes and the improvements fail to materialize.
Instead, picked a fixed amount of time to run the test, and then assess results at that point.
Funsho Olukade says
Thank you Prof for this post. One area that my students struggle with is understanding the relationship between confidence level (CI) and the p-value vs significance level. Any intuitive explanation on this?
Jim Frost says
Yes, I’ve written a post that covers that exactly! 🙂
Hypothesis Testing and Confidence Intervals