You’ve just performed a hypothesis test and your results are statistically significant. Hurray! These results are important, right? Not so fast. Statistical significance does not necessarily mean that the results are practically significant in a real-world sense of importance.

In this blog post, I’ll talk about the differences between practical significance and statistical significance, and how to determine if your results are meaningful in the real world.

## Statistical Significance

The hypothesis testing procedure determines whether the sample results that you obtain are likely if you assume the null hypothesis is correct for the population. If the results are sufficiently improbable under that assumption, then you can reject the null hypothesis and conclude that an effect exists. In other words, the strength of the evidence in your sample has passed your defined threshold of the significance level (alpha). Your results are statistically significant.

You use p-values to determine statistical significance in hypothesis tests such as t-tests, ANOVA, and regression coefficients among many others. Consequently, it might seem logical that p-values and statistical significance relate to importance. However, that is false because conditions other than large effect sizes can produce tiny p-values.

Hypothesis tests with small effect sizes can produce very low p-values when you have a large sample size and/or the data have low variability. Consequently, effect sizes that are trivial in the practical sense can be highly statistically significant.

Here’s how small effect sizes can still produce tiny p-values:

**You have a very large sample size.** As the sample size increases, the hypothesis test gains greater statistical power to detect small effects. With a large enough sample size, the hypothesis test can detect an effect that is so miniscule that it is meaningless in a practical sense.

**The sample variability is very low.** When your sample data have low variability, hypothesis tests can produce more precise estimates of the population’s effect. This precision allows the test to detect tiny effects.

Statistical significance indicates only that you have sufficient evidence to conclude that an effect exists. It is a mathematical definition that does not know anything about the subject area and what constitutes an important effect.

**Related posts**: How Hypothesis Tests Work and How to Interpret P-values Correctly

## Practical Significance

While statistical significance relates to whether an effect exists, practical significance refers to the magnitude of the effect. However, no statistical test can tell you whether the effect is large enough to be important in your field of study. Instead, you need to apply your subject area knowledge and expertise to determine whether the effect is big enough to be meaningful in the real world. In other words, is it large enough to care about?

How do you do this? I find that it is helpful to identify the smallest effect size that still has some practical significance. Again, this process requires that you use your knowledge of the subject to make this determination. If your study’s effect size is greater than this smallest meaningful effect, your results are practically significant.

For example, suppose you are evaluating a training program by comparing the test scores of program participants to those who study on their own. Further, we decide that the difference between these two groups must be at least five points to represent a practically meaningful effect size. An effect of 4 points or less is too small to care about.

After performing the study, the analysis finds a statistically significant difference between the two groups. Participants in the study program score an average of 3 points higher on a 100-point test. While these results are statistically significant, the 3-point difference is less than our 5-point threshold. Consequently, our study provides evidence that this effect exists, but it is too small to be meaningful in the real world. The time and money that participants spend on the training program are not worth an average improvement of only 3 points.

Not all statistically significant differences are interesting!

## Use Confidence Intervals to Determine Practical Significance

That sounds pretty straightforward. Unfortunately, there is one small complication. The effect size in your study is only an estimate because it is based on a sample. Thanks to sampling error, there is a margin of error around the estimated effect.

We need a method to determine whether the estimated effect is still practically significant when you factor in this margin of error. Enter confidence intervals!

A confidence interval is a range of values that likely contains the population value. I’ve written about confidence intervals extensively elsewhere, so I’ll keep it short here. The crucial idea is that confidence intervals incorporate the margin of error by creating a range around the estimated effect. The population value is likely to fall within that range. Your task is to determine whether all, some, or none of that range represents practically significant effects.

**Related posts**: How Confidence Intervals Work

## Example of Using Confidence Intervals for Practical Significance

Suppose we conduct two studies on the training program described above. Both studies are statistically significant and produce an estimated effect of 9. These effects look good because they’re both greater than our smallest meaningful effect size of 5. However, these estimates don’t incorporate the margin of error. The confidence intervals (CIs) for both studies below provide that crucial information.

Study A’s CI extends from values that are too small to be meaningful (<5) to those that are large enough to be meaningful. Even though the study is statistically significant and the estimated effect is 9, the CI creates doubt about whether the actual population effect is large enough to be meaningful. The CI tells us that if we implement the program on a larger scale, we might produce only an average 3-point increase! We can’t be sure about practical significance after we include the margin of error around the estimate.

On the other hand, the CI for Study B contains only meaningful effect sizes. We can be more confident that the population effect size is large enough for us to care about!

I really like confidence intervals because you can use them to determine both statistical significance (if they exclude zero) and practical significance. Confidence intervals focus on the size of the effect and the uncertainty around the estimate rather than just whether the effect exists.

In closing, statistical significance indicates that your sample provides sufficient evidence to conclude that the effect exists in the population. Practical significance asks whether that effect is large enough to care about. Use statistical analyses to determine statistical significance and subject-area expertise to assess practical significance.

Henry Gutierrez says

Hi Jim,

We have an assignment and we need to speak to Practical Significance and Statistical Significance. I believe we’ve done a good job on the statistical, but could use some advice on the verbiage of practical significance. Should this speak to more real world examples of effectiveness?

Here is our verbiage:

Two key data points, suspensions and chronic absenteeism, were archived and retrieved from Sanger Unified School District’s Aeries Student Information System and used for analysis of program effectiveness. Aeries Student Information System is a comprehensive program that incorporates multiple technologies to support the diverse needs of educators (Aeries SIS, n.d.). Timelines restrict the ability to pull data beyond the three years after the initial implementation of PBIS. Discipline and attendance data is tracked in Aeries for each pupil in Kindergarten through 5th grade at the school site and district office.

The first set of data retrieved was the recorded number of suspensions in the school year. The data examines suspension rates three years prior to PBIS implementation and for the three subsequent years post PBIS implementation. School administration provided suspension totals for each school year prior to PBIS implementation, along with suspension totals from the three years post-implementation. The second data set retrieved from the Aeries student information system was chronic absenteeism, which is defined as ten or more absences by an individual student during a school year. Utilizing Aeries data to track chronic absences in the given school year allowed a comparison of annual absences pre and post PBIS implementation. The absences for each student were recorded in Aeries and the total absences from the end of three school years were compared to the totals from the three subsequent school years.

Data Analysis

Practical Significance

A Cohen’s (1977) d value was calculated to determine the effect size of the decrease in suspensions for each year of pre-PBIS implementation and post-PBIS implementation. Cohen’s d allows for the comparison of the change in mean (average) scores to determine the magnitude of practical growth. A Cohen’s d value of 0.2 is considered a small effect size, a d of 0.5 is considered a medium effect size and 0.8 is considered a large effect size. The decrease in suspensions for this program evaluation indicated an effect size of 3.38, which is considered highly significant.

A Cohen’s (1977) d value was also calculated to determine the effect size of the decrease in chronic absenteeism for each year of pre-PBIS implementation and post-PBIS implementation. Cohen’s d allows for the comparison of the change in mean (average) scores to determine the magnitude of practical growth. A Cohen’s d value of 0.2 is considered a small effect size, a d of 0.5 is considered a medium effect size and 0.8 is considered a large effect size.

With the Cohen’s d value related to effect sizes, the decrease in chronic absenteeism for this program evaluation indicated an effect size of 3.21, which is considered highly significant.

Statistical Significance

Independent-samples t-test.

The Statistical Package for the Social Sciences software was used to conduct an independent-samples t-test to compare pre-intervention (N = 3) and post-intervention (N = 3) in terms of yearly schoolwide suspensions. There was significant difference in the suspensions for pre-intervention (M=25, SD=2.646) and post-intervention (M=4.7, SD=8.1) conditions; t (4)=4.14, p = .014. These results suggest that Positive Behavioral Interventions and Supports does have an effect on suspensions.

The Statistical Package for the Social Sciences software was also used to conduct a second independent-samples t-test to compare pre-intervention (N =3) and post-intervention (N = 3) in terms of yearly schoolwide chronic absenteeism. There was a significant difference in the number of chronically absent students for pre-intervention (M=68.0, SD=8.33) and post-intervention (M=39.0, SD=9.29) conditions; t (4)=3.93, p = 0.17. These results suggest that Positive Behavioral Interventions and Supports does have an effect on chronic absenteeism.

L says

What if I have a small (and non-random) sample size (n=55) for a training/intervention program to reduce sexual assault on campus and few of my results are significant? But the change in the DV appears meaningful to us. Could I make the argument that because we have such low statistical power (and an IV that has 3 categories), we see few statistically significant results but we believe the results are practically significant?

Jim Frost says

Hi,

You really only have strong enough evidence to support the statistically significant IVs in your model. If an IV is not significant, you don’t have enough evidence to conclude that there is a relationship between the IV and DV in the population.

I think you’re saying that the coefficient for the IV appears to be large enough to be practically meaningful but it’s not significant. What that indicates is that the relationship exists in your *sample* to a degree that looks important. However, given the variability in your data (i.e., the noise), the effect (i.e., the signal) is not strong enough to conclude that it exists in the *population*. That relationship in your sample might exist only because of random sampling error. The evidence is not strong enough to conclude that it also exists in the population.

In other words, you can’t really say that the results are not statistically significant but they are practically significant so they’re still important. Ultimately, what you’re observing is not distinguishable from random variability in your data.

Regarding the low power, that could be one reason for not obtaining significant results. However, don’t assume that having a larger sample size would necessarily produce significant results. Instead, it’s possible that the sample effect you observed would go away with a large sample. It’s impossible to say. However, the good news is that your insignificant results don’t prove that there is no effect. Instead, it’s just insufficient evidence for concluding that there is an effect. That’s a very different conclusion.

I’d say that the jury is still out about the intervention program. It’s probably worth a larger study. You might consider this as a pilot study that shows a bit of promise, but it’s not conclusive. You might recommend more research.

I hope this helps!

philoinme says

I think you got my question right. While the concept of power is typically used to calculate sample size and experimental design, I wanted to know if it could be used the other way round.

But I didn’t get why did you treat my question as if it has two possibilities, (1st para:

applying power calculation on HT and 2nd para: If the historical data provides good estimates, power analysis might work fine). Generally, power calculation might have been applied on HT(which is not the case as per your answer) and HTs are in turn applied on historical data, isn’t it?

Sylvia says

“the CI tells us that if we implement the program on a larger scale, we might produce only an average 3-point increase!”

How can we tell this from the above example?

Jim Frost says

Hi Sylvia,

Thanks for the great question!

That statement refers to the CI from study A, which is [3 15]. Three is the lower limit for that CI. All values within the CI are not considered unusual so we would not be surprised if the actual population mean fell somewhere in that range. Consequently, it would not be a surprise if the true population effect was 3. It would also not be a surprise if it was 15 or anywhere else in between. The CI highlights the uncertainty associated with a point estimate. We should be wary of the benefits of the program because the CI includes values that we consider to be practically insignificant.

There’s more certainty behind the estimate for Study B, and we can be fairly certain that, even incorporating the uncertainty, the true population mean is greater than 3 because Study B’s CI extents from 7 to 11.

philoinme says

Hi Jim,

Thanks for the quick response.

It’s standard that effect size and power are used to back calculate the sample size. But for instance, there is some historical data which was collected and through some statistical tests power was calculated. Now can’t we use this power relatively compared to most commonly taken figure (80%- used in back calculation) for decision making or further course of action?

I understand that we can’t calculate effect size but according to the subject matter, we know what we want and I am wondering after we know what we want, we can just calculate power and proceed or not proceed with what we want to do?

My main concern is after knowing statistical significance, why can’t we use power calculation to quantify the practical significance as most of the times when it comes to historical data, sample size is already there. This is the case where it’s not about future experiment but about the past occurrences.

Jim Frost says

Hi,

I’m not quite clear on what you’re asking?

Are you asking about using a power calculation based using estimates from a hypothesis test that you performed? The problem with that approach is if you use it on significant test results, you’ll obtain very high power. However, if you use it on a test that wasn’t significant, you’ll obtain lower power. It’s how the math works out. However, in that case, you don’t want to say the test missed the significance due to low power because you’ll always obtain a low power for that situation.

However, if you have other data (e.g., historical data) that provides good estimates to input into a power analysis, you might be able to determine whether a different study had low power. In that case, you’re looking at a kind of historical consensus about effect sizes and variability to assess a particular study. If you trust the historical estimate, that’s probably ok. You just don’t want to use the results from a hypothesis test to assess the power of that same hypothesis test. But, I’m not sure if that’s what you’re asking?

philoinme says

Isn’t the power calculation (1 – beta) useful for calculating the practical effect size?

Jim Frost says

Hi,

It’s the other way around. You use the smallest effect size that is practically significant in the power calculations to estimate a reasonable sample size. You can read about it in my post about power and sample size calculations.

Think about it this way. You can’t mathematically calculate what effect size is considered practically significant. Instead, that is based on subject-area expertise and real-world concerns. However, after you know how large an effect must be to be considered practically significant, you can use that knowledge to calculate a sample size that will detect an effect of that size with a sufficient amount of power.