You’ve just performed a hypothesis test and your results are statistically significant. Hurray! These results are important, right? Not so fast. Statistical significance does not necessarily mean that the results are practically significant in a real-world sense of importance.

In this blog post, I’ll talk about the differences between practical significance and statistical significance, and how to determine if your results are meaningful in the real world.

## Statistical Significance

The hypothesis testing procedure determines whether the sample results that you obtain are likely if you assume the null hypothesis is correct for the population. If the results are sufficiently improbable under that assumption, then you can reject the null hypothesis and conclude that an effect exists. In other words, the strength of the evidence in your sample has passed your defined threshold of the significance level (alpha). Your results are statistically significant.

Learn more about Statistical Significance: Definition & Meaning.

You use p-values to determine statistical significance in hypothesis tests such as t-tests, ANOVA, and regression coefficients among many others. Consequently, it might seem logical that p-values and statistical significance relate to importance. However, that is false because conditions other than large effect sizes can produce tiny p-values.

Hypothesis tests with small effect sizes can produce very low p-values when you have a large sample size and/or the data have low variability. Consequently, effect sizes that are trivial in the practical sense can be highly statistically significant.

Here’s how small effect sizes can still produce tiny p-values:

**You have a very large sample size.** As the sample size increases, the hypothesis test gains greater statistical power to detect small effects. With a large enough sample size, the hypothesis test can detect an effect that is so minuscule that it is meaningless in a practical sense.

**The sample variability is very low.** When your sample data have low variability, hypothesis tests can produce more precise estimates of the population’s effect. This precision allows the test to detect tiny effects.

Statistical significance indicates only that you have sufficient evidence to conclude that an effect exists. It is a mathematical definition that does not know anything about the subject area and what constitutes an important effect.

**Related posts**: How Hypothesis Tests Work and How to Interpret P-values Correctly

## Practical Significance

While statistical significance relates to whether an effect exists, practical significance refers to the magnitude of the effect. However, no statistical test can tell you whether the effect is large enough to be important in your field of study. Instead, you need to apply your subject area knowledge and expertise to determine whether the effect is big enough to be meaningful in the real world. In other words, is it large enough to care about?

How do you do this? I find that it is helpful to identify the smallest effect size that still has some practical significance. Again, this process requires that you use your knowledge of the subject to make this determination. If your study’s effect size is greater than this smallest meaningful effect, your results are practically significant.

For example, suppose you are evaluating a training program by comparing the test scores of program participants to those who study on their own. Further, we decide that the difference between these two groups must be at least five points to represent a practically meaningful effect size. An effect of 4 points or less is too small to care about.

After performing the study, the analysis finds a statistically significant difference between the two groups. Participants in the study program score an average of 3 points higher on a 100-point test. While these results are statistically significant, the 3-point difference is less than our 5-point threshold. Consequently, our study provides evidence that this effect exists, but it is too small to be meaningful in the real world. The time and money that participants spend on the training program are not worth an average improvement of only 3 points.

Not all statistically significant differences are interesting!

**Related post**: Effect Sizes in Statistics

## Use Confidence Intervals to Determine Practical Significance

That sounds pretty straightforward. Unfortunately, there is one small complication. The effect size in your study is only an estimate because it is based on a sample. Thanks to sampling error, there is a margin of error around the estimated effect.

We need a method to determine whether the estimated effect is still practically significant when you factor in this margin of error. Enter confidence intervals!

A confidence interval is a range of values that likely contains the population value. I’ve written about confidence intervals extensively elsewhere, so I’ll keep it short here. The crucial idea is that confidence intervals incorporate the margin of error by creating a range around the estimated effect. The population value is likely to fall within that range. Your task is to determine whether all, some, or none of that range represents practically significant effects.

**Related posts**: How Confidence Intervals Work

## Example of Using Confidence Intervals for Practical Significance

Suppose we conduct two studies on the training program described above. Both studies are statistically significant and produce an estimated effect of 9. These effects look good because they’re both greater than our smallest meaningful effect size of 5. However, these estimates don’t incorporate the margin of error. The confidence intervals (CIs) for both studies below provide that crucial information.

Study A’s CI extends from values that are too small to be meaningful (<5) to those that are large enough to be meaningful. Even though the study is statistically significant and the estimated effect is 9, the CI creates doubt about whether the actual population effect is large enough to be meaningful. The CI tells us that if we implement the program on a larger scale, we might produce only an average 3-point increase! We can’t be sure about practical significance after we include the margin of error around the estimate.

On the other hand, the CI for Study B contains only meaningful effect sizes. We can be more confident that the population effect size is large enough for us to care about!

I really like confidence intervals because you can use them to determine both statistical significance (if they exclude zero) and practical significance. Confidence intervals focus on the size of the effect and the uncertainty around the estimate rather than just whether the effect exists.

In closing, statistical significance indicates that your sample provides sufficient evidence to conclude that the effect exists in the population. Practical significance asks whether that effect is large enough to care about. Use statistical analyses to determine statistical significance and subject-area expertise to assess practical significance.

Mike Daw says

Hi Jim,

Thanks for taking the time to respond so promptly. Regarding your suggestion that I was doing something wrong, I don’t think I was. Here’s why…

Just to clarify, I was finding median diff. CIs that include zero for analyses that also yielded significant p-values using SPSS, which uses the Hodges-Lehman formula. (Note that R uses the same method and was also giving these results.) So I couldn’t really be doing something wrong, unless SPSS and R are wrong too.

So, doing further digging since I wrote to you, I came across a paper that suggests that this method of calculating CIs (i.e., median difference), “[does not] coincide with the test statistic of the Mann–Whitney test, so the interpretation of the test results and the confidence intervals may be importantly different … [and] if the two distributions [in a test] do not have the same shape, it is easy to find examples where the confidence interval for the median and the Mann–Whitney test results strongly disagree” (Perme & Manevski, 2019, p. 3755).

There may also be a confounding issue with my data, which involves almost all integer values. Hence my medians were also almost always integers. I suspect that some of the zero median differences I was finding would have been more accurately zero-point-somethings rounded to zero. (I realise that that sentence may not be strictly correct, but hopefully you get what I mean.)

Having discussed this issue with my examiner who is reviewing my PhD thesis amendments, we agreed that I would instead show CIs around r effect size values instead, calculated using another helpful website, http://vassarstats.net/rho.html. These CIs don’t include zeros for my significant results and so serve the purpose of showing practical significance for my findings without contradicting my significant results.

I hope this is helpful to you and anyone else who may have a similar issue…

Thanks for sharing your stats knowledge on your various blogs. Generous people like you really help people like me who are still very much on a stats learning curve!

Best wishes,

Mike

Ref: Perme, M. P., & Manevski, D. (2019). Confidence intervals for the Mann–Whitney test. Statistical Methods in Medical Research, 28(12), 3755–3768. https://doi.org/10.1177/0962280218814556

Mike Daw says

Hi Jim, I really appreciate this post. However, I’m still struggling with interpreting my results. I recently had my PhD viva and was asked to amend my survey results by providing CIs. I’m using Mann-Whitney tests and whilst I’m getting significant (and sometimes highly significant) differences between my populations, my CIs sometimes include zero (for differences in the medians). What does that mean for my hypothesis testing?! Can I say that my hypothesis is confirmed, but only tentatively, or weakly? Or that I need to exercise caution for those results? Or that there’s little practical meaning? (But surely that’s provided by my effect size, which I’m also reporting.) I would be grateful for your views… Best wishes, Mike

Jim Frost says

Hi Mike,

When you’re looking at a CI of a difference, such as the difference between two medians in your case, the value of zero indicates no difference. When your CI includes zero, it indicates you can’t reject the hypothesis that there is no difference (i.e., you fail to reject the null). And that’s because zero is included in the like range of population values for the difference between medians. Assuming you’re using the correct and consistent methodologies, your CIs should include zero when your p-value is greater than your significance level. If you’re getting a mismatch (e.g., CI includes zero but your p-value is less than your significance level), you’re doing something wrong.

Now, let’s assume your CI doesn’t include zero and correspondingly your p-value is statistically significant. This situation tells you that have sufficient evidence in sample to conclude that the effect (difference between medians) also exists in the population. The CI gives you a range of likely values for the effect size. As I write in this post, you’ll need to use your subject-area knowledge to determine whether it suggests a practically significant effect. If the CI includes values near zero, it’s possible that an effect exists in the population but it’s so small that it’s trivial in a practical sense. Perhaps the other end of the CI further away from zero is a value that would be practically significant. In that case, the effect might be practically significant but you don’t know for sure because it also includes near zero possibilities.

Keep in mind that the effect size point estimate has a margin of error around it because you’re working with a sample. You don’t know that it’s exactly that value. It’s just an estimate. The CI tells you the range the population value is likely to fall in, which is why it provides crucial information. You might also want to read about the CI of the Difference article that I wrote. It’s about differences between means but it also applies to medians.

I hope that helps!

waka waka says

in my comparative study what I faced is, that the difference in mean pretest scores of the two groups became statistically significant

Jim Frost says

That would suggest pre-existing differences between the groups, which complicates matters.

Auburne says

HI,

I am taking a Stat II class and I had a horrible instructor in Stat I. I do not feel like I learned anything from Stat I, therefore making this class unbearable. My research idea has to do with Children in CPS custody and mental health diagnosis they are giving during the process. I know my four variables 1) removal from family 2) mental and physical harm to the child based on the removal 3) CPS instability 4) mental health diagnosis. But I am unclear as to how to get my statistical values in order to run SPSS. Can you explain, I feel so dumb.

Jim Frost says

Hi Auburne,

Oh no, I’m so sorry to hear about the horrible instructor. One thing I’ll say is that you should never feel dumb because of the instructor. I know that’s much easier said that done. My daughter is a great writer, but thanks to a poor teacher, she felt dumb writing, like she just wasn’t able to get it. But, I could tell it was the instructor rather my daughter. I know how frustrating it can be. So, hang in there. Stick with it. Try to get help from others. And just know that it does NOT mean you’re dumb!

As for your research, I’m having a hard time determining what you’re hoping to show and the types of data. Are you hoping to use some variables to explain others? What’s the research question you’re trying to answer?

I’ve written a blog post about how to design a study that includes statistical analysis. That might help get you started. Although, I’m not sure how much of the process you need to do for your class.

I hope that helps some. With more information, I might be able to help more.

Danijela says

Hi, I have estimated a logistic model. I wonder how the interpretation regarding pratical significance can be used here. The variable I am concerned about has a odds ratio of 0.90 and a coefficient equal to -0,07. Confidence interval for the coefficient [-2.28, 2,13]. If I used the odds ratio in magnitude, I would say that this is practical big effect because if I increase my indepedant variable by 1 unit is assoicated with 0,90 odds decrease in my depedant variable.

Thank you in advance.

Jim Frost says

Hi Danijela,

For a binary logistic model, the odds ratio is a great measure of the effect. Odds ratios greater than 1 indicate that the probability of the event occurring increase as the predictor increases. Ratios less than 1 indicate that the event’s probability declines as the predictor increases. The confidence interval for the odds ratio is actually more important here than the CI for the coefficient. An odds ratio of 1 indicates that the probability of the event occurring doesn’t tend to change with changes in the predictor. If the CI for the odds ratio excludes 1, then your results are statistically significant.

However, these are all statistical assessments of statistical significance. These statistical results indicate that an effect exists. However, you’ll need to use subject area expertise to determine whether this effect is important in the real world to determine practical significance. I can’t answer that because I don’t know the subject area. Sometimes an effect exists but it is so small that it doesn’t really matter in a practical sense. Consequently, you’ll need to determine whether that effect size is large enough to matter.

I hope that helps!

Mansouri says

Hi Jim,

Suppose that we have two groups that they select N items. For example, in the table below, we have number of people from each group that have selected each item.

items: A B C D E F G H K L

G1: 3 4 52 633 1000 0 6 915 890 500

G2: 1 19 17 333 1000 25 6 110 951 510

Population of group 1: 1000

Population of group 2: 1050

Is there any test that determines the differences in which items are statistically/practically important/significant? Thanks very much.

adelaglez says

How do you compute CI? :))

Jim Frost says

That completely depends on the context! Is for a mean using t-tests or ANOVA? In a regression analysis? Or for a proportion or count? Typically, your statistical software will include the relevant CIs.

Troy says

Thanks. . I have been stressing over this for quire some time. I should have asked sooner.

Troy says

Thanks, Jim. I have two variables that are highly correlated. I have been advised (instructed) to hold on to both of them for exploration (stepwise). Neither is may variable of interest, but both are important to address in my study. Should I delete one or the other to evaluate my variable of interest (hypotheses testing) and then reintroduce it for the exploratory analysis? Am I overthinking this?

Troy

Jim Frost says

Hi Troy,

Check the VIFs for all your IVs. If it’s just the two that you were required to leave in, then you don’t really have a problem other than you can’t really trust the coefficients and p-values for those two IVs. The problems associated with multicollinearity only affects the highly correlated IVs. If the other IVs don’t have high VIFs, then you don’t have reason to suspect that multicollinearity is affecting them. In that case, you would be controlling for these two presumably important variables and able to trust the results for the other IVs.

But, check the VIFs for all the IVs to confirm that is the case!

Troy says

Jim,

How does this look with a regression? Specifically, I have conducted a simultaneous entry regression (SPSS) with 10 IVs. I conducted a stepwise analysis with the same 10. SPSS eliminated 4, leaving only 6 in the most inclusive model. Three of the eliminated variables were actually significant in the first analysis. I suppose that they were eliminated because they (being correlated with other variables) did not increase the predictability of the model. I am confused as to why the previously insignificant IVs became significant in the stepwise analysis? In fact 2, of them were the first variables to be included. The first included variable had a standardized beta similar to the other significant variables (in the simultaneous analysis), but the other two were much lower. What else could cause these variables to be significant in the stepwise analysis, but not when simultaneously entered?

Troy

Jim Frost says

Hi Troy,

There’s a variety of possible reasons, but the first thing I’d look for is multicollinearity. When that is present, you often see what you describe. That the significance of the IVs can vary greatly depending the precise combination of other variables in the model. Read my post about multicollinearity and check the VIFs! That’s a good place to start.

Henry Gutierrez says

Hi Jim,

We have an assignment and we need to speak to Practical Significance and Statistical Significance. I believe we’ve done a good job on the statistical, but could use some advice on the verbiage of practical significance. Should this speak to more real world examples of effectiveness?

Here is our verbiage:

Two key data points, suspensions and chronic absenteeism, were archived and retrieved from Sanger Unified School District’s Aeries Student Information System and used for analysis of program effectiveness. Aeries Student Information System is a comprehensive program that incorporates multiple technologies to support the diverse needs of educators (Aeries SIS, n.d.). Timelines restrict the ability to pull data beyond the three years after the initial implementation of PBIS. Discipline and attendance data is tracked in Aeries for each pupil in Kindergarten through 5th grade at the school site and district office.

The first set of data retrieved was the recorded number of suspensions in the school year. The data examines suspension rates three years prior to PBIS implementation and for the three subsequent years post PBIS implementation. School administration provided suspension totals for each school year prior to PBIS implementation, along with suspension totals from the three years post-implementation. The second data set retrieved from the Aeries student information system was chronic absenteeism, which is defined as ten or more absences by an individual student during a school year. Utilizing Aeries data to track chronic absences in the given school year allowed a comparison of annual absences pre and post PBIS implementation. The absences for each student were recorded in Aeries and the total absences from the end of three school years were compared to the totals from the three subsequent school years.

Data Analysis

Practical Significance

A Cohen’s (1977) d value was calculated to determine the effect size of the decrease in suspensions for each year of pre-PBIS implementation and post-PBIS implementation. Cohen’s d allows for the comparison of the change in mean (average) scores to determine the magnitude of practical growth. A Cohen’s d value of 0.2 is considered a small effect size, a d of 0.5 is considered a medium effect size and 0.8 is considered a large effect size. The decrease in suspensions for this program evaluation indicated an effect size of 3.38, which is considered highly significant.

A Cohen’s (1977) d value was also calculated to determine the effect size of the decrease in chronic absenteeism for each year of pre-PBIS implementation and post-PBIS implementation. Cohen’s d allows for the comparison of the change in mean (average) scores to determine the magnitude of practical growth. A Cohen’s d value of 0.2 is considered a small effect size, a d of 0.5 is considered a medium effect size and 0.8 is considered a large effect size.

With the Cohen’s d value related to effect sizes, the decrease in chronic absenteeism for this program evaluation indicated an effect size of 3.21, which is considered highly significant.

Statistical Significance

Independent-samples t-test.

The Statistical Package for the Social Sciences software was used to conduct an independent-samples t-test to compare pre-intervention (N = 3) and post-intervention (N = 3) in terms of yearly schoolwide suspensions. There was significant difference in the suspensions for pre-intervention (M=25, SD=2.646) and post-intervention (M=4.7, SD=8.1) conditions; t (4)=4.14, p = .014. These results suggest that Positive Behavioral Interventions and Supports does have an effect on suspensions.

The Statistical Package for the Social Sciences software was also used to conduct a second independent-samples t-test to compare pre-intervention (N =3) and post-intervention (N = 3) in terms of yearly schoolwide chronic absenteeism. There was a significant difference in the number of chronically absent students for pre-intervention (M=68.0, SD=8.33) and post-intervention (M=39.0, SD=9.29) conditions; t (4)=3.93, p = 0.17. These results suggest that Positive Behavioral Interventions and Supports does have an effect on chronic absenteeism.

L says

What if I have a small (and non-random) sample size (n=55) for a training/intervention program to reduce sexual assault on campus and few of my results are significant? But the change in the DV appears meaningful to us. Could I make the argument that because we have such low statistical power (and an IV that has 3 categories), we see few statistically significant results but we believe the results are practically significant?

Jim Frost says

Hi,

You really only have strong enough evidence to support the statistically significant IVs in your model. If an IV is not significant, you don’t have enough evidence to conclude that there is a relationship between the IV and DV in the population.

I think you’re saying that the coefficient for the IV appears to be large enough to be practically meaningful but it’s not significant. What that indicates is that the relationship exists in your *sample* to a degree that looks important. However, given the variability in your data (i.e., the noise), the effect (i.e., the signal) is not strong enough to conclude that it exists in the *population*. That relationship in your sample might exist only because of random sampling error. The evidence is not strong enough to conclude that it also exists in the population.

In other words, you can’t really say that the results are not statistically significant but they are practically significant so they’re still important. Ultimately, what you’re observing is not distinguishable from random variability in your data.

Regarding the low power, that could be one reason for not obtaining significant results. However, don’t assume that having a larger sample size would necessarily produce significant results. Instead, it’s possible that the sample effect you observed would go away with a large sample. It’s impossible to say. However, the good news is that your insignificant results don’t prove that there is no effect. Instead, it’s just insufficient evidence for concluding that there is an effect. That’s a very different conclusion.

I’d say that the jury is still out about the intervention program. It’s probably worth a larger study. You might consider this as a pilot study that shows a bit of promise, but it’s not conclusive. You might recommend more research.

I hope this helps!

philoinme says

I think you got my question right. While the concept of power is typically used to calculate sample size and experimental design, I wanted to know if it could be used the other way round.

But I didn’t get why did you treat my question as if it has two possibilities, (1st para:

applying power calculation on HT and 2nd para: If the historical data provides good estimates, power analysis might work fine). Generally, power calculation might have been applied on HT(which is not the case as per your answer) and HTs are in turn applied on historical data, isn’t it?

Sylvia says

“the CI tells us that if we implement the program on a larger scale, we might produce only an average 3-point increase!”

How can we tell this from the above example?

Jim Frost says

Hi Sylvia,

Thanks for the great question!

That statement refers to the CI from study A, which is [3 15]. Three is the lower limit for that CI. All values within the CI are not considered unusual so we would not be surprised if the actual population mean fell somewhere in that range. Consequently, it would not be a surprise if the true population effect was 3. It would also not be a surprise if it was 15 or anywhere else in between. The CI highlights the uncertainty associated with a point estimate. We should be wary of the benefits of the program because the CI includes values that we consider to be practically insignificant.

There’s more certainty behind the estimate for Study B, and we can be fairly certain that, even incorporating the uncertainty, the true population mean is greater than 3 because Study B’s CI extents from 7 to 11.

philoinme says

Hi Jim,

Thanks for the quick response.

It’s standard that effect size and power are used to back calculate the sample size. But for instance, there is some historical data which was collected and through some statistical tests power was calculated. Now can’t we use this power relatively compared to most commonly taken figure (80%- used in back calculation) for decision making or further course of action?

I understand that we can’t calculate effect size but according to the subject matter, we know what we want and I am wondering after we know what we want, we can just calculate power and proceed or not proceed with what we want to do?

My main concern is after knowing statistical significance, why can’t we use power calculation to quantify the practical significance as most of the times when it comes to historical data, sample size is already there. This is the case where it’s not about future experiment but about the past occurrences.

Jim Frost says

Hi,

I’m not quite clear on what you’re asking?

Are you asking about using a power calculation based using estimates from a hypothesis test that you performed? The problem with that approach is if you use it on significant test results, you’ll obtain very high power. However, if you use it on a test that wasn’t significant, you’ll obtain lower power. It’s how the math works out. However, in that case, you don’t want to say the test missed the significance due to low power because you’ll always obtain a low power for that situation.

However, if you have other data (e.g., historical data) that provides good estimates to input into a power analysis, you might be able to determine whether a different study had low power. In that case, you’re looking at a kind of historical consensus about effect sizes and variability to assess a particular study. If you trust the historical estimate, that’s probably ok. You just don’t want to use the results from a hypothesis test to assess the power of that same hypothesis test. But, I’m not sure if that’s what you’re asking?

philoinme says

Isn’t the power calculation (1 – beta) useful for calculating the practical effect size?

Jim Frost says

Hi,

It’s the other way around. You use the smallest effect size that is practically significant in the power calculations to estimate a reasonable sample size. You can read about it in my post about power and sample size calculations.

Think about it this way. You can’t mathematically calculate what effect size is considered practically significant. Instead, that is based on subject-area expertise and real-world concerns. However, after you know how large an effect must be to be considered practically significant, you can use that knowledge to calculate a sample size that will detect an effect of that size with a sufficient amount of power.