Effect sizes in statistics quantify the differences between group means and the relationships between variables. While analysts often focus on statistical significance using p-values, effect sizes determine the practical importance of the findings.

In experiments and other studies, analysts typically assess relationships between variables. Effect sizes represent the magnitude of a relationship between variables. For example, you might want to know whether average health outcomes differ between the control group and a treatment group receiving a new medicine. Or, you might want to determine whether processing temperatures relate to a product’s strength.

Effect sizes tell you whether these relationships are strong or weak. Do these variables have a large or negligible impact on the outcome? The experimental medicine might improve health outcomes, but is it a trivial or substantial improvement? This type of information is crucial in determining whether the effect is meaningful in real-world applications.

Effect sizes come in two general flavors, unstandardized and standardized. Depending on your field, you might be more familiar with one or the other.

In this post, you’ll learn about both unstandardized and standardized effect sizes. Specifically, we’ll look at the following effect sizes:

**Unstandardized**: Mean differences between groups and regression coefficients**Standardized**: Correlation coefficients, Cohen’s d, eta squared, and omega squared.

Finally, I close the post by explaining the difference between statistical significance and effect sizes, and why you need to consider both.

## Unstandardized Effect Sizes

Unstandardized effect sizes use the natural units of the data. Using the raw data units can be convenient when you intuitively understand those units. This is often the case with tangible concepts, such as weight, money, temperature, etc.

Let’s look at two common types of unstandardized effect sizes, the mean difference between groups and regression coefficients.

### Mean Differences between Groups

This one is simple. Just subtract group means to calculate the unstandardized effect size

Difference Between Group Means = Group 1 Mean – Group 2 Mean

Group 1 and 2 can be the means of the Treatment and Control groups, the Posttest and pretest means, two different types of treatments, and so on.

For example, imagine we’re developing a weight loss pill. The control group loses an average of 5kg while the treatment group loses an average of 15 during the study. The effect size is 15 – 5 = 10 kg. That’s the mean difference between the two groups.

Because you are only subtracting means, the units remain the natural data units. In the example, we’re using kilograms. Consequently, the effect size is 10 kg.

**Related post**: Post Hoc Tests in ANOVA to Assess Differences between Means

### Regression Coefficients

Regression coefficients are an effect size that indicates the relationship between variables. These coefficients use the units of your model’s dependent variable.

For example, suppose you fit a regression model with years of experience as an independent variable and income in U.S. dollars as the dependent variable. The model estimates a coefficient for years of experience of, say, 867. This value indicates that for every one-year increase in experience, income increases by an average of $867.

That value is the effect size for the relationship between years of experience and income. It is an unstandardized effect size because it uses the natural units of the dependent variable, U.S. dollars.

**Related post**: How to Interpret Regression Coefficients and their P-values

## Standardized Effect Sizes

Standardized effect sizes do not use the original data units. Instead, they are unitless, allowing you to compare results between studies and variables that use different units.

Additionally, standardized effect sizes are useful for experiments where the original units are not inherently meaningful or potentially confusing to your readers. For example, think back to the years of experience and income example. That study reported its results in U.S. dollars, or insert your local currency for that example. As a measurement unit, your currency is inherently meaningful to you. You understand what the magnitude of the value represents.

Conversely, many psychology studies use inventories to assess personality characteristics. Those inventory units are not inherently meaningful. For example, it might not be self-evident whether a 10-point difference on a specific inventory represents a small or large effect. Even if you know the answer because it’s your specialty, your readers might not!

However, by standardizing the effect size and removing the data units, the effect’s magnitude becomes apparent. You can compare it to other findings and you don’t need to be familiar with the original units to understand the results.

Consider using standardized effect sizes for comparisons between studies and different variables. Or when the original units are not intuitively meaningful. Meta-analyses often use standardized effect sizes from many studies to summarize a set of findings.

Let’s examine several common standardized effect sizes, including correlation coefficients, Cohen’s d, eta squared, and omega squared.

### Correlation coefficients

You might not think of correlation coefficients as standardized effect sizes, but they are a standardized alternative to regression coefficients. Correlation does not use the original data units and all values fall between -1 and +1. You can use them to compare the strengths of the relationships between different pairs of variables because they use a standardized scale.

In the regression coefficient example, recall that the coefficient of 867 represents the mean change of the dependent variable in U.S. dollars. You could instead report the correlation between experience and income.

To understand the potential strength of correlation coefficients, consider different studies that find correlations between height and weight, processing temperature and product strength, and hours of sunlight and depression scores. These studies assess relationships between entirely different types of variables that use different measurement units.

Now imagine these pairs of variables all have the same correlation coefficient. Even though the pairs are highly dissimilar, you know that the strengths of the relationships are equal. Or, if one had a higher correlation, you’d quickly see that it has a stronger relationship. The diverse nature of the variables is not a problem at all because correlation coefficients are standardized!

Instead of correlation coefficients, you can also use standardized regression coefficients for the same reasons.

**Related post**: Interpreting Correlation Coefficients and Spearman’s Rank Order Correlation Explained

### Cohen’s d

Cohen’s d is a standardized effect size for differences between group means. For the unstandardized effect size, you just subtract the group means. To standardize it, divide that difference by the standard deviation. It’s an appropriate effect size to report with t-test and ANOVA results.

The numerator is simply the unstandardized effect size, which you divide by the standard deviation. The standard deviation is either the pooled standard deviation for both groups or the control group. Because both parts of the fraction use the same units, the division process cancels them out and produces a unitless result.

Cohen’s d represents the effect size by indicating how large the unstandardized effect is relative to the data’s variability. Think of it as a signal-to-noise ratio. A large Cohen’s d means the effect (signal) is large relative to the variability (noise). A *d* of 1 indicates that the effect is the same magnitude as the variability. A 2 signifies that the effect is twice the size of the variability. Etc.

For example, if the unstandardized effect size is 10 and the standard deviation is 2, Cohen’s d is an impressive 5. However, if you have the same effect size of 10 and the standard deviation is also 10, Cohen’s d is a much less impressive 1. The effect is on par with the variability in the data.

As you gain experience in your field of study, you’ll learn which effect sizes are considered small, medium, and large. Cohen suggested that values of 0.2, 0.5, and 0.8 represent small, medium, and large effects. However, these values don’t apply to all subject areas. Instead, build up a familiarity with Cohen’s d values in your subject area.

Learn more about Cohen’s d.

### Eta Squared and Omega Squared

Eta Squared and the related Omega Squared are standardized effect sizes that indicate the percentage of the variance that each categorical variable in an ANOVA model explains. Values can range from 0 to 100%. These effect sizes are similar to R-squared, which represents the percentage of the variance that all variables in the model collectively explain.

Each categorical variable has a value that indicates the percentage of the variance that it explains. Like R-squared, eta squared and omega squared are intuitive measures that you can use to compare variable effect sizes between models.

The difference between eta squared and omega square is that omega squared adjusts for bias present in eta squared, particularly for small samples. Typically, statisticians prefer omega squared because it is an unbiased estimator.

**Related post**: How to Interpret R-squared

## Effect Sizes and Statistical Significance

Historically, statistical results were all about statistical significance. Statistical significance was the goal. However, that emphasis has changed over time. Analysts have increasingly reported effect sizes to show that their findings are important in the real world.

What is the difference between these two concepts?

After performing a hypothesis test, statistically significant results indicate that your sample provides sufficient evidence to conclude that an effect exists in the population. Specifically, statistical significance suggests that the population effect is unlikely to equal zero.

That’s a good start. It helps rule out random sampling error as the culprit for an apparent effect in your sample.

While the word “significant” makes the results sound important, it doesn’t necessarily mean the effect size is meaningful in the real world. Again, it suggests only a non-zero effect, which includes trivial findings.

If you have a large sample size and/or a low amount of variability in your data, hypothesis tests can produce significant p-values for trivial effects.

Conversely, effect sizes indicate the magnitudes of those effects. By assessing the effect size, you can determine whether the effect is meaningful in the real world or trivial with no practical importance.

In a nutshell, here’s the difference:

**Statistical significance**: After accounting for random sampling error, your sample suggests that a non-zero effect exists in the population.**Effect sizes**: The magnitude of the effect. It answers questions about how much or how well the treatment works. Are the relationships strong or weak?

**Related posts**: How Hypothesis Tests Work (Significance Levels), and How to Interpret P-values

## Consider both Effect Size and Statistical Significance!

It’s essential to use both statistics together. After all, you can have a sizeable effect apparent in your sample that is not significant. In that case, random sampling error might be creating the appearance of an effect in the sample, but it does not exist in the population.

When your results are statistically significant, assess the effect size to determine whether it is practically important.

To get bonus points from me, interpret the effect size with confidence intervals to evaluate the estimate’s precision.

For additional information on this topic, including more about the role of confidence intervals in this process, read my post about Practical versus Statistical Significance.

## Reference

Baguley T., Standardized or simple effect size: what should be reported? *Br J Psychol*. 2009 Aug;100(Pt 3):603-17. doi: 10.1348/000712608X377117. Epub 2008 Nov 17. PMID: 19017432.

Stan Alekman says

Jim, I always calculate and report the confidence interval of the effect size in addition to statistical significance because it is a random variable. I have done studies where the effect size is impressive but the CI is so wide that the point value is much less impressive. Replication is needed.

I have seen reports of effect size expressed as standard deviation units. What meaning or interpretation cam I give to these values?

Thank you.

Stan Alekman

Maria Fionda says

Hello Jim,

First, thank you for the helpful explanation! I’m wondering if you can help me understand how statistical significance (p-value), effect size *and* odds ratios work to tell the story, then. For example, the chi-square results of two categorical variables, Student Group (A vs B) and Re-enrollment (yes vs no):

Yes No

Group A 20 7

Group B 1416 1276

Chi-square (1, N=2719) = 4.95, p = 0.0261

Phi-coefficient (effect size): 0.0426

Odds ratio: 2.59 = Group A re-enrollment rate is 159% higher

It seems like the effect size is so negligible (i.e. it is saying that that significant difference is driven only by the large sample size but not because there is a real difference in the rates of re-enrollment between the two groups) but then the odds ratio makes it look like there is indeed a large difference. What story are these numbers telling?

I’m not sure if you offer paid consultation services. I tried clicking around the website to see if I could request this but all I could find was your suggestion that we ask our question on the article most closely related. I figured either this one or your chi-square article (https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/) would be the most appropriate ones.

My sincere apologies if this question is beyond the intended scope of your services.

Tess says

Hi Jim, firstly, I want to say thank for your incredibly helpful blogs and your dedication for making statistics more accessible. I appreciate that my question is probably rather vague, but I wondered if you had some general advice about appraising effect sizes in published research literature (my subject area is psychology). I have noticed reviews and meta-analyses discussing the likelihood of studies being underpowered in relation to effect sizes. I have a general understanding of this area, but I am always left feeling a bit stuck when I see comments in papers such as :

“There were no differences between the clusters in terms of depression, anxiety or somatoform dissociation, although the effect sizes (0.72, 0.70 and 0.41 respectively) suggest that this was a product of low power.”

I find myself thinking:

– How did the authors glance at these results and come up with such conclusions?

– Are they saying that if the study was better powered, they would likely find a significant difference with the mentioned effect sizes?

I would appreciate you input on this and any general guidance you might have for interpreting effect sizes in published literature.

Jim Frost says

Hi Tess,

You’re very welcome and I’m so glad that my website has been helpful!

It’s great that you’re trusting your instincts with your doubts about that type of assertion. No, those authors should not be stating something like that. The truth is that when you obtain insignificant results, you don’t know whether you’d get significant results if you had a larger sample size or not. While I’m not exactly sure what their thought process was, I have an idea.

If those effect sizes are Cohen’s d, then they represent medium to large effect sizes. And, they’re probably thinking, well, if they’re medium to large but they’re not significant, they would be significant if we had gathered a larger sample. Basically, these authors are assuming that if they had a larger sample size that they would continue to obtain the same size effects, which would become significant. That’s not how it works! If you could just do that, you’d never need a larger sample size. Just get a small one an extrapolate out for a large one.

The reality is that when you have a small sample size (underpowered), the estimated effect sizes are more unstable. That means they can swing around wildly and produce unusual values. It’s easier to get unusual results with a small sample than a large sample. However, a large sample won’t necessarily follow the same pattern because a large sample smooths out those erratic swings in the estimated effect sizes.

In fact, I think one of the largest benefits of p-values is when you actually get a large, insignificant p-value. It’s a protection against the very thing the authors have mistakenly concluded. The insignificant p-value warns you that while you have an apparent effect, given the variability in the sample and the size of the sample, there’s a large chance that the apparent effect in the sample represents random sample error and NOT a true effect in the population. For more information, read a post that I’ve written about Can High P-values Be Meaningful.

Additionally, it’s a known problem that underpowered studies tend to produce inflated effect sizes. For more information, read my post about Low Power Tests Exaggerate Effect Sizes.

Evgeniya says

Maybe I should use a different effect sizes? In my regression model the dependent variable and explanatory variables of interest are quantitative but regression coefficients are not meaningful.

Evgeniya says

Hello Jim, thanks for a very helpful and comprehensive article. But I have one question: can I use omega squared after regression to measure effect sizes if my independent variables of interest are quantitative?

Kenneth Tuttle Wilhelm says

Evening Jim,

Thank you for the reply. Currently, the common replication of studies, in our curriculum (IB), is limited to t-tests, and Mann-Whitney, maybe the occasional ANOVA.

I rarely see anyone looking at skew or kurtosis. So this means no one is looking at the sample distribution and whether it’s normal or not. Which then forces most into the non-parametric end. And other things like an automatic rejection of parametric analysis of data from Likert scales, as an example.

Psychology of course, being such a multi-factor research area, I believe that it makes sense that when considering the characteristics of a sample population, there will be instances where means may be similar (CLT), and yet there maybe wide variability present in an experimental group. Hence, my question about whether post-hoc or preplanned analysis is more appropriate.

The issue is, that students replicating prior research, may only be looking means, as that’s all the original research focused on, or at least reported. So it’s only after collecting the data, and observing the descriptive stats, that a variance is noticed, and one becomes interested in checking the variance.

In my particular environment, the school population is very multi-cultural (over 60% of students are from foreign countries), and even the surrounding general population is multi-ethnic. So if results show one IV-DV relationship (possibly a control) with an IV-DV relationship (experimental) showing a visual difference between variances, to me, it makes sense to at least consider the difference as having some informative value.

At a minimum, analysis of the variances, allows for students to demonstrate further critical thinking, with a bit of extra stats results added to the discussion and recommendations.

Cheers

Kenneth Wilhelm says

Hello Jim,

You used the word ‘…historically…’, which is my segue into a couple of questions.

When designing an experiment, that is relying on, let’s say a two sample t-test, should researchers indicate in their procedures that they’ll also analyse the differences in variance? Or would this be something that it’s acceptable to do pos-hoc when observing that while the means are not difference, the graph and numbers give hint that the variances may be statistically different?

I, not too infrequently notice means not being significant, maybe leaning towards it, but short of the alpha, while the Standard Deviations of two groups do appear (visually) to be different. I’m a teacher of Psychology in an international school, and the curriculum programme requires students to replicate studies. So I’m looking at design, analysis, and results on a regular basis.

My own grad stats class, 28 years ago, so I’m more than rusty. I’ve completed reading through your Hypothesis Testing book, lots of stuff forgotten, and quite a bit that’s new. Now moving on to the Regression Analysis. (And joining the ASA as a teacher to boot)

Jim Frost says

Hi Kenneth,

I’m so glad that you’re finding my Hypothesis Testing book to be helpful! That sounds like a great requirement for students! I wrote about the lack of replication and the relationship to p-values in the field of Psychology.

Ideally, the researchers would note in their plans that differences between the means and/or standard deviations would be important and, consequently, plan from the beginning to test both.

I’m assuming that differences in the standard deviation would be a meaningful finding? If so, that’s a good reason. However, if you’re just assessing the difference in variability only because the difference between the means was not significant but you’re hoping to find something, that’s a different matter.

I don’t see anything wrong with doing the extra analysis based on observing the results. I’d include that in the discussion for transparency. And really consider whether that is an important finding or whether the only reason you’re looking at the variance is because the mean difference was not significant. You might also consider why, if a difference in variability is important, you didn’t think of it earlier. If it’s a worthwhile aspect to study, work it into the plan from the beginning!

There is a danger of adding analyses on at the end–to keep performing analyses until something pops out that is significant. That’s a form of cherry picking. I assume that’s not the case you’re describing. That you’re just adding the variability?

Best of luck on your exciting journey!

Stan Alekman says

Hi Jim. Regarding unstandardized effect sizes, one can compute the confidence interval of the mean difference between groups for an impression of the uncertainty in the observed effect size.This can no doubt be done for standardized effect sizes as well although I have never done it.

Jim Frost says

Hi Stan,

You are absolutely correct as usual, both for unstandardized and standardized effect sizes! I include that as a bonus point right near the end.

Lucas says

Hello Jim, how are you? Very nice article like other of your own. Jim, I was wondering if it is possilbe get te printed versions of your books. Thank you so much.

Regards from Uruguay.

Lucas

Jim Frost says

Hi Lucas!

I have a soft spot for Uruguay because my wife is from there!

Yes, you can definitely get print versions of all my books. You can order them from Amazon. In My Webstore, I provide Amazon links for multiple countries. You can also order them from other online retailer and some physical bookstores can order them for you.

antonius suhartomo says

Thanks Jim for this explanation, as electrical engineering background I got something new

Jim Frost says

You’re very welcom, Antonius! I’m always glad to hear when someone gets something new out of it! 🙂

Marty Shudak says

Jim, thank you for such an intuitive description of effect sizes. I may be misreading this part but under the heading: Mean Differences between Groups, the last sentence; shouldn’t the effect size be 10?

Jim Frost says

Hi Marty, thanks and yes, you’re absolutely correct about that! I fixed it. Higher math like that is challenging! 😉