Effect sizes in statistics quantify the differences between group means and the relationships between variables. While analysts often focus on statistical significance using p-values, effect sizes determine the practical importance of the findings.

In experiments and other studies, analysts typically assess relationships between variables. Effect sizes represent the magnitude of a relationship between variables. For example, you might want to know whether average health outcomes differ between the control group and a treatment group receiving a new medicine. Or, you might want to determine whether processing temperatures relate to a product’s strength.

Effect sizes tell you whether these relationships are strong or weak. Do these variables have a large or negligible impact on the outcome? The experimental medicine might improve health outcomes, but is it a trivial or substantial improvement? This type of information is crucial in determining whether the effect is meaningful in real-world applications.

Effect sizes come in two general flavors, unstandardized and standardized. Depending on your field, you might be more familiar with one or the other.

In this post, you’ll learn about both unstandardized and standardized effect sizes. Specifically, we’ll look at the following effect sizes:

**Unstandardized**: Mean differences between groups and regression coefficients**Standardized**: Correlation coefficients, Cohen’s d, eta squared, and omega squared.

Finally, I close the post by explaining the difference between statistical significance and effect sizes, and why you need to consider both.

## Unstandardized Effect Sizes

Unstandardized effect sizes use the natural units of the data. Using the raw data units can be convenient when you intuitively understand those units. This is often the case with tangible concepts, such as weight, money, temperature, etc.

Let’s look at two common types of unstandardized effect sizes, the mean difference between groups and regression coefficients.

### Mean Differences between Groups

This one is simple. Just subtract group means to calculate the unstandardized effect size

Difference Between Group Means = Group 1 Mean – Group 2 Mean

Group 1 and 2 can be the means of the Treatment and Control groups, the Posttest and pretest means, two different types of treatments, and so on.

For example, imagine we’re developing a weight loss pill. The control group loses an average of 5kg while the treatment group loses an average of 15 during the study. The effect size is 15 – 5 = 10 kg. That’s the mean difference between the two groups.

Because you are only subtracting means, the units remain the natural data units. In the example, we’re using kilograms. Consequently, the effect size is 10 kg.

**Related post**: Post Hoc Tests in ANOVA to Assess Differences between Means

### Regression Coefficients

Regression coefficients are an effect size that indicates the relationship between variables. These coefficients use the units of your model’s dependent variable.

For example, suppose you fit a regression model with years of experience as an independent variable and income in U.S. dollars as the dependent variable. The model estimates a coefficient for years of experience of, say, 867. This value indicates that for every one-year increase in experience, income increases by an average of $867.

That value is the effect size for the relationship between years of experience and income. It is an unstandardized effect size because it uses the natural units of the dependent variable, U.S. dollars.

**Related post**: How to Interpret Regression Coefficients and their P-values

## Standardized Effect Sizes

Standardized effect sizes do not use the original data units. Instead, they are unitless, allowing you to compare results between studies and variables that use different units.

Additionally, standardized effect sizes are useful for experiments where the original units are not inherently meaningful or potentially confusing to your readers. For example, think back to the years of experience and income example. That study reported its results in U.S. dollars, or insert your local currency for that example. As a measurement unit, your currency is inherently meaningful to you. You understand what the magnitude of the value represents.

Conversely, many psychology studies use inventories to assess personality characteristics. Those inventory units are not inherently meaningful. For example, it might not be self-evident whether a 10-point difference on a specific inventory represents a small or large effect. Even if you know the answer because it’s your specialty, your readers might not!

However, by standardizing the effect size and removing the data units, the effect’s magnitude becomes apparent. You can compare it to other findings and you don’t need to be familiar with the original units to understand the results.

Consider using standardized effect sizes for comparisons between studies and different variables. Or when the original units are not intuitively meaningful. Meta-analyses often use standardized effect sizes from many studies to summarize a set of findings.

Let’s examine several common standardized effect sizes, including correlation coefficients, Cohen’s d, eta squared, and omega squared.

### Correlation coefficients

You might not think of correlation coefficients as standardized effect sizes, but they are a standardized alternative to regression coefficients. Correlation does not use the original data units and all values fall between -1 and +1. You can use them to compare the strengths of the relationships between different pairs of variables because they use a standardized scale.

In the regression coefficient example, recall that the coefficient of 867 represents the mean change of the dependent variable in U.S. dollars. You could instead report the correlation between experience and income.

To understand the potential strength of correlation coefficients, consider different studies that find correlations between height and weight, processing temperature and product strength, and hours of sunlight and depression scores. These studies assess relationships between entirely different types of variables that use different measurement units.

Now imagine these pairs of variables all have the same correlation coefficient. Even though the pairs are highly dissimilar, you know that the strengths of the relationships are equal. Or, if one had a higher correlation, you’d quickly see that it has a stronger relationship. The diverse nature of the variables is not a problem at all because correlation coefficients are standardized!

Instead of correlation coefficients, you can also use standardized regression coefficients for the same reasons.

**Related post**: Interpreting Correlation Coefficients and Spearman’s Rank Order Correlation Explained

### Cohen’s d

Cohen’s d is a standardized effect size for differences between group means. For the unstandardized effect size, you just subtract the group means. To standardize it, divide that difference by the standard deviation. It’s an appropriate effect size to report with t-test and ANOVA results.

The numerator is simply the unstandardized effect size, which you divide by the standard deviation. The standard deviation is either the pooled standard deviation for both groups or the control group. Because both parts of the fraction use the same units, the division process cancels them out and produces a unitless result.

Cohen’s d represents the effect size by indicating how large the unstandardized effect is relative to the data’s variability. Think of it as a signal-to-noise ratio. A large Cohen’s d means the effect (signal) is large relative to the variability (noise). A *d* of 1 indicates that the effect is the same magnitude as the variability. A 2 signifies that the effect is twice the size of the variability. Etc.

For example, if the unstandardized effect size is 10 and the standard deviation is 2, Cohen’s d is an impressive 5. However, if you have the same effect size of 10 and the standard deviation is also 10, Cohen’s d is a much less impressive 1. The effect is on par with the variability in the data.

As you gain experience in your field of study, you’ll learn which effect sizes are considered small, medium, and large. Cohen suggested that values of 0.2, 0.5, and 0.8 represent small, medium, and large effects. However, these values don’t apply to all subject areas. Instead, build up a familiarity with Cohen’s d values in your subject area.

### Eta Squared and Omega Squared

Eta Squared and the related Omega Squared are standardized effect sizes that indicate the percentage of the variance that each categorical variable in an ANOVA model explains. Values can range from 0 to 100%. These effect sizes are similar to R-squared, which represents the percentage of the variance that all variables in the model collectively explain.

Each categorical variable has a value that indicates the percentage of the variance that it explains. Like R-squared, eta squared and omega squared are intuitive measures that you can use to compare variable effect sizes between models.

The difference between eta squared and omega square is that omega squared adjusts for bias present in eta squared, particularly for small samples. Typically, statisticians prefer omega squared because it is an unbiased estimator.

**Related post**: How to Interpret R-squared

## Effect Sizes and Statistical Significance

Historically, statistical results were all about statistical significance. Statistical significance was the goal. However, that emphasis has changed over time. Analysts have increasingly reported effect sizes to show that their findings are important in the real world.

What is the difference between these two concepts?

After performing a hypothesis test, statistically significant results indicate that your sample provides sufficient evidence to conclude that an effect exists in the population. Specifically, statistical significance suggests that the population effect is unlikely to equal zero.

That’s a good start. It helps rule out random sampling error as the culprit for an apparent effect in your sample.

While the word “significant” makes the results sound important, it doesn’t necessarily mean the effect size is meaningful in the real world. Again, it suggests only a non-zero effect, which includes trivial findings.

If you have a large sample size and/or a low amount of variability in your data, hypothesis tests can produce significant p-values for trivial effects.

Conversely, effect sizes indicate the magnitudes of those effects. By assessing the effect size, you can determine whether the effect is meaningful in the real world or trivial with no practical importance.

In a nutshell, here’s the difference:

**Statistical significance**: After accounting for random sampling error, your sample suggests that a non-zero effect exists in the population.**Effect sizes**: The magnitude of the effect. It answers questions about how much or how well the treatment works. Are the relationships strong or weak?

**Related posts**: How Hypothesis Tests Work (Significance Levels), and How to Interpret P-values

## Consider both Effect Size and Statistical Significance!

It’s essential to use both statistics together. After all, you can have a sizeable effect apparent in your sample that is not significant. In that case, random sampling error might be creating the appearance of an effect in the sample, but it does not exist in the population.

When your results are statistically significant, assess the effect size to determine whether it is practically important.

To get bonus points from me, interpret the effect size with confidence intervals to evaluate the estimate’s precision.

For additional information on this topic, including more about the role of confidence intervals in this process, read my post about Practical versus Statistical Significance.

Kenneth Tuttle Wilhelm says

Evening Jim,

Thank you for the reply. Currently, the common replication of studies, in our curriculum (IB), is limited to t-tests, and Mann-Whitney, maybe the occasional ANOVA.

I rarely see anyone looking at skew or kurtosis. So this means no one is looking at the sample distribution and whether it’s normal or not. Which then forces most into the non-parametric end. And other things like an automatic rejection of parametric analysis of data from Likert scales, as an example.

Psychology of course, being such a multi-factor research area, I believe that it makes sense that when considering the characteristics of a sample population, there will be instances where means may be similar (CLT), and yet there maybe wide variability present in an experimental group. Hence, my question about whether post-hoc or preplanned analysis is more appropriate.

The issue is, that students replicating prior research, may only be looking means, as that’s all the original research focused on, or at least reported. So it’s only after collecting the data, and observing the descriptive stats, that a variance is noticed, and one becomes interested in checking the variance.

In my particular environment, the school population is very multi-cultural (over 60% of students are from foreign countries), and even the surrounding general population is multi-ethnic. So if results show one IV-DV relationship (possibly a control) with an IV-DV relationship (experimental) showing a visual difference between variances, to me, it makes sense to at least consider the difference as having some informative value.

At a minimum, analysis of the variances, allows for students to demonstrate further critical thinking, with a bit of extra stats results added to the discussion and recommendations.

Cheers

Kenneth Wilhelm says

Hello Jim,

You used the word ‘…historically…’, which is my segue into a couple of questions.

When designing an experiment, that is relying on, let’s say a two sample t-test, should researchers indicate in their procedures that they’ll also analyse the differences in variance? Or would this be something that it’s acceptable to do pos-hoc when observing that while the means are not difference, the graph and numbers give hint that the variances may be statistically different?

I, not too infrequently notice means not being significant, maybe leaning towards it, but short of the alpha, while the Standard Deviations of two groups do appear (visually) to be different. I’m a teacher of Psychology in an international school, and the curriculum programme requires students to replicate studies. So I’m looking at design, analysis, and results on a regular basis.

My own grad stats class, 28 years ago, so I’m more than rusty. I’ve completed reading through your Hypothesis Testing book, lots of stuff forgotten, and quite a bit that’s new. Now moving on to the Regression Analysis. (And joining the ASA as a teacher to boot)

Jim Frost says

Hi Kenneth,

I’m so glad that you’re finding my Hypothesis Testing book to be helpful! That sounds like a great requirement for students! I wrote about the lack of replication and the relationship to p-values in the field of Psychology.

Ideally, the researchers would note in their plans that differences between the means and/or standard deviations would be important and, consequently, plan from the beginning to test both.

I’m assuming that differences in the standard deviation would be a meaningful finding? If so, that’s a good reason. However, if you’re just assessing the difference in variability only because the difference between the means was not significant but you’re hoping to find something, that’s a different matter.

I don’t see anything wrong with doing the extra analysis based on observing the results. I’d include that in the discussion for transparency. And really consider whether that is an important finding or whether the only reason you’re looking at the variance is because the mean difference was not significant. You might also consider why, if a difference in variability is important, you didn’t think of it earlier. If it’s a worthwhile aspect to study, work it into the plan from the beginning!

There is a danger of adding analyses on at the end–to keep performing analyses until something pops out that is significant. That’s a form of cherry picking. I assume that’s not the case you’re describing. That you’re just adding the variability?

Best of luck on your exciting journey!

Stan Alekman says

Hi Jim. Regarding unstandardized effect sizes, one can compute the confidence interval of the mean difference between groups for an impression of the uncertainty in the observed effect size.This can no doubt be done for standardized effect sizes as well although I have never done it.

Jim Frost says

Hi Stan,

You are absolutely correct as usual, both for unstandardized and standardized effect sizes! I include that as a bonus point right near the end.

Lucas says

Hello Jim, how are you? Very nice article like other of your own. Jim, I was wondering if it is possilbe get te printed versions of your books. Thank you so much.

Regards from Uruguay.

Lucas

Jim Frost says

Hi Lucas!

I have a soft spot for Uruguay because my wife is from there!

Yes, you can definitely get print versions of all my books. You can order them from Amazon. In My Webstore, I provide Amazon links for multiple countries. You can also order them from other online retailer and some physical bookstores can order them for you.

antonius suhartomo says

Thanks Jim for this explanation, as electrical engineering background I got something new

Jim Frost says

You’re very welcom, Antonius! I’m always glad to hear when someone gets something new out of it! 🙂

Marty Shudak says

Jim, thank you for such an intuitive description of effect sizes. I may be misreading this part but under the heading: Mean Differences between Groups, the last sentence; shouldn’t the effect size be 10?

Jim Frost says

Hi Marty, thanks and yes, you’re absolutely correct about that! I fixed it. Higher math like that is challenging! 😉