If your study has low statistical power, it will exaggerate the effect size. What?!

Statistical power is the ability of a hypothesis test to detect an effect that exists in the population. Clearly, a high-powered study is a good thing just for being able to identify these effects. Low power reduces your chances of discovering real findings. However, many analysts don’t realize that low power also inflates the effect size.

In this post, I show how this unexpected relationship between power and exaggerated effect sizes exists. I’ll also tie it to other issues, such as the bias of effects published in journals and other matters about statistical power. I think this post will be eye-opening and thought provoking! As always, I’ll use many graphs rather than equations.

## Hypothetical Study Scenario

To illustrate how this effect size inflation works, I’ll simulate a study and conduct it many times at three power levels.

Imagine that we’re studying a fictitious medication that promises to increase your intelligence (IQ). Our experiment has two groups—a control group that doesn’t take the pill and the treatment group which takes it. Then, each group takes the same IQ test and we compare the results. The effect size is the difference between group means.

Because we’re simulating these studies, we can control the effect size and other properties of the population. I’ll set the effect size at 10 IQ points and define the two populations as follows:

**Control group**: Normal distribution with a mean of 100 and a standard deviation of 15.**Treatment group**: Normal distribution with a mean of 110 and a standard deviation of 15.

I calculated the sample sizes I need to produce statistical power of 0.3, 0.55, and 0.8. The first two values represent low power studies, while the third value is a standard target value. The output below shows the power analysis results.

Now, I’ll use statistical software to draw 50 random samples from the above populations for each of the sample sizes in the power analysis. Finally, I perform 2-sample t-tests on all datasets. I use two-tailed tests with a significance level of 0.05. The following discussion explains the results of the 50 2-sample t-tests for each of the three power levels.

**Related post**: Learn How t-Tests Work and Significance Levels and P-values

## Findings and Estimated Effect Sizes for Very Low Power (0.3)

We know that the correct effect size for these analyses is 10. Let’s see what the 2-sample hypothesis tests reveal for the 50 datasets that have a power of 0.3. Given this power, we expect to detect the effect 30% of the time. We’d expect that percentage if we perform the test an infinite number of times. However, we conducted it only 50 times, so there’s a margin of error around the percentage of significant studies.

Of the 50 tests with the lowest statistical power, 13 (26%) are statistically significant. The average effect size is 17.05 IQ points, and the range extends from 12.01 to 21.45. Not only is the average effect too high, but the entire range of effects is greater than the actual effect. The graph below displays the distribution of statistically significant findings with a reference line for the real effect (10).

Now, let’s look at the results for the other two levels of statistical power.

### Power = 0.55

Of the 50 tests, 34 (68%) are statistically significant. The average effect size is 13.50, and the range extends from 7.71 to 19.04. A large majority of the effects are greater than 10.

### Power = 0.8

Of the 50 tests, 41 (82%) are statistically significant. The average effect size is 11.92, and the range extends from 6.30 to 19.43. The real effect size is moving closer to the center of the distribution.

## Relationship between Statistical Power and Effect Size

As the power level increases, the percentage of detections increases and the exaggeration of the effect size decreases. Both are good things and they have a common cause. The graph below displays the exaggeration factor (mean significant effect / actual effect) by power. No exaggeration occurs at a value of one.

We saw that the effects all have a positive bias amongst the statistically significant studies. Typically, when you perform studies, you, other researchers, and peer-reviewed journals pay attention only to statistically significant findings. After all, insignificant results indicate that your sample provides insufficient evidence to conclude that the effect exists in the population. But, let’s look at all the outcomes, both significant and non-significant, for the three power levels.

For each of the three power levels, all 50 tests have an average effect near the correct value of 10. Additionally, the effects are symmetrically distributed around 10 approximately.

**Clue**: When we assess both the significant and non-significant studies together, the estimated effect is an unbiased estimator of the actual population effect. However, when we evaluate only the statistically significant effects, the estimated effect is a biased estimator.

Let’s graph the distribution of significant and non-significant to explore this clue.

These graphs show that the hypothesis test classifies the most extreme effect sizes as being statistically significant. As the power level increases, effect sizes need to be less extreme to be significant. That’s precisely how hypothesis tests are supposed to work!

## How Low Statistical Power Biases the Estimates

We noticed that those full distributions of both significant and non-significant test results approximately center on the correct effect size. However, that was not true with biased estimates of the significant findings.

For a value to be the mean of a symmetric distribution, it must have roughly an equal number of values above it and below. In this case, we know the correct effect size is 10. However, statistical power affects how extreme an estimated effect size must be for the test to classify it as statistically significant. The non-significant findings are systematically on the low end of the distribution. By filtering the results by statistical significance, you exclude these smaller effects when calculating the mean, which biases the mean upwards.

For a statistical power of 0.3, given the conditions of the experiment, the test can detect an effect size only when it is greater than or equal to 13.39 IQ points. I calculated this value using the critical t-value and multiplying it by the standard error of the differences between means (2.093 * 6.396). Note that 13.39 is greater than the correct effect size of 10. From that value alone, you know the significant effects will be biased upwards. You need an unusually high sample effect to obtain statistical significance. This high critical value severely truncates the full distribution of results, which eliminates the lower estimates (i.e., < 10) that would otherwise pull the average down to the correct value. It also explains why all the significant sample estimates are greater than 10. Consequently, it’s biased too high.

The minimum detectable effects size for powers of 0.55 and 0.8 are 9.14 and 6.95, respectively. These higher-powered tests can detect less extreme effect sizes. However, it still truncates the lower end of the distribution, which biases the effects upward. The only way to avoid the bias is to have a statistical power of 100%, which includes all the test results. However, hypothesis tests never obtain 100% power. Fortunately, when you’re near 80% power, the bias is relatively small.

## Graphical Representation Using Probability Distribution Plots

The charts below show how this works using probability distribution plots for a power of 0.3 and 0.8. The red distribution on the left is the t-distribution that the 2-sample t-test uses to determine statistical significance given the design. The red numbers at the top correspond to the t-distribution. They pinpoint the null hypothesis value (difference = 0) and the upper critical value (CV) for a two-tailed test. I don’t display the lower critical value. For convenience, I’ve converted the t-values you usually see with t-distributions to the real data units.

The blue distribution on the right is the expected distribution of differences between means given the properties of the two populations. The blue number locates the mean IQ difference associated with the correct effect (10), which occurs at the peak of the blue curve.

Here are some essential aspects to notice.

Notice how the critical value line truncates the distribution of differences in the blue curves. All differences on the blue curve to the left of the red dashed CV line will not be significant. This portion of the blue curve represents false negatives, which statisticians refer to as Type II errors. Consequently, those smaller effect sizes won’t be included in the calculations for the mean of the significant effect sizes. The average effect must be higher than the critical value. Truncation diminishes as power increases from 0.3 to 0.8. The amount of truncation determines the degree of bias in the estimated effect amongst statistically significant results.

The area under the blue curve to the right of the CV line represents the power of the test because those differences will be significant. For 0.3 statistical power, 30% of the area under the blue curve is to the right of the CV line. For 0.8 power, 80% is to the right of the CV line.

The higher the proportion of the blue curve that is to the right of the CV line, the less bias exists in the significant effects and the greater the statistical power.

## Discussion

I hope this illustration was an eye-opener. I suspect the fact that low power studies inflate effect sizes is underappreciated. Typically, analysts think the biggest danger of low power studies is missing an effect, which is a real possibility. However, when an analyst obtains significant results with low power, they are relieved, but they don’t realize it inflates the effect size!

Particular fields are more prone to conducting studies with small sample sizes and low power. For example, psychology studies routinely use small samples. Unsurprisingly, psychology has had problems with exaggerated effect sizes. Alternatively, researchers might use a small pilot study to start, which could produce inflated effect sizes.

You might have heard that effect sizes in journal articles are biased because editors publish only significant studies. While the articles in a journal won’t all neatly have the same power, the same principle applies. By restricting publication by p-values, journals exclude the smaller estimates. Imagine you’re researching a subject area and you find all published articles about a particular effect. You might think that by averaging them together, you’ll get a reasonable estimate. That’s not necessarily the case! Think back to the first set of graphs that displayed only the significant results. Those were biased. It wasn’t until we added in all the smaller, non-significant effects that the average effect was close to the actual effect.

Finally, calculating power for this simulation was easy because I knew the correct values to enter into the power calculations. However, for a study in the real-world, it can be difficult. Consequently, you might not always realize you have a low power study. Additionally, consider that the smallest studies in these simulations had 22 subjects split between two groups, which produced a statistical power of 0.3. If this were a real study, I bet the researchers would not realize it had such low power. When in doubt, err on the size of larger sample sizes. And, do your best to be realistic with the power calculations!

Kushal says

Hello Jim,

First of all, let me thank you for this blog. This is really helpful. I understand the overall idea; however, I have some questions which I am unable to get answers myself. Could you please help me with the following questions?

1) “For a statistical power of 0.3, given the conditions of the experiment, the test can detect an effect size

only when it is greater than or equal to 13.39 IQ points.”

“Of the 50 tests with the lowest statistical power, 13 (26%) are statistically significant. The average effect

size is 17.05 IQ points, and the range extends from 12.01 to 21.45.”

I an unable to understand what are the above 2 lines indicating. In other words, if for 0.3 statistical power, the test can detect an effect size only when the effect is above 13.39, then how is it that the graph of statistical significance effects (statistical power of 0.3) includes difference of less than 13.39. Relevant excerpts from the blog are as follows

2) I, to a certain extent, understand the relation between significance level, statistical power, effect size, variance, and sample size. My understanding was that meeting the minimum sample size requirement is sufficient. However, if I understand correctly, the minimum requirement for each of the tests (statistical power of 0.3. 0.55 and 0.8), the sample size exceeded the minimum requirement. How is it that even after meeting the minimum sample size requirement, there is so much error in the results produced? Could you please help me understand?

3) This question may be too basic. Kindly pardon my ignorance. At some places, the blog mentions that we have performed 50 tests. My understanding is that we performed a single test (or 3 tests, one for each of the 3 statistical powers?) including 50 samples. Could you please help clear my thought process? Relevant excerpt from the blog is as follows:

“We’d expect that percentage if we perform the test an infinite number of times. However, we conducted

it only 50 times, so there’s a margin of error around the percentage of significant studies.”

Waiting for your response.

Regards,

Kushal Jain

Jim Frost says

Hi Kushal,

Those are great questions. Some answers!

1) I calculated that value of 13.39 using the population parameters and entering them into the correct equation. Bear in mind that the samples will have estimates of the parameters, which will vary randomly around the true parameters. If you take some combinations of those parameters estimates and plug them into the equation, you’ll obtain smaller detectable values in some cases. So, the smaller significant effects exist because of the random variation in the sample estimates of the parameters.

2) The samples sizes I used produced nearly the exact power specified. For example, 11 samples per group produces exactly 31% power. Usually when you perform power analysis, you only have estimates of the effect size and, consequently, you have an estimate of the power. In this case, because I’m controlling the population parameters and effect size, it’s the true power. That’s the great thing about simulation studies! And the results largely followed expectations given the power for each design. Power is the ability to detect an effect that exists. We know it exists. For example, in the design with 30% power, it detected the effect 26% of the time. Almost exactly the expected amount. Same with the other designs and their power levels. The reason this happens is because we’re working with samples. An effect can exist in the population but thanks to the vagaries of a specific example, you might not be able to detect the effect. Power analysis helps you understand how likely you are to detect the effect if the effect exists in the population.

3) I described the process in the blog but I’ll summarize here. For the design for each power level, I drew sufficient random samples to conduct the 2-sample t-test 50 times. For example, with the 30% power design, I drew 50 random samples for the two groups with 11 observations per group using the specified population parameters. So, it’s one TYPE of hypothesis test (2-sample t-test) but I’ve collected 50 random samples and I’m performing that test 50 times for each power level.

I hope that helps clarify things!

karen says

Hi Jim

What would you recommend if a study/project cant afford large sample size (becasue of budget constrains) to ensure certain power (0.8) to detect small effect size between baseline (before intervention starts) and endline (after intervention) measurement. Thanks, karen

Jim Frost says

Hi Karen,

I know that budget constraints are frequently a problem for studies which can make obtaining a sufficiently large sample size difficult. I wish I could answer your question concretely, but the sample size required to obtain a power of 0.8 depends on how large your effect size is compared to the variability in your data. I’d recommend coming up with the best estimates for the effect size and variability for your study and then using a power analysis to estimate a good sample size to obtain a power of 0.8.

Read this post to learn more about power and sample size analysis.

Also, if you need a statistical tool to calculate power and sample for your study, I recommend G*Power, which is freeware.

I hope this helps!

Jerry says

Jim, I ‘get’ that averaging all the effect size in published studies (which are almost always statistically significant) will give you an average effect size that is higher than if you had included non-significant studies. But I don’t see how low-powered, low-n studies would exaggerate the effect size. The effect size is whatever it comes out to be, which depends on the n, the power level, and the alpha level. If I do a study with low power (let’s say because of a low n), then my effect size may be small, medium, or large, but all will likely be statistically non-significant if my power is low. Wouldn’t it be more accurate to say that publication bias results in artificially higher effect sizes being published, and that the effect size found in any single study you conduct will NOT be affected by the power (although the significance of the results will be) ?

Researchers should also take into account the sample sizes when reviewing studies to glean a typical effect size from the literature. If I see a study that has an effect size in line with other published studies, but is not statistically significant, I still think they may be on the right track and I look at the n to see if it’s under-powered to detect that difference as significant. Many small studies in medicine suffer from small sample sizes and their results appear to not be meaningful, when in fact they are meaningful when one looks beyond just the p-value. I once found a study where having just one more patient with a positive test result would have pushed it from being non-significant to being statistically significant.

Jim Frost says

Hi Jerry,

The average effect over many studies shows the expected value of the effect you’d obtain if you performed the study once–plus or minus the error. The average is the expected value. When the average equals the true effect, it’s unbiased. On the whole, you’d expect one study’s estimate to equal the true effect–again plus or minus error.

However, when the average is too high or too low, the estimate is biased. It indicates that you’d expect to obtain an estimate that is, in these examples, too high. You write, “the effect size is whatever is comes out to be.” Well, seeing how this works and the high averages, we know that the expected value for “whatever is comes out to be” is greater than 10 in most cases. It’s biased high. From these examples, we can see that is possible that when the expected value is too high, some studies are close to or even below the correct value. However, they are the exceptions. Most studies are too high.

In the real world, you’d perform the study once and probably not know what the true effect size is. However, if you can figure you have low power, the expected value for the effect size is greater than the correct value. As with these examples, an individual study might be right at the correct value, but they’re the exceptions. In other words, it’s more likely than not that your study is overestimating the size of the effect.

In a nutshell, these high averages indicate that all of the individual studies included in the average, as a whole, are systematically too high. The effect size in individual studies are affected, which is why the mean is high. The low power filters out the smaller effects, leaving only the unusually large effects being statistically significant. You won’t get those smaller effects because the test can’t detect them.

One final caution about the second part of your comment. You have to be careful when “looking beyond just the p-value.” P-values are a critical protection against misinterpreting random error (noise) as an effect (signal). I write about it in my post: Can high p-values be meaningful? Additionally, I also write about how studies with lower p-values are more likely to replicate. P-values aren’t perfect but they do provide important information.

Curtis says

How did you calculate the statistical powers of .30, .55 and .80?

Jim Frost says

Hi Curtis,

I used the power and sample size calculations in my statistical software. In my case, I’m using Minitab, but other software should have something equivalent. For this example, I know for a fact what the difference and standard deviations are because I’ve defined the populations for this simulation. Then, I entered the power levels I wanted it to calculate. The power analysis output in this post tells me how large the sample sizes need to be to produce the power levels that I entered.

Typically, you don’t know the difference and standard deviation for the population. If you’re lucky, you’ll have estimates of the population standard deviation. Frequently, you’ll enter the smallest effect size that is meaningful in a practical sense.

For more information about this process, read my post about power and sample size analysis. In that post, I show and explain this process in more detail.

I hope this helps!