Low Power Tests Exaggerate Effect Sizes

If your study has low statistical power, it will exaggerate the effect size. What?!

Statistical power is the ability of a hypothesis test to detect an effect that exists in the population. Clearly, a high-powered study is a good thing just for being able to identify these effects. Low power reduces your chances of discovering real findings. However, many analysts don’t realize that low power also inflates the effect size. Learn more about Statistical Power.

In this post, I show how this unexpected relationship between power and exaggerated effect sizes exists. I’ll also tie it to other issues, such as the bias of effects published in journals and other matters about statistical power. I think this post will be eye-opening and thought provoking! As always, I’ll use many graphs rather than equations.

Hypothetical Study Scenario

To illustrate how this effect size inflation works, I’ll simulate a study and conduct it many times at three power levels.

Imagine that we’re studying a fictitious medication that promises to increase your intelligence (IQ). Our experiment has two groups—a control group that doesn’t take the pill and the treatment group which takes it. Then, each group takes the same IQ test and we compare the results. The effect size is the difference between group means.

Because we’re simulating these studies, we can control the effect size and other properties of the population. I’ll set the effect size at 10 IQ points and define the two populations as follows:

Control group: Normal distribution with a mean of 100 and a standard deviation of 15.
Treatment group: Normal distribution with a mean of 110 and a standard deviation of 15.

I calculated the sample sizes I need to produce statistical power of 0.3, 0.55, and 0.8. The first two values represent low power studies, while the third value is a standard target value. The output below shows the power analysis results.

Now, I’ll use statistical software to draw 50 random samples from the above populations for each of the sample sizes in the power analysis. Finally, I perform 2-sample t-tests on all datasets. I use two-tailed tests with a significance level of 0.05. The following discussion explains the results of the 50 2-sample t-tests for each of the three power levels.

Findings and Estimated Effect Sizes for Very Low Power (0.3)

We know that the correct effect size for these analyses is 10. Let’s see what the 2-sample hypothesis tests reveal for the 50 datasets that have a power of 0.3. Given this power, we expect to detect the effect 30% of the time. We’d expect that percentage if we perform the test an infinite number of times. However, we conducted it only 50 times, so there’s a margin of error around the percentage of significant studies.

Of the 50 tests with the lowest statistical power, 13 (26%) are statistically significant. The average effect size is 17.05 IQ points, and the range extends from 12.01 to 21.45. Not only is the average effect too high, but the entire range of effects is greater than the actual effect. The graph below displays the distribution of statistically significant findings with a reference line for the real effect (10).

Now, let’s look at the results for the other two levels of statistical power.

Power = 0.55

Of the 50 tests, 34 (68%) are statistically significant. The average effect size is 13.50, and the range extends from 7.71 to 19.04. A large majority of the effects are greater than 10.

Power = 0.8

Of the 50 tests, 41 (82%) are statistically significant. The average effect size is 11.92, and the range extends from 6.30 to 19.43. The real effect size is moving closer to the center of the distribution.

Relationship between Statistical Power and Effect Size

As the power level increases, the percentage of detections increases and the exaggeration of the effect size decreases. Both are good things and they have a common cause. The graph below displays the exaggeration factor (mean significant effect / actual effect) by power. No exaggeration occurs at a value of one.

We saw that the effects all have a positive bias amongst the statistically significant studies. Typically, when you perform studies, you, other researchers, and peer-reviewed journals pay attention only to statistically significant findings. After all, insignificant results indicate that your sample provides insufficient evidence to conclude that the effect exists in the population. But, let’s look at all the outcomes, both significant and non-significant, for the three power levels.

For each of the three power levels, all 50 tests have an average effect near the correct value of 10. Additionally, the effects are symmetrically distributed around 10 approximately.

Clue: When we assess both the significant and non-significant studies together, the estimated effect is an unbiased estimator of the actual population effect. However, when we evaluate only the statistically significant effects, the estimated effect is a biased estimator.

Let’s graph the distribution of significant and non-significant to explore this clue.

These graphs show that the hypothesis test classifies the most extreme effect sizes as being statistically significant. As the power level increases, effect sizes need to be less extreme to be significant. That’s precisely how hypothesis tests are supposed to work!

How Low Statistical Power Biases the Estimates

We noticed that those full distributions of both significant and non-significant test results approximately center on the correct effect size. However, that was not true with biased estimates of the significant findings.

For a value to be the mean of a symmetric distribution, it must have roughly an equal number of values above it and below. In this case, we know the correct effect size is 10. However, statistical power affects how extreme an estimated effect size must be for the test to classify it as statistically significant. The non-significant findings are systematically on the low end of the distribution. By filtering the results by statistical significance, you exclude these smaller effects when calculating the mean, which biases the mean upwards.

For a statistical power of 0.3, given the conditions of the experiment, the test can detect an effect size only when it is greater than or equal to 13.39 IQ points. I calculated this value using the critical t-value and multiplying it by the standard error of the differences between means (2.093 * 6.396). Note that 13.39 is greater than the correct effect size of 10. From that value alone, you know the significant effects will be biased upwards. You need an unusually high sample effect to obtain statistical significance. This high critical value severely truncates the full distribution of results, which eliminates the lower estimates (i.e., < 10) that would otherwise pull the average down to the correct value. It also explains why all the significant sample estimates are greater than 10. Consequently, it’s biased too high.

The minimum detectable effects size for powers of 0.55 and 0.8 are 9.14 and 6.95, respectively. These higher-powered tests can detect less extreme effect sizes. However, it still truncates the lower end of the distribution, which biases the effects upward. The only way to avoid the bias is to have a statistical power of 100%, which includes all the test results. However, hypothesis tests never obtain 100% power. Fortunately, when you’re near 80% power, the bias is relatively small.

Graphical Representation Using Probability Distribution Plots

The charts below show how this works using probability distribution plots for a power of 0.3 and 0.8. The red distribution on the left is the t-distribution that the 2-sample t-test uses to determine statistical significance given the design. The red numbers at the top correspond to the t-distribution. They pinpoint the null hypothesis value (difference = 0) and the upper critical value (CV) for a two-tailed test. I don’t display the lower critical value. For convenience, I’ve converted the t-values you usually see with t-distributions to the real data units.

The blue distribution on the right is the expected distribution of differences between means given the properties of the two populations. The blue number locates the mean IQ difference associated with the correct effect (10), which occurs at the peak of the blue curve.

Here are some essential aspects to notice.

Notice how the critical value line truncates the distribution of differences in the blue curves. All differences on the blue curve to the left of the red dashed CV line will not be significant. This portion of the blue curve represents false negatives, which statisticians refer to as Type II errors. Consequently, those smaller effect sizes won’t be included in the calculations for the mean of the significant effect sizes. The average effect must be higher than the critical value. Truncation diminishes as power increases from 0.3 to 0.8. The amount of truncation determines the degree of bias in the estimated effect amongst statistically significant results.

The area under the blue curve to the right of the CV line represents the power of the test because those differences will be significant. For 0.3 statistical power, 30% of the area under the blue curve is to the right of the CV line. For 0.8 power, 80% is to the right of the CV line.

The higher the proportion of the blue curve that is to the right of the CV line, the less bias exists in the significant effects and the greater the statistical power.

Discussion

I hope this illustration was an eye-opener. I suspect the fact that low power studies inflate effect sizes is underappreciated. Typically, analysts think the biggest danger of low power studies is missing an effect, which is a real possibility. However, when an analyst obtains significant results with low power, they are relieved, but they don’t realize it inflates the effect size!

Particular fields are more prone to conducting studies with small sample sizes and low power. For example, psychology studies routinely use small samples. Unsurprisingly, psychology has had problems with exaggerated effect sizes. Alternatively, researchers might use a small pilot study to start, which could produce inflated effect sizes.

You might have heard that effect sizes in journal articles are biased because editors publish only significant studies. While the articles in a journal won’t all neatly have the same power, the same principle applies. By restricting publication by p-values, journals exclude the smaller estimates. Imagine you’re researching a subject area and you find all published articles about a particular effect. You might think that by averaging them together, you’ll get a reasonable estimate. That’s not necessarily the case! Think back to the first set of graphs that displayed only the significant results. Those were biased. It wasn’t until we added in all the smaller, non-significant effects that the average effect was close to the actual effect.

Finally, calculating power for this simulation was easy because I knew the correct values to enter into the power calculations. However, for a study in the real-world, it can be difficult. Consequently, you might not always realize you have a low power study. Additionally, consider that the smallest studies in these simulations had 22 subjects split between two groups, which produced a statistical power of 0.3. If this were a real study, I bet the researchers would not realize it had such low power. When in doubt, err on the size of larger sample sizes. And, do your best to be realistic with the power calculations!

Comments

Scott says

February 21, 2025 at 5:27 pm

Hey Jim. This is a very thought-provoking post. Thanks!

Just to check my sanity, in your illustrative calculation:

For a statistical power of 0.3, given the conditions of the experiment, the test can detect an effect size only when it is greater than or equal to 13.39 IQ points. I calculated this value using the critical t-value and multiplying it by the standard error of the differences between means (2.093 * 6.396).

… it looks to me like you read off the critical value for df=19 when I think you meant df = 20 = 2*(11-1). Right?

I’m looking at your t-table: https://statisticsbyjim.com/hypothesis-testing/t-distribution-table/

Loading...

Reply
Steve says

September 2, 2022 at 3:09 am

Say you generate 95% confidence intervals for each of your 50 samples/datasets. Then, when you take just the statistically significant ones, we know that many of them (100% of them in your simulation scenario using power .30) will have a biased midpoint because the midpoint is the simply the effect size for each of the samples. Is it correct to extend the argument of biased estimates to confidence intervals as well?

Loading...

Reply
- Jim Frost says
  
  September 5, 2022 at 2:16 am
  
  Hi Steve,
  
  That’s a great question. That’s correct, the CIs are also biased too high.
  
  For each effect size, the CIs are formed by adding and subtracting the same margin of error to the effect size. Hence, the CIs are symmetrical around the effect size. Consequently, the biased effect sizes for the significant results cause their CIs to also be biased too high. I’ve graphed all the CIs for the significant results for the power = 0.30 dataset I used throughout this post.
  
  Interestingly, all the CIs contain the true effect size of 10. However, notice how the red reference line for the effect size is always closer to the bottom end of the intervals. That’s the bias. When you have unbiased results, you’d expect the true effect size to be above and below the true effect size equally. Instead, our CIs are all shifted upwards relative to the true effect.
  
  One thing to note is that these CIs are wider than the CIs for the datasets with higher power. The extra width means the estimates are less precise, but they also help the CIs include the true effect size despite the bias. However, I’d expect that in the long run these 95% confidence intervals will have a lower than 95% rate of including the true effect size due to the bias. In other words, random chance seems to have operated in our favor by having all 13 of these CIs contain the true effect.
  
  Loading...
  
  Reply
Tess says

February 4, 2022 at 9:25 am

Hi Jim,

Hope this reaches you well. I wonder if you had some pointers on assessing the quality of published research in relation to power? I am currently doing as narrative review of psychological literature and most of the studies I have included do not report a power analysis. One of the quality checklists I am using to assess the quality of research asks the following:

“Was the study sufficiently powered to detect an
intervention effect”

I am really struggling to answer that question as it is not clear in the published studies how they arrived at the sample size etc.

Any guidance or advice would be hugely helpful!

With gratitude,

Tess

Loading...

Reply
- Jim Frost says
  
  February 5, 2022 at 11:15 pm
  
  Hi Tess,
  
  Assessing the statistical power of a study after it was completed is known as a post hoc power analysis. And they can be difficult or impossible to perform correctly. It depends on the information that you can gather about the subject area.
  
  Statistical power calculations involve taking an estimate of the effect size and variability in the population and factoring in your sample size to determine the probability that your study will detect that effect if it actually exists. That’s the statistical power and it’s based on estimates of the population effect and variability. So, you never really know the true power of a study.
  
  Assessing it after a study is very difficult. Here’s one thing you shouldn’t do. Don’t take a study’s data (estimated effect size and variability) and enter it into the power calculations. That just doesn’t work. If the study obtained significant results, such a power analysis will always indicate high power even if that wasn’t the case. Conversely, a study with nonsignificant findings will always produce low post hoc power results regardless of the actual power.
  
  I think the only way to get a good sense of a study’s power is if the experts in the study area have a very clear sense of what the true effect and variability are. Perhaps it’s an area that’s been studied many times and multiple studies have converged on a good estimate. In that case, you can use that information to inform your post hoc power analysis. However, in those cases, you don’t usually need to perform that kind of post hoc analyses because the research question has been asked and then answered consistently multiple times!
  
  Alternatively, there might be studies that just patently have a very small sample size. If a study’s n=10 or less, it’s probably underpowered regardless of what they’re studying!
  
  I’m not surprised it’s a difficult task. Unfortunately, I don’t have a good answer. If you must assess their power, try to find a reasonable estimate of the sample size and variability that aren’t based on the specific study you’re assessing. Unfortunately, that can be easier said than done in many cases. And be aware that there is a substantial amount of the literature that is opposed to post hoc power analysis.
  
  Here’s what one peer reviewed article concludes about post hoc power analysis after they perform a simulation study.
  
  “Power analysis is an indispensable component of planning clinical research studies. However, when used to indicate power for outcomes already observed, it is not only conceptually flawed but also analytically misleading. Our simulation results show that such power analyses do not indicate true power for detecting statistical significance, since post hoc power estimates are generally variable in the range of practical interest and can be very different from the true power.”
  
  Zhang Y, Hedo R, Rivera A, et al., Post hoc power analysis: is it an informative and meaningful analysis? General Psychiatry 2019;32:e100069. doi: 10.1136/gpsych-2019-100069
  
  Loading...
  
  Reply
Jamie says

March 13, 2021 at 2:32 pm

Hello Jim,

What about the reverse? I have an independent samples t-test (2 tailed). My sample was 4764 with 1149 in one group and 3615 in the other with homogeneity being met. My effect size was .002. The power of my test was only .05! Why was it so low with such a large group? I assume it is because the effect size was so small? There was not statistical significance.

Loading...

Reply
- Jim Frost says
  
  March 14, 2021 at 5:29 pm
  
  Hi Jamie,
  
  In a statistical sense, an effect size is only large or small in comparison to the variability in the data. Think of it as a signal to noise ratio. So, an effect of 0.002 can be small or large depending on the variability. However, given your large sample size, I’d have to think that it is small effect in relation to the variability. However, a caution, it’s not recommended to take your hypothesis test results and feed them into a power analysis. If your results are significant, you’ll always obtain power analysis results indicate you had a lot of power. Conversely, if your test results were not significant, you’ll always find that you had very low power.
  
  In this post, I know the true power because I set the parameters. But, if you were to take the results of the insignificant tests, it would show low power. Converse, for the significant tests. In other words, it’ll exaggerated the power either high or low. In your case, the power analysis is likely exaggerating the low power of your analysis.
  
  But, in your case, with such a large sample size, you have to start thinking that maybe there is no effect or a tiny effect! Or, you had a lot noise or other problems with your data.
  
  Loading...
  
  Reply
Kushal says

January 8, 2021 at 10:56 pm

Hello Jim,

First of all, let me thank you for this blog. This is really helpful. I understand the overall idea; however, I have some questions which I am unable to get answers myself. Could you please help me with the following questions?

1) “For a statistical power of 0.3, given the conditions of the experiment, the test can detect an effect size
only when it is greater than or equal to 13.39 IQ points.”

“Of the 50 tests with the lowest statistical power, 13 (26%) are statistically significant. The average effect
size is 17.05 IQ points, and the range extends from 12.01 to 21.45.”

I an unable to understand what are the above 2 lines indicating. In other words, if for 0.3 statistical power, the test can detect an effect size only when the effect is above 13.39, then how is it that the graph of statistical significance effects (statistical power of 0.3) includes difference of less than 13.39. Relevant excerpts from the blog are as follows

2) I, to a certain extent, understand the relation between significance level, statistical power, effect size, variance, and sample size. My understanding was that meeting the minimum sample size requirement is sufficient. However, if I understand correctly, the minimum requirement for each of the tests (statistical power of 0.3. 0.55 and 0.8), the sample size exceeded the minimum requirement. How is it that even after meeting the minimum sample size requirement, there is so much error in the results produced? Could you please help me understand?

3) This question may be too basic. Kindly pardon my ignorance. At some places, the blog mentions that we have performed 50 tests. My understanding is that we performed a single test (or 3 tests, one for each of the 3 statistical powers?) including 50 samples. Could you please help clear my thought process? Relevant excerpt from the blog is as follows:

“We’d expect that percentage if we perform the test an infinite number of times. However, we conducted
it only 50 times, so there’s a margin of error around the percentage of significant studies.”

Waiting for your response.

Regards,
Kushal Jain

Loading...

Reply
- Jim Frost says
  
  January 8, 2021 at 11:47 pm
  
  Hi Kushal,
  
  Those are great questions. Some answers!
  
  1) I calculated that value of 13.39 using the population parameters and entering them into the correct equation. Bear in mind that the samples will have estimates of the parameters, which will vary randomly around the true parameters. If you take some combinations of those parameters estimates and plug them into the equation, you’ll obtain smaller detectable values in some cases. So, the smaller significant effects exist because of the random variation in the sample estimates of the parameters.
  
  2) The samples sizes I used produced nearly the exact power specified. For example, 11 samples per group produces exactly 31% power. Usually when you perform power analysis, you only have estimates of the effect size and, consequently, you have an estimate of the power. In this case, because I’m controlling the population parameters and effect size, it’s the true power. That’s the great thing about simulation studies! And the results largely followed expectations given the power for each design. Power is the ability to detect an effect that exists. We know it exists. For example, in the design with 30% power, it detected the effect 26% of the time. Almost exactly the expected amount. Same with the other designs and their power levels. The reason this happens is because we’re working with samples. An effect can exist in the population but thanks to the vagaries of a specific example, you might not be able to detect the effect. Power analysis helps you understand how likely you are to detect the effect if the effect exists in the population.
  
  3) I described the process in the blog but I’ll summarize here. For the design for each power level, I drew sufficient random samples to conduct the 2-sample t-test 50 times. For example, with the 30% power design, I drew 50 random samples for the two groups with 11 observations per group using the specified population parameters. So, it’s one TYPE of hypothesis test (2-sample t-test) but I’ve collected 50 random samples and I’m performing that test 50 times for each power level.
  
  I hope that helps clarify things!
  
  Loading...
  
  Reply
karen says

December 30, 2019 at 3:54 pm

Hi Jim

What would you recommend if a study/project cant afford large sample size (becasue of budget constrains) to ensure certain power (0.8) to detect small effect size between baseline (before intervention starts) and endline (after intervention) measurement. Thanks, karen

Loading...

Reply
- Jim Frost says
  
  December 30, 2019 at 4:08 pm
  
  Hi Karen,
  
  I know that budget constraints are frequently a problem for studies which can make obtaining a sufficiently large sample size difficult. I wish I could answer your question concretely, but the sample size required to obtain a power of 0.8 depends on how large your effect size is compared to the variability in your data. I’d recommend coming up with the best estimates for the effect size and variability for your study and then using a power analysis to estimate a good sample size to obtain a power of 0.8.
  
  Read this post to learn more about power and sample size analysis.
  
  Also, if you need a statistical tool to calculate power and sample for your study, I recommend G*Power, which is freeware.
  
  I hope this helps!
  
  Loading...
  
  Reply
Jerry says

August 13, 2019 at 1:35 pm

Jim, I ‘get’ that averaging all the effect size in published studies (which are almost always statistically significant) will give you an average effect size that is higher than if you had included non-significant studies. But I don’t see how low-powered, low-n studies would exaggerate the effect size. The effect size is whatever it comes out to be, which depends on the n, the power level, and the alpha level. If I do a study with low power (let’s say because of a low n), then my effect size may be small, medium, or large, but all will likely be statistically non-significant if my power is low. Wouldn’t it be more accurate to say that publication bias results in artificially higher effect sizes being published, and that the effect size found in any single study you conduct will NOT be affected by the power (although the significance of the results will be) ?

Researchers should also take into account the sample sizes when reviewing studies to glean a typical effect size from the literature. If I see a study that has an effect size in line with other published studies, but is not statistically significant, I still think they may be on the right track and I look at the n to see if it’s under-powered to detect that difference as significant. Many small studies in medicine suffer from small sample sizes and their results appear to not be meaningful, when in fact they are meaningful when one looks beyond just the p-value. I once found a study where having just one more patient with a positive test result would have pushed it from being non-significant to being statistically significant.

Loading...

Reply
- Jim Frost says
  
  August 13, 2019 at 2:16 pm
  
  Hi Jerry,
  
  The average effect over many studies shows the expected value of the effect you’d obtain if you performed the study once–plus or minus the error. The average is the expected value. When the average equals the true effect, it’s unbiased. On the whole, you’d expect one study’s estimate to equal the true effect–again plus or minus error.
  
  However, when the average is too high or too low, the estimate is biased. It indicates that you’d expect to obtain an estimate that is, in these examples, too high. You write, “the effect size is whatever is comes out to be.” Well, seeing how this works and the high averages, we know that the expected value for “whatever is comes out to be” is greater than 10 in most cases. It’s biased high. From these examples, we can see that is possible that when the expected value is too high, some studies are close to or even below the correct value. However, they are the exceptions. Most studies are too high.
  
  In the real world, you’d perform the study once and probably not know what the true effect size is. However, if you can figure you have low power, the expected value for the effect size is greater than the correct value. As with these examples, an individual study might be right at the correct value, but they’re the exceptions. In other words, it’s more likely than not that your study is overestimating the size of the effect.
  
  In a nutshell, these high averages indicate that all of the individual studies included in the average, as a whole, are systematically too high. The effect size in individual studies are affected, which is why the mean is high. The low power filters out the smaller effects, leaving only the unusually large effects being statistically significant. You won’t get those smaller effects because the test can’t detect them.
  
  One final caution about the second part of your comment. You have to be careful when “looking beyond just the p-value.” P-values are a critical protection against misinterpreting random error (noise) as an effect (signal). I write about it in my post: Can high p-values be meaningful? Additionally, I also write about how studies with lower p-values are more likely to replicate. P-values aren’t perfect but they do provide important information.
  
  Loading...
  
  Reply
Curtis says

August 12, 2019 at 1:29 pm

How did you calculate the statistical powers of .30, .55 and .80?

Loading...

Reply
- Jim Frost says
  
  August 12, 2019 at 2:38 pm
  
  Hi Curtis,
  
  I used the power and sample size calculations in my statistical software. In my case, I’m using Minitab, but other software should have something equivalent. For this example, I know for a fact what the difference and standard deviations are because I’ve defined the populations for this simulation. Then, I entered the power levels I wanted it to calculate. The power analysis output in this post tells me how large the sample sizes need to be to produce the power levels that I entered.
  
  Typically, you don’t know the difference and standard deviation for the population. If you’re lucky, you’ll have estimates of the population standard deviation. Frequently, you’ll enter the smallest effect size that is meaningful in a practical sense.
  
  For more information about this process, read my post about power and sample size analysis. In that post, I show and explain this process in more detail.
  
  I hope this helps!
  
  Loading...
  
  Reply