Statistical power is the ability of a hypothesis test to detect an effect that exists in the population. Clearly, a high powered study is a good thing just for being able to identify these effects. Low power reduces your chances of discovering real findings. However, many analysts don’t realize that low power also inflates the effect size.
In this post, I show how this unexpected relationship between power and exaggerated effect sizes exists. I’ll also tie it to other issues, such as the bias of effects published in journals and other matters about statistical power. I think this post will be eye-opening and thought provoking! As always, I’ll use many graphs rather than equations.
Hypothetical Study Scenario
To illustrate how this effect size inflation works, I’ll simulate a study and conduct it many times at three power levels.
Imagine that we’re studying a fictitious medication that promises to increase your intelligence (IQ). Our experiment has two groups—a control group that doesn’t take the pill and the treatment group which takes it. Then, each group takes the same IQ test and we compare the results. The effect size is the difference between group means.
Because we’re simulating these studies, we can control the effect size and other properties of the population. I’ll set the effect size at 10 IQ points and define the two populations as follows:
- Control group: Normal distribution with a mean of 100 and a standard deviation of 15.
- Treatment group: Normal distribution with a mean of 110 and a standard deviation of 15.
I calculated the sample sizes I need to produce statistical power of 0.3, 0.55, and 0.8. The first two values represent low power studies, while the third value is a standard target value. The output below shows the power analysis results.
Now, I’ll use statistical software to draw 50 random samples from the above populations for each of the sample sizes in the power analysis. Finally, I perform 2-sample t-tests on all datasets. I use two-tailed tests with a significance level of 0.05. The following discussion explains the results of the 50 2-sample t-tests for each of the three power levels.
Findings and Estimated Effect Sizes for Very Low Power (0.3)
We know that the correct effect size for these analyses is 10. Let’s see what the 2-sample hypothesis tests reveal for the 50 datasets that have a power of 0.3. Given this power, we expect to detect the effect 30% of the time. We’d expect that percentage if we perform the test an infinite number of times. However, we conducted it only 50 times, so there’s a margin of error around the percentage of significant studies.
Of the 50 tests with the lowest statistical power, 13 (26%) are statistically significant. The average effect size is 17.05 IQ points, and the range extends from 12.01 to 21.45. Not only is the average effect too high, but the entire range of effects is greater than the actual effect. The graph below displays the distribution of statistically significant findings with a reference line for the real effect (10).
Now, let’s look at the results for the other two levels of statistical power.
Power = 0.55
Of the 50 tests, 34 (68%) are statistically significant. The average effect size is 13.50, and the range extends from 7.71 to 19.04. A large majority of the effects are greater than 10.
Power = 0.8
Of the 50 tests, 41 (82%) are statistically significant. The average effect size is 11.92, and the range extends from 6.30 to 19.43. The real effect size is moving closer to the center of the distribution.
Relationship between Statistical Power and Effect Size
As the power level increases, the percentage of detections increases and the exaggeration of the effect size decreases. Both are good things and they have a common cause. The graph below displays the exaggeration factor (mean significant effect / actual effect) by power. No exaggeration occurs at a value of one.
We saw that the effects all have a positive bias amongst the statistically significant studies. Typically, when you perform studies, you, other researchers, and peer-reviewed journals pay attention only to statistically significant findings. After all, insignificant results indicate that your sample provides insufficient evidence to conclude that the effect exists in the population. But, let’s look at all the outcomes, both significant and non-significant, for the three power levels.
For each of the three power levels, all 50 tests have an average effect near the correct value of 10. Additionally, the effects are symmetrically distributed around 10 approximately.
Clue: When we assess both the significant and non-significant studies together, the estimated effect is an unbiased estimator of the actual population effect. However, when we evaluate only the statistically significant effects, the estimated effect is a biased estimator.
Let’s graph the distribution of significant and non-significant to explore this clue.
These graphs show that the hypothesis test classifies the most extreme effect sizes as being statistically significant. As the power level increases, effect sizes need to be less extreme to be significant. That’s precisely how hypothesis tests are supposed to work!
How Low Statistical Power Biases the Estimates
We noticed that those full distributions of both significant and non-significant test results approximately center on the correct effect size. However, that was not true with biased estimates of the significant findings.
For a value to be the mean of a symmetric distribution, it must have roughly an equal number of values above it and below. In this case, we know the correct effect size is 10. However, statistical power affects how extreme an estimated effect size must be for the test to classify it as statistically significant. The non-significant findings are systematically on the low end of the distribution. By filtering the results by statistical significance, you exclude these smaller effects when calculating the mean, which biases the mean upwards.
For a statistical power of 0.3, given the conditions of the experiment, the test can detect an effect size only when it is greater than or equal to 13.39 IQ points. I calculated this value using the critical t-value and multiplying it by the standard error of the differences between means (2.093 * 6.396). Note that 13.39 is greater than the correct effect size of 10. From that value alone, you know the significant effects will be biased upwards. You need an unusually high sample effect to obtain statistical significance. This high critical value severely truncates the full distribution of results, which eliminates the lower estimates (i.e., < 10) that would otherwise pull the average down to the correct value. It also explains why all the significant sample estimates are greater than 10. Consequently, it’s biased too high.
The minimum detectable effects size for powers of 0.55 and 0.8 are 9.14 and 6.95, respectively. These higher-powered tests can detect less extreme effect sizes. However, it still truncates the lower end of the distribution, which biases the effects upward. The only way to avoid the bias is to have a statistical power of 100%, which includes all the test results. However, hypothesis tests never obtain 100% power. Fortunately, when you’re near 80% power, the bias is relatively small.
Graphical Representation Using Probability Distribution Plots
The charts below show how this works using probability distribution plots for a power of 0.3 and 0.8. The red distribution on the left is the t-distribution that the 2-sample t-test uses to determine statistical significance given the design. The red numbers at the top correspond to the t-distribution. They pinpoint the null hypothesis value (difference = 0) and the upper critical value (CV) for a two-tailed test. I don’t display the lower critical value. For convenience, I’ve converted the t-values you usually see with t-distributions to the real data units.
The blue distribution on the right is the expected distribution of differences between means given the properties of the two populations. The blue number locates the mean IQ difference associated with the correct effect (10), which occurs at the peak of the blue curve.
Here are some essential things to notice.
Notice how the critical value line truncates the distribution of differences in the blue curves. All differences on the blue curve to the left of the red dashed CV line will not be significant. This portion of the blue curve represents false negatives, which statisticians refer to as Type II errors. Consequently, those smaller effect sizes won’t be included in the calculations for the mean of the significant effect sizes. The average effect must be higher than the critical value. Truncation diminishes as power increases from 0.3 to 0.8. The amount of truncation determines the degree of bias in the estimated effect amongst statistically significant results.
The area under the blue curve to the right of the CV line represents the power of the test because those differences will be significant. For 0.3 statistical power, 30% of the area under the blue curve is to the right of the CV line. For 0.8 power, 80% is to the right of the CV line.
The higher the proportion of the blue curve that is to the right of the CV line, the less bias exists in the significant effects and the greater the statistical power.
I hope this illustration was an eye-opener. I suspect the fact that low power studies inflate effect sizes is underappreciated. Typically, analysts think the biggest danger of low power studies is missing an effect, which is a real possibility. However, when an analyst obtains significant results with low power, they are relieved, but they don’t realize it inflates the effect size!
Particular fields are more prone to conducting studies with small sample sizes and low power. For example, psychology studies routinely use small samples. Unsurprisingly, psychology has had problems with exaggerated effect sizes. Alternatively, researchers might use a small pilot study to start, which could produce inflated effect sizes.
You might have heard that effect sizes in journal articles are biased because editors publish only significant studies. While the articles in a journal won’t all neatly have the same power, the same principle applies. By restricting publication by p-values, journals exclude the smaller estimates. Imagine you’re researching a subject area and you find all published articles about a particular effect. You might think that by averaging them together, you’ll get a reasonable estimate. That’s not necessarily the case! Think back to the first set of graphs that displayed only the significant results. Those were biased. It wasn’t until we added in all the smaller, non-significant effects that the average effect was close to the actual effect.
Finally, calculating power for this simulation was easy because I knew the correct values to enter into the power calculations. However, for a study in the real-world, it can be difficult. Consequently, you might not always realize you have a low power study. Additionally, consider that the smallest studies in these simulations had 22 subjects split between two groups, which produced a statistical power of 0.3. If this were a real study, I bet the researchers would not realize it had such low power. When in doubt, err on the size of larger sample sizes. And, do your best to be realistic with the power calculations!