Determining a good sample size for a study is always an important issue. After all, using the wrong sample size can doom your study from the start. Fortunately, power analysis can find the answer for you. Power analysis combines statistical analysis, subject-area knowledge, and your requirements to help you derive the optimal sample size for your study.

Statistical power in a hypothesis test is the probability that the test will detect an effect that actually exists. As you’ll see in this post, both under-powered and over-powered studies are problematic. Let’s learn how to find a good sample size for your study!

When you perform hypothesis testing, there is a lot of preplanning you must do before collecting any data. This planning includes identifying the data you will gather, how you will collect it, and how you will measure it among many other details. A crucial part of the planning is determining how much data you need to collect. I’ll show you how to estimate the sample size for your study.

Before we get to estimating sample size requirements, let’s review the factors that influence statistical significance. This process will help you see the value of formally going through a power and sample size analysis rather than guessing.

**Related post**: 5 Steps for Conducting Scientific Studies with Statistical Analyses

## Factors Involved in Statistical Significance

Look at the chart below and identify which study found a real treatment effect and which one didn’t. Within each study, the difference between the treatment group and the control group is the sample estimate of the effect size.

Did either study obtain significant results? The estimated effects in both studies can represent either a real effect or random sample error. You don’t have enough information to make that determination. Hypothesis tests incorporate these considerations to determine whether the results are statistically significant.

**Effect size**: The larger the effect size, the less likely it is to be random error. It’s clear that Study A exhibits a more substantial effect in the sample—but that’s insufficient by itself.**Sample size**: Larger sample sizes allow hypothesis tests to detect smaller effects. If Study B’s sample size is large enough, its more modest effect can be statistically significant.**Variability**: When your sample data have greater variability, random sampling error is more likely to produce considerable differences between the experimental groups even when there is no real effect. If the sample data in Study A have sufficient variability, random error might be responsible for the large difference.

Hypothesis testing takes all of this information and uses it to calculate the p-value—which you use to determine statistical significance. The key takeaway is that the statistical significance of any effect depends collectively on the size of the effect, the sample size, and the variability present in the sample data. Consequently, you cannot determine a good sample size in a vacuum because the three factors are intertwined.

**Related post**: How Hypothesis Tests Work

## Statistical Power of a Hypothesis Test

Because we’re talking about determining the sample size for a study that has not been performed yet, you need to learn about a fourth consideration—statistical power. Statistical power is the probability that a hypothesis test correctly infers that a sample effect exists in the population. In other words, the test correctly rejects a false null hypothesis. Consequently, power is inversely related to a Type II error. Power = 1 – β. The power of the test depends on the other three factors.

For example, if your study has 80% power, it has an 80% chance of detecting an effect that exists. Let this point be a reminder that when you work with samples, nothing is guaranteed! When an effect actually exists in the population, your study might not detect it because you are working with a sample. Samples contain sample error, which can occasionally cause a random sample to misrepresent the population.

**Related post**: Types of Errors in Hypothesis Testing

## Goals of a Power and Sample Size Analysis

Power analysis involves taking these four considerations, adding subject-area knowledge, and managing tradeoffs to settle on a sample size. During this process, you must heavily rely on your expertise to provide reasonable estimates of the input values.

Power analysis helps you manage an essential tradeoff. As you increase the sample size, the hypothesis test gains a greater ability to detect small effects. This situation sounds great. However, larger sample sizes cost more money. And, there is a point where an effect becomes so miniscule that it is meaningless in a practical sense.

You don’t want to collect a large and expensive sample only to be able to detect an effect that is too small to be useful! Nor do you want an underpowered study that has a low probability of detecting an important effect. Your goal is to collect a sample that is large enough to have a sufficient statistical power to detect a meaningful effect—but not too large to be wasteful.

As you’ll see in the upcoming examples, the analyst provides numeric values that correspond to “a good chance” and “meaningful effect.” These values allow you to tailor the analysis to your needs.

All of these details might sound complicated, but a statistical power analysis helps you manage them. In fact, going through this procedure forces you to focus on the relevant information. Typically, you specify three of the four factors discussed above and your statistical software calculates the remaining value. For instance, if you specify the smallest effect size that is practically significant, variability, and power, the software calculates the required sample size.

Let’s work through some examples in different scenarios to bring this to life.

## 2-Sample t-Test Power Analysis for Sample Size

Suppose we’re conducting a 2-sample t-test to determine which of two materials is stronger. If one type of material is significantly stronger than the other, we’ll use that material in our process. Furthermore, we’ve tested these materials in a pilot study, which provides background knowledge to draw from.

In a power and sample size analysis, statistical software presents you with a dialog box something like the following:

We’ll go through these fields one-by-one. First off, we will leave **Sample sizes** blank because we want the software to calculate this value.

### Differences

Differences is often a confusing value to enter. Do not enter your guess for the difference between the two types of material. Instead, use your expertise to identify the smallest difference that is still meaningful for your application. In other words, you consider smaller differences to be inconsequential. It would not worthwhile to expend resources to detect them.

By choosing this value carefully, you tailor the experiment so that it has a reasonable chance of detecting useful differences while allowing smaller, non-useful differences to remain potentially undetected. This value helps prevent us from collecting an unnecessarily large sample.

For our example, we’ll enter 5 because smaller differences are unimportant for our process.

### Power values

Power values is where we specify the probability that the statistical hypothesis test detects the difference in the sample if that difference really exists in the population. This field is where you define the “reasonable chance” that I mentioned earlier. If you hold the other input values constant and increase the power of the test, the required sample size also increases. The proper value to enter in this field depends on norms in your study area or industry. Common power values are 0.8 and 0.9.

We’ll enter a power of 0.9 so that the 2-sample t-test has a 90% chance of detecting a difference of 5.

### Standard deviation

Standard deviation is the field where we enter the data variability. We need to enter an estimate of the common standard deviation for the strengths of the two types of material. These estimates are typically based on pilot studies and historical research data. Inputting better estimates of the variability will produce more reliable power analysis results. Consequently, you should strive to improve these estimates over time as you perform additional studies and testing. Providing good estimates of the standard deviation is often the most difficult part of a power and sample size analysis.

For our example, we’ll assume that the two types of material have a standard deviation of 4 units of strength. After we click OK, we see the results.

**Related post**: Measures of Variability

## Interpreting the Statistical Power Analysis and Sample Size Results

Statistical power and sample size analysis provides both numeric and graphical results, as shown below.

The text output indicates that we need 15 samples per group (total of 30) to have a 90% chance of detecting a difference of 5 units.

The dot on the Power Curve corresponds to the information in the text output. However, by studying the entire graph, we can learn additional information about how statistical power varies by the difference. If we start at the dot and move down the curve to a difference of 2.5, we learn that the test has a power of approximately 0.4 (40%). This power is too low. However, we indicated that differences less than 5 were not practically significant to our process. Consequently, having low power to detect a difference of 2.5 is not problematic.

Conversely, follow the curve up from the dot and notice how power quickly increases to nearly 100% before we reach a difference of 6. This design satisfies the process requirements while using a manageable sample size of 15 per group.

## Other Power Analysis Options

Now, let’s explore a few more options that are available for power analysis. This time we’ll use a one-tailed test and have the software calculate a value other than sample size.

Suppose we are again comparing the strengths of two types of material. However, in this scenario, we are currently using one kind of material and are considering switching to another. We will change to the new material only if it is stronger than our current material. Again, the smallest difference in strength that is meaningful to our process is 5 units. The standard deviation in this study is now 7. Further, let’s assume that our company uses a standard sample size of 20, and we need approval to increase it to 40. Because the standard deviation (7) is larger than the smallest meaningful difference (5), we might need a larger sample.

In this scenario, the test needs to determine only whether the new material is stronger than the current material. Consequently, we can use a one-tailed test. This type of test provides greater statistical power to determine whether the new material is stronger than the old material, but no power to determine if the current material is stronger than the new—which is acceptable given the dictates of the new scenario.

In this analysis, we’ll enter the two potential values for Sample sizes and leave Power values blank. The software will estimate the power of the test for detecting a difference of 5 for designs with both 20 and 40 samples per group.

We fill in the dialog box as follows:

And, in **Options**, we choose the following one-tailed test:

## Interpreting the Power and Sample Size Results

The statistical output indicates that a design with 20 samples per group (a total of 40) has a ~72% chance of detecting a difference of 5. Generally, this power is considered to be too low. However, a design with 40 samples per group (80 total) achieves a power of ~94%, which is almost always acceptable. Hopefully, the power analysis convinces management to approve the larger sample size.

Assess the Power Curve graph to see how the power varies by the difference. For example, the curve for the sample size of 20 indicates that the smaller design does not achieve 90% power until the difference is approximately 6.5. If increasing the sample size is genuinely cost prohibitive, perhaps accepting 90% power for a difference of 6.5, rather than 5, is acceptable. Use your process knowledge to make this type of determination.

## Use Power Analysis for Sample Size Estimation For All Studies

Throughout this post, we’ve been looking at continuous data, and using the 2-sample t-test specifically. For continuous data, you can also use power analysis to assess sample sizes for ANOVA and DOE designs. Additionally, there are hypothesis tests for other types of data, such as proportions tests (binomial data) and rates of occurrence (Poisson data). These tests have their own corresponding power and sample analyses.

In general, when you move away from continuous data to these other types of data, your sample size requirements increase. And, there are unique intricacies in each. For instance, in a proportions test, you need a relatively larger sample size to detect a particular difference when your proportion is closer 0 or 1 than if it is in the middle (0.5). There are many factors that can affect the optimal sample size. Power analysis helps you navigate these concerns.

After reading this post, I hope you see how power analysis combines statistical analyses, subject-area knowledge, and your requirements to help you derive the optimal sample size for your specific needs. If you don’t perform this analysis, you risk performing a study that is either likely to miss an important effect or have an exorbitantly large sample size. I’ve written a post about a Mythbusters experiment that had no chance of detecting an effect because they guessed a sample size instead of performing a power analysis.

In this post, I’ve focused on how power affects your test’s ability to detect a real effect. However, low power tests also exaggerate effect sizes!

Finally, experimentation is an iterative process. As you conduct more studies in an area, you’ll develop better estimates to input into power and sample size analyses and gain a clearer picture of how to proceed.

Jaser Ali says

Hi sir,

Just wanted to understand, if the confidence interval and power is same.

Tom says

Thanks for your explanation, Jim.

Tom says

Jim,

I would like to design a test for the following problem

(under the assumption that the Poisson distribution applies):

Samples from a population can be either defective or not (e.g. some technical component from a production)

Out of a random sample of N, there should be at most k defective occurrences, with a 95% probability (e.g. N = 100’000, k = 30).

I would like to design a test for this (testing this Hypothesis) with a sample size N1 (different from N).

What should my limit on k1 (defective occurrences from the sample of N1) be?

Such that I can say that with a 95% confidence, there will be at most k occurrences out of N samples.

E.g. N2 = 20’000. k1 = ???

Any hints how to tackle this problem?

Many thanks in advance

Tom

Jim Frost says

Hi Tom,

To me, it sounds like you need to use the binomial distribution rather than the Poisson distribution. You use the binomial distribution when you have binary data and you know the probability of an event and the number of trials. That’s sounds like you’re scenario!

In the graph below, I illustrate a binomial distribution where we assume the defect rate is 0.001 and the sample size is 100,000. I had the software shade the upper and lower ~2.5% of the tails. 95% of the outcomes should fall within the middle.

If you have sample data, you can use the Proportions hypothesis test, which is based on the binomial distribution. If you have a single sample, use the Proportions test to determine whether your sample is significantly different from a target probability and to construct a confidence interval.

I hope this help!

Gavin Austin says

Hi Jim,

Thanks very much for putting together this very helpful and informative page. I just have a quick question about statistical power: it’s been surprisingly difficult for me to locate an answer to it in the literature.

I want to calculate the sample size required in order to reach a certain level of a priori statistical power in my experiment. My question is about what ‘sample size’ means in this type of calculation. Does it mean the number of participants or the number of data points? If there is one data point per participant, then these numbers will obviously be the same. However, I’m using a mixed-effects logistic regression model in which there are multiple data points nested within each participant. (Each participant produces multiple ‘yes/no’ responses.)

It would seem odd if the calculation of a priori statistical power did not differentiate between whether each participant produces one response or multiple responses.

Gavin

Khalid says

Thank you so much sir for the lucid explanation. Really appreciate your kind help. Many Thanks!

Khalid says

Dear sir,

When i search online for sample size determination, i predominantly see mention of margin of error formula for its calculation.

At other places, like your website, i see use of effect size and desired power etc. for the same calcation.

I’m struggling to reconcile between these 2 approaches. Is there a link between the two?

I wish to determine sample size for testing a hypothesis with sufficient power, say 80% or 90%. Please guide me.

Jim Frost says

Hi Khalid, a margin of error (MOE) quantifies the amount of random sampling error in the estimation of a parameter, such as the mean or proportion. MOEs represent the uncertainty about how well the sample estimates from a study represent the true population value and are related to confidence intervals. In a confidence interval, the margin of error is the distance between the sample estimate and each endpoint of the CI.

Margins of error are commonly used for surveys. For example, if a survey result is that 75% of the respondents like the product with a MOE of 3 percent. This result indicates that we can be 95% confident that 75% +/- 3% (or 72-78%) of the population like the product.

If you conduct a study, you can estimate the sample size that you need to achieve a specific margin of error. The narrower the MOE, the more precise the estimate. If you have requirements about the precision of the estimates, then you might need to estimate the margin of error based on different sample sizes. This is simply one form of power and sample size analysis where the focus is on how sample sizes relate to the margin of error.

However, if you need to calculate power to detect an effect, use the methods I describe in this post.

In summary, determine what your requirements are and use the corresponding analysis. Do you need to estimate a sample size that produces a level of precision that you specify for the estimates? Or, do you need to estimate a sample size that produces an amount of power to detect a specific size effect? Of course, these are related questions and it comes down to what you want to input in as your criteria.

I hope this helps!

增大 says

受益匪浅,感触良多!

Ashwini says

Jim,

Thank you so much for this very intuitive article on sample size.

Thank you,

Ashwini

Jim Frost says

Hi Ashwini, you’re very welcome! I’m glad it was helpful!

hellen turyahabwa says

Thank you.This was very helpful

Jim Frost says

You’re very welcome, Hellen! I’m glad you found it to be helpful!

George says

Thanks for your answer Jim. I was indeed aware of this tool, which is great for demonstration. I think I’ll stick to it.

Saheeda Mujaffar says

Awaiting your book!

Jim Frost says

Thanks! If all goes well, the first one should be out in September 2018!

George says

Once again, a nice demonstration. Thanks Jim.

I was wondering which software you used in your examples. Is it, perhaps, R or G*Power? And, would you have any suggestions on an (online/offline) tool that can be used in class?

Thanks!

Jim Frost says

Hi George, thank you very much! I’m glad it was helpful! I used Minitab for the examples, but I would imagine that most statistical software have similar features.

I found this interactive tool for displaying how power, alpha, effect size, etc. are related. Perhaps this is what you’re looking for?

Dharmendra Dubey says

Thanks for information, please explain for case- control study, sample size calculation if different study says different prevalence for different parameter.

Khursheed Ahmad Ganaie says

Thnks sir ….

Wana to salute uh. ……bt to far

Sir send me sme articles on distributions of probability. ..

MOST KNDNSS