Determining a good sample size for a study is always an important issue. After all, using the wrong sample size can doom your study from the start. Fortunately, power analysis can find the answer for you. Power analysis combines statistical analysis, subject-area knowledge, and your requirements to help you derive the optimal sample size for your study.
Statistical power in a hypothesis test is the probability that the test will detect an effect that actually exists. As you’ll see in this post, both under-powered and over-powered studies are problematic. Let’s learn how to find a good sample size for your study!
When you perform hypothesis testing, there is a lot of preplanning you must do before collecting any data. This planning includes identifying the data you will gather, how you will collect it, and how you will measure it among many other details. A crucial part of the planning is determining how much data you need to collect. I’ll show you how to estimate the sample size for your study.
Before we get to estimating sample size requirements, let’s review the factors that influence statistical significance. This process will help you see the value of formally going through a power and sample size analysis rather than guessing.
Factors Involved in Statistical Significance
Look at the chart below and identify which study found a real treatment effect and which one didn’t. Within each study, the difference between the treatment group and the control group is the sample estimate of the effect size.
Did either study obtain significant results? The estimated effects in both studies can represent either a real effect or random sample error. You don’t have enough information to make that determination. Hypothesis tests incorporate these considerations to determine whether the results are statistically significant.
- Effect size: The larger the effect size, the less likely it is to be random error. It’s clear that Study A exhibits a more substantial effect in the sample—but that’s insufficient by itself.
- Sample size: Larger sample sizes allow hypothesis tests to detect smaller effects. If Study B’s sample size is large enough, its more modest effect can be statistically significant.
- Variability: When your sample data have greater variability, random sampling error is more likely to produce considerable differences between the experimental groups even when there is no real effect. If the sample data in Study A have sufficient variability, random error might be responsible for the large difference.
Hypothesis testing takes all of this information and uses it to calculate the p-value—which you use to determine statistical significance. The key takeaway is that the statistical significance of any effect depends collectively on the size of the effect, the sample size, and the variability present in the sample data. Consequently, you cannot determine a good sample size in a vacuum because the three factors are intertwined.
Related post: How Hypothesis Tests Work
Statistical Power of a Hypothesis Test
Because we’re talking about determining the sample size for a study that has not been performed yet, you need to learn about a fourth consideration—statistical power. Statistical power is the probability that a hypothesis test correctly infers that a sample effect exists in the population. In other words, the test correctly rejects a false null hypothesis. Consequently, power is inversely related to a Type II error. Power = 1 – β. The power of the test depends on the other three factors.
For example, if your study has 80% power, it has an 80% chance of detecting an effect that exists. Let this point be a reminder that when you work with samples, nothing is guaranteed! When an effect actually exists in the population, your study might not detect it because you are working with a sample. Samples contain sample error, which can occasionally cause a random sample to misrepresent the population.
Related post: Types of Errors in Hypothesis Testing
Goals of a Power and Sample Size Analysis
Power analysis involves taking these three considerations, adding subject-area knowledge, and managing tradeoffs to settle on a sample size. During this process, you must rely heavily on your expertise to provide reasonable estimates of the input values.
Power analysis helps you manage an essential tradeoff. As you increase the sample size, the hypothesis test gains a greater ability to detect small effects. This situation sounds great. However, larger sample sizes cost more money. And, there is a point where an effect becomes so minuscule that it is meaningless in a practical sense.
You don’t want to collect a large and expensive sample only to be able to detect an effect that is too small to be useful! Nor do you want an underpowered study that has a low probability of detecting an important effect. Your goal is to collect a large enough sample to have sufficient power to detect a meaningful effect—but not too large to be wasteful.
As you’ll see in the upcoming examples, the analyst provides numeric values that correspond to “a good chance” and “meaningful effect.” These values allow you to tailor the analysis to your needs.
All of these details might sound complicated, but a statistical power analysis helps you manage them. In fact, going through this procedure forces you to focus on the relevant information. Typically, you specify three of the four factors discussed above and your statistical software calculates the remaining value. For instance, if you specify the smallest effect size that is practically significant, variability, and power, the software calculates the required sample size.
Let’s work through some examples in different scenarios to bring this to life.
2-Sample t-Test Power Analysis for Sample Size
Suppose we’re conducting a 2-sample t-test to determine which of two materials is stronger. If one type of material is significantly stronger than the other, we’ll use that material in our process. Furthermore, we’ve tested these materials in a pilot study, which provides background knowledge for the estimates.
In a power and sample size analysis, statistical software presents you with a dialog box something like the following:
We’ll go through these fields one-by-one. First off, we will leave Sample sizes blank because we want the software to calculate this value.
Differences is often a confusing value to enter. Do not enter your guess for the difference between the two types of material. Instead, use your expertise to identify the smallest difference that is still meaningful for your application. In other words, you consider smaller differences to be inconsequential. It would not be worthwhile to expend resources to detect them.
By choosing this value carefully, you tailor the experiment so that it has a reasonable chance of detecting useful differences while allowing smaller, non-useful differences to remain potentially undetected. This value helps prevent us from collecting an unnecessarily large sample.
For our example, we’ll enter 5 because smaller differences are unimportant for our process.
Power values is where we specify the probability that the statistical hypothesis test detects the difference in the sample if that difference exists in the population. This field is where you define the “reasonable chance” that I mentioned earlier. If you hold the other input values constant and increase the test’s power, the required sample size also increases. The proper value to enter in this field depends on norms in your study area or industry. Common power values are 0.8 and 0.9.
We’ll enter a power of 0.9 so that the 2-sample t-test has a 90% chance of detecting a difference of 5.
Standard deviation is the field where we enter the data variability. We need to enter an estimate for the standard deviation of material strength. Analysts frequently base these estimates on pilot studies and historical research data. Inputting better variability estimates will produce more reliable power analysis results. Consequently, you should strive to improve these estimates over time as you perform additional studies and testing. Providing good estimates of the standard deviation is often the most difficult part of a power and sample size analysis.
For our example, we’ll assume that the two types of material have a standard deviation of 4 units of strength. After we click OK, we see the results.
Related post: Measures of Variability
Interpreting the Statistical Power Analysis and Sample Size Results
Statistical power and sample size analysis provides both numeric and graphical results, as shown below.
The text output indicates that we need 15 samples per group (total of 30) to have a 90% chance of detecting a difference of 5 units.
The dot on the Power Curve corresponds to the information in the text output. However, by studying the entire graph, we can learn additional information about how statistical power varies by the difference. If we start at the dot and move down the curve to a difference of 2.5, we learn that the test has a power of approximately 0.4 (40%). This power is too low. However, we indicated that differences less than 5 were not practically significant to our process. Consequently, having low power to detect a difference of 2.5 is not problematic.
Conversely, follow the curve up from the dot and notice how power quickly increases to nearly 100% before we reach a difference of 6. This design satisfies the process requirements while using a manageable sample size of 15 per group.
Other Power Analysis Options
Now, let’s explore a few more options that are available for power analysis. This time we’ll use a one-tailed test and have the software calculate a value other than sample size.
Suppose we are again comparing the strengths of two types of material. However, in this scenario, we are currently using one kind of material and are considering switching to another. We will change to the new material only if it is stronger than our current material. Again, the smallest difference in strength that is meaningful to our process is 5 units. The standard deviation in this study is now 7. Further, let’s assume that our company uses a standard sample size of 20, and we need approval to increase it to 40. Because the standard deviation (7) is larger than the smallest meaningful difference (5), we might need a larger sample.
In this scenario, the test needs to determine only whether the new material is stronger than the current material. Consequently, we can use a one-tailed test. This type of test provides greater statistical power to determine whether the new material is stronger than the old material, but no power to determine if the current material is stronger than the new—which is acceptable given the dictates of the new scenario.
In this analysis, we’ll enter the two potential values for Sample sizes and leave Power values blank. The software will estimate the power of the test for detecting a difference of 5 for designs with both 20 and 40 samples per group.
We fill in the dialog box as follows:
And, in Options, we choose the following one-tailed test:
Interpreting the Power and Sample Size Results
The statistical output indicates that a design with 20 samples per group (a total of 40) has a ~72% chance of detecting a difference of 5. Generally, this power is considered to be too low. However, a design with 40 samples per group (80 total) achieves a power of ~94%, which is almost always acceptable. Hopefully, the power analysis convinces management to approve the larger sample size.
Assess the Power Curve graph to see how the power varies by the difference. For example, the curve for the sample size of 20 indicates that the smaller design does not achieve 90% power until the difference is approximately 6.5. If increasing the sample size is genuinely cost prohibitive, perhaps accepting 90% power for a difference of 6.5, rather than 5, is acceptable. Use your process knowledge to make this type of determination.
Use Power Analysis for Sample Size Estimation For All Studies
Throughout this post, we’ve been looking at continuous data, and using the 2-sample t-test specifically. For continuous data, you can also use power analysis to assess sample sizes for ANOVA and DOE designs. Additionally, there are hypothesis tests for other types of data, such as proportions tests (binomial data) and rates of occurrence (Poisson data). These tests have their own corresponding power and sample analyses.
In general, when you move away from continuous data to these other types of data, your sample size requirements increase. And, there are unique intricacies in each. For instance, in a proportions test, you need a relatively larger sample size to detect a difference when your proportion is closer 0 or 1 than if it is in the middle (0.5). Many factors can affect the optimal sample size. Power analysis helps you navigate these concerns.
After reading this post, I hope you see how power analysis combines statistical analyses, subject-area knowledge, and your requirements to help you derive the optimal sample size for your specific needs. If you don’t perform this analysis, you risk performing a study that is either likely to miss an important effect or have an exorbitantly large sample size. I’ve written a post about a Mythbusters experiment that had no chance of detecting an effect because they guessed a sample size instead of performing a power analysis.
In this post, I’ve focused on how power affects your test’s ability to detect a real effect. However, low power tests also exaggerate effect sizes!
Finally, experimentation is an iterative process. As you conduct more studies in an area, you’ll develop better estimates to input into power and sample size analyses and gain a clearer picture of how to proceed.