Despite the popular notion to the contrary, understanding the results of your statistical hypothesis test is not as simple as determining only whether your P value is less than your significance level. In this post, I present additional considerations that help you assess and minimize the possibility of being fooled by false positives and other misleading results.
The False Positive Problem
When a statistical hypothesis test produces significant results, there is always that chance that it is a false positive. In this context, a false positive occurs when you obtain a statistically significant P value, and you unknowingly reject a null hypothesis that is actually true. You conclude that an effect exists in the population when it actually does not exist.
In a previous post about how to interpret P values correctly, I showed how a common misconception about interpreting P values produces the illusion of substantially more evidence against the null hypothesis than is justified. For example, a P value near 0.05 often has a false positive error rate of between 23-50%. These greater than expected false positive rates create serious doubts about trusting results that are statistically significant.
From a strictly scientific understanding point of view, the high false positive rates are problematic because of the misleading results. However, if you are using a hypothesis test to improve a product or process, you won’t obtain the benefits that you expect if the tests results are a false positive. That can cost you a lot of money!
Let’s delve into the tips. These tips will help you develop a deeper understanding of your test results. I’ll use a real AIDS vaccine study conducted in Thailand to work through these considerations. The study obtained a P value of 0.039, which sounds great. Hurray, the vaccine works! However, after reading through this blog post, you might think differently.
Tip 1: Smaller P values are Better
Analysts often view statistical results as being either significant or not. The focus is on whether the P value is less than the significance level because statistically significant results are so highly prized. Unfortunately, that’s an oversimplification because no special significance level correctly determines which studies have real population effects 100% of the time. Instead, we need to focus on understanding the relationship between false positive rates and P values.
Simulation studies find that lower false positive rates are associated with lower P values. For example, a P value close to 0.05 often has an error rate of 25-50%. However, a P value of 0.0027 often has an error rate around 4.5%. That error rate is close to the rate that is often erroneously ascribed to a P value of 0.05.
Lower P values indicate stronger evidence against the null hypothesis and a lower probability of a false positive. You can’t hang your hat on a single study that produces a P value near 0.05. Your P value needs to be close to 0.002 before you can start to get excited over the statistical results from a single study.
It’s important to note that there is no directly calculable relationship between P values and the false positive rate. However, simulation studies and the Bayesian approach can produce ballpark estimates of the false positive rate. On the empirical side of things, studies with lower P values have higher reproducibility rates in follow-up studies.
To help avoid misleading results, you should consider the exact value of the P value. Using the binary approach of a yes or no determination of statistical significance is too simplistic.
The AIDS vaccine study has a P value of 0.039. Based on the information above, we should be cautious of this result.
Tip 2: Replication is Crucial
In the previous tip, I referred to results from a single study. Realistically, you need to replicate statistically significant results several times before you can have confidence in the conclusions.
Given the high pressure to obtain significant P values, a single P value is often considered to be conclusive evidence. However, Ronald Fisher developed P values with the notion that they are just one part of the scientific process that includes experimentation, analysis, and replication.
A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. –Ronald Fisher
The false positive rates that are associated with a single study that has a P value between 0.01 and 0.05 are likely to be too high to be acceptable. In these cases, you need repeated experimentation with consistently significant results to be confident that the alternative hypothesis is correct.
Along these lines, you can think of P values as probabilities that you can multiply. For example, if two independent studies both have P values of 0.05, you can multiply them to obtain a probability of 0.0025. If you use this approach, you can’t cherry pick the best studies. You need to include all studies in a series of relevant studies whether they are significant or not.
You should consider results from a study in conjunction with other similar studies. It is extremely unlikely that a single study can prove that the alternative hypothesis is true with any confidence. So, don’t expect it to!
For the AIDS vaccine study, the Thai study is the first AIDS vaccination study to produce statistically significant results. It has not been replicated yet, so we need to be wary of misleading results. This vaccine has not built up a track record of significant results.
Tip 3: The Effect Size is Important
The high pressure to obtain statistically significant P values draws attention away from both the effect size and the precision of the estimate. You can have statistically significant test results even when effect sizes are too small to be practically significant. Additionally, a significant P value does not indicate that the effect size has been estimated with high precision.
You should not think about statistical significance strictly from a binary perspective. Instead, consider whether the effect size is large enough to be practically significant and if the estimate is sufficiently precise.
Unfortunately, the confidence interval for the effectiveness of the AIDS vaccine extends from 1% to 52%. The vaccine might work almost none of the time up to half the time. The confidence interval reveals that the estimated effect size is both small and imprecise. In this case, the P value provides a misleading idea about what the data show.
Related post: Practical vs. Statistical Significance
Tip 4: The Plausibility of the Alternative Hypothesis Matters
As we evaluate P values in statistical hypothesis tests, there is a tendency to think that similar P values across studies provide similar support for the alternative hypothesis. For example, a P value of 0.04 in one study seems to provide the same amount of evidence as a P value of 0.04 in another study. However, simulation studies show that the plausibility of the study’s alternative hypothesis greatly affects the false positive rate.
For instance, with a P value of 0.05, a highly plausible alternative hypothesis is associated with a false positive rate of at least 12% while an implausible alternative has a rate of at least 76%! In other words, if you are studying an implausible alternative hypothesis and you obtain a significant P value, there is a greater probability that the alternative hypothesis is not actually correct.
Extraordinary claims require extraordinary evidence—consider the plausibility of the alternative hypothesis in conjunction with the P value. A significant P value doesn’t absolve us of using our common sense while interpreting the results. If you hear about a startling study that produces unprecedented results, don’t let that initial significant P value sway you too much. Wait until the other studies replicate the findings before you trust them!
No studies of other AIDS vaccines have provided sufficient evidence to reject the null hypothesis. This pattern demonstrates it is unlikely that the alternative hypothesis is true for the Thai study. In this scenario, we can expect false positive rates of around 75%! For this scenario, replicating the results with other studies is crucial—see tip #4.
Tip 5: Use Your Expertise
You must apply your subject area expertise to all facets of statistical hypothesis testing to avoid misleading results. Investigators should use their expertise to evaluate the validity of the experimental design, proposed mechanisms behind the effect, practical significance of the effect, the plausibility of the alternative hypothesis, and so on. Expertise transforms statistical results from numbers into meaningful findings that you can trust.
Evaluating the Hypothesis Test Results for the AIDS Vaccine Study
In this post, we looked at an AIDS vaccine study that had statistically significant results. However, we saw the following:
- The P value of 0.039 is not compelling evidence by itself.
- The vaccine does not have a proven track record of significant results.
- The confidence interval indicates that that the estimated effect size is both small and imprecise.
- Studies of other AIDS vaccines have not had significant results, which suggest that the alternative hypothesis in the Thai study is implausible.
Taken together, the additional considerations should make us cautious about potentially misleading results. In other words, we shouldn’t pop open a bottle of champagne and start mass producing the vaccine yet. We need to wait and see if other studies will replicate these results. We also need to keep an eye on the effect size in future studies to determine whether the vaccine’s effectiveness is practically significant.
Now that’s a much more thorough assessment than simply noting that the P value is statistically significant!