Despite the popular notion to the contrary, understanding the results of your statistical hypothesis test is not as simple as determining only whether your P value is less than your significance level. In this post, I present additional considerations that help you assess and minimize the possibility of being fooled by false positives and other misleading results.
The False Positive Problem
When a statistical hypothesis test produces significant results, there is always that chance that it is a false positive. In this context, a false positive occurs when you obtain a statistically significant P value, and you unknowingly reject a null hypothesis that is actually true. You conclude that an effect exists in the population when it actually does not exist.
In a previous post about how to interpret P values correctly, I showed how a common misconception about interpreting P values produces the illusion of substantially more evidence against the null hypothesis than is justified. For example, a P value near 0.05 often has a false positive error rate of between 23-50%. These greater than expected false positive rates create serious doubts about trusting results that are statistically significant.
From a strictly scientific understanding point of view, the high false positive rates are problematic because of the misleading results. However, if you are using a hypothesis test to improve a product or process, you won’t obtain the benefits that you expect if the tests results are a false positive. That can cost you a lot of money!
Let’s delve into the tips. These tips will help you develop a deeper understanding of your test results. I’ll use a real AIDS vaccine study conducted in Thailand to work through these considerations. The study obtained a P value of 0.039, which sounds great. Hurray, the vaccine works! However, after reading through this blog post, you might think differently.
Related Posts: How Hypothesis Tests Work and Why Are P Values Misinterpreted So Frequently?
Tip 1: Smaller P values are Better
Analysts often view statistical results as being either significant or not. The focus is on whether the P value is less than the significance level because statistically significant results are so highly prized. Unfortunately, that’s an oversimplification because no special significance level correctly determines which studies have real population effects 100% of the time. Instead, we need to focus on understanding the relationship between false positive rates and P values.
Simulation studies find that lower false positive rates are associated with lower P values. For example, a P value close to 0.05 often has an error rate of 25-50%. However, a P value of 0.0027 often has an error rate around 4.5%. That error rate is close to the rate that is often erroneously ascribed to a P value of 0.05.
Lower P values indicate stronger evidence against the null hypothesis and a lower probability of a false positive. You can’t hang your hat on a single study that produces a P value near 0.05. Your P value needs to be close to 0.002 before you can start to get excited over the statistical results from a single study.
It’s important to note that there is no directly calculable relationship between P values and the false positive rate. However, simulation studies and the Bayesian approach can produce ballpark estimates of the false positive rate. On the empirical side of things, studies with lower P values have higher reproducibility rates in follow-up studies.
To help avoid misleading results, you should consider the exact value of the P value. Using the binary approach of a yes or no determination of statistical significance is too simplistic.
The AIDS vaccine study has a P value of 0.039. Based on the information above, we should be cautious of this result.
Typically, you’re hoping for low p-values. However, even high p-values have benefits!
Tip 2: Replication is Crucial
In the previous tip, I referred to results from a single study. Realistically, you need to replicate statistically significant results several times before you can have confidence in the conclusions.
Given the high pressure to obtain significant P values, a single P value is often considered to be conclusive evidence. However, Ronald Fisher developed P values with the notion that they are just one part of the scientific process that includes experimentation, analysis, and replication.
A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. –Ronald Fisher
The false positive rates that are associated with a single study that has a P value between 0.01 and 0.05 are likely to be too high to be acceptable. In these cases, you need repeated experimentation with consistently significant results to be confident that the alternative hypothesis is correct.
Along these lines, you can think of P values as probabilities that you can multiply. For example, if two independent studies both have P values of 0.05, you can multiply them to obtain a probability of 0.0025. If you use this approach, you can’t cherry pick the best studies. You need to include all studies in a series of relevant studies whether they are significant or not.
You should consider results from a study in conjunction with other similar studies. It is extremely unlikely that a single study can prove that the alternative hypothesis is true with any confidence. So, don’t expect it to!
For the AIDS vaccine study, the Thai study is the first AIDS vaccination study to produce statistically significant results. It has not been replicated yet, so we need to be wary of misleading results. This vaccine has not built up a track record of significant results.
Related post: What is the Relationship Between Reproducibility and P-values?
Tip 3: The Effect Size is Important
The high pressure to obtain statistically significant P values draws attention away from both the effect size and the precision of the estimate. You can have statistically significant test results even when effect sizes are too small to be practically significant. Additionally, a significant P value does not indicate that the effect size has been estimated with high precision.
To place a greater emphasis on effect size and precision, use confidence intervals. A confidence interval is a range that likely contains the actual effect size for the entire population.
You should not think about statistical significance strictly from a binary perspective. Instead, consider whether the effect size is large enough to be practically significant and if the estimate is sufficiently precise.
Unfortunately, the confidence interval for the effectiveness of the AIDS vaccine extends from 1% to 52%. The vaccine might work almost none of the time up to half the time. The confidence interval reveals that the estimated effect size is both small and imprecise. In this case, the P value provides a misleading idea about what the data show.
Related post: Practical vs. Statistical Significance
Tip 4: The Plausibility of the Alternative Hypothesis Matters
As we evaluate P values in statistical hypothesis tests, there is a tendency to think that similar P values across studies provide similar support for the alternative hypothesis. For example, a P value of 0.04 in one study seems to provide the same amount of evidence as a P value of 0.04 in another study. However, simulation studies show that the plausibility of the study’s alternative hypothesis greatly affects the false positive rate.
For instance, with a P value of 0.05, a highly plausible alternative hypothesis is associated with a false positive rate of at least 12% while an implausible alternative has a rate of at least 76%! In other words, if you are studying an implausible alternative hypothesis and you obtain a significant P value, there is a greater probability that the alternative hypothesis is not actually correct.
Extraordinary claims require extraordinary evidence—consider the plausibility of the alternative hypothesis in conjunction with the P value. A significant P value doesn’t absolve us of using our common sense while interpreting the results. If you hear about a startling study that produces unprecedented results, don’t let that initial significant P value sway you too much. Wait until the other studies replicate the findings before you trust them!
No studies of other AIDS vaccines have provided sufficient evidence to reject the null hypothesis. This pattern demonstrates it is unlikely that the alternative hypothesis is true for the Thai study. In this scenario, we can expect false positive rates of around 75%! For this scenario, replicating the results with other studies is crucial—see tip #4.
For more detailed information about the importance of assessing the plausibility of the alternative hypothesis, read my post about P-values, Error Rates, and False Positives!
Tip 5: Use Your Expertise
You must apply your subject area expertise to all facets of statistical hypothesis testing to avoid misleading results. Investigators should use their expertise to evaluate the validity of the experimental design, proposed mechanisms behind the effect, practical significance of the effect, the plausibility of the alternative hypothesis, and so on. Expertise transforms statistical results from numbers into meaningful findings that you can trust.
Evaluating the Hypothesis Test Results for the AIDS Vaccine Study
In this post, we looked at an AIDS vaccine study that had statistically significant results. However, we saw the following:
- The P value of 0.039 is not compelling evidence by itself.
- The vaccine does not have a proven track record of significant results.
- The confidence interval indicates that that the estimated effect size is both small and imprecise.
- Studies of other AIDS vaccines have not had significant results, which suggest that the alternative hypothesis in the Thai study is implausible.
Taken together, the additional considerations should make us cautious about potentially misleading results. In other words, we shouldn’t pop open a bottle of champagne and start mass producing the vaccine yet. We need to wait and see if other studies will replicate these results. We also need to keep an eye on the effect size in future studies to determine whether the vaccine’s effectiveness is practically significant.
Now that’s a much more thorough assessment than simply noting that the P value is statistically significant!
Can you help me with this question
The p.value in statistical test results indicates:
b) The probability of having committed a type 1 error.
c) The probability of having committed a type 2 error.
d) The probability of data being accurate and valid
Jim Frost says
The correct answer is None of the Above. The p-value does not indicate any of those options. To learn what the p-value does mean, read my post about interpreting p-values.
Zack Su says
This is only explanation between the relationship between p-values and false positive rates. Could you also provide the source that did the simulation studies?
Hi Jim, thank you for your helpful explanations. I want just to ask you further explanation about P and F values.
1) Considering both is a must to determine significance?
2) Sometimes the model reads significant/ lower P value/ while it is not the same in the case of Type III ANOVA/ for the treatment. How we can interpret such a case? significant/not?
Jeron R. says
This is an awesome explanation, Jim! It’s provided a lot of clarity. I look forward to diving into more of your blog posts!
Jim Frost says
You’re very welcome, Jeron! I’m glad that was helpful. Sometimes with this topic, I feel like the more I try explaining the more convoluted it becomes!
Jeron R. says
I just stumbled upon your articles regarding p-values. I have found them highly informative, and I wish they would teach these nuances in my college statistics courses. They have also taken me down a rabbit hole of learning about the differences between Bayesians and Frequentists.
Through my reading of other paper’s on the history of p-values and alpha’s there is still one seemingly simple concept I can’t seem to wrap my mind around.
From my understanding, if we set alpha (let’s assume at 0.05), and alpha is the probability of making a type 1 error, we are saying we are okay with making a type 1 error 5% of the time. Let’s say I get a p-value of 0.04. This would fall into the critical region and I would say the probability of getting this extreme value is in fact the p-value. Where as I understand the p-value is not the probability of a type 1 error, how is it reasoned that a p-value of 0.04 can be attributed with a false positive rate (type 1 error) of 25%-50% when I explicitly set alpha at 0.05 because I wanted to limit my type 1 error to be 5% of the time.
Essentially my question boils down to how can setting alpha at .05 in turn lead to a conclusion that could have a type 1 error associated with a 25%-50% type 1 error?
I hope this question makes sense.
Jim Frost says
I always found p-values and significance to be such an interesting topic–and so counter-intuitive!
To answer your question, there’s a simple distinction you need to make. When you’re talking significance levels and the Type I error rate, you’re talking about an entire class of studies. You can’t apply that to individual studies. Let’s say we use a significance level of 0.05. Then, we know that of the studies that use that significance level and the null is true, 5% of them will have statistically significance results. Imagine we could magically know that a set of 100 studies were testing a null hypothesis that is true. Normally, you wouldn’t know that for sure, but for this example we do. If we proceeded to conduct those 100 experiments and use a significance level of 0.05, then 5% of those studies will have p-values that are statistically significant.
So, significance levels and the type I error rate apply to a class of studies rather than individual studies.
However, once you get to p-values, you’re getting to individual studies. You want to know what’s the probability that this particular study with such and such p-value is a false positive. Frequentist methods can’t answer that. One way to understand this is to understand how probabilities are calculated as the area under the curve of the probability distribution for the test statistic. An individual p-value doesn’t produce an area. You need a range of p-values, which you get when you use significance levels. For more details, see my post on how t-tests work.
Another way to think of it is considering that p-values that fall between 0 and 0.05 will collectively produce a false positive rate of 5%. However, if you look at smaller chunks of the range, it’s not distributed evenly. For example, p-values near 0.001 will have much, much lower false positive rates. While p-values near 0.05 will have much higher false positive rates. Altogether, they will average to 5%. So, you’d expect higher percentages near 0.05. However, there is no mathematically relationship to determine what it is for any given study. It depends on the prior probabilities, which you often don’t know.
I hope this helps!
This should be a required part of every intro to statistics class! It’s amazing how I took 2 stats classes and never learned these things. Great stuff, Jim! Thank you for making this post.
Jim Frost says
Mohamed Jama says
hello dear Jim
if you start tutorial videos for statistical analysis tool/software’s will be very very interesting and helpful!!
# think about creating a YouTube channel
thanks a lot
Jim Frost says
Thank you so much for writing! I do plan to create video courses at some point. Unfortunately, it probably won’t be for a year or so, but it’s in my plans!
Manisha K. says
Reading your series of posts about interpreting p-values has provided me with a surprising amount of clarity about this fundamental but rather misunderstood subject. Thank you very much!
Jim Frost says
You’re very welcome! I think p-values and statistical significance are fascinating concepts. They’re so misunderstand and really get to the heart of how we learn from data!