P values are commonly misinterpreted. It’s a very slippery concept that requires a lot of background knowledge to understand. Not surprisingly, I’ve received many questions about P values in statistical hypothesis testing over the years. However, one question stands out. Why are P value misinterpretations so prevalent? I answer that question in this blog post, and help you avoid making the same mistakes.
The Correct Way to Interpret P Values
First, I need to be sure that we’re all on the right page when it comes to interpreting P values. If we’re not, the rest of this blog post won’t make sense!
P values are the probability of observing a sample statistic that is at least as different from the null hypothesis as your sample statistic when you assume that the null hypothesis is true. That’s a pretty convoluted but technically correct definition—and I’ll come back it later on!
In a nutshell, P value calculations assume that the null hypothesis is true and use that assumption to determine the likelihood of obtaining your observed sample data. P values answer the question, “Are your sample data unusual if the null hypothesis is true?”
Here’s a quick way to tell if you are misinterpreting P values in hypothesis tests. If you interpret P values as the probability that the null hypothesis is true or the probability that rejecting the null hypothesis is a mistake, those are incorrect interpretations. In fact, these are the most common misinterpretations of P values that I’m addressing specifically in this post. Why are they so common?
Historical Events Made P Values Confusing
The problem of misinterpreting P values has existed for nearly a century. The origins go back to two rival camps in the early days of hypothesis testing. On one side, we have Ronald Fisher with his measures of evidence approach (P values). And, on the other side, we have Jerzy Neyman and Egon Pearson with their error rate method (alpha). Fisher believed that you could use sample data to learn about a population. However, Neyman and Pearson thought that you couldn’t learn from individual studies but only a long series of hypothesis tests.
Textbook publishers and statistics courses have squished together these two incompatible approaches. Today, the familiar hypothesis testing procedure of comparing P values to alphas seems to fit together perfectly. However, they’re based on irreconcilable methods.
Much can be said about this forced merger. For the topic of this blog post, an important outcome is that P values became associated with the Type I error rate, which is incorrect. A P value is NOT an error rate, but alpha IS an error rate. By directly comparing the two values in a hypothesis test, it’s easy to think they’re both error rates. This misconception leads to the most common misinterpretations of P values.
Fisher spent decades of his life trying to clarify the misunderstanding but to no effect. If you want to read more about the unification of the two schools of thought, I highly recommend this article.
P Values Don’t Provide the Answers that We Really Want
Let’s be honest. The common misinterpretations are what we really want to learn from hypothesis testing. We’d love to learn the probability that a hypothesis is correct. That would be nice. Unfortunately, hypothesis testing doesn’t provide that type of information. Instead, we obtain the likelihood of our observation. How likely is our sample if the null hypothesis is true? That’s just not as helpful.
Think about this logically. Hypothesis tests use data from one sample exclusively. There is no outside reference to anything else in the world. You can’t use a single sample to determine whether it represents the population. There’s no basis to draw conclusions like that. Consequently, it’s not possible to evaluate the probabilities associated with any hypotheses. To do that, we’d need a larger perspective than a single sample can provide. As an aside, Bayesian statistics attempt to construct this broader framework of probabilities.
P values can’t provide answers to what we really want to know. However, there is a persistent temptation to interpret them in this manner anyway. Always remember, if you start to think of P values as the probability of a hypothesis, you’re barking up the wrong tree!
P Values Have a Torturous Definition
As I showed earlier, the correct definition of P values is pretty convoluted. It is the probability of observing the data that you actually did observe in the hypothetical context that the null hypothesis is true. Huh? And, there is weird wording about being at least as extreme as your observation. It’s just not intuitive. In fact, it takes a lot of studying to truly understand it all.
Unfortunately, the incorrect interpretations sound so much simpler than the correct interpretation. There is no straightforward and accurate definition of P values that can compete against the simpler sounding misinterpretations. In fact, not even scientists can explain P values! And, so the errors continue.
Is Misinterpreting P values Really a Problem?
Let’s recap. Historical circumstances have linked P values and the Type I error rate incorrectly. We have a natural inclination to want P values to tell us more than they are able. There is no technically correct definition of P value that rolls off the tongue. There’s nothing available to thwart the simpler and more tempting misinterpretation. It’s no surprise that this problem persists! Even Fisher couldn’t fix it!
You might be asking, “Is this really a problem, or is it just semantics?” Make no mistake; the correct and incorrect interpretations are very different. If you believe that a P value of 0.04 indicates that there is only a 4% chance that the null hypothesis is correct, you’re in for a big surprise! It’s often around 26%!
To better understand the correct interpretation of P values, I’ve written three blog posts that focus on interpreting and using P values. The first post describes the correct and incorrect ways to interpret P values. It goes into detail about the substantial problems associated with the incorrect interpretations. The second post uses concepts and graphs to explain how P values and significance levels work. I find that these charts are a lot easier to understand than convoluted definitions! Finally, the third post provides tips to avoid being fooled by false positives and other misleading results.
- How to Correctly Interpret P Values
- How Hypothesis Tests Work: Significance Levels and P Values
- Five P value Tips to Avoid Being Fooled
Thanks for your series of posts about statistic and p-value, they help me a lot.
After reading your posts, this is my interpret of p-value and I am not sure if it is right.
Can you help me figure out?
About the concept that you said, “If you interpret P values as the probability that the null hypothesis is true…”, I think about this mistake by Bayes theorem:
(Assume the test statistic denotes as T and its realization denotes as t.)
(1) p-value : P(T > t | H0 is true)
(2) null hypothesis is true : P(H0 is true)
And their relation is :
(3) p-value = P(T > t | H0 is true) = P(H0 is true | T > t) * P(T > t) / P(H0 is true)
Obviously “p-value” and “null hypothesis is true” are difference concepts, although they are similar Intuitively.
And the reason of why we must mention the test statistic when we interpret p-value is : p-value would be affected by the choose of test statistic.
For example, the hypothesis test of population mean, where is population is normal distribution with unknown mean and variance equal 1.
And the null hypothesis is: population mean = 0.
If we using sample mean (sample size = 25) as the test statistic and its realization is 0.5, then the p-value is P(sample mean > 0.5 | population mean = 0) = 0.0062
But if we choose another test statistic, for example, mean of the first two samples ((X1 + X2) / 2), and its realization is also 0.5.
In this situation, the p-value is P((X1 + X2) / 2 > 0.5 | population mean = 0) = 0.2389
So we must consider the test statistic when we interpret the p-value.
Am I making any mistakes here, or is there anything I haven’t considered?
p.s. English is not my native language. I’ve try my best to describe my idea and I hope I’ve described it clearly. If there is anything still unclear I am very sorry about it.
Jim Frost says
There are several things to keep in mind about Frequentist methodology. First, there is no relationship between p-values and the probability the null is true. It simply doesn’t exist. Second, and relatedly, the probability that the null is true is either 0 or 1, but you don’t know which value it is. Note that this is different from how Bayesian methodologies handle probabilities.
Regarding reporting both the test statistic and the p-value, that’s definitely OK to do but it’s not required. Typically, you can’t interpret the test statistic unless it’s particularly high or low value. The p-value has a direct relationship to the test statistic and it is more interpretable. In a sense, you can think of the p-value as a probability that fully takes into account the test statistic value you obtained and the design of your experiment (which determines the probability distribution of your test statistic). Consequently, it’s OK to report the p-value by itself because it incorporates all those factors. The value of the test statistic actually provides less information than the p-value because it does not factor in those design considerations.
For transparency reasons, you should report the methodology you’re using, which could include the test statistics. However, for interpreting the p-value you don’t need to know the value of the test statistic. For more information about that process of test statistics –> p-values, I highly recommend getting my book about Hypothesis Testing where I describe that in detail for various tests.
Thanks for writing! You’re English is good! I hope my explanation helps clarify things!
Thank you for the informative article! Would you say that a p-value of 0.05 gives you 95% confidence in your results, say a mean difference between groups? For instance, lets say I find a raw mean difference of 12 points on a test between group A and group B with a p-value of 0.1, can I then report this mean difference with 80% confidence? I don’t think this is the case as this is what CI’s are for (although please correct me), but I am struggling to verbalize why a p-value does not translate to a % confidence in a particular finding. It would be great if you could explain this.
Also, I have found your blog extremely helpful in increasing my understanding of statistics. Thank you for all your work!
Jim Frost says
I’m happy to hear that my blog has help you with statistics!
Yes, there is a very strong connection between hypothesis tests and confidence intervals. However, in addition to p-values, you also need to incorporate the significance level and the confidence level. What I write below, I describe in much more detail in my post about How Confidence Intervals Work. Read that to get even more details.
When you compare the correct pair of p-values and confidence intervals while using corresponding significance levels and confidence levels, they’ll always agree. Occasionally, there are apparent contradictions between the two, but that usually occurs because an analyst compares the wrong CIs to a p-value, which I discuss in my post about using confidence intervals to compare means.
All p-values are based on the value declared in the null hypothesis that corresponds to no effect or relationship. That value is often zero, which I’ll use for the following explanation.
If you use a significance level of X% and a confidence level of (100-X)% (e.g., a significance level of 0.05 corresponds to a 95% confidence level), the hypothesis test and CI results will always agree in the following manner:
Let’s go to your example of a p-value of 0.1 for a mean difference between two groups. Suppose you use a significance level of 0.1 (higher than typical). If you then construct a 90% confidence interval, a CI endpoint will fall exactly on zero. Both results provide a consistent conclusion. However, if you use the more typical 5% significance level and 95% confidence level, you’d fail to reject the null hypothesis in the hypothesis test and the CI would include zero. Again, a consistent conclusion.
I hope this helps!