P values are commonly misinterpreted. It’s a very slippery concept that requires a lot of background knowledge to understand. Not surprisingly, I’ve received many questions about P values in statistical hypothesis testing over the years. However, one question stands out. *Why* are P value misinterpretations so prevalent? I answer that question in this blog post, and help you avoid making the same mistakes.

## The Correct Way to Interpret P Values

First, I need to be sure that we’re all on the right page when it comes to interpreting P values. If we’re not, the rest of this blog post won’t make sense!

P values are the probability of observing a sample statistic that is at least as different from the null hypothesis as your sample statistic when you assume that the null hypothesis is true. That’s a pretty convoluted but technically correct definition—and I’ll come back it later on!

In a nutshell, P value calculations assume that the null hypothesis is true and use that assumption to determine the likelihood of obtaining your observed sample data. P values answer the question, “Are your sample data unusual if the null hypothesis is true?”

Here’s a quick way to tell if you are misinterpreting P values in hypothesis tests. If you interpret P values as the probability that the null hypothesis is true or the probability that rejecting the null hypothesis is a mistake, those are *incorrect* interpretations. In fact, these are the most common misinterpretations of P values that I’m addressing specifically in this post. Why are they so common?

## Historical Events Made P Values Confusing

The problem of misinterpreting P values has existed for nearly a century. The origins go back to two rival camps in the early days of hypothesis testing. On one side, we have Ronald Fisher with his measures of evidence approach (P values). And, on the other side, we have Jerzy Neyman and Egon Pearson with their error rate method (alpha). Fisher believed that you could use sample data to learn about a population. However, Neyman and Pearson thought that you couldn’t learn from individual studies but only a long series of hypothesis tests.

Textbook publishers and statistics courses have squished together these two incompatible approaches. Today, the familiar hypothesis testing procedure of comparing P values to alphas seems to fit together perfectly. However, they’re based on irreconcilable methods.

Much can be said about this forced merger. For the topic of this blog post, an important outcome is that P values became associated with the Type I error rate, which is incorrect. A P value is NOT an error rate, but alpha IS an error rate. By directly comparing the two values in a hypothesis test, it’s easy to think they’re both error rates. This misconception leads to the most common misinterpretations of P values.

Fisher spent decades of his life trying to clarify the misunderstanding but to no effect. If you want to read more about the unification of the two schools of thought, I highly recommend this article.

## P Values Don’t Provide the Answers that We Really Want

Let’s be honest. The common misinterpretations are what we really want to learn from hypothesis testing. We’d love to learn the probability that a hypothesis is correct. That *would* be nice. Unfortunately, hypothesis testing doesn’t provide that type of information. Instead, we obtain the likelihood of our *observation*. How likely is our sample if the null hypothesis is true? That’s just not as helpful.

Think about this logically. Hypothesis tests use data from one sample exclusively. There is no outside reference to anything else in the world. You can’t use a single sample to determine whether it represents the population. There’s no basis to draw conclusions like that. Consequently, it’s not possible to evaluate the probabilities associated with any hypotheses. To do that, we’d need a larger perspective than a single sample can provide. As an aside, Bayesian statistics attempt to construct this broader framework of probabilities.

P values can’t provide answers to what we really want to know. However, there is a persistent temptation to interpret them in this manner anyway. Always remember, if you start to think of P values as the probability of a hypothesis, you’re barking up the wrong tree!

## P Values Have a Torturous Definition

As I showed earlier, the correct definition of P values is pretty convoluted. It is the probability of observing the data that you actually did observe in the hypothetical context that the null hypothesis is true. Huh? And, there is weird wording about being at least as extreme as your observation. It’s just not intuitive. In fact, it takes a lot of studying to truly understand it all.

Unfortunately, the incorrect interpretations sound so much simpler than the correct interpretation. There is no straightforward *and* accurate definition of P values that can compete against the simpler sounding misinterpretations. In fact, not even scientists can explain P values! And, so the errors continue.

## Is Misinterpreting P values Really a Problem?

Let’s recap. Historical circumstances have linked P values and the Type I error rate incorrectly. We have a natural inclination to want P values to tell us more than they are able. There is no technically correct definition of P value that rolls off the tongue. There’s nothing available to thwart the simpler and more tempting misinterpretation. It’s no surprise that this problem persists! Even Fisher couldn’t fix it!

You might be asking, “Is this really a problem, or is it just semantics?” Make no mistake; the correct and incorrect interpretations are very different. If you believe that a P value of 0.04 indicates that there is only a 4% chance that the null hypothesis is correct, you’re in for a big surprise! It’s often around 26%!

To better understand the correct interpretation of P values, I’ve written three blog posts that focus on interpreting and using P values. The first post describes the correct and incorrect ways to interpret P values. It goes into detail about the substantial problems associated with the incorrect interpretations. The second post uses concepts and graphs to explain how P values and significance levels work. I find that these charts are a lot easier to understand than convoluted definitions! Finally, the third post provides tips to avoid being fooled by false positives and other misleading results.

James says

April 2, 2019 at 4:52 amHi Jim,

Thank you for the informative article! Would you say that a p-value of 0.05 gives you 95% confidence in your results, say a mean difference between groups? For instance, lets say I find a raw mean difference of 12 points on a test between group A and group B with a p-value of 0.1, can I then report this mean difference with 80% confidence? I don’t think this is the case as this is what CI’s are for (although please correct me), but I am struggling to verbalize why a p-value does not translate to a % confidence in a particular finding. It would be great if you could explain this.

Also, I have found your blog extremely helpful in increasing my understanding of statistics. Thank you for all your work!

James

Jim Frost says

April 2, 2019 at 10:18 amHi James,

I’m happy to hear that my blog has help you with statistics!

Yes, there is a very strong connection between hypothesis tests and confidence intervals. However, in addition to p-values, you also need to incorporate the significance level and the confidence level. What I write below, I describe in much more detail in my post about How Confidence Intervals Work. Read that to get even more details.

When you compare the correct pair of p-values and confidence intervals while using corresponding significance levels and confidence levels, they’ll always agree. Occasionally, there are apparent contradictions between the two, but that usually occurs because an analyst compares the wrong CIs to a p-value, which I discuss in my post about using confidence intervals to compare means.

All p-values are based on the value declared in the null hypothesis that corresponds to no effect or relationship. That value is often zero, which I’ll use for the following explanation.

If you use a significance level of X% and a confidence level of (100-X)% (e.g., a significance level of 0.05 corresponds to a 95% confidence level), the hypothesis test and CI results will always agree in the following manner:

Let’s go to your example of a p-value of 0.1 for a mean difference between two groups. Suppose you use a significance level of 0.1 (higher than typical). If you then construct a 90% confidence interval, a CI endpoint will fall exactly on zero. Both results provide a consistent conclusion. However, if you use the more typical 5% significance level and 95% confidence level, you’d fail to reject the null hypothesis in the hypothesis test and the CI would include zero. Again, a consistent conclusion.

I hope this helps!