What do P values mean? P values tell you whether your hypothesis test results are statistically significant. Statistics use them all over the place. You’ll find P values in t-tests, distribution tests, ANOVA, and regression analysis. P values have become so important that they’ve taken on a life of their own. They can determine which studies are published, which projects receive funding, and which university faculty members become tenured!

Ironically, despite being so influential, P values are misinterpreted very frequently. What *is* the correct interpretation of P values? What do P values *really* mean? That’s the topic of this post!

P values are a slippery concept. Don’t worry. I’ll explain P values using an intuitive, concept-based approach so you can avoid a very common misinterpretation that that can cause you serious problems.

## What Is the Null Hypothesis?

You need to understand the null hypothesis before you can understand P values.

In all hypothesis tests, the researchers are testing an effect of some sort. The effect can be the effectiveness of a new vaccination, the durability of a new product, and so on. There is some benefit or difference that the researchers hope to identify.

However, it’s possible that there really is no effect or no difference between the experimental groups. In statistics, we call this lack of an effect the null hypothesis. When you assess the results of a hypothesis test, you can think of the null hypothesis as the devil’s advocate position, or the position you take for the sake of argument.

To understand this idea, imagine a hypothetical study for medication that we know is completely useless. In other words, the null hypothesis is true. There is no difference at the population level between subjects who take the medication and subjects who don’t.

Even though the null hypothesis is correct, we’ll most likely see an effect in the sample because of random sample error. It’s an incredibly rare occurrence for the sample data to exactly match the actual population value. Therefore, the position you take for the sake of argument (devil’s advocate) is that random sample error produces the observed sample effect rather than it being a true effect.

## What Are P values?

P-values indicate the believability of the devil’s advocate case that the null hypothesis is true given the sample data. They gauge how consistent your sample statistics are with the null hypothesis. Specifically, if the null hypothesis is correct, what is the probability of obtaining an effect at least as large as the one in your sample?

- High P-values: Your sample results are consistent with a null hypothesis that is true.
- Low P-values: Your sample results are not consistent with a null hypothesis that is true.

If your P value is small enough, you can conclude that your sample is so incompatible with the null hypothesis that you can reject the null for the entire population.

## How Do You Interpret P values?

Here is the technical definition of the P value:

P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true.

Let’s go back to our hypothetical medication study. Suppose the hypothesis test generates a P value of 0.03. You’d interpret this P-value as follows:

If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.

How probable are your sample data if the null hypothesis is correct? That’s the only question that P values answer. This restriction segues to a very frequent and problematic misinterpretation.

**Related post**: Understanding P values can be easier using a graphical approach: How Hypothesis Tests Work: Significance Levels and P-values.

## P values Are *NOT *an Error Rate

Unfortunately, P values are frequently misinterpreted. A common mistake is the notion that they represent the likelihood of rejecting a null hypothesis that is actually true (Type I error). The idea that P values are the probability of making a mistake is WRONG! You can read a blog post I wrote to learn *why* P values are misinterpreted so frequently.

You can’t use P values to directly calculate the error rate for several reasons.

First, P value calculations assume that the null hypothesis is correct. Thus, from the P value’s point of view, the null hypothesis is 100% true. Remember, P values assume that the null is true and the observed sample effect is caused by random sample error.

Second, P values tell you how consistent your sample data are with a null hypothesis that is true. However, when your data are very inconsistent with the null hypothesis, P values can’t determine which of the following two possibilities is more probable:

- The null hypothesis is true, but your sample is unusual due to random sample error.
- The null hypothesis is false.

To figure out which option is true, you must apply expert knowledge of the study area and, very importantly, assess the results of similar studies.

Going back to our medication study, let’s highlight the correct and incorrect way to interpret the P value of 0.03:

**Correct:**Assuming the medication has zero effect in the population, you’d obtain the sample effect, or larger, in 3% of studies because of random sample error.**Incorrect:**There’s a 3% chance of making a mistake by rejecting the null hypothesis.

Yes, I realize that the incorrect definition seems more straightforward, and that’s probably why it is so common. But, using this definition gives you a false sense of security, as I’ll show you next.

**Related posts**: See a graphical illustration of how t-tests and the F-test in ANOVA produce P values.

## What Is the True Error Rate?

The difference between the correct and incorrect interpretation is not just a matter of wording. There is a fundamental difference in the amount of evidence against the null hypothesis that each definition implies.

The P value for our medication study is 0.03. If you interpret that P value as a 3% chance of making a mistake by rejecting the null hypothesis, you’d feel like you’re on pretty safe ground. However, after reading this post, you should realize that P values are not an error rate and you can’t interpret them this way.

So, if the P value is not the error rate for our study, what is the error rate? Hint: It’s higher!

As I explained earlier, you can’t directly calculate an error rate based on a P value, at least not using the frequentist approach that produces P values. However, you can estimate error rates associated with P values by using the Bayesian approach and simulation studies.

Sellke et al.* have done this. While the exact error rate varies based on different assumptions, the values below use run-of-the-mill assumptions.

P value | Probability of rejecting a null hypothesis that is true |

0.05 | At least 23% (and typically close to 50%) |

0.01 | At least 7% (and typically close to 15%) |

These higher error rates probably surprise you! Regrettably, the common misconception that P values are the error rate produces the false impression of considerably more evidence against the null hypothesis than is warranted. A single study with a P value around 0.05 does not provide substantial evidence that the sample effect actually exists in the population.

These estimated error rates emphasize the need to have lower P values and replicate studies that confirm the initial results before you can safely conclude that an effect exists at the population level. In fact, studies with lower P values have higher reproducibility rates in follow-up studies.

Now that you know how to interpret P values correctly, check out my Five P Value Tips to Avoid Being Fooled by False Positives and Other Misleading Results!

### Reference

*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p-values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

Sean Saunders says

Jim, thanks for this post, but perhaps you could clarify something for me: assuming that H0 is true, if we set an alpha=0.05 level of significance and get a p-value less than that as the result of our sample data, wouldn’t that indicate, since less than 5% of samples would have such an effect due to random sample error, that there is only a 5% chance of getting such a sample, and thus, a 5% chance of rejecting the null hypothesis incorrectly? What am I missing here? Almost every stats book I’ve ever read has presented the concept this way (a type 1 error is even called an alpha-error!) Thanks for your feedback!

Jim Frost says

Hi Sean, thanks for your comment. Yes, you’re absolutely correct. The significance level (alpha) is the type I error rate. It’s the probability that you will reject the null hypothesis when it is true. However, the p-value is not an error rate. It’s a bit confusing because you compare one to the other.

In the post above, I provide a link to a post where I explain significance levels and p-values using graphs. I think it’s much easier to understand that way. I’ll explain below, but check that post out too.

Both alpha and p-values refer to regions on a probability distribution plot. You need an area under the curve to calculate probabilities. You can calculate probabilities for regions, but not a specific value.

That works fine for alpha. If the null is true, you expect sample values to fall in the critical regions X% of times based on the significance level that you specify. For p-values, the problems occur when you want to know the error rate for your specific study. You can’t do that for a single value from an individual study because you need an area under the curve.

The best you can say for p-values is: if the null is true, then you’d expect X% of studies to have an effect at least as large as the one in your study. X = your P-value. Notice the “at least as large.” That’s needed to produce the range of values for an area under the curve. It’s also means you can’t apply the percentage to your specific study. You can apply it only to the entire range of theoretical studies that have an effect at least as large as yours. That range collectively has an error rate that equals the p-value, but not your study alone.

Another thing to consider is that, within the range defined by the p-value, your study provides the weakest results because it defines the point closest to the null. So, the overall error rate for the range is largely based on theoretical studies that provide stronger evidence than your actual study!

In a similar fashion, if you reject the null for your study using an alpha = 0.05, you know that all studies in the critical region have a Type I error rate = 0.05. Again, this applies to the entire range of studies and not yours alone.

I hope this all makes sense. Again, read the other post and it’s easier to see with graphs.

naseer says

What should be the nature of the relationship of p values (especially Bonferroni corrected) with the Cohen’s d values for the same set of data?