P values determine whether your hypothesis test results are statistically significant. Statistics use them all over the place. You’ll find P values in t-tests, distribution tests, ANOVA, and regression analysis. P values have become so important that they’ve taken on a life of their own. They can determine which studies are published, which projects receive funding, and which university faculty members become tenured!

Ironically, despite being so influential, P values are misinterpreted very frequently. What *is* the correct interpretation of P values? What do P values *really* mean? That’s the topic of this post!

P values are a slippery concept. Don’t worry. I’ll explain P values using an intuitive, concept-based approach so you can avoid a very common misinterpretation that that can cause you serious problems.

## What Is the Null Hypothesis?

P values are directly connected to the null hypothesis. So, we need to cover that first!

In all hypothesis tests, the researchers are testing an effect of some sort. The effect can be the effectiveness of a new vaccination, the durability of a new product, and so on. There is some benefit or difference that the researchers hope to identify.

However, it’s possible that there really is no effect or no difference between the experimental groups. In statistics, we call this lack of an effect the null hypothesis. When you assess the results of a hypothesis test, you can think of the null hypothesis as the devil’s advocate position, or the position you take for the sake of argument.

To understand this idea, imagine a hypothetical study for medication that we know is completely useless. In other words, the null hypothesis is true. There is no difference at the population level between subjects who take the medication and subjects who don’t.

Even though the null hypothesis is correct, we’ll most likely see an effect in the sample because of random sample error. It’s an incredibly rare occurrence for the sample data to exactly match the actual population value. Therefore, the position you take for the sake of argument (devil’s advocate) is that random sample error produces the observed sample effect rather than it being a true effect.

## What Are P values?

P-values indicate the believability of the devil’s advocate case that the null hypothesis is true given the sample data. They gauge how consistent your sample statistics are with the null hypothesis. Specifically, if the null hypothesis is correct, what is the probability of obtaining an effect at least as large as the one in your sample?

- High P-values: Your sample results are consistent with a null hypothesis that is true.
- Low P-values: Your sample results are not consistent with a null hypothesis that is true.

If your P value is small enough, you can conclude that your sample is so incompatible with the null hypothesis that you can reject the null for the entire population. P-values are an integral part of inferential statistics because they help you use your sample to draw conclusions about a population.

**Background information**: Difference between Descriptive and Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics

## How Do You Interpret P values?

Here is the technical definition of P values:

P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true.

Let’s go back to our hypothetical medication study. Suppose the hypothesis test generates a P value of 0.03. You’d interpret this P-value as follows:

If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.

How probable are your sample data if the null hypothesis is correct? That’s the only question that P values answer. This restriction segues to a very frequent and problematic misinterpretation.

**Related post**: Understanding P values can be easier using a graphical approach: How Hypothesis Tests Work: Significance Levels and P-values.

## P values Are *NOT *an Error Rate

Unfortunately, P values are frequently misinterpreted. A common mistake is the notion that they represent the likelihood of rejecting a null hypothesis that is actually true (Type I error). The idea that P values are the probability of making a mistake is WRONG! You can read a blog post I wrote to learn *why* P values are misinterpreted so frequently.

You can’t use P values to directly calculate the error rate for several reasons.

First, P value calculations assume that the null hypothesis is correct. Thus, from the P value’s point of view, the null hypothesis is 100% true. Remember, P values assume that the null is true and the observed sample effect is caused by random sample error.

Second, P values tell you how consistent your sample data are with a null hypothesis that is true. However, when your data are very inconsistent with the null hypothesis, P values can’t determine which of the following two possibilities is more probable:

- The null hypothesis is true, but your sample is unusual due to random sample error.
- The null hypothesis is false.

To figure out which option is true, you must apply expert knowledge of the study area and, very importantly, assess the results of similar studies.

Going back to our medication study, let’s highlight the correct and incorrect way to interpret the P value of 0.03:

**Correct:**Assuming the medication has zero effect in the population, you’d obtain the sample effect, or larger, in 3% of studies because of random sample error.**Incorrect:**There’s a 3% chance of making a mistake by rejecting the null hypothesis.

Yes, I realize that the incorrect definition seems more straightforward, and that’s probably why it is so common. But, using this definition gives you a false sense of security, as I’ll show you next.

**Related posts**: See a graphical illustration of how t-tests and the F-test in ANOVA produce P values.

## What Is the True Error Rate?

The difference between the correct and incorrect interpretation is not just a matter of wording. There is a fundamental difference in the amount of evidence against the null hypothesis that each definition implies.

The P value for our medication study is 0.03. If you interpret that P value as a 3% chance of making a mistake by rejecting the null hypothesis, you’d feel like you’re on pretty safe ground. However, after reading this post, you should realize that P values are not an error rate and you can’t interpret them this way.

So, if the P value is not the error rate for our study, what is the error rate? Hint: It’s higher!

As I explained earlier, you can’t directly calculate an error rate based on a P value, at least not using the frequentist approach that produces P values. However, you can estimate error rates associated with P values by using the Bayesian approach and simulation studies.

Sellke et al.* have done this. While the exact error rate varies based on different assumptions, the values below use run-of-the-mill assumptions.

P value | Probability of rejecting a null hypothesis that is true |

0.05 | At least 23% (and typically close to 50%) |

0.01 | At least 7% (and typically close to 15%) |

These higher error rates probably surprise you! Regrettably, the common misconception that P values are the error rate produces the false impression of considerably more evidence against the null hypothesis than is warranted. A single study with a P value around 0.05 does not provide substantial evidence that the sample effect actually exists in the population.

These estimated error rates emphasize the need to have lower P values and replicate studies that confirm the initial results before you can safely conclude that an effect exists at the population level. In fact, studies with lower P values have higher reproducibility rates in follow-up studies. Learn about the Types of Errors in Hypothesis Testing.

Now that you know how to interpret P values correctly, check out my Five P Value Tips to Avoid Being Fooled by False Positives and Other Misleading Results!

Typically, you’re hoping for low p-values, but even high p-values have benefits!

### Reference

*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p-values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

YIHENEW says

Thank you. You give me good insight

David says

Awesome read! How would sample size affect the True Error rate? I would assume since p-values tend to become smaller as sample size increases, that would also effectively reduce the True Error rate since you are more confident about the population (assuming True Error means type I and type II errors).

Jim Frost says

Hi David, Thanks, and I’m glad you enjoyed the article!

There are two types of errors in hypothesis testing. So, let’s see how changing the sample size affects them. You might want to read my article about Type I and Type II Errors in Hypothesis Testing.

There’s three basic components for calculating p-values. The effect size, variability in the data, and the sample size. For the sake of discussion, let’s hold the effect size and the variability constant and just increase the sample size. In that case, you would expect that the p-values would decrease. Frequentists will cringe at this, but lower p-values are associated with lower false discovery rates (Type I errors). Additionally, increasing the sample size while holding the other two factors constant will increase the power of your test. Power is just (1 – Type II error rate). So, you’d expect the Type II errors (false negatives) to decrease. Increasing the sample size is good all around because it lowers both types of error

for a single study! I explain the italicized text later!However, a couple of important caveats for the above. Of course, as I point out in this article, you can’t calculate any error rates from the p-value using the frequentist approach. There’s no direct mapping from p-values to an error rate. You can use simulation studies and the Bayesian approach to estimate the false positive rate from the p-value. However, this requires an estimate of the a priori probability that the alternative hypothesis is correct. That information might be hard to obtain. After all, you’re conducting the study because you don’t know. Additionally, it’s always difficult to calculate the type II error rate. So, while you can say that increasing the sample should reduce both type I and type II errors, you don’t really know what they are! By the way, in a related vein, you might want to read how P-values correlate with the reproducibility of scientific studies.

Let’s return to Frequentist approach because there’s another side of things that isn’t obvious. In contrast with the earlier example for an individual study, the Frequentist approach talks about the Type I errors not for an individual study but for a class of studies that use the same significance level. A result is statistically significant when the p-value is less than the significance level. The significance level equals the Type I Error for all studies that use a particular significance level. For example, 5% of all studies that use a significance level of 0.05 should be false positives. Of course, when you see significant test results, you don’t know for sure which ones are real effects and which ones are false discoveries.

Let’s now hold the other two factors constant but

reducesample size. Let’s reduce it enough so that you have low power for detecting an effect. As your statistical power decreases, your test is less likely to detect real effects when they exist (the Type II error rate increases). However, the hypothesis test controls or holds constant the Type I error rate at your significance level. That’s built into the test. If you have a low power hypothesis test, the test’s ability to detect a real effect is low but it’s false positive rate remains the same. Consequently, when you obtain statistically significant results for a test with low power, you need to be wary because it’s relatively likely to be false positive and less likely to represent a real effect.That’s probably more than what you wanted, but it’s a fascinating topic!

Tetyana says

Dear Jim, thank you very much for you posts!

Does it mean that after I have obtained some small p-value, I have to do some other tests?

Jim Frost says

Hi Tetyana,

After you obtain a small p-value, you can reject the null hypothesis. You don’t necessarily need to perform other tests. I just want analysts to avoid a common misinterpretation. Obtaining a statistically significant result is still a good thing, but you have to keep in mind what it really represents.

Ahmad Allam says

Thank you.

Ahmad Allam says

Thank you very much. You made me reassuring . Appreciated.

How could I record this result in a scientific manuscript?

Jim Frost says

Hi Ahmad,

I think it’s perfectly acceptable to report such a small p-value using the scientific notation that is in your output. The other option would be to report it as a regular value by moving the decimal point 16 places to the left, but that takes up so much more room. So, I’d use scientific notation. It’s there to save space for extremely small and large values depending on the context.

Ahmad Allam says

Hi Jim. Thanks for this value post. But if you can help me on that, I got this result (6.79974E-16) ??? What that mean?

Appreciated.

Jim Frost says

Hi Ahmad,

That is called scientific notation. The E-16 in it indicates that you need to move the decimal point 16 digits to the left. That’s a very small value. Therefore, you have a very significant p-value!

Pamela Marcum says

What an awesome post! Should be required reading for all STEM students.

Jim Frost says

Thanks, Pamela. That means a lot to me!

Amit Kumar Sahoo says

Thanks Jim for your response. i think i got it..

Amit Kumar Sahoo says

Hi Jim,

Thanks for the post. Am little confused with the statement below

“If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.”

Now as per defination

“P-values indicate the believability of the devil’s advocate case that the null hypothesis is true given the sample data. ”

So doesn’t that mean higher P value will accept the alternate hypothesis since higher the probability of alternate happening when null is true. Am not able to get my head wrapped around this concept..

Amit

Jim Frost says

Hi Amit,

Great question! So, the first thing to realize is that the null and alternative hypotheses are mutually exclusive. If the probability of the alternative being true is higher, then the probability of the null must be lower.

However, the p-value doesn’t indicate the probability of either hypothesis being true. This is a very common misconception. Anytime you start linking p-values to the probability that a hypothesis is true, you know you’re going in the wrong direction!

P-values represent the probability of obtaining the effect observed in your sample, or more extreme, if the null hypothesis is true. It’s a probability of obtaining your data assuming the null is true. Consequently, a low p-value indicates that you were unlikely to obtain the sample data that was collected if the null is true. In this manner, lower p-values represent stronger evidence against the null hypothesis. Lower p-values indicate that your data are less compatible with the null hypothesis.

I think this is easier to understand graphically. I have a link in this post to another post How Hypothesis Tests Work: Significance Levels and P-values. This post shows how it works with graphs. I’d recommend taking a look at it.

I hope this helps!

Khursheed statistics says

Hello sir …..hope u r f9

I hve no words that u hve cleared me a lot of concepts of stats ….nd I am really hppy

……nd Wht evr u r uploading

Owsme

Jim Frost says

Hi Khursheed, I’m so happy to hear that you found this post to be helpful. Thanks for the encouraging words. They mean a lot to me!

naseer says

What should be the nature of the relationship of p values (especially Bonferroni corrected) with the Cohen’s d values for the same set of data?

Sean Saunders says

Jim, thanks for this post, but perhaps you could clarify something for me: assuming that H0 is true, if we set an alpha=0.05 level of significance and get a p-value less than that as the result of our sample data, wouldn’t that indicate, since less than 5% of samples would have such an effect due to random sample error, that there is only a 5% chance of getting such a sample, and thus, a 5% chance of rejecting the null hypothesis incorrectly? What am I missing here? Almost every stats book I’ve ever read has presented the concept this way (a type 1 error is even called an alpha-error!) Thanks for your feedback!

Jim Frost says

Hi Sean, thanks for your comment. Yes, you’re absolutely correct. The significance level (alpha) is the type I error rate. It’s the probability that you will reject the null hypothesis when it is true. However, the p-value is not an error rate. It’s a bit confusing because you compare one to the other.

In the post above, I provide a link to a post where I explain significance levels and p-values using graphs. I think it’s much easier to understand that way. I’ll explain below, but check that post out too.

Both alpha and p-values refer to regions on a probability distribution plot. You need an area under the curve to calculate probabilities. You can calculate probabilities for regions, but not a specific value.

That works fine for alpha. If the null is true, you expect sample values to fall in the critical regions X% of times based on the significance level that you specify. For p-values, the problems occur when you want to know the error rate for your specific study. You can’t do that for a single value from an individual study because you need an area under the curve.

The best you can say for p-values is: if the null is true, then you’d expect X% of studies to have an effect at least as large as the one in your study. X = your P-value. Notice the “at least as large.” That’s needed to produce the range of values for an area under the curve. It’s also means you can’t apply the percentage to your specific study. You can apply it only to the entire range of theoretical studies that have an effect at least as large as yours. That range collectively has an error rate that equals the p-value, but not your study alone.

Another thing to consider is that, within the range defined by the p-value, your study provides the weakest results because it defines the point closest to the null. So, the overall error rate for the range is largely based on theoretical studies that provide stronger evidence than your actual study!

In a similar fashion, if you reject the null for your study using an alpha = 0.05, you know that all studies in the critical region have a Type I error rate = 0.05. Again, this applies to the entire range of studies and not yours alone.

I hope this all makes sense. Again, read the other post and it’s easier to see with graphs.