P values determine whether your hypothesis test results are statistically significant. Statistics use them all over the place. You’ll find P values in t-tests, distribution tests, ANOVA, and regression analysis. P values have become so important that they’ve taken on a life of their own. They can determine which studies are published, which projects receive funding, and which university faculty members become tenured!
Ironically, despite being so influential, P values are misinterpreted very frequently. What is the correct interpretation of P values? What do P values really mean? That’s the topic of this post!
P values are a slippery concept. Don’t worry. I’ll explain p-values using an intuitive, concept-based approach so you can avoid making a widespread misinterpretation that can cause serious problems.
Learn more about Statistical Significance: Definition & Meaning.
What Is the Null Hypothesis?
P values are directly connected to the null hypothesis. So, we need to cover that first!
In all hypothesis tests, the researchers are testing an effect of some sort. The effect can be the effectiveness of a new vaccination, the durability of a new product, and so on. There is some benefit or difference that the researchers hope to identify.
However, it’s possible that there actually is no effect or no difference between the experimental groups. In statistics, we call this lack of an effect the null hypothesis. When you assess the results of a hypothesis test, you can think of the null hypothesis as the devil’s advocate position, or the position you take for the sake of argument.
To understand this idea, imagine a hypothetical study for medication that we know is entirely useless. In other words, the null hypothesis is true. There is no difference at the population level between subjects who take the medication and subjects who don’t.
Despite the null being accurate, you will likely observe an effect in the sample data due to random sampling error. It is improbable that samples will ever exactly equal the null hypothesis value. Therefore, the position you take for the sake of argument (devil’s advocate) is that random sample error produces the observed sample effect rather than it being an actual effect.
What Are P values?
P-values indicate the believability of the devil’s advocate case that the null hypothesis is correct given the sample data. They gauge how consistent your sample statistics are with the null hypothesis. Specifically, if the null hypothesis is right, what is the probability of obtaining an effect at least as large as the one in your sample?
- High P-values: Your sample results are consistent with a true null hypothesis.
- Low P-values: Your sample results are not consistent with a null hypothesis.
If your P value is small enough, you can conclude that your sample is so incompatible with the null hypothesis that you can reject the null for the entire population. P-values are an integral part of inferential statistics because they help you use your sample to draw conclusions about a population.
Background information: Difference between Descriptive and Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics
How Do You Interpret P values?
Here is the technical definition of P values:
P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true.
Let’s go back to our hypothetical medication study. Suppose the hypothesis test generates a P value of 0.03. You’d interpret this P-value as follows:
If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.
How probable are your sample data if the null hypothesis is correct? That’s the only question that P values answer. This restriction segues to a very persistent and problematic misinterpretation.
Related posts: Understanding P values can be easier using a graphical approach: How Hypothesis Tests Work: Significance Levels and P-values and learn about significance levels from a conceptual standpoint.
P values Are NOT an Error Rate
Unfortunately, P values are frequently misinterpreted. A common mistake is that they represent the likelihood of rejecting a null hypothesis that is actually true (Type I error). The idea that P values are the probability of making a mistake is WRONG! You can read a blog post I wrote to learn why P values are misinterpreted so frequently.
You can’t use P values to directly calculate the error rate for several reasons.
First, P value calculations assume that the null hypothesis is correct. Thus, from the P value’s point of view, the null hypothesis is 100% true. Remember, P values assume that the null is true, and sampling error caused the observed sample effect.
Second, P values tell you how consistent your sample data are with a true null hypothesis. However, when your data are very inconsistent with the null hypothesis, P values can’t determine which of the following two possibilities is more probable:
- The null hypothesis is true, but your sample is unusual due to random sampling error.
- The null hypothesis is false.
To figure out which option is right, you must apply expert knowledge of the study area and, very importantly, assess the results of similar studies.
Going back to our medication study, let’s highlight the correct and incorrect way to interpret the P value of 0.03:
- Correct: Assuming the medication has zero effect in the population, you’d obtain the sample effect, or larger, in 3% of studies because of random sample error.
- Incorrect: There’s a 3% chance of making a mistake by rejecting the null hypothesis.
Yes, I realize that the incorrect definition seems more straightforward, and that’s why it is so common. Unfortunately, using this definition gives you a false sense of security, as I’ll show you next.
Related posts: See a graphical illustration of how t-tests and the F-test in ANOVA produce P values.
Learn why you “fail to reject the null hypothesis” rather than accepting it.
What Is the True Error Rate?
The difference between the correct and incorrect interpretation is not just a matter of wording. There is a fundamental difference in the amount of evidence against the null hypothesis that each definition implies.
The P value for our medication study is 0.03. If you interpret that P value as a 3% chance of making a mistake by rejecting the null hypothesis, you’d feel like you’re on pretty safe ground. However, after reading this post, you should realize that P values are not an error rate, and you can’t interpret them this way.
If the P value is not the error rate for our study, what is the error rate? Hint: It’s higher!
As I explained earlier, you can’t directly calculate an error rate based on a P value, at least not using the frequentist approach that produces P values. However, you can estimate error rates associated with P values by using the Bayesian approach and simulation studies.
Sellke et al.* have done this. While the exact error rate varies based on different assumptions, the values below use run-of-the-mill assumptions.
P value | Probability of rejecting a true null hypothesis |
0.05 | At least 23% (and typically close to 50%) |
0.01 | At least 7% (and typically close to 15%) |
These higher error rates probably surprise you! Regrettably, the common misconception that P values are the error rate produces the false impression of considerably more evidence against the null hypothesis than is warranted. A single study with a P value around 0.05 does not provide substantial evidence that the sample effect exists in the population. For more information about how these false positive rates are calculated, read my post about P-values, Error Rates, and False Positives.
These estimated error rates emphasize the need to have lower P values and replicate studies that confirm the initial results before you can safely conclude that an effect exists at the population level. Additionally, studies with smaller P values have higher reproducibility rates in follow-up studies. Learn about the Types of Errors in Hypothesis Testing.
Now that you know how to interpret P values correctly, check out my Five P Value Tips to Avoid Being Fooled by False Positives and Other Misleading Results!
Typically, you’re hoping for low p-values, but even high p-values have benefits!
Learn more about What is P-Hacking: Methods & Best Practices.
Reference
*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p-values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1
AP says
Thanks Jim for the nice explanation!
Given that even low p-values are associated with quiet high false positive rates (23 – 50%), wouldn’t studies for new drugs/medicines that get approved based on a significant p-value (and meaningful clinical benefit in the sample population) end up not showing any effect in the real world setting?
Or do real world drug studies look at a myriad of other sample statistics as well?
Jim Frost says
Hi,
I don’t know the specific guidelines the FDA uses to approve drugs here in the U.S., but I have heard that there must be more than one statistically significant study. Additionally, the FDA should evaluate not just the statistical significance but also the practical significance in real-world terms. Is the effect size meaningful?
Furthermore, many clinical trials for medications use extremely large sample sizes. For example, Moderna’s COVID vaccine study had 30,000 participants! (Click to read my review of the study.) Consequently, when a practically meaningful effect size truly exists, these powerful studies tend to produce very low p-values.
So, between assessing multiple studies, effect sizes, and using very large samples, the series of clinical trials that pharmaceutical companies perform for new medication tend to produce strong evidence. Pharmaceutical companies might not have the best reputation, but I have to admit their clinical trial protocols are top notch.
Kirsten Thomson says
This is extremely helpful, thank you!
Jim Frost says
You’re very welcome, Kirsten. I’m so glad it was helpful!
Kirsten Thomson says
Hi Jim,
I hope it’s not too late to comment and ask a question on this subject?
My graduate stats professor (one of them anyway) taught us not to report the actual p value if we reject the null hypothesis. If I remember his thinking properly (and I may not!) it is because, as you say, p values are based on the premise that the null hypothesis is true. Once you reject the null, then the actual p value shouldn’t then be reported. Just if it’s p< .01 or whatever. I've taken this as gospel for years. But it seems like you might not agree with this based on your reporting that it seems like it does matter how low the p value is (lower p value results more reproducible, etc.). I could imagine a couple conclusions here: he's right, he's wrong, he's right but your point is still valid that the lower p values are meaningful but you still shouldn't report the actual p value and just p<.001, or some combination I haven't thought of. Can you elucidate a bit? Thanks so much!
Jim Frost says
Hi Kirsten,
I do disagree that advice to only state that the results are significant at some level, such as 0.05. The particular value of the p-value provides additional information. When the p-value is less than 0.05, it’s significant, but there’s a vast difference if the p-value is 0.045 or 0.001. When the p-value is near 0.05, it can be significant but the evidence against the null is fairly weak. On the other hand, if it’s near 0.001, that’s really strong evidence against the null. So just saying the results are significant at the 0.05 level leaves out a lot of information!
The key point to remember is that precise p-value indicates the strength of the evidence against the null. So, it’s really helpful knowing the exact value. And that’s true even when you do reject the null. Just how strong was the evidence?
In this post, I discuss some reasons for doing that based on Bayesian ideas. You might also be interested an empirical look at how lower p-values are related to greater reproducibility of results. In that article, I look at studies that were reproduced. But imagine you only had one study. You can see how the p-value is very helpful for helping you understand the strength of the results!
And there’s really no reason to not report the precise p-value. It’s not like it costs you more!
monika says
can you please suggest any article which explains the relation ship between bias in data and p value?
I am new to these concepts so getting confused.
Jim Frost says
Hi Monika,
I don’t have an article for you. However, most p-values assume that the data are unbiased. When there is bias (measurements tend to be too high or too low), p-values are generally not valid.
Mike Lotinga says
Thanks Jim, that’s very helpful and a bit of a relief as I’m just getting to grips with it, so not thrilled about the idea of having the whole rug pulled out from under my feet. I’ve also been reading your https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/ page which confirms that the p-values in regression models are simply a form of inferential statistical hypothesis test, which suggests the answer to my last query about this is yes, the same applies, but this is also subject to the info you provided in your reply (? or is there a different slant in relation to regression model p-values?)
Jim Frost says
Hi Mike,
Sorry, I accidentally missed your question about p-values in regression!
That same principles apply to p-values in regression analysis. Although, I’d say there are extra concerns surrounding them because now you need to worry about the characteristics of the model. There are various issues that can affect the validity of the model and bias the p-values. However, once you get to a valid model, you’re dealing with the same principles behind p-values as elsewhere. P-values all relate to hypothesis tests that are a part of inferential statistics. These tests, from t-tests to regression analysis, all help you to use samples to draw conclusions about the population.
Mike Lotinga says
Hi Jim, I love your website and am a happy owner of your ‘Intro to Statistics’ ebook.
I’m not properly trained in statistics, and I’ve been recently reading some of the debate around interpretation of p-values and statistical significance in the journals, such as ‘The ASA Statement on p-values’ in 2016 (https://doi.org/10.1080/00031305.2016.1154108), the follow-up Editorial in 2019 ‘Moving to a world beyond “p<0.05"' (https://doi.org/10.1080/00031305.2019.1583913), and recent articles like 'The p-value statement, five years on' (https://doi.org/10.1111/1740-9713.01505).
For untrained people like me, it seems like statisticians are at war with other scientists over the issue of p-value interpretation, and it's difficult to know what to make of it.
Do you have any views on this debate, and whether the conventional use and interpretation of p-values for inferential hypothesis testing and statistical significance still holds any value or meaning?
And do the concerns raised by the ASA and others also potentially undermine interpretation of p-values in regression models?
Jim Frost says
Hi Mike,
Thanks so much for getting my book and so glad to hear that you’re a happy owner! 🙂
I’ve followed those p-value debates with interest over the years. I do have some thoughts on it. For starters, in this post, you get some sense for where I think the problem lies. There’s the common misinterpretation that I write about which falsely overstates the strength of the evidence against the null hypothesis. And that’s where I think the problem really starts. You get a p-value of 0.04 and think, it’s significant! But, a single study with that p-value provides fairly week evidence against the null hypothesis. So, you really need lower p-values and/or more replication studies. Preferably both! Even one study with a lower p-value isn’t conclusive.
But, I do think p-values are valuable tools. They quantify the strength of the evidence against the null. The problem is that people misuse and abuse them.
To get a sense of how I think they should be used, read my post about Five P Value Tips to Avoid Being Fooled. Hopefully, from that post you’ll see there is a smart way to use p-values and other tools, such as confidence intervals. And read my post about P-values and Reproducibility to see how they can really shine as measures of evidence.
Finally, to get a perspective on why p-values are misinterpreted so frequently, click that link to learn more!
I’d hope that people can learn to use p-values correctly. They’re good tools, but they’re being used incorrectly.
I hope that answers your question!
NAMAN JAIN says
Hey
Can you please tell how to calculate P value mathematically in regression
Christian says
Wow, thank you for this brilliant article, Jim!
If I get a p-value of 0.03, would it be correct to say: “If the H0 is true, there is only a probability of 3% of observing this data. Hence, we reject the the null hypothesis.”
Is this statement correct, and is there any other credible way of bringing the word ‘probability’ into the interpretation?
Thank you very much mate!
Cheers,
Christian
Jim Frost says
Hi Christian,
That’s very close to be 100% correct! The only thing I’d add is that is “this data or more extreme.” But you’re right on with the idea. Most people don’t get it that close!
There’s really no other way to work in probability to this context. In fact, I’ll often tell people that if they’re using probability in relation to anything other than their data/effect size, that’s a sign that they’re barking up the wrong tree.
Thanks for writing!
Monique Ekundayo says
Thanks
Monique Ekundayo says
Hi Jim,
I conducted a mediation analysis (Baron and Kenny) and my p-value from a Strobel Test came back negative? What does a negative p-value signify?
Jim Frost says
Hi Monique, I’ll assume that you’re actually asking about the Sobel test (there is no Strobel test that I’m aware of). I don’t know why you got a negative p-value. That should not occur. There might be a problem with the code or application you’re using.
vik says
Thanks
vik says
Hi Jim,
I enjoy reading your blogs. I purchased two of your books. I have learnt more from these books than from textbooks written by other people. I have a question about interpretation of significance level and p-value – two statements from your book come across as contradictory (to me).
On page 11 of your “hypothesis testing” book, these statements concerning interpretation of significance level are made :
(1) In other words it is the probability that you say there is an effect when there is no effect. For instance, a significance level of 0.05 signifies a 5% risk of deciding that an effect exists when it does not exist.
On page 77 the following statement is made about interpretation of pvalue :
(2) A common mistake is that they represent the likelihood of rejecting a null hypothesis that is actually true (Type I error). The idea that p-values are the probability of making a mistake is wrong !
I find statements (1) and (2) contradictory because of the following. In making the decision about whether to reject the null hypothesis one compares the p-value to the significance level. (If pvalue is lower than the preset significance level one rejects the null hypothesis). It is possible to compare two quantities only if they have the same interpretation (same units in problems in the area of physics). Therefore the interpretation of significance level and pvalue should be the same ! For example if pvalue turns about to be 0.04, we reject the null hypothesis since 0.04 is lower than 0.05. If 0.05 significance level implies 5% risk of (incorrectly ) rejecting a true null hypothesis then a pval of 0.04 should be interpreted as a 4% risk of (incorrectly ) rejecting a true null hypothesis ?
What am I missing here ?
Thanks.
Jim Frost says
Hi Vik,
Thanks so much for supporting my books!
This issue is very confusing. You might find it surprising, but there are no contradictory statements in what I wrote!
Keep in mind that your 1 and 2 statements are about the significance level and p-values, respectively. So, they’re about different concepts and, hence, it’s not surprising that different conditions apply.
For significance levels (alpha), it is appropriate to say that if you use a significance level of 0.05, then for all studies that use that significance level, you’d expect 5% of them to be positive when the null hypothesis is true. Importantly, significance levels apply to a range of p-values. Also, note that stating that you have a 5% false-positive rate when the null is true is entirely different than applying an error rate probability to the null hypothesis itself.
We’re not saying there’s a 5% chance that the test results for an individual study are incorrectly saying that the null is false when it is actually true. We’re saying that in cases where the null is true, 5% of studies that use a significance level of 0.05 will get false positives. Unfortunately, we’re never sure when the null is true or not. We just know the error rate for when it is true. In other words, it’s based on the assumption that the null is true.
Your second statement is about the p-value. That’s the probability for a specific study rather than a class of studies. It’s the probability of obtaining the observed results, or more extra, under the assumption that the null is true.
So, alpha applies to a class of studies (have p-values within a range and the null is true), whereas p-values apply to a specific study. For both, it’s under the assumption that the null is true and does not indicate the probability related to any hypothesis.
Let’s get to your example with a p-value of 0.04 and we’re using a significance level of 0.05. The correct interpretation for the p-value is that you have a 4% chance of observing the results you obtained, or more extreme, if the null is true. For the significance level, your study is significant. Consequently, it is in the class of studies that obtain significant results using an alpha of 0.05. In that class, 5% of the studies will produce significant results when the null is true. However, we don’t know whether the null is true or not for your study. Additionally, we can’t use those results to determine the probability of whether the null is true.
Specifically, it is NOT accurate to say that a p-value of 0.04 represents a 4% risk of incorrectly rejecting the null. That’s the common misconception I warn about!
I hope that helps clarify! It is a tricky area. Just remember that any time you start to think that either p-values or the significance level allow you to apply a probability to the null hypothesis, you’re barking up the wrong tree. Both assume that the null is true. Please note in my hypothesis testing book my illustrations of sampling distributions of the various tests statistics. All of those are based on the assumption that the null is true. From those distributions, we can apply the significance level and derive p-values. So, they’re incorporating the underlying assumption that the null is true.
Anisha Kumar says
Hi,
When writing the interpretation do we set it up as “Assuming the null is true, there is a 3% chance of getting null hypothesis or the alternative?
I do not necessarily understand if the p-value is bigger than alpha why we fail to reject the null hypothesis.
Yechezkal Gutfreund says
Would this be a fair statement?
With an alpha of 0.05, If one repeats the sample enough times, the mean percent of Type I errors will approach 5%? (since type I errors do assume a true null hypothesis). However, we cannot say that about an individual test and it’s P-value.
Jim Frost says
Hi, that is sort of correct. More correct would be to say that if you repeat an experiment on a population where the null is true, you’d expect 5% (using alpha = 0.05) of the studies to be statistically significant (false positives). However, if the null is false, you can’t have a false positive! So, keep in mind that what you write is true only when the null is true.
And, right, using frequentist methodology, you can’t use the p-value (or anything else) to determine the probability that an individual study is a false positive.
Yechezkal Gutfreund says
I hope you don’t mind me continuing the conversation here, if not tell me.
Hopefully, I am also helping you in giving a clue where the mental blocks are.
I believe I get the distinction between P values and alpha (I would not conflate them). As I understand it now, P-Values are sample specific, point values, Alphas are related to a parameterized test statistic (PDF) that captures the results of repeated iterations of taking samples from the population. If that is wrong, then I need to be corrected before going any further.
What I did not grok, and which probably should be emphasized in the post, is that Alphas are still assuming null is true. Also I read in you posts that alpha === error rate, I was taking this as Type I error rate. It seems that was a false reading.
For the moment I am not interested in Type II errors and distinguishing them from Type I (False Positives). So what I would like to see in the blog post is why alpha is different from Type I errors, and why Bayesian simulation is needed to get a better handle on Type I errors.
And yes, this section (below) of your comment is also probably a great place of a blog article, since it would be great with a worked out example and a chart showing exactly how this disparity can happen.
“`
So, yes, you can be 95% confident that the CI contains the true parameter, but you might be in the 5% portion and not know it. And, it comes down to the probability that null is false. If it’s likely that the null is false then you’re more likely to be in the 5%. When the null is more likely to be correct, you’re more likely in the 95% portion. I can see a lot minds blowing now!
“`
Jim Frost says
Hi,
Yes, that’s right about p-values and alpha. P-values are specific to a particular study. Using the frequentist methodology, there is no way to translate from a p-value to an error rate for a hypothesis. Alpha applies to a class of studies and it IS an error rate. It is the Type I error rate.
You had it right earlier that alpha = the Type I error rate. Alpha is the probability that your test produces significant results when the null is true. And Type I errors are when you reject a null hypothesis that is true. Hence, alpha and the Type I error rate are the same thing.
Think back to the plots that show the sampling distributions of the test statistics. Again, these graphs show the distribution of the test statistic assuming the null is true. To determine probabilities, you need an area under the curve. The significance level (alpha) is simply shading the outer 5% (usually) portion of the curve. The test statistic will fall in those portions 5% of the time when the null is true. You can’t get a probability for an individual value of the test statistic because that doesn’t produce an area under the curve.
Yechezkal Gutfreund says
Well, that is a good answer at the definitional level, i.e. that is the propability of the effect, with the assumption that the null hypothesis is true. OK, but what I am trying to do with my clogged block-head is wrap my mind about this. (I am 1/2 way through the hypothesis testing book, and yes, the diagrams help but not yet on this).
Here is another way I am struggling with this. Ok, granted that the P-Value is disconnected with error rate, but in your book you mention that alpha the same thing as the Type I error rate.
So if my alpha is 0.05 and my P-value is 0.03, why am I not at a 95% confidence level? As you say in this post , Sellke et al.* using simulation show that the actual error rate is probably closer to 50%. Huh? Should I not be at least 95% confident there is no Type I error?
Now, I have a hunch this all has to do with the fact that after the the alternative hypothesis is accepted, there is some conditional probabilities (Bayes strikes again). But I am trying to ground this in intuition, and that is why I think a worked example of how we go from 0.05 to 0.5
That is why I am looking for an example worked out with graphs that identify where the “additional” source of Type I errors is occurring.
Jim Frost says
Hi Yechezkal,
I highlight the definition because it’ll point in the right direction when you’re starting out. If you ever start thinking that it’s the probability of a hypothesis, you know you’re barking up the wrong tree!
As you look at the graphs, keep in mind that they show the sampling distributions of the test statistic. These distributions assume that the null is true. Hence, the peak occurs at the null hypothesis value. You then place the test statistic for your sample into that distribution. That whole process shows how the null being true is baked right into the calculations. The distributions apply to a class of studies, those with the same characteristics as yours. The test statistics is specific to your test. You’ll see that distinction between class of study and your specific study again in a moment.
You raise a good point about alpha. And the fact that you’re comparing alpha (which is an error rate) to the p-value (not an error rate) definitely adds to the confusion. I write about this and other reasons for why p-values are misinterpreted so frequently. (There’s some historical reasons at play among other things.)
The significance level (alpha) is an error rate. However, it is an error rate for when the null hypothesis is true. This error rate applies to all studies where the null is true and have the same alpha. For example, if you use an alpha of 0.05 and you have 100 studies where the null is true, you’d expect five of them to have significant results. The key point is that the error rate for alpha applies to a class of studies (null is true, same alpha).
On the other hand, p-values apply to a specific study. Furthermore, while you know alpha, you don’t know whether the null is true. Not for sure. So, if you obtain significant results, is it because the effect exists in the population, or is it a Type I error (false positive). You just don’t know.
So, when you obtain a significant p-value and calculate a 95% confidence interval, those results will agree. However, you still don’t know the probability that the null is true or not. So, yes, you can be 95% confident that the CI contains the true parameter, but you might be in the 5% portion and not know it. And, it comes down to the probability that null is false. If it’s likely that the null is false then you’re more likely to be in the 5%. When the null is more likely to be correct, you’re more likely in the 95% portion. I can see a lot minds blowing now!
I will be writing a blog post on this, so I’m not going to explain it all here. It’s just too much for the comments section. P-values and CIs are part of the frequentist tradition in statistics. Under this view, there is no probability that the null is true or false. It’s either true or false but you don’t know. You can’t calculate the probability using frequentist methods. You know that if the null is true, then there’s a 5% chance of obtaining significant results anyway. However, there is no way to calculate the probability of the null being true so there’s no way to convert it into an error rate.
However, using simulations and Bayesian methodology, you can get to the point of estimating error rates for p-values . . . sort of in some cases. Some Frequentists don’t like this because it is going outside their methodology, but it sheds light on the real strength of the evidence for different p-values. And, the conclusions of the simulation studies and Bayesian methodology are consistent with attempts to reproduce significant results in experiments. P-values predict the likelihood of reproducing significant results.
So, stay tuned for that blog post! I’ll make it my next one. If you’re on my email list, you’ll receive an email when I publish it. If not, add yourself to the email list by looking in the right margin of my website and scroll partway down. You’ll see a box to enter your email to receive notifications of new blog posts.
Yechezkal Gutfreund says
Jim, I am a Ph.D. in Computer science. I really like your approach to teaching this, I have always struggled with getting an intuition into stats. But I am still mentally blocked on why the P-value is not the same as the error rate.
You state:
The null hypothesis is false. — I get this, that is part of the definition and assumption of p, but I still don’t see how it effects the error rate.
Later on you state (and I can accept it on authority, but not on intuition):
—–
Sellke et al.* have done this. While the exact error rate varies based on different assumptions, the values below use run-of-the-mill assumptions.
P value Probability of rejecting a true null hypothesis
0.05 At least 23% (and typically close to 50%)
0.01 At least 7% (and typically close to 15%)
These higher error rates probably surprise you!
—
Well yes, it does surprise me. Can I be somewhat chutzpanik and ask you to create an numerical example problem or two, that has low p-values (e.g. 0.05) and error rates of 15%-50%, then show what are the factors (from the example ) that lead to the higher error rate?
I have also read that if the significance level I am seeking (and yes, I grok that is different than p-value) is 0.05 if you do enough experiments, the error rate will approach the alpha (significance level)
If that could be also part of the example, I think folks would grasp this better from a real world example than from declarative statements?
Do you this would be worth a blog post to attach to this one? Tell me if that is true, and ping me if you do such a thing.
What I am working on are modeling and simulations of military battles with new equipment. I am looking at how many times I need to run a stochastic simulation (since causalities will be different each time) till I get definitive statement that this new equipment leads to less causalities.
Jim Frost says
Hi Yechezkal,
The p-value is a conditional probability based on the assumption that the null is false. However, what is it a probability of? It’s a probability of observing an effect size at least as large as the one you observed. That probability has nothing to do with whether the null is true or false. So, keep that in mind. It’s a probability of seeing an effect size. There’s nothing in the definition about being a probability related to one of the hypotheses! That’s why it’s not an error rate! Then map on to that the conditional assumption that the null is true.
I think it’s easier to understand graphically. So, check it out in the context of how hypothesis tests work where I graphically show how p-values and significance levels work.
I will write a post about how this works and the factors involved. It’s an interesting area to study. Bayesian and simulation studies have looked at this using their different methodologies and have come up with similar answers. Look for the post either later this year or early 2021!
Thanks for writing and the great suggestion!
Sam Mohsen says
Thank you for the article. I have always struggled to correctly interpret the p-value.
I have two sets of data (readings for process durations conducted using different approaches). I have used graphical representation and they two sets seems very similar. However, I want to apply the t-test and examine if they are really similar or not. I have two questions:
A) Should I use the whole datasets when conducting the t-test and examining the p-value? I have more than 10k in both datasets. Or should I “randomly” select a sample from these 10k records I have?
B) If, let’s say I got a t-test = 2.5 and a p-value of 0.000045 (a very small), what does that mean? Does it mean that the two datasets are actually different? (meaning that I reject the null-hypothesis that I assume the are similar). Is there a better interpretation?
Jim Frost says
Hi Sam,
This is a great question.
First, you should use the full dataset. There’s generally little reason to throw out data unless you question the data themselves. If you think the data are good, then keep it!
The “problem” with a large dataset is that it give hypothesis tests a lot of statistical power. Having a lot of power gives the test the ability to detect very small effects. Imagine that there is a trivial difference between the means of the two populations. A test with a very large sample sizes can detect this trivial difference and produce a very small, significant p-value. That might occur in your case because you have 10k observation in both groups. However, I put problem in quotes because it’s not actually a problem because there are methods for determining whether a statistically significant result is also practically significant.
I point out in various places that a significant p-value does not automatically indicate that the results are practically meaningful in the real world. In your example with a p-value of 0.000045, it indicates that the evidence supports the hypothesis that an effect exists at the population level. However, the p-value by itself does not indicate that the effect is necessarily meaningful in a practical sense. You should always take the extra step of assessing the precision and magnitude of that effect and the real-world implications regardless of your sample size. I write about this process in my post about practical versus statistical significant.
I also write about it in my post with 5 tips to avoid being mislead by p-values.
I’d read those posts and follow those tips. Pay particular attention to the parts about assessing CIs of the estimated effect. In case your, the populations are probably different, but now you need to determine if that difference in meaningful in a real-world sense.
I hope that helps!
Eugene Demidenko says
Unfortunately, the correct interpretation of the p-value is not valuable and is not informative for making judgements on the strength of the null hypothesis. Many people forget that the p-value strongly depends on the sample size: the larger n the smaller p (E. Demidenko. The p-value you can’t buy, 2016). The correct interpretation of the p-value is the proportion of samples from future samples of the same size that have the p-value less than the original one, if the null hypothesis is true. That is why I claim that the p-value is not informative but people try to overemphasize it. Use d-value — it has more sense.
Jim Frost says
Hi Eugene,
I’d agree that p-values are confusing and don’t answer the question that many people think it does. However, I’m afraid I have to disagree that it is not informative. It measures the strength of the evidence against the null hypothesis. As such, it is informative.
Sample size does affect p-values, but only when an effect is present. When the null hypothesis is true for the population, p-values do not tend to increase. So, it’s not accurate to say “the larger the n, the smaller the p.” Sometimes yes. Sometimes no. I think you’re referring to the potential problem that huge samples can detect miniscule effects that aren’t important in the real world. I write about this in my post about practical significance vs statistical significance.
I’m guessing that when you say “d-value,” you’re talking about Cohen’s d, a measure of the relative effect size and not the d-value in microbiology that is the decimal reduction time! Cohen’s d indicates the effect size relative to the pooled standard deviation. It can be informative when you’re assessing differences between means. But, it doesn’t help you with other types of parameters. I’d suggest that you need to evaluate confidence intervals. They indicate the likely effect size while incorporating a margin of error. You can also use them to determine statistical significance. Unlike Cohen’s d, you can assess confidence intervals for all types of parameters, such as means, proportions, and counts. In short, CIs help you assess practical significance, the precision of the estimate, and statistical significance. As I write in my blog posts, I really like confidence intervals!
Your definition of the p-value isn’t quite correct. P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true.
Are p-values informative? Yes, they are. As I show towards the end of this post, lower p-values are associated with lower false discovery rates. Additionally, a replication study found that lower p-values in the initial study were associated with a higher chance that the follow-up replication study is also statistically significant. Read my post about Relationship Between the Reproducibility of Experimental Results and P-values.
High p-values can help prevent you from jumping to conclusions!
And, finally, I present tips for how to use p-values and avoid misleading results.
I hope that helps clarify p-values!
Gudelli Prashanth Reddy says
Hello Sir.. If p value is 0.03 and it means that 3% in the study show the sample effect due to random errors, what does it mean?
Can you please extend the explanation from there.
Why do we call it as significant statistical value sir? What are we infering here?
Jim Frost says
Hi Gudelli,
For what a p-value of 0.03 means, just use the information I provide in this article. In fact, I give the correct interpretation for a p-value of 0.03 in this article! Scroll down until you see the green Correct in the text. That’s what you’re looking for. If there’s a more specific point that’s not clear, please let me know. But there’s no point for me repeating what is already written in the article.
As for statistical significance, that indicates that an effect/relationship you observe in a random sample is likely to exist in the population from which you drew the sample. It helps you rule out the possibility that was random sampling error. Remember, just because you see an effect in a sample does not mean it necessarily exists in the population.
Ben (@Ben38622544) says
Hi Jim,
Hope you’re well.
When calculating our z scores, we obviously use (score-mean)/SD.
Say i have 50 years of annual climate data (1951-2000) – 1 mean for each year – do i have to use the mean and standard deviation of all this data? Or can i use the mean & sd of 1951-1980 for example? (Is (1999 mean -(minus) mean of the 1951-1980 means)/SD of the 1951-1980 data)
Of course this may well prompt more statistically significant points between 1981 and 2000.
However, is this reasonable practice in data science or is this overmanipulation/an absolute no no.
Thank you in advance for your help! Hope you have a good day!
Ben
Jim Frost says
Hi Ben,
A normal distribution, for which you calculate Z-scores, involves a series of independent and identically distributed events. I just wrote a post about that concept. Time series data don’t qualify because they’re not independent events–one point in time is correlated to another point in time. And, if there is a trend in temperatures, they’re not identically distributed. In a technical sense, it wouldn’t be best practice to calculate Z-scores for that type of data. If you’re just calculating them to find outliers, the requirements aren’t so stringent. However, be aware that an trend in the data would increase the variability, which decreases the Z-scores because the SD is in the denominator. If you were to use shorter timeframes, there might not be noticeable trends in the data.
Typically, what you’d want to do is fit a time series model to the data and then look for deviations from the model’s expected values (i.e., large residuals).
I hope this helps!
Manuel Leos says
What is the difference between the p-value as given by Excel or a statistics program, such as r, and the alpha level. What is the relation to the critical value? Why does this matter?
Jim Frost says
Hi Manuel,
If you’re performing the same test with the same options, there should be no differences between Excel and statistical programs. However, I do notice that Excel sometimes has limited methodology options compared to statistical packages, which means their calculation might not always matchup.
As for p-values and critical values (regions), I write about that in a post about p-values and significance levels. Read that article and if you more questions on that topic, please post them there!
Aakash Sharma says
My p value of one way ANOVA is 1.09377645180021E-12. What does it mean.. Is it significant ?
Jim Frost says
Hi Aakash, your p-value is written using scientific notation. Scientific notation is a convenient way to represent very large and very small numbers. In your case, it represents a very small p-values. Yes, it significant!
The minus 12 indicates that you need to move the decimal point 12 places to the left. Your p-value is much smaller than any reasonable significance level and, therefore, represent statistically significant results. You can reject the null hypothesis for your ANOVA.
I hope this helps!
Zaki Khadija says
Hello, Thank you very much for your explanations, I have studied the significance of the correlation between several quantitative variables at the base of a software, but practically I want to know how to calculate p-value manually? in order to understand its principle.
On the other hand, concerning the p-value, what does it mean technically, because I find it difficult to define this parameter practically in my field of environmental chemistry?
Cordially
Lizette Franks says
Hi! thanks so much! this clarifies the difference very much. I’m analyzing and writing reports about Nutrition related literature. Two of the studies are prospective cohort studies, with several covariates. The topic is the egg/dietary cholesterol relationship with cardiovascular disease. You probably know that nutrition research is like a roller coaster 🙂 So I encountered new terms for statistics analysis used on these type of studies that explore non linear associations. The Rao-Scott chi-square test, the Cox proportional hazard models, restricted cubic splines are terms that I’ve learned recently. I love your blog, it’s helping me A LOT to understand, clarify basic and more advanced statistical concepts. I have bookmarked it and will be using it a lot!
Lizette
Jim Frost says
Hi Lizette, I often describe statistics as an adventure because it’s a process that leads to discoveries but it is filled with trials and tribulations! It sounds like you’re having an adventure! And, of course, we like having our “cool” terms in statistics! I don’t have blog posts on the procedures you mention, at least not yet.
I’m so glad my blog has been helpful in your journey! Thanks for taking the time to write. I really appreciate it!! 🙂
Lizette says
Hi, I’m trying to understand what “p linear” and “p non linear trend” mean. I have only taken basic statistics and I’m working on reviewing nutrition related research articles. thanks so much!
Jim Frost says
Hi Lizette,
The context matters and I’m not sure what kind of analysis this is from? I’ve heard of those p-values in the context of time series analysis. In that scenario, these p-values help you determine whether the time series has a constant rate of change over time (p linear) or a variable rate of change over time (nonlinear). The meaning of linear trend is easy to understand because it represents a constant rate of change. Nonlinear trends are more nuanced because you might have a greater rate of change earlier, later, or in the middle. It’s not consistent throughout. You can also learn more from the combinations of the two p-values.
If the linear p-value is significant but nonlinear is not significant, you have a nice consistent rate of change (increase or decrease) over time.
If both p-values are significant, it would suggest a variable rate of change but one that has a consistent direction over time.
If neither p-value is significant, it suggests that the variable does not systematically tend to increase or decrease over time.
If the nonlinear p-value is significant but not the linear p-value, it suggests you have variable rates of change in the short term but in the long run there is no systematic increase or decrease in the variable.
I hope that helps!
Natalie says
How to you interpret a p-value that is displayed P=1.5 X 10 -19?
Jim Frost says
Hi Natalie,
That p-value is written using scientific notation. Scientific notation is a convenient way to represent very large and very small numbers. In your case, it represents a very small p-values. Yes, it significant!
The minus 19 indicates that you need to move the decimal point 19 places to the left.
Your p-value is much smaller than any reasonable significance level and, therefore, represent statistically significant results. You can reject the null hypothesis for whichever hypothesis test you are performing.
I hope this helps!
Shaheen Majeed Wani says
I am getting a P value of 0.351. Can you please explain it.
Amrita bhat says
Mine p value is 6.18694E-23 what does it mean.. Is it significant
Jim Frost says
Hi Amrita,
That p-value is written using scientific notation. Scientific notation is a convenient way to represent very large and very small numbers. In your case, these represent very small p-values. Yes, it significant!
The number after the E specifies the direction and number of places to move the decimal point. For example, the negative 23 value in “E-23” indicates you need to move the decimal point 23 places to the left. On the other hand, positive values indicate that you need to shift the decimal point to the right.
Your p-value is much smaller than any reasonable significance level and, therefore, represent statistically significant results. You can reject the null hypothesis for whichever hypothesis test you are performing.
I hope this helps!
Alfonso says
Thanks so much for your answer Jim!
Indeed I think we want to reach the same conclusion, but I’d like to see the results of the wrong approach to further cement my understanding since I’m not an expert in statistics (“seeing is believing!”)
In other words I entirely agree that it’s WRONG to keep testing the pvalue as the experiment runs. But how can I prove it empirically? My idea was to show to me and others that all tests that should fail to reject the null hypothesis (my list of A/A tests described above), can reject it if left to run long enough (in other words, every A/A test will have p < 0.05 if it runs long enough).
Is this statement correct? if not, why not?
Thank you!
alfonso says
Jim,
after reading again, I have one more question:
“If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.”
– in online testing, is it correct to say that we cannot have sampling error? Since we always compare (for a limited time) the entire population in A and the entire population in B? If yes, how does that affect the interpretation of pvalue?
Jim Frost says
Hi again Alfonso,
It’s still a sample. While you might have test the entire population for a limited time, you are presumably applying the results outside of that time span. That’s also why you need to be careful about the time frame you choose. Is it representative? You can only apply the results beyond the sample (i.e., to other times) if your sample is representative of those other times. If you collect data all day Sunday, you might be capturing different users than you’d get during the week. If that’s true, you wouldn’t be able to generalize your Sunday sample to other days. Same if you were to collect day during only a specific time during the day.
In your context, you’re still collecting a sample that you want to generalize beyond the sample. So, you’d need to use hypothesis testing for that reason. You also ensure that your sampling method collects a representative sample.
I hope this helps! I’m glad that you’re hooked and reading!!
alfonso says
Dear Jim,
A colleague just shared your blog with me and after 2 posts I’m hooked. I will read more today.
I use ttest and pvalues in the domain of web and app A/B testing and I’ve read everything I could find online but I still wasn’t sure I understood.
I built an A/A simulator in python and I got a lot more statistically significant results than 5% so I’m confused.
Just for clarity I call an A/A test a randomise observation where both series use the same success rate in %.
Even after reading your article alpha and pvalue are still somehow overlapping for me. I’ll keep reading your article to further clarify.
I have 3 questions that I hope you can answer:
– what would the graph look like of plotting the pvalue of 20 A/A tests over time? I would expect the pvalue to swing widely in the beginning and then stay firmly above 0.05 and every so often a test would go to statistical significance for a while and then come back above 0.05. I would expect 1 or max 2 statistically significant experiments *at any given point in time* (this is crucial in my understanding) after a sample size big enough has been reached
– is it true that if I keep collecting samples every single A/A test will eventually turn statistically significant even if just briefly?
– given that I will run hundreds or thousands of tests, is there an accepted standard of build my analysis framework to guarantee a 5% false positives rate? I was thinking all I needed was to set the sample size at the start to avoid falling into the trap I ask about in the previous question but now I’m not so sure anymore. (I use a well known tool online to calculate my sample size based on base conversion rate and observable absolute or relative difference)
I will keep reading but if you talk about any of this in details in any other article I would be grateful if you could share the link and if you haven’t covered these topics I hope you might do so in the future.
Jim Frost says
Hi Alfonso,
I know enough about the context of A/B test online to know that it is often fairly different than how we’d teach using hypothesis tests in statistics.
For statistical studies, you’d select a sample size, collect your sample, perform the experiment, and analyze the results just once. You wouldn’t keep collecting data and performing the hypothesis test with each new observation. The risk with continuing to perform the analysis as you collect the data is that, yes, you are very likely to get an anomalous significant result at some point. I don’t recommend the process you describe of plot p-values over time. Pick a sample size and stick with it and calculate the results only at the end.
Also, be aware that different types of users might be online at different times and days of week due things like age, employment status, and time zone. Use a sampling plan that gets a good representative sample. Otherwise your results might apply only to a subset.
If you follow the standard rule of collecting the sample of a set size and analyzing the data once at the end, then your false positives should equal your significance level. If you’re checking the p-values repeatedly or keep testing until you get a significant p-value, that will dramatically increase your false positive rate.
Finally, I’ve heard that some A/B testing uses one tailed testing. Don’t do that! For more details, read my post about when you should use one-tailed testing.
Damodar Suar says
I have read the comments. I am not a specialist in statistics but i use stastistics in my research.Let us come to application of p at least to t and r. In each case a study is conducted and results are significant at 5 % level (p <=.05). t test access the mean difference on wage rate of females in two locations X, and Y, the mean indicate Y has a higher value), r indicates the relation between depression and low exam marks among students, in each case the sample-size is 100. It may be understood that in research, we test directional alternate hypothesis , not null (which is obviously the no difference or no relation in null and opposite of alternate hypothesis). Taking the 'p' into account, how will we give a convincing interpretation or linguistic expression so that another, non-expert can understand it? Liking it to false positive, error may not be understood by a common man. Please reply. Does it mean the following in context of t and r respectively?
t- There are 95% chances that the wages in location Y are higher than location X and 5 %chances that difference will not be there.
r- The relations between anxiety and low exam marks hold good in 95% of cases and do not hold good in 5% of cases.
Jim Frost says
Hi Damodar,
Many of the answers to your questions are right here in this post. So, you might want to reread it.
P-values represent the strength of the evidence that your sample provides against the null hypothesis. That’s why you use the p-value in conjunction with the significance level to determine whether you can reject the null. Hypothesis testing is all about the null hypothesis and whether you can reject it or fail to reject it.
Coming up with an easy to understand definition of p-values is difficult if not impossible. That’s unfortunate because it makes it difficult to interpret correctly. Read my post, why are p-values misinterpreted so frequently for more on that.
As for your interpretation, those are the common misconceptions that I refer to in this post. So, please reread the sections where I talk about the common misconceptions! P-values are NOT the probability that either of the hypotheses are correct!
P-values are the probability of obtaining the observed results, or more extreme, if the null hypothesis is correct.
Himani Narula says
Hi Jim,
Thanks for a prompt reply. I have a fair understanding now. Please tell me if I am wrong if I say that for a statistically significant result, we say that if my null hypothesis is true, i would expect measure under consideration to be at least as large as the one observed in my study.
I came to this conclusion, by comparing my p value with alpha. if p value lies in critical region, we reject the null hypothesis and vice versa. Now that you stated that for a single study, we can’t state that error rate of false positive is alpha, how are we comparing alpha and p value for conclusions?
Jim Frost says
Hi again Himani!
If you have read it already, read my post about p-values and significance levels. I think that will answer many of your questions.
A statistically results indicates that IF the null hypothesis is true, you’d be unlikely to obtain the results that your study actually obtained. Statistical significance and p-values relate to the probability of obtaining your observed data IF the null is true. Always note that the probability is based on the assumption that the null is true.
You can think of the significance level as an evidentiary standard. It describes how strong the evidence must be for you to be able to conclude using your sample that an effect exists in the population. The strength of the evidence is defined in terms of how probable is your observed data if the null is true.
The p-value represents the strength of your sample evidence against the null. Lower p-values represent stronger evidence. Like the significance level, the p-value is stated in terms of the likelihood of your sample evidence if the null is true. For example, a p-value of 0.03 indicates that the sample effect you observe, or more extreme, had a 3% chance of occurring if the null is true.
So, the significance level indicates how strong the evidence must be while the p-value indicates how strong your sample evidence actually is. If your sample evidence is stronger than the evidentiary standard, you can conclude that the effect exists in the population. In other words, when the p-value is less than or equal to the significance level, you have statistically significant results, you can reject the null, and conclude that the effect exists in the population.
Please do read the other post if you haven’t already because I show how this works graphically and I think it’s easier to understand in that format!
Himani Narula says
Hi Jim,
Your blog has been of great help. It would be great if you could explain a bit further about how alpha (false positive) is different from the false positive rate (0.23) mentioned by you in the post and role of simulation in this case.
Big help!
Thank you
Jim Frost says
Hi Himani,
Thanks for writing with the excellent question. I can see how these two errors sound kind of similar, but they’re actually very different!
The Type I error rate is the probability of rejecting null hypothesis when it is actually true. It is a probability that applies to a class of studies. For an alpha of 0.05, it applies to studies with that alpha level and to studies where the null is true. You can say that 5% of all studies that have a true null will have statistically significant results when alpha = 0.05. However, you cannot apply to probability to a single study. For example, for a statistically significant study at the alpha = 0.05 level, you CANNOT state that there is a 5% chance that the null is true. You cannot obtain the probability for a single study using alpha, p-values, etc with Frequentist methodologies. The reason you can’t apply it to a single study is because you don’t know whether the null is true or false and the Type I error rate only applies when the null is true.
The error rates based on the simulation studies and Bayesian methodology can be applied to individual studies, at least in theory. However, to get precise probabilities you’ll need information that you often won’t have. Using these methodologies, you can take the p-value of an individual study and estimate the probability that the particular study is a false positive. However, I don’t want you to get too wrapped in mapping p-values to false positive rates. You’ll need to know the prior probability, which is often unknown. However, the gist is that the common misinterpretation of p-values underestimates the chance of a false positive. Also, a p-value near 0.05 often represents weak evidence even though it is statistically significant.
I hope this clarifies matters!
Edward says
Hi Jim,
Thanks for helpful posts. I have been browsing your blog for some time now and I gained a lot.
One quick question:
What happens if the null hypothesis is rejected based on t-test but we can’t do so by looking at p-value.
I know one is derived from the other statistic. But which one we should look at first in order to be able to say something about the null hypothesis: t-statistics or p-value in the t-test?
The same applies to ANOVA as well.
Which one do we look at first? Whether if Significance F is less than F statistics or the P-value alone?
Jim Frost says
Hi Edward,
You can either reject the null hypothesis by determining whether the test statistic (t, F, chi-square, etc.) falls into the critical or by comparing the p-value to the significance level. These two methods will always agree. If the test statistic falls within the critical region, then the p-value is less than or equal to the significance level.
Because the two methods are 100% consistent, you can use either one to evaluate statistical significance. You don’t need to use both methods, except maybe when you’re learning about how it all works. Personally, I find it easiest just to look at the p-value.
To see how both methods work, read my posts about how hypothesis tests work, how t-tests work, and how the F-test works in one-way ANOVA.
I hope this helps!
Yash Guleria says
Hi Jim,
I have 3 p values .. 0, 2E-12 and 3.2E-316.
I dont know what is wrong but how do i interpret these values?
Jim Frost says
Hi Yash,
Those p-values are written using scientific notation. Scientific notation is a convenient way to represent very large and very small numbers. In your case, these represent very small p-values.
The number after the E specifies the direction and number of places to move the decimal point. For example, the negative 13 value in “E-12” indicates you need to move the decimal point 12 places to the left. On the other hand, positive values indicate that you need to shift the decimal point to the right.
These values are smaller than any reasonable significance level and, therefore, represent statistically significant results. You can reject the null hypothesis for whichever hypothesis test you are performing.
I hope this helps!
adamson okunmwendia says
you are good jim. you are the best
Trent says
Jim, thank you so, so much for your patience and help over the past week. I think I can finally say that I get it. Not easy to keep everything straight, but your simplistic breakdown in your most recent post really helped to clear everything up. Even though I previously read about p-values and type I errors from your other blog posts, I guess I needed to re-hear/re-think those tricky concepts in a variety of different ways to finally absorb them. I finally feel comfortable enough to share these cool insights with my research peers, and I’ll point them to your blog for extra stats goodies!
Thank you so much, again. I’m slowly making my way through your blog (trying to balance grad school at the same time); I look forward to your other posts!
aloha
trent
P.S. Please do email me about the notification issue, I don’t believe I received an email from you yet. Your blog has really helped me get a better grasp of stats (I found your blog from your chocolate vs mustard analogy for interaction analyses, that was brilliant!), and so I’d be more than happy to help with the notification issue in any way I can.
Jim Frost says
Hi Trent,
You’re very welcome! P-values are a very confusing concept. Somewhere in one of my posts, I have a link to an external article that shows how even scientists have a hard time describing what they are! They’re not intuitive. And, when you conduct a study, p-values really aren’t exactly what you want them to be. You want them to be the probability that the null is true. That would be the most helpful. Unfortunately, they’re not that–and they can’t be that. I’m not sure if you read it, but I’ve written a post about why p-values are so easy to misunderstand.
Despite these difficulties, p-values provide valuable information. In fact, as I write in an article, there’s a relationship between p-values and the reproducibility of studies.
Just a couple more p-value posts to read if you’re so interested! If you haven’t already.
Best of luck with grad school! I’m sure you’ll do great!
By the way, I did email you. If you haven’t received it, that’s odd! I will try again from a different email address over the weekend.
A says
Jim… I cannot explain how many videos I have watched and articles I have read to try and understand this and you just cleared it all up with this. Saved my life. Thank you, thank you, thank you.
Jim Frost says
You’re very welcome! Presenting statistics in a clear manner is my goal for the website. So, it makes my day when I hear that my articles have actually helped people! Thanks for writing!
Trent says
Hi Jim, thank you so much for your reply! I’m sorry I wasn’t able to check back in until now. It seems that I still haven’t been able to connect the final pieces of the puzzle, based on your response to: “Thus, for a sample statistic assessed by a large group of similar studies, a P<0.05 would translate to a Type I error rate of <5%."
This is where I'm getting stuck:
Prior to a study, researchers typically set their significance level (alpha level) to 0.05. Researchers will then compare their p-value to the alpha level of 0.05 to determine if their results are statistically significant. If P<0.05, then the results are statistically significant at an alpha level of 0.05, which by extension means that the results have a 5% or lower probability of being a false positive (since the alpha level was set to 0.05, and alpha level = probability of a false positive), right? If this is all true, then a P<0.05 for a study with a significance level of 0.05 does not have a false positive probability of 23% (and typically close to 50%)… it has a 5% or lower probability of being a false positive.
That said, based on your article, I know I'm messing up my logic somewhere, but I can't figure out where…
aloha
trent
P.S. I double checked my gmail spam & trash folders and there were no notification emails of any of your replies.
Jim Frost says
Hi Trent,
I’m going to send you an email soon about the notification issue. So, be on the lookout for that.
I think part of the confusion is over the issue of single studies versus a group of studies. Or, relatedly, a single p-value versus a range of p-values. Alpha is a range of p-values and applies to a group of studies. All studies (the group) that have p-values less than or equal to 0.05 (range of p-values) have a Type I error rate of 0.05. That error rate applies to the groups of studies. You can’t apply it to a single study (i.e., a single p-value).
A single p-value for a single study is not that type of error rate at all. It represents the probability of obtaining your sample if the null is true. In other words, the p-value calculations begin with the assumption that the null is true. Therefore, you can not use the p-value to determine the probability that the null (or alternative) hypothesis is true. In other words, you can’t map p-values to the false positive rate.
So, when you say “If P<0.05, then the results are statistically significant at an alpha level of 0.05, which by extension means that the results have a 5% or lower probability of being a false positive (since the alpha level was set to 0.05, and alpha level = probability of a false positive)," that's not true. For one thing, the p-value assumes the null *is* true. For another, the group of studies as a whole has an error rate of 0.05, but you don't know the error rate for an individual study. Additionally, you just don't know whether the null is true or false. The error rate only applies to studies where the null is true. And, the p-value calculations assume the null is true. But, you don't know for sure whether it is true or not for any given study.
Let's go back to what I said about the p-values being the "devil's advocate" argument. For any treatment effect that you observe in sample data, you can make the argument that the effect is simply random sampling error rather than a true effect. The p-value essentially says, "OK, lets assume the null is true. How likely was it for us to observe these results in that case." If the probability is low, you were unlikely to obtain that sample if the null is true. It pokes a hole in the devil's advocate argument. It's important to remember that p-values are a probability related to obtaining your data assuming the null is true and *not* a probability that the null is true. You're trying to equate p-values to the probability of the null being true--which is not possible with the Frequentist approach.
Trent says
Hi Jim,
Thank you for your reply. The two other articles you linked were really helpful. I think I’m almost there with understanding the whole picture. May I clarify my current understanding with you?
Alpha applies to a group of similar studies, thus we can’t directly translate the p-value of a single study to the Type I error rate for a given hypothesis. However, using simulation studies or Bayesian methods, we can estimate the Type I error rate–from a single study–for a P=0.05 sample statistic to 23% (and typically close to 50%).
That said, in order to estimate the Type I error rate directly using alpha (and P-values), we need to see the results from a group of similar studies (ie meta-analysis). Thus, for a sample statistic assessed by a large group of similar studies, a P<0.05 would translate to a Type I error rate of <5%.
How did I do?
aloha
trent
P.S. I'm unsure how the "Notify me of new comments via email" function is supposed to work on your blog, but it didn't notify me via email of your reply. So I had no idea that you replied to my comment until I checked back on this post.
Jim Frost says
Hi Trent,
I’m glad the other articles were helpful! There’s actually quite a bit to take into understand p-values. It’s possible to come up with a brief definition, but implies a thorough knowledge of underlying concepts! I will look into the Notify function. It should email you. I’ll hunt around in the settings to be sure, but I believe it is set up to send emails. Is there a chance it went to your junk folder?
Yes! That’s very close! Just a couple of minor quibbles and clarifications. I wouldn’t say that you use simulation and Bayesian methods to estimate the Type I error rate. That’s specific to the hypothesis testing framework. And, it applies to group of similar studies. Alpha = the Type I error rate. And both apply to a group of studies.
Simulation studies and Bayesian methods can help you take a P-value from an individual study and estimate the probability of a false discovery (or false positive). P-values relate to individual studies and the probability of a false positive applies to that individual study. So, we’ve moved from probabilities for a group of studies (Alpha/Type I error) to probabilities of false positive for an individual study. To make that shift from a group to an individual study, we must switch methodologies because the Frequentist method cannot calculate the false discovery rate for a single study.
An important note, for simulation studies or Bayesian methodology to estimate the false discovery rate, you need additional information beyond just the sample data. You need an estimate of the probability that the alternative hypothesis is true at the beginning of the study. This is known as the prior probability in Bayesian circles. To develop this probably, you already need to know and incorporate external information into the calculations. This information can come from a group of similar studies as you mention. This probability along with the P-value affects the false discovery rate. That’s why there is a range of values for any given P-value. There is no direct, fixed mapping of p-values to the false discovery rate. A criticism of the prior probability is that it is being estimated. Presumably, the researchers are performing a study because they’re not sure if the alternative is true or not.
It’s not clear to me what you mean in your sentence, “Thus, for a sample statistic assessed by a large group of similar studies, a P<0.05 would translate to a Type I error rate of <5%." I'll assume you're referring to a p-value from a meta analysis. In that case, it still depends on the prior probability. If the prior probability is very high, the false discovery rate will be low. Conversely, if the prior probability is low, the false discovery rate will be higher. You can't state a general rule like the one in your sentence.
Thanks for writing with the interesting questions!
Trent says
Hi Jim, wonderful post! A lot to chew on. May I clarify a point of confusion?
I’ve been taught that alpha is the probability of committing a Type I error. In addition, studies typically set alpha to 0.05, and beta to 0.20 (giving a power of 0.8).
Based on your article, this must be false. A true statement should read:
“Studies typically set the P-VALUE cut-off to 0.05, and beta to 0.20 (giving a power of 0.8).”
Logically following, this means that alpha is generally not set to anything. And for a study with a p-value cut-off of 0.05, the alpha would actually be about 0.23 (and typically close to 0.50).
Is my understanding, correct?
aloha
trent
Jim Frost says
Hi Trent,
It’s correct that alpha (aka the significance level) represents the probability of a type I error. Hypothesis tests are designed so that the researchers can set that value. However, it’s not possible to set beta. You can estimate beta using a power analysis. Power is just 1-beta. However, power analyses are estimates and not something your technically setting like you do with alpha. I write more about this in my post about Type I and Type II errors.
I definitely understand your confusion regarding p-values and alpha. The important thing to keep in mind is that alpha really applies to a class of studies. Of all studies that use an alpha of 0.05 and the null is true, you’d expect to obtain significant results (i.e., a false positive) in 5% of those cases.
P-values represent the strength of the evidence against the null for an individual study. You can state it as being the probability of obtaining the observed outcome, or more extreme, if the null is true. However, you can’t state that it is the probability of the null being true. It’s the probability of the outcome if you assume the null is true (which you don’t really know for sure). Not the probability of whether the null is true.
I think based on what you write, you might be confusing that issue (re: alpha actually being 0.23). Both P-values and alpha relate to cases where the null is true–which you don’t know. The false positive error rates which I think you’re getting at, and I write about at the end, are dealing with the probability of the null being true. In the former, you’re assuming the null is true while in the latter you’re calculating the probability of whether it is true. Using the Frequentist approach (p-values, alpha) you cannot calculate the probability of the null being true. However, you can do that using simulation studies and sometimes using Bayesian methods.
I always think this is a bit easier to understand using graphs and so highly recommend reading my post about p-values and the significance level, which primarily uses graphs.
I hope this helps!
YIHENEW says
Thank you. You give me good insight
David says
Awesome read! How would sample size affect the True Error rate? I would assume since p-values tend to become smaller as sample size increases, that would also effectively reduce the True Error rate since you are more confident about the population (assuming True Error means type I and type II errors).
Jim Frost says
Hi David, Thanks, and I’m glad you enjoyed the article!
There are two types of errors in hypothesis testing. So, let’s see how changing the sample size affects them. You might want to read my article about Type I and Type II Errors in Hypothesis Testing.
There’s three basic components for calculating p-values. The effect size, variability in the data, and the sample size. For the sake of discussion, let’s hold the effect size and the variability constant and just increase the sample size. In that case, you would expect that the p-values would decrease. Frequentists will cringe at this, but lower p-values are associated with lower false discovery rates (Type I errors). Additionally, increasing the sample size while holding the other two factors constant will increase the power of your test. Power is just (1 – Type II error rate). So, you’d expect the Type II errors (false negatives) to decrease. Increasing the sample size is good all around because it lowers both types of error for a single study! I explain the italicized text later!
However, a couple of important caveats for the above. Of course, as I point out in this article, you can’t calculate any error rates from the p-value using the frequentist approach. There’s no direct mapping from p-values to an error rate. You can use simulation studies and the Bayesian approach to estimate the false positive rate from the p-value. However, this requires an estimate of the a priori probability that the alternative hypothesis is correct. That information might be hard to obtain. After all, you’re conducting the study because you don’t know. Additionally, it’s always difficult to calculate the type II error rate. So, while you can say that increasing the sample should reduce both type I and type II errors, you don’t really know what they are! By the way, in a related vein, you might want to read how P-values correlate with the reproducibility of scientific studies.
Let’s return to Frequentist approach because there’s another side of things that isn’t obvious. In contrast with the earlier example for an individual study, the Frequentist approach talks about the Type I errors not for an individual study but for a class of studies that use the same significance level. A result is statistically significant when the p-value is less than the significance level. The significance level equals the Type I Error for all studies that use a particular significance level. For example, 5% of all studies that use a significance level of 0.05 should be false positives. Of course, when you see significant test results, you don’t know for sure which ones are real effects and which ones are false discoveries.
Let’s now hold the other two factors constant but reduce sample size. Let’s reduce it enough so that you have low power for detecting an effect. As your statistical power decreases, your test is less likely to detect real effects when they exist (the Type II error rate increases). However, the hypothesis test controls or holds constant the Type I error rate at your significance level. That’s built into the test. If you have a low power hypothesis test, the test’s ability to detect a real effect is low but it’s false positive rate remains the same. Consequently, when you obtain statistically significant results for a test with low power, you need to be wary because it’s relatively likely to be false positive and less likely to represent a real effect.
That’s probably more than what you wanted, but it’s a fascinating topic!
Tetyana says
Dear Jim, thank you very much for you posts!
Does it mean that after I have obtained some small p-value, I have to do some other tests?
Jim Frost says
Hi Tetyana,
After you obtain a small p-value, you can reject the null hypothesis. You don’t necessarily need to perform other tests. I just want analysts to avoid a common misinterpretation. Obtaining a statistically significant result is still a good thing, but you have to keep in mind what it really represents.
Ahmad Allam says
Thank you.
Ahmad Allam says
Thank you very much. You made me reassuring . Appreciated.
How could I record this result in a scientific manuscript?
Jim Frost says
Hi Ahmad,
I think it’s perfectly acceptable to report such a small p-value using the scientific notation that is in your output. The other option would be to report it as a regular value by moving the decimal point 16 places to the left, but that takes up so much more room. So, I’d use scientific notation. It’s there to save space for extremely small and large values depending on the context.
Ahmad Allam says
Hi Jim. Thanks for this value post. But if you can help me on that, I got this result (6.79974E-16) ??? What that mean?
Appreciated.
Jim Frost says
Hi Ahmad,
That is called scientific notation. The E-16 in it indicates that you need to move the decimal point 16 digits to the left. That’s a very small value. Therefore, you have a very significant p-value!
Pamela Marcum says
What an awesome post! Should be required reading for all STEM students.
Jim Frost says
Thanks, Pamela. That means a lot to me!
Amit Kumar Sahoo says
Thanks Jim for your response. i think i got it..
Amit Kumar Sahoo says
Hi Jim,
Thanks for the post. Am little confused with the statement below
“If the medicine has no effect in the population as a whole, 3% of studies will obtain the effect observed in your sample, or larger, because of random sample error.”
Now as per defination
“P-values indicate the believability of the devil’s advocate case that the null hypothesis is true given the sample data. ”
So doesn’t that mean higher P value will accept the alternate hypothesis since higher the probability of alternate happening when null is true. Am not able to get my head wrapped around this concept..
Amit
Jim Frost says
Hi Amit,
Great question! So, the first thing to realize is that the null and alternative hypotheses are mutually exclusive. If the probability of the alternative being true is higher, then the probability of the null must be lower.
However, the p-value doesn’t indicate the probability of either hypothesis being true. This is a very common misconception. Anytime you start linking p-values to the probability that a hypothesis is true, you know you’re going in the wrong direction!
P-values represent the probability of obtaining the effect observed in your sample, or more extreme, if the null hypothesis is true. It’s a probability of obtaining your data assuming the null is true. Consequently, a low p-value indicates that you were unlikely to obtain the sample data that was collected if the null is true. In this manner, lower p-values represent stronger evidence against the null hypothesis. Lower p-values indicate that your data are less compatible with the null hypothesis.
I think this is easier to understand graphically. I have a link in this post to another post How Hypothesis Tests Work: Significance Levels and P-values. This post shows how it works with graphs. I’d recommend taking a look at it.
I hope this helps!
Khursheed statistics says
Hello sir …..hope u r f9
I hve no words that u hve cleared me a lot of concepts of stats ….nd I am really hppy
……nd Wht evr u r uploading
Owsme
Jim Frost says
Hi Khursheed, I’m so happy to hear that you found this post to be helpful. Thanks for the encouraging words. They mean a lot to me!
naseer says
What should be the nature of the relationship of p values (especially Bonferroni corrected) with the Cohen’s d values for the same set of data?
Sean Saunders says
Jim, thanks for this post, but perhaps you could clarify something for me: assuming that H0 is true, if we set an alpha=0.05 level of significance and get a p-value less than that as the result of our sample data, wouldn’t that indicate, since less than 5% of samples would have such an effect due to random sample error, that there is only a 5% chance of getting such a sample, and thus, a 5% chance of rejecting the null hypothesis incorrectly? What am I missing here? Almost every stats book I’ve ever read has presented the concept this way (a type 1 error is even called an alpha-error!) Thanks for your feedback!
Jim Frost says
Hi Sean, thanks for your comment. Yes, you’re absolutely correct. The significance level (alpha) is the type I error rate. It’s the probability that you will reject the null hypothesis when it is true. However, the p-value is not an error rate. It’s a bit confusing because you compare one to the other.
In the post above, I provide a link to a post where I explain significance levels and p-values using graphs. I think it’s much easier to understand that way. I’ll explain below, but check that post out too.
Both alpha and p-values refer to regions on a probability distribution plot. You need an area under the curve to calculate probabilities. You can calculate probabilities for regions, but not a specific value.
That works fine for alpha. If the null is true, you expect sample values to fall in the critical regions X% of times based on the significance level that you specify. For p-values, the problems occur when you want to know the error rate for your specific study. You can’t do that for a single value from an individual study because you need an area under the curve.
The best you can say for p-values is: if the null is true, then you’d expect X% of studies to have an effect at least as large as the one in your study. X = your P-value. Notice the “at least as large.” That’s needed to produce the range of values for an area under the curve. It’s also means you can’t apply the percentage to your specific study. You can apply it only to the entire range of theoretical studies that have an effect at least as large as yours. That range collectively has an error rate that equals the p-value, but not your study alone.
Another thing to consider is that, within the range defined by the p-value, your study provides the weakest results because it defines the point closest to the null. So, the overall error rate for the range is largely based on theoretical studies that provide stronger evidence than your actual study!
In a similar fashion, if you reject the null for your study using an alpha = 0.05, you know that all studies in the critical region have a Type I error rate = 0.05. Again, this applies to the entire range of studies and not yours alone.
I hope this all makes sense. Again, read the other post and it’s easier to see with graphs.