In my post about how to interpret p-values, I emphasize that p-values are not an error rate. The number one misinterpretation of p-values is that they are the probability of the null hypothesis being correct.
The correct interpretation is that p-values indicate the probability of observing your sample data, or more extreme, when you assume the null hypothesis is true. If you don’t solidly grasp that correct interpretation, please take a moment to read that post first.
Hopefully, that’s clear.
Unfortunately, one part of that blog post confuses some readers. In that post, I explain how p-values are not a probability, or error rate, of a hypothesis. I then show how that misinterpretation is dangerous because it overstates the evidence against the null hypothesis.
For example, suppose you incorrectly interpret a p-value of 0.05 as a 5% chance that the null hypothesis is correct. You’ll feel that rejecting the null is appropriate because it is unlikely to be true. However, I later present findings that show a p-value of 0.05 relates to a false-positive rate of “at least 23% (and typically close to 50%).”
The logical question is, if p-values aren’t an error rate, how can you report those higher false positive rates (an error rate)? That’s a reasonable question and it’s the topic of this post!
A Quick Note about This Post
This post might be a bit of a mind-bender. P-values are already confusing! And in this post, we look at p-values differently using a different branch of statistics and methodology. I’ve hesitated writing this post because it feels like a deep, dark rabbit hole!
However, the ideas from this exploration of p-values have strongly influenced how I view and use p-values. While I’m writing this post after other posts and an entire book chapter about p-values, the line of reasoning I present here strongly influenced how I wrote that earlier content. Buckle up!
Frequentist Statistics
Before calculating the false positive rate, you need to understand frequentist statistics, also known as frequentist inference. Frequentist statistics are what you learned, or are learning, in your Introduction to Statistics course. This methodology is a type of inferential statistics containing the familiar hypothesis testing framework where you compare your p-values to the significance level to determine statistical significance. It also includes using confidence intervals to estimate effects.
Frequentist inference focuses on frequencies that make it possible to use samples to draw conclusions about entire populations. The frequencies in question are the sampling distributions of test statistics. That goes beyond the scope of this post but click the related posts links below for the details.
Frequentist methodology treats population parameters, such as the population mean (µ), as fixed but unknown characteristics. There are no probabilities associated with them. The null and alternative hypotheses are statements about population parameters. Consequently, frequentists can’t say that there is such and such probability that the null hypothesis is correct. It either is correct or incorrect, but you don’t know the answer. The relevant point here is that when you stick strictly to frequentist statistics, there is no way to calculate the probability that a hypothesis is correct.
Related posts: How Hypothesis Tests Work, How t-Tests Work, How F-tests Work in ANOVA, and How the Chi-Squared Test of Independence Works
Why Can’t Frequentists Calculate those Probabilities?
There are mathematical reasons for that but let’s look at it intuitively. In frequentist inference, you take a single, random sample and draw conclusions about the population. The procedure does not use other information from the outside world or other studies. It’s all based on that single sample with no broader context.
In that setting, it’s just not possible to know the probability that a hypothesis is correct without incorporating other information. There’s no way to tell whether your sample is unusual or representative. Frequentist methods have no way to include such information and, therefore, cannot calculate the probability that a hypothesis is correct.
However, Bayesian statistics and simulation studies include additional information. Those are large areas of study, so I’ll only discuss the points relevant to our discussion.
Bayesian Statistics
Bayesian statistics can incorporate an entire framework of evidence that resides outside the sample. Does the overall fact pattern support a particular hypothesis? Does the larger picture indicate that a hypothesis is more likely to be correct before starting your study? This additional information helps you calculate probabilities for a hypothesis because it’s not limited to a single sample.
Simulation Studies
When you perform a study in the real world, you do it just once. However, simulation studies allow statisticians to perform simulated studies thousands of times while changing the conditions. Importantly, you know the correct results, enabling you to calculate error rates, such as the false positive rate.
Using frequentist methods, you can’t calculate error rates for hypotheses. There is no way to take a p-value and convert it to an error rate. It’s just not possible with the math behind frequentist statistics. However, by incorporating Bayesian and simulation methods, we can estimate error rates for p-values.
Simulation Studies and False Positives
In my post about interpreting p-values, I quote the results from Sellke et al. He used a Bayesian approach. But let’s start with simulation studies and see how they can help us understand the false positive rate. For this, we’ll look at the work of David Colquhoun, a professor in biostatistics, who lays it out here.
Factors that influence the false-positive rate include the following:
- Prevalence of real effects (higher is good)
- Power (higher is good)
- Significance level (lower is good)
“Good” indicates the conditions under which hypothesis tests are less likely to produce false positives. Click the links to learn more about each concept. The prevalence of real effects indicates the probability that an effect exists in the population before conducting your study. More on that later!
Let’s see how to calculate the false positive rate for a particular set of conditions. Our scenario uses the following conditions:
- Prevalence of real effects = 0.1
- Significance level (alpha) = 0.05
- Power = 80%
We’ll “perform” 1000 hypothesis tests under these conditions.
In this scenario, the total number of positive test results are 45 + 80 = 125. However, 45 of those positives are false. Consequently, the false positive rate is:
Mathematically, calculate the false positive rate using the following:
Where alpha is your significance level and P(real) is the prevalence of real effects.
Simulation studies for P-values
The previous example and calculation incorporate the significance level to derive the false positive rate. However, we’re interested in p-values. That’s were the simulation studies come in!
Using simulation methodology, Colquhoun runs studies many times and sets the values of the parameters above. He then focuses on the simulated studies that produce p-values between 0.045 and 0.05 and evaluates how many are false positives. For these studies, he estimates a false positive rate of at least 26%. The 26% error rate assumes the prevalence of real effects is 0.5, and power is 80%. Decreasing the prevalence to 0.1 causes the false positive rate to jump to 76%. Yikes!
Let’s examine the prevalence of real effects more closely. As you saw, it can dramatically influence the error rate!
P-Values and the Bayesian Prior Probability
The property that Colquhoun names the prevalence of real effects (P(real)) is what the Bayesian approach refers to as the prior probability. It is the proportion of studies where a similar effect is present. In other words, the alternative hypothesis is correct. The researchers don’t know this, of course, but sometimes you have an idea. You can think of it as the plausibility of the alternative hypothesis.
When your alternative hypothesis is implausible, or similar studies have rarely found an effect, the prior probability (P(real)) is low. For instance, a prevalence of 0.1 signifies that 10% of comparable alternative hypotheses were correct, while 90% of the null hypotheses were accurate (1 – 0.1 = 0.9). In this case, the alternative hypothesis is unusual, untested, or otherwise unlikely to be correct.
When your alternative hypothesis is consistent with current theory, has a recognized process for producing the effect, or prior studies have already found significant results, the prior probability is higher. For instance, a prevalence of 0.90 suggests that the alternative is correct 90% of the time, while the null is right only 10% of the time. Your alternative hypothesis is plausible.
When the prior probability is 0.5, you have a 50/50 chance that either the null or alternative hypothesis is correct at the beginning of the study.
You never know this prior probability for sure, but theory, previous studies, and other information can give you clues. For this blog post, I’ll assess prior probabilities to see how they impact our interpretation of P values. Specifically, I’ll focus on the likelihood that the null hypothesis is correct (1 – P(real)) at the start of the study. When you have a high probability that the null is right, your alternative hypothesis is unlikely.
Moving from the Prior Probability to the Posterior Probability
From a Bayesian perspective, studies begin with varying probabilities that the null hypothesis is correct, depending on the alternative hypothesis’s plausibility. This prior probability affects the likelihood the null is valid at the end of the study, the posterior probability.
If P(real) = 0.9, there is only a 10% probability that the null is correct at the start. Therefore, the chance that the hypothesis test rejects a true null at the end of the study cannot be greater than 10%. However, if the study begins with a 90% probability that the null is right, the likelihood of rejecting a true null escalates because there are more true nulls.
The following table uses Colquhoun and Sellke et al.’s calculations. Lower prior probabilities are associated with lower posterior probabilities. Additionally, notice how the likelihood that the null is correct decreases from the prior probability to the posterior probability. The precise value of the p-value affects the size of that decrease. Smaller p-values cause a larger decline. Finally, the posterior probability is also the false positive rate in this context because of the following:
- the low p-values cause the hypothesis test to reject the null.
- the posterior probability indicates the likelihood that the null is correct even though the hypothesis test rejected it.
Prior Probability of true null 1 – P(real) |
Study P-value |
Posterior probability of true null (False positive rate) |
0.5 | 0.05 | 0.289 |
0.5 | 0.01 | 0.110 |
0.5 | 0.001 | 0.018 |
0.33 | 0.05 | 0.12 |
0.9 | 0.05 | 0.76 |
Safely Using P-values
Many combinations of factors affect the likelihood of rejecting a true null. Don’t try to remember these combinations and false-positive rates. When conducting a study, you probably will have only a vague sense of the prior probability that your null is true! Or maybe no sense of that probability at all!
Just keep these two big takeaways in mind:
- A single study that produces statistically significant test results can provide weak evidence that the null is false, especially when the P value is close to 0.05.
- Different studies can produce the same p-value but have vastly different false-positive rates. You need to understand the plausibility of the alternative hypothesis.
Carl Sagan’s quote embodies the second point, “Extraordinary claims require extraordinary evidence.”
Suppose a new study has surprising results that astound scientists. It even has a significant p-value! Don’t trust the alternative hypothesis until another study replicates the results! As the last row of the table shows, a study with an implausible alternative hypothesis and a significant p-value can still have an error rate of 76%!
I can hear some of you wondering. Ok, both Bayesian methodology and simulation studies support these points about p-values. But what about empirical research? Does this happen in the real world? A study that looks at the reproducibility of results from real experiments supports it all. Read my post about p-values and the reproducibility of experimental results.
I know this post might make p-values seem more confusing. But don’t worry! I have another post that provides simple recommendations to help you navigate P values. Read my post: Five P-value Tips to Avoid Being Fooled by False Positives.
Hi Jim, thank you for this explanation. I have one question. It is a probably a dumb question, but I am going to ask it anyway…
Suppose I define the alpha as 5%. Does this mean that I have decided to reject the null hypothesis if p<0.05? Or when I define alpha as 5% I could use another threshold for the p-value?
Hi Carolina,
Yes, that’s correct! Technically, you reject the null if the p-value is less than or equal to 0.05 when you use an alpha of 0.05. So, basically what you said, but it’s less than or equal to.
I found this blogpost by googling for “significance false positive rate”. I noticed that what you call “false positive rate” is apparently called “false discovery rate” elsewhere. According to Wikipedia, the false positive rate is the number of false positives (FP) divided by the number of negatives (TN + FP). So FP is _not_ divided by the number of positives (TP + FP); doing this, you would get (according to Wikipedia) just the “false discovery rate”.
https://en.wikipedia.org/wiki/False_positive_rate
https://en.wikipedia.org/wiki/False_discovery_rate
Now I fully understand that the p value is not the same as the false discovery rate, as you correctly show. But how is the p value related to the false positive rate (defined as FP/(TN + FP))?
Hi Andreas,
The False Discovery Rate (FDR) and the False Positive Rate (FPR) are synonymous in this context. In statistics, one concept will sometimes have several different names. For example, alpha, the significance level, and the Type I error rate all mean the same thing!
As you have found, analysts from different backgrounds will sometimes use these terms differently. It does make it a bit confusing! That’s why it’s good practice to include the calculations, as I do in this post.
Thanks for writing!
Many moons ago, when I was a junior electrical engineer, I wrote a white paper (for the US Navy). At the time, there was a big push to inject all sorts of Built-In Test (BIT) and Built-in Test Electronics (BITE) into avionics (i.e., aircraft weapon systems). The rapid pace of miniaturization of electronics made this a very attractive idea. In the paper I recommended we should slow down and inject BIT/E very judiciously, mainly for the reasons illustrated in your post.
Specifically, if the actual failure rate of a weapon system is very low (i.e., the Prevalence of Real Effects is very small), and the Significance Level is too large, we will get a very high False Positive rate, which will result in the “pulling” of numerous “black boxes” for repair that don’t require maintenance. (BTW, this is what, in fact, happened. The incidence of “No Fault Found” on systems sent in for repair has gone up drastically.)
And the Bayesian logic illustrated above is why certain medical diagnostic tests aren’t (or shouldn’t be) given to the general public: The prevalence in the general population is too low. The tests must be reserved for a sub-group of persons who are high risk for disease.
Cheers,
Joe
Hi Joe,
Thanks so much for your insightful comment! These issues have real-world implications and I appreciate you sharing your experiences with us. Whenever anyone analyzes data, it’s crucial to know the underlying processes and subject area to understand correctly the implications, particularly when basing decisions on the analysis!
Hello Jim, I have been binge reading the blogs/articles written by you. It is very helpful. I have a question related to prevalence. Is the concept of prevalence applicable to all scenarios and end goals (for which the analysis is performed) similar to the way alpha and beta are. For example, in the example that is relate to change in per capita income (from 260 to 330), my understanding is that prevalence does not hold true, Is that correct? If not, how to interpret/understand prevalence in that example? Your inputs will be helpful.
Hi Kushal,
In this context, the prevalence is the probability that the effect exists in the population. You’d need to be able to come up with some probability that the per capita income has changed from 260 to 330. I think coming up with a good estimate can often be difficult. It becomes easier as a track record develops. Is that size change typical or unusual in previous years? Does it fit other economic observations? Etc. Coming up with a rough estimate can help you evaluate p-values.
Thank you so much Jim. This was even better than what I expected when I asked you to explain: Sellke et al. I am going to suggest to all my fellow (Data) Scientists that this be a must read.
Thanks, Steven! I appreciate the kind words and sharing!
Looking forward to that.
Hi Jim,
This is a nice post. The language is not just elementary, it also made complex concepts intuitively easier to grasp. I have read these concepts several times in many textbooks, for the first time I have a better understanding of the lay application behind the erstwhile difficult topics.
Thank you,
Emmanuel
Thanks a lot Jim. It will be better, if you take this in the context of Panel data
Jim, thank you. As always, so informative and you are constantly challenging me with different ways of approaching concepts. Have you or do you know of any studies that applies this approach to COVID testing? I’m thinking about recent news from Elon Musk in which he said he had 4 tests done in the same day, same test, same health professional. Two came back positive and two negative. Is there a substantial error rate on these tests?
Dear Sir
My question is that I have a dep variable say X and a variable of interest Y with some control variables(Z)
Now when I run following regressions
1) X at time t , Y & Z at t-1
2) X at time t , Y at t-1 & Z at t
3) X at time t , Y & Z at t
The sign of my variable of interest changes(significance too). If there are not any theory to guide me with respect lag specification of variable of interest and control variables, which one from the above model should I use? What is the general principle
Hi Vishnu,
A good method for identify lags to include is to use the cross correlation function (CCF). This helps find lags of on time series that can predict the current value of your time series of interest. You can also use the autocorrelation function (ACF) and partial autocorrelation function (PACF) to identify lags within one time series. These functions simply look for correlations between observations of a time series that are separated by k time units. CCF is between different sets of time series data while ACF and PACF are within one set of time series data.
I don’t currently have posts about these topics but they’re on my list!
Hi Jim,
Thanks so much for your great post. It’s always been tremendously helpful.
I have one simple question about the difference between a significance level and a false positive rate.
I have read your comment in one of your p-value posts: “When you’re talking significance levels and the Type I error rate, you’re talking about an entire class of studies. You can’t apply that to individual studies.”
But, in this post, we simulated a test 1000 times, and in my humble opinion, it seemed like we treated 1000 tests as a kind of “a class of studies.” However, the false positive rate, 0.36, is still pretty different from the initial significance level setup, 0.05.
I think this is a silly question, but could you please kindly clarify this?
Thanks!
Hi Mavis,
That’s a great question. And there’s a myriad of details details like that which are crucial to understand. That’s why it’s such a deep, dark rabbit hole!
What you’re asking about gets to the heart of a major difference Frequentist and Bayesian statistics.
Using Frequentist methodology, there’s no probability associated with the null hypothesis. It’s true or not true but you don’t know. The significance level is part of the Frequentist methodology. So, it can’t calculate a probability about whether the null is true. Instead, the significance level assumes the null hypothesis is true and goes from there. The significance level indicates the probability of the hypothesis producing significant results when the null is true. So, you don’t know whether the null is true or not, but you do know that IF it is true, your test is unlikely to be significant. Think of the significance level as a conditional probability based on the null being true.
Compare that to the Bayesian approach, where you can have probabilities associated with the null hypothesis. The example I work through is akin to the Bayesian approach because we’re stating that the null has a 90% chance of being correct and a 10% chance of being incorrect. That’s a different scenario than Frequentist methodology where you assume the null is true. That’s why the numbers are different because they’re assessing different scenarios and assumptions.
In a nutshell, yes, the 1000 tests can be a class of studies but this class includes cases where the null is both true and false at some assumed proportion. For significance levels, the class of studies contains only studies where the null hypothesis is true (e.g., 5% of all studies where the null is true).
I hope that clarifies that point!
Idea!
It is not necessary to use the notation α for the threshold (critical) value of the random variable
P ̃_v=Pr[(T ̃≤-|t|│H_0 )+(T ̃≥+|t|│H_0 )]
and call it the significance level. For it a different notation, for instance, p_crit should be used.
There is no direct relationship between the observed p-value (p_val) and the probability of the null hypothesis P(H_0│data), just as there is no direct relationship between the critical p-value p_crit and the significance level α (the probability of a type I error)!
Hi Nikita,
I don’t follow your comment. Is this just your preference for the notation or something more? Alpha is the usual notation for this concept.
Very informative and useful. Thank you
You’re very welcome! I’m glad it was helpful!