In my post about how to interpret p-values, I emphasize that p-values are not an error rate. The number one misinterpretation of p-values is that they are the probability of the null hypothesis being correct.
The correct interpretation is that p-values indicate the probability of observing your sample data, or more extreme, when you assume the null hypothesis is true. If you don’t solidly grasp that correct interpretation, please take a moment to read that post first.
Hopefully, that’s clear.
Unfortunately, one part of that blog post confuses some readers. In that post, I explain how p-values are not a probability, or error rate, of a hypothesis. I then show how that misinterpretation is dangerous because it overstates the evidence against the null hypothesis.
For example, suppose you incorrectly interpret a p-value of 0.05 as a 5% chance that the null hypothesis is correct. You’ll feel that rejecting the null is appropriate because it is unlikely to be true. However, I later present findings that show a p-value of 0.05 relates to a false-positive rate of “at least 23% (and typically close to 50%).”
The logical question is, if p-values aren’t an error rate, how can you report those higher false positive rates (an error rate)? That’s a reasonable question and it’s the topic of this post!
A Quick Note about This Post
This post might be a bit of a mind-bender. P-values are already confusing! And in this post, we look at p-values differently using a different branch of statistics and methodology. I’ve hesitated writing this post because it feels like a deep, dark rabbit hole!
However, the ideas from this exploration of p-values have strongly influenced how I view and use p-values. While I’m writing this post after other posts and an entire book chapter about p-values, the line of reasoning I present here strongly influenced how I wrote that earlier content. Buckle up!
Before calculating the false positive rate, you need to understand frequentist statistics, also known as frequentist inference. Frequentist statistics are what you learned, or are learning, in your Introduction to Statistics course. This methodology is a type of inferential statistics containing the familiar hypothesis testing framework where you compare your p-values to the significance level to determine statistical significance. It also includes using confidence intervals to estimate effects.
Frequentist inference focuses on frequencies that make it possible to use samples to draw conclusions about entire populations. The frequencies in question are the sampling distributions of test statistics. That goes beyond the scope of this post but click the related posts links below for the details.
Frequentist methodology treats population parameters, such as the population mean (µ), as fixed but unknown characteristics. There are no probabilities associated with them. The null and alternative hypotheses are statements about population parameters. Consequently, frequentists can’t say that there is such and such probability that the null hypothesis is correct. It either is correct or incorrect, but you don’t know the answer. The relevant point here is that when you stick strictly to frequentist statistics, there is no way to calculate the probability that a hypothesis is correct.
Why Can’t Frequentists Calculate those Probabilities?
There are mathematical reasons for that but let’s look at it intuitively. In frequentist inference, you take a single, random sample and draw conclusions about the population. The procedure does not use other information from the outside world or other studies. It’s all based on that single sample with no broader context.
In that setting, it’s just not possible to know the probability that a hypothesis is correct without incorporating other information. There’s no way to tell whether your sample is unusual or representative. Frequentist methods have no way to include such information and, therefore, cannot calculate the probability that a hypothesis is correct.
However, Bayesian statistics and simulation studies include additional information. Those are large areas of study, so I’ll only discuss the points relevant to our discussion.
Bayesian statistics can incorporate an entire framework of evidence that resides outside the sample. Does the overall fact pattern support a particular hypothesis? Does the larger picture indicate that a hypothesis is more likely to be correct before starting your study? This additional information helps you calculate probabilities for a hypothesis because it’s not limited to a single sample.
When you perform a study in the real world, you do it just once. However, simulation studies allow statisticians to perform simulated studies thousands of times while changing the conditions. Importantly, you know the correct results, enabling you to calculate error rates, such as the false positive rate.
Using frequentist methods, you can’t calculate error rates for hypotheses. There is no way to take a p-value and convert it to an error rate. It’s just not possible with the math behind frequentist statistics. However, by incorporating Bayesian and simulation methods, we can estimate error rates for p-values.
Simulation Studies and False Positives
In my post about interpreting p-values, I quote the results from Sellke et al. He used a Bayesian approach. But let’s start with simulation studies and see how they can help us understand the false positive rate. For this, we’ll look at the work of David Colquhoun, a professor in biostatistics, who lays it out here.
Factors that influence the false-positive rate include the following:
- Prevalence of real effects (higher is good)
- Power (higher is good)
- Significance level (lower is good)
“Good” indicates the conditions under which hypothesis tests are less likely to produce false positives. Click the links to learn more about each concept. The prevalence of real effects indicates the probability that an effect exists in the population before conducting your study. More on that later!
Let’s see how to calculate the false positive rate for a particular set of conditions. Our scenario uses the following conditions:
- Prevalence of real effects = 0.1
- Significance level (alpha) = 0.05
- Power = 80%
We’ll “perform” 1000 hypothesis tests under these conditions.
In this scenario, the total number of positive test results are 45 + 80 = 125. However, 45 of those positives are false. Consequently, the false positive rate is:
Mathematically, calculate the false positive rate using the following:
Where alpha is your significance level and P(real) is the prevalence of real effects.
Simulation studies for P-values
The previous example and calculation incorporate the significance level to derive the false positive rate. However, we’re interested in p-values. That’s were the simulation studies come in!
Using simulation methodology, Colquhoun runs studies many times and sets the values of the parameters above. He then focuses on the simulated studies that produce p-values between 0.045 and 0.05 and evaluates how many are false positives. For these studies, he estimates a false positive rate of at least 26%. The 26% error rate assumes the prevalence of real effects is 0.5, and power is 80%. Decreasing the prevalence to 0.1 causes the false positive rate to jump to 76%. Yikes!
Let’s examine the prevalence of real effects more closely. As you saw, it can dramatically influence the error rate!
P-Values and the Bayesian Prior Probability
The property that Colquhoun names the prevalence of real effects (P(real)) is what the Bayesian approach refers to as the prior probability. It is the proportion of studies where a similar effect is present. In other words, the alternative hypothesis is correct. The researchers don’t know this, of course, but sometimes you have an idea. You can think of it as the plausibility of the alternative hypothesis.
When your alternative hypothesis is implausible, or similar studies have rarely found an effect, the prior probability (P(real)) is low. For instance, a prevalence of 0.1 signifies that 10% of comparable alternative hypotheses were correct, while 90% of the null hypotheses were accurate (1 – 0.1 = 0.9). In this case, the alternative hypothesis is unusual, untested, or otherwise unlikely to be correct.
When your alternative hypothesis is consistent with current theory, has a recognized process for producing the effect, or prior studies have already found significant results, the prior probability is higher. For instance, a prevalence of 0.90 suggests that the alternative is correct 90% of the time, while the null is right only 10% of the time. Your alternative hypothesis is plausible.
When the prior probability is 0.5, you have a 50/50 chance that either the null or alternative hypothesis is correct at the beginning of the study.
You never know this prior probability for sure, but theory, previous studies, and other information can give you clues. For this blog post, I’ll assess prior probabilities to see how they impact our interpretation of P values. Specifically, I’ll focus on the likelihood that the null hypothesis is correct (1 – P(real)) at the start of the study. When you have a high probability that the null is right, your alternative hypothesis is unlikely.
Moving from the Prior Probability to the Posterior Probability
From a Bayesian perspective, studies begin with varying probabilities that the null hypothesis is correct, depending on the alternative hypothesis’s plausibility. This prior probability affects the likelihood the null is valid at the end of the study, the posterior probability.
If P(real) = 0.9, there is only a 10% probability that the null is correct at the start. Therefore, the chance that the hypothesis test rejects a true null at the end of the study cannot be greater than 10%. However, if the study begins with a 90% probability that the null is right, the likelihood of rejecting a true null escalates because there are more true nulls.
The following table uses Colquhoun and Sellke et al.’s calculations. Lower prior probabilities are associated with lower posterior probabilities. Additionally, notice how the likelihood that the null is correct decreases from the prior probability to the posterior probability. The precise value of the p-value affects the size of that decrease. Smaller p-values cause a larger decline. Finally, the posterior probability is also the false positive rate in this context because of the following:
- the low p-values cause the hypothesis test to reject the null.
- the posterior probability indicates the likelihood that the null is correct even though the hypothesis test rejected it.
|Prior Probability of true null
1 – P(real)
|Posterior probability of true null
(False positive rate)
Safely Using P-values
Many combinations of factors affect the likelihood of rejecting a true null. Don’t try to remember these combinations and false-positive rates. When conducting a study, you probably will have only a vague sense of the prior probability that your null is true! Or maybe no sense of that probability at all!
Just keep these two big takeaways in mind:
- A single study that produces statistically significant test results can provide weak evidence that the null is false, especially when the P value is close to 0.05.
- Different studies can produce the same p-value but have vastly different false-positive rates. You need to understand the plausibility of the alternative hypothesis.
Carl Sagan’s quote embodies the second point, “Extraordinary claims require extraordinary evidence.”
Suppose a new study has surprising results that astound scientists. It even has a significant p-value! Don’t trust the alternative hypothesis until another study replicates the results! As the last row of the table shows, a study with an implausible alternative hypothesis and a significant p-value can still have an error rate of 76%!
I can hear some of you wondering. Ok, both Bayesian methodology and simulation studies support these points about p-values. But what about empirical research? Does this happen in the real world? A study that looks at the reproducibility of results from real experiments supports it all. Read my post about p-values and the reproducibility of experimental results.
I know this post might make p-values seem more confusing. But don’t worry! I have another post that provides simple recommendations to help you navigate P values. Read my post: Five P-value Tips to Avoid Being Fooled by False Positives.