In my post about how to interpret p-values, I emphasize that p-values are not an error rate. The number one misinterpretation of p-values is that they are the probability of the null hypothesis being correct.

The correct interpretation is that p-values indicate the probability of observing your sample data, or more extreme, when you assume the null hypothesis is true. If you don’t solidly grasp that correct interpretation, please take a moment to read that post first.

Hopefully, that’s clear.

Unfortunately, one part of that blog post confuses some readers. In that post, I explain how p-values are not a probability, or error rate, of a hypothesis. I then show how that misinterpretation is dangerous because it overstates the evidence against the null hypothesis.

For example, suppose you incorrectly interpret a p-value of 0.05 as a 5% chance that the null hypothesis is correct. You’ll feel that rejecting the null is appropriate because it is unlikely to be true. However, I later present findings that show a p-value of 0.05 relates to a false-positive rate of “at least 23% (and typically close to 50%).”

The logical question is, if p-values aren’t an error rate, how can you report those higher false positive rates (an error rate)? That’s a reasonable question and it’s the topic of this post!

## A Quick Note about This Post

This post might be a bit of a mind-bender. P-values are already confusing! And in this post, we look at p-values differently using a different branch of statistics and methodology. I’ve hesitated writing this post because it feels like a deep, dark rabbit hole!

However, the ideas from this exploration of p-values have strongly influenced how I view and use p-values. While I’m writing this post after other posts and an entire book chapter about p-values, the line of reasoning I present here strongly influenced how I wrote that earlier content. Buckle up!

## Frequentist Statistics

Before calculating the false positive rate, you need to understand frequentist statistics, also known as frequentist inference. Frequentist statistics are what you learned, or are learning, in your Introduction to Statistics course. This methodology is a type of inferential statistics containing the familiar hypothesis testing framework where you compare your p-values to the significance level to determine statistical significance. It also includes using confidence intervals to estimate effects.

Frequentist inference focuses on frequencies that make it possible to use samples to draw conclusions about entire populations. The frequencies in question are the sampling distributions of test statistics. That goes beyond the scope of this post but click the related posts links below for the details.

Frequentist methodology treats population parameters, such as the population mean (µ), as fixed but unknown characteristics. There are no probabilities associated with them. The null and alternative hypotheses are statements about population parameters. Consequently, frequentists can’t say that there is such and such probability that the null hypothesis is correct. It either is correct or incorrect, but you don’t know the answer. The relevant point here is that when you stick strictly to frequentist statistics, there is no way to calculate the probability that a hypothesis is correct.

**Related posts**: How Hypothesis Tests Work, How t-Tests Work, How F-tests Work in ANOVA, and How the Chi-Squared Test of Independence Works

### Why Can’t Frequentists Calculate those Probabilities?

There are mathematical reasons for that but let’s look at it intuitively. In frequentist inference, you take a single, random sample and draw conclusions about the population. The procedure does not use other information from the outside world or other studies. It’s all based on that single sample with no broader context.

In that setting, it’s just not possible to know the probability that a hypothesis is correct without incorporating other information. There’s no way to tell whether your sample is unusual or representative. Frequentist methods have no way to include such information and, therefore, cannot calculate the probability that a hypothesis is correct.

However, Bayesian statistics and simulation studies include additional information. Those are large areas of study, so I’ll only discuss the points relevant to our discussion.

### Bayesian Statistics

Bayesian statistics can incorporate an entire framework of evidence that resides outside the sample. Does the overall fact pattern support a particular hypothesis? Does the larger picture indicate that a hypothesis is more likely to be correct before starting your study? This additional information helps you calculate probabilities for a hypothesis because it’s not limited to a single sample.

### Simulation Studies

When you perform a study in the real world, you do it just once. However, simulation studies allow statisticians to perform simulated studies thousands of times while changing the conditions. Importantly, you know the correct results, enabling you to calculate error rates, such as the false positive rate.

Using frequentist methods, you can’t calculate error rates for hypotheses. There is no way to take a p-value and convert it to an error rate. It’s just not possible with the math behind frequentist statistics. However, by incorporating Bayesian and simulation methods, we can estimate error rates for p-values.

## Simulation Studies and False Positives

In my post about interpreting p-values, I quote the results from Sellke et al. He used a Bayesian approach. But let’s start with simulation studies and see how they can help us understand the false positive rate. For this, we’ll look at the work of David Colquhoun, a professor in biostatistics, who lays it out here.

Factors that influence the false-positive rate include the following:

- Prevalence of real effects (higher is good)
- Power (higher is good)
- Significance level (lower is good)

“Good” indicates the conditions under which hypothesis tests are less likely to produce false positives. Click the links to learn more about each concept. The prevalence of real effects indicates the probability that an effect exists in the population before conducting your study. More on that later!

Let’s see how to calculate the false positive rate for a particular set of conditions. Our scenario uses the following conditions:

- Prevalence of real effects = 0.1
- Significance level (alpha) = 0.05
- Power = 80%

We’ll “perform” 1000 hypothesis tests under these conditions.

In this scenario, the total number of positive test results are 45 + 80 = 125. However, 45 of those positives are false. Consequently, the false positive rate is:

Mathematically, calculate the false positive rate using the following:

Where alpha is your significance level and P(real) is the prevalence of real effects.

### Simulation studies for P-values

The previous example and calculation incorporate the significance level to derive the false positive rate. However, we’re interested in p-values. That’s were the simulation studies come in!

Using simulation methodology, Colquhoun runs studies many times and sets the values of the parameters above. He then focuses on the simulated studies that produce p-values between 0.045 and 0.05 and evaluates how many are false positives. For these studies, he estimates a false positive rate of at least 26%. The 26% error rate assumes the prevalence of real effects is 0.5, and power is 80%. Decreasing the prevalence to 0.1 causes the false positive rate to jump to 76%. Yikes!

Let’s examine the prevalence of real effects more closely. As you saw, it can dramatically influence the error rate!

## P-Values and the Bayesian Prior Probability

The property that Colquhoun names the prevalence of real effects (P(real)) is what the Bayesian approach refers to as the prior probability. It is the proportion of studies where a similar effect is present. In other words, the alternative hypothesis is correct. The researchers don’t know this, of course, but sometimes you have an idea. You can think of it as the plausibility of the alternative hypothesis.

When your alternative hypothesis is implausible, or similar studies have rarely found an effect, the prior probability (P(real)) is low. For instance, a prevalence of 0.1 signifies that 10% of comparable alternative hypotheses were correct, while 90% of the null hypotheses were accurate (1 – 0.1 = 0.9). In this case, the alternative hypothesis is unusual, untested, or otherwise unlikely to be correct.

When your alternative hypothesis is consistent with current theory, has a recognized process for producing the effect, or prior studies have already found significant results, the prior probability is higher. For instance, a prevalence of 0.90 suggests that the alternative is correct 90% of the time, while the null is right only 10% of the time. Your alternative hypothesis is plausible.

When the prior probability is 0.5, you have a 50/50 chance that either the null or alternative hypothesis is correct at the beginning of the study.

You never know this prior probability for sure, but theory, previous studies, and other information can give you clues. For this blog post, I’ll assess prior probabilities to see how they impact our interpretation of P values. Specifically, I’ll focus on the likelihood that the null hypothesis is correct (1 – P(real)) at the start of the study. When you have a high probability that the null is right, your alternative hypothesis is unlikely.

## Moving from the Prior Probability to the Posterior Probability

From a Bayesian perspective, studies begin with varying probabilities that the null hypothesis is correct, depending on the alternative hypothesis’s plausibility. This prior probability affects the likelihood the null is valid at the end of the study, the posterior probability.

If P(real) = 0.9, there is only a 10% probability that the null is correct at the start. Therefore, the chance that the hypothesis test rejects a true null at the end of the study cannot be greater than 10%. However, if the study begins with a 90% probability that the null is right, the likelihood of rejecting a true null escalates because there are more true nulls.

The following table uses Colquhoun and Sellke *et al.’s *calculations*.* Lower prior probabilities are associated with lower posterior probabilities. Additionally, notice how the likelihood that the null is correct decreases from the prior probability to the posterior probability. The precise value of the p-value affects the size of that decrease. Smaller p-values cause a larger decline. Finally, the posterior probability is also the false positive rate in this context because of the following:

- the low p-values cause the hypothesis test to reject the null.
- the posterior probability indicates the likelihood that the null is correct even though the hypothesis test rejected it.

Prior Probability of true null1 â€“ P(real) |
StudyP-value |
Posterior probability of true null(False positive rate) |

0.5 | 0.05 | 0.289 |

0.5 | 0.01 | 0.110 |

0.5 | 0.001 | 0.018 |

0.33 | 0.05 | 0.12 |

0.9 | 0.05 | 0.76 |

## Safely Using P-values

Many combinations of factors affect the likelihood of rejecting a true null. Don’t try to remember these combinations and false-positive rates. When conducting a study, you probably will have only a vague sense of the prior probability that your null is true! Or maybe no sense of that probability at all!

Just keep these two big takeaways in mind:

- A single study that produces statistically significant test results can provide weak evidence that the null is false, especially when the P value is close to 0.05.
- Different studies can produce the same p-value but have vastly different false-positive rates. You need to understand the plausibility of the alternative hypothesis.

Carl Sagan’s quote embodies the second point, “Extraordinary claims require extraordinary evidence.”

Suppose a new study has surprising results that astound scientists. It even has a significant p-value! Don’t trust the alternative hypothesis until another study replicates the results! As the last row of the table shows, a study with an implausible alternative hypothesis and a significant p-value can still have an error rate of 76%!

I can hear some of you wondering. Ok, both Bayesian methodology and simulation studies support these points about p-values. But what about empirical research? Does this happen in the real world? A study that looks at the reproducibility of results from real experiments supports it all. Read my post about p-values and the reproducibility of experimental results.

I know this post might make p-values seem more confusing. But don’t worry! I have another post that provides simple recommendations to help you navigate P values. Read my post: Five P-value Tips to Avoid Being Fooled by False Positives.

Jeremy says

Hi Jim, thank you for this very important work of explanation. I have fallen into the rabbit hole of the relationship between p-values and error rates because of some literature review I have been doing in sports science.

In this field, researchers often use ANOVA to compare the effect of different training regimens on certain physical ability metrics such as endurance. To test endurance, they come up with tests for which they often don’t evaluate the test-retest reliability. My initial inquiry was: how often can an ANOVA incorrectly detect a difference with p <= 0.05 as a function of test-retest reliabiliyt (measured using an ICC), in other words, how is the error rate affected by measurement (un)reliability?

I ended up finding a paper by Westfall and Yarkoni (2016) on the effect of reliability on controlling for confounding variables, but I don't think this translates to my inquiry.

That is how I ended up reading your blog posts on p-values which have been very illuminating. However I believe the work you shared doesn't take into account measurement reliability. Would you happen to have some thoughts or references to share on the impact of measurement reliability on the rate of false positives (type I error rate) in ANOVA?

Thank you very much.

Jim Frost says

Hi Jeremy,

All hypothesis tests, including ANOVA, assume that measurement error is small compared to the sampling error. If you can’t make that assumption, it raises questions about the results. Hypothesis testing does not account for measurement error, just sampling error.

I don’t know of a way to factor in measurement error to the results. It’s not standard practice. Ideally, the researchers would have conducted an assessment of their measurements to make that determination. Unfortunately, I don’t have references on hand. But, if you have concerns about the data’s reliability, that is potentially a legitimate problem and I’d encourage you to look into it more. Sorry I can’t be more helpful with a reference though.

Hugo Alonso says

Hello Jim

Yes but I am not looking for the error rate after the simulation is done. I need a way to control the error rate before the algorithm run. And intuitively there must be a way to do it with a threshold p Value on which you base the decision. The lower the pValue threshold, the better for error rate. I am looking for a way to calculate the function that retrieve error rate from this “beforehand chosen pValue threshold”.

The only way I can think for now is to run the simulation with different critical values, observe error rate, and interpolate points to get a continuous function. So I was hoping that you had a better idea.

Hugo Alonso says

Hello Jim.

Thanks for this blog. I am not a mathematician, just a computer scientist. Thus I may misunderstand but you seem to say that we cant compute error rate from value.

My problem is as follow. I have a set of inputs that follow random distributions. By design, all the distribution are equals except for one that is have a bigger mean (very litle difference). All have same variance.

I try to find the quickest way (in amount of try) to isolate this particular input with user given probability x.

One of my approach is based on critical pvalues over the difference of the best set of data compared with all the others. I stop when the difference reach a predefined pValue. I was really surprised by the difference between error rate and pvalue observe: pvalue = 0.0025 => 0.14 error rate. This is why I came here to try to understand. It’s clear thanks to you now that this is to be expected, but I still cant grasp that there is no way to link the two values when you control every parameter.

Since I am doing simulation, I control every parameter. The prevalence of effect is one. This is really bogging me (I find it counterintuitive), that I cant control x with pValues, but I can using an other interval of trust technique. Specially because the pvalue method go a bit faster:

pvalues: (x = 0.859: numberOfTry=127.881)

intÃ©rvals: (x=0.864, numberOfTry=134.649

So my question is:

Is there really nothing to do to anticipate error rate from critical pValue for my specific use case? Do you have recommendation on the best way to resolve my problem ?

Ps: the interval technique finish when both sets of data interval and all the other data interval become disjoint.

Jim Frost says

Hi Hugo,

Please understand that when I say you can’t link p-values to error rates, I’m referring to real studies using real data. In those cases, you only get one sample and you don’t know (and can’t control) the population parameters.

However, when you’re performing simulation studies, you certainly do control the population parameters and can draw repeated samples from the populations as you define them. In those cases, yes, you can certainly know the error rates because you know all the necessary information. However, in real world studies, you don’t have all that necessary info. That’s a huge difference!

Owen Byer says

Thanks again, Jim. I have 3 comments:

First, you said that the probabilities of the following four events do not sum to 1. I think they DO sum to 1 — it is just that two of them will have probability zero, because, as you said, either the null is true or it is false. So, point taken.

1. Reject a true null hypothesis.

2. Reject a false null hypothesis.

3. Fail to reject a true null hypothesis.

4. Fail to reject a false null hypothesis.

Second, I guess I still don’t understand the definition of a Type I error rate, if you say it is hard to determine. I completely understand that the error rate is not equal to the P-value, even though it is in fact a probability — but how is that probability defined? Given what you have written, I don’t see how it is different than alpha.

Finally, I was talking about these ideas with a friend, and he referred me to this interesting article. Evidently I am not alone in thinking that Type I errors don’t occur. See page 1000.

https://www.sjsu.edu/faculty/gerstman/misc/Cohen1994.pdf

The author makes a point that I had never seen before. We are all familiar with this logic:

If A, then B; it follows that if B isn’t true, we assume A isn’t true.

In our context, ff the null is true, we won’t get this data; we got this data, so the null is false.

He then points out how this isn’t quite right, and it is more accurate to say:

If the null is true, we probably don’t get this data. We then conclude that if we got this data, the null is probably false.

But this is very bad logic, as shown in this example:

If a person is an American, he is probably not a member of Congress. Since this person is a member of Congress, he is probably not an American.

Such logic falls into the same trap of thinking that Prob(getting this sample data, given that the null is true) is equal to Prob(the null is true, given this sample data).

Jim Frost says

Hi Owen,

We’re getting to the point where we’re going around in circles a bit. If you have questions after this reply, please use my contact me form. I’ll try not to be too repetitive below because I’ve addressed several of these points already.

I suppose you could say that all four should sum to 1. However, only two of them will be valid for any given test. In my list below, only 1 & 2

or3 & 4 will be valid possibilities for a given test. And, again, you should be listing them in a logical order like the following where you correctly group complementary pairs. The order you use doesn’t emphasize the natural pairings.1. Reject a true null: error rate = Î±

2. Failing to reject a true null: correct decision rate = 1 â€“ Î±.

3. Failing to reject a false null: error rate = Î²

4. Reject a false null: correct decision rate = 1 â€“ Î² (aka statistical power)

While you could say the invalid pair has a probability that sums to zero, it doesn’t really make sense to consider, say, the probability of rejecting a true null for a test where the null is false. Of course, you don’t know the answer to that, but in theory that’s the case. But, if you want to consider one pair to have a probability of zero and the other pair to have a probability of 1, I suppose that works. Maybe it even clarifies how one pair is invalid.

I focus on the interpretation of p-values. Click the link to read. I specifically cover what the probability represents. And read the following for graphical comparison between significance levels and p-values.

I’ve already covered in detail in my previous replies why it’s not a problem if type I errors don’t exist. I have heard of this thinking before, but I don’t buy it personally. It’s easy enough to imagine a totally ineffective treatment where both populations are by definition the same. But, even if you assume that there is always some minimal effect, it’s not a problem for all the reasons I explained before. Then it just becomes a case of having a large enough sample size to detect

meaningfuleffects and to produce a sufficiently precise confidence interval. That’s already built into the power analysis process. So, even if you’re right, it’s a not a problem.I do want to address your logic example. I actually addressed this idea in a previous reply. Yes, that is bad logic. And hypothesis testing specifically addresses that. That’s why when your results are not significant, we say that you “fail to reject the null.” You are NOT accepting the null. A non-significant hypothesis test isn’t proving that there is no effect (i.e., not proving the null is true). Instead, it’s saying that you have insufficient evidence to conclude that an effect exists in the population. Similar to your logic example, that is NOT the same as saying there is no effect.

I’ve written a post about that topic exactly. I included in a previous reply, and I suggest you read it this time! ðŸ™‚ Failing to Reject the Null Hypothesis.

Owen Byer says

Wait, one more post. Perhaps I just had an epiphany.

By the error rate of “rejecting a true null”, do you mean the probability that the null is true, given that we rejected it? And this is what can be has high as 0.23 when P = .05 ?

This is in contrast to the probability of a Type I error, alpha, which is the probability of rejecting a null, given that it is true?

If this is what is meant, then my confusion is removed, and it explains why the error rate and alpha are not equal — they are different conditional probabilities. Of course these two probabilities are related to each other via Baye’s Theorem.

By the way, if this is correct, then I change my initial objection from Type I errors hardly ever occurring to claim that the error rate is almost always 0, since the null is hardly ever true, unless we have some error tolerance built into the statement of the null :).

Jim Frost says

Type I errors can only occur when the null is true by definition. You’re rejecting a null that is true. That’s an error and can, obviously, only occur when the null is true. When the null is false, you can’t reject it incorrectly.

The p-value error rate is also the same idea. You can only incorrectly reject the null when the null is true. So, yes, both cases are conditional on a true null. You can’t incorrectly reject a false null.

As I write in my other reply, The type I error rate equals the significance level and applies to a range of p-values for a class of studies. For individual p-values from a single study, you need to use other methodologies just to estimate the false positive error rate.

Owen Byer says

Maybe we should continue, if you are willing, to do this via private email. I feel I have hijacked your thread here! So, I’ll just give one last response.

My point was that those four scenarios partition the space of outcomes from experiments, so all four should add up to 1, and it doesn’t matter what order we list them.

If we want to look at the probability that we make an error, in my list we can add them:

P(error) = P(Case 1) + P(Case 4). In your list, you have written them as conditional probabilities, so they can’t be added. The probability of making an error is not Î± + Î². This is why when errors type I and Type II errors are discussed, I think they should ALWAYS be described as conditional probabilities. To me, saying “rejecting a true null” is too likely to be interpreted as “rejecting and true null” rather than “rejecting | true null”.

I’ve read your other pieces, and want to make sure I understand something. Above, you say the Type I error rate is simply Î±. However, in the article, you say the Type I error rate can be as high as 23% when P = 0.05. Does this just mean that apriori the error rate is Î±, but after you take your sample and get P=.05, you have new information, and the error rate has now climbed to 0.23?

Thanks again.

Jim Frost says

Hi Owen,

I think this is a good discussion that others will benefit from. That’s why I always prefer discussion in the comments rather than via email!

But that’s not correct thinking that those four scenarios should sum to 1. Perhaps we need to teach that better. However, the null hypothesis is either true or false. We don’t know the answer, but we do know that it’s one or the other. When the null is false, there is no chance of a false negative. And when the null is true, there is no chance of a false positive. I show two distribution curves in my post about the types of error. In actuality, only one of those curves exist, we just don’t know which one. As you say, they are conditional probabilities. Although, I think that’s baked right into the names as I’ve mentioned, but I can see the need to emphasize it.

Getting to your questions about the error, there are a few complications! For one thing, the type I error rate equals the significance level (Î±), which applies to a

rangeof p-values. Using frequentist methodology, there is no way to obtain an error rate for asinglep-value from a study. However, using other methodologies, such as Bayesian statistics and simulation studies, you can estimate error rates for individual p-values. You do need to make some assumptions but it’s possible to come up with ballpark figures. And when I talk about error rates as high as 23% for a p-value of 0.05, it’s using those other methodologies. That’s why I consider a p-value around 0.05 (either above or below) to be fairly weak evidence on their own. I think I use an a priori probability of 0.5 for whether the null is true for the 23%. Obviously, the false positive rate will be higher when that probability is higher.But there was no reason to have assumed that a p-value of 0.05 should produce an error rate of 0.05 to begin with. That’s the common misinterpretation I discuss in my article about interpreting p-values. Many people link p-values to that type of error rate, but it’s just not true. And my point is that using conservative a priori probabilities, you can see that the true error rate is typically higher.

Again, the Type I error rate equals the significance level, not an individual p-value.

Owen D Byer says

You wrote: “I still donâ€™t quite understand what youâ€™re saying about the vagueness of the Type I error rate. The type I error rate is the probability of rejecting a true null hypothesis. Therefore, by definition weâ€™re talking about cases where the null hypothesis is correct.”

This is what I meant. There are four non-overlapping possibilities, each with its own probability.

1. Reject a true null hypothesis.

2. Reject a false null hypothesis.

3. Fail to reject a true null hypothesis.

4. Fail to reject a false null hypothesis.

It would be reasonable for one to conclude that the sum of these four probabilities is 1.

However, when you say that the Type 1 error rate is the probability of rejecting a true null hypothesis, you actually mean the sum of the probabilities in 1 and 3 equals 1, and that the error rate is P(1)/( P(1) + P(3) ).

Jim Frost says

Hi Owen,

I guess if you write the list in that particular order, you’d need to sum non-adjacent items. Consequently, I wouldn’t list them in that order. It’s more logical to group them by whether the null hypothesis is true or not rather than by the rejection decision. But I do agree that we need to be clear when teaching this subject!

1. Reject a true null: error rate = Î±

2. Failing to reject a true null: correct decision rate = 1 – Î±.

3. Failing to reject a false null: error rate = Î²

4. Reject a false null: correct decision rate = 1 – Î² (aka statistical power)

For more information on this topic, read my post about the two types of error in hypothesis testing. In that post, I put these in a table and I also show them on sampling distributions.

Owen D Byer says

I completely resonate with what you say here. In fact, I’ve long thought that hypotheses should actually have an error tolerance built into them that somehow includes the effect size that is considered negligible. For example, it should be stated as an interval: mu = 100 +/- 1, if all values of mu in that range would be considered indistinguishable in any practical sense for the given context. Of course, this would make the calculation of the P-value a bit more complicated, and one would have to assume some type of distribution of the values of the parameter (probably normal or uniform) within the interval, but with technology this wouldn’t be a problem. I have never actually taken the step to see what effect such an approach would have on the P-values. Maybe none.

Finally, I didn’t mean to imply that I think the definition of a Type I error is vague — I agree it is well-defined. What I meant is that I think that when the probability of a Type I error is discussed, we could all do a better job of clarifying that the sample space is all experiments for which the null is true. (Of course, that gets me back to my earlier issue, because I think the sample space is so small!)

Thank you again for your responses. I want to read some of your other articles. I’m a mathematician who teaches statistics, and this is all very helpful to me.

Jim Frost says

Hi Owen,

There is actually a standard way of doing just that. It involves using confidence intervals to evaluate both the magnitude and precision of the estimate effect. For more details, read my post about practical vs statistical significance. The nice thing is that CI approach is entirely consistent with p-values.

I still don’t quite understand what you’re saying about the vagueness of the Type I error rate. The type I error rate is the probability of rejecting a true null hypothesis. Therefore, by definition we’re talking about cases where the null hypothesis is correct.

And, even if that sample space is small, it’s not really a problem.

Thanks for the interesting discussion!

Owen Byer says

Thank you for the comment. That is a good point about the possibility of the null hypothesis being true with an equal sign for two-sample tests when considering the effect of a bogus drug. I guess I was mostly thinking of one-sample tests with a fixed standard in the null.

Having said that, in your example, yes, it is easy to believe in a theoretically worthless treatment. In practice, if every subject of the population were tested (i.e., our sample is the population), an effect would likely always be observed, however small it is. In this case, then, it seems we probably need to define exactly what we mean when we refer to a population. To make my case (that the null is never true), I would define the population as an actual group of subjects who could conceivably be tested, not as an idealized theoretical group of all possible subjects. It seems the logic is backwards to say “The treatment is worthless, so the parameter must exactly equal zero.”

On my other point, I realize that “probability of rejecting a null hypothesis that is true” is the usual definition. But I find this to be vague, because it can logically be interpreted by students as the probability of the intersection of two events: (1) Rejecting the null, and (2) The null is true. That is a very different than the conditional probability of Rejecting the null given that the null is true.

I do realize these comments of mine are a bit pedantic. However, they have troubled me for some time, so I appreciate having your ear for a moment!

Jim Frost says

Hi Owen,

For the sake of discussion, let’s go beyond the question of whether the null can be true exactly or not but ponder only those cases where it’s not exactly true but close. We’ll assume those cases exist to one degree or another even if we’re not sure how often.

In those cases, it’s still not a problem. If the null is always false to some degree, then you don’t need to worry about Type I errors because that deals with true nulls. Instead, you’re worrying about Type II errors (failing to reject a false null) because that is applicable to false nulls. An effect exists but the test is not catching it. That sounds like a problem, but it isn’t necessarily. If the true population effect exists but is trivial, it’s not a problem if you fail to detect it. When you fail to reject the null in that case, you’re not missing out on an important finding.

In fact, when you perform a power analysis before a test, you need to know the minimum effect size that is not trivial. This process helps you obtain a large enough sample so you have reasonable chance of detecting an effect that you’d consider important if it exists. (It also prevents you from obtain such a large sample size that you’ll detect a trivial effect.) In this scenario, you just want to have a reasonable chance of detecting an effect that is important. If you fail to reject the null in this case, it doesn’t matter whether the null is true or minimally false. In a practical sense that doesn’t matter. And remember, failing to reject the null doesn’t mean you’re proving the null is true. You can read my article about why we use the convoluted wording of failing to reject the null.

So, in the scenario you describe, you wouldn’t worry about type I errors, only type II. And in that context, you want to detect important effects but it’s fine to fail to detect trivial effects. And that comes down to power analysis. I probably made that as clear as mud, but I hope you get my point.

To learn more about how and why a power analysis builds in the idea of a practically significant effect size, read my post about power analysis.

Finally, I don’t think the definition of a type I error is vague at all (or type II). They’re very specific. “It’s an error if you reject a null hypothesis that is true.” That statement is true by definition and has very precise meaning in the context of a hypothesis test where you define the null hypothesis. It’s certainly true that students can misinterpret that but that’s a point of education rather than a vague definition.

It is an interesting discussion!

Owen Byer says

Could you clarify what you mean by the error rate? I think you said it is the conditional probability of Rejecting the null, given that the null is true? However, the null hypothesis is actually NEVER true if when we write = we really mean equal. It might be very close to being true, or it might be true to the level of precision with which we can measure, but it won’t actually be true. (In the same way that no matter how many decimals someone gives me for the value of the number pi, the value they give will still not actually equal pi.) However, in our hypotheses, we do not stipulate the level of accuracy for which we need to agree that two numbers are equal.

So, my question: How does it make sense to talk about the conditional probability of an event when the underlying condition never happens?

Jim Frost says

Hi Owen,

That’s correct that the error rate, more specifically, the Type I error rate, is the probability of rejecting a null hypothesis that is true. However, I’d disagree that the null hypothesis is never true when using an equal sign. For example, imagine that you’re testing a medication that truly is worthless. It has no effect whatsoever. If you perform an experiment with a treatment and control group, the null hypothesis is that the outcomes for the treatment group equals the control group. If the medication truly has zero effect, then at the population level, the outcomes should be equal. Of course, your sample means are unlikely to be exactly equal due to random sampling error.

However, I would agree that there are many cases where, using the medication example, it has some effect but not a practically meaningful effect. In that case, the null hypothesis is not correct. But that’s not a problem. If you reject the null hypothesis when the treatment group is marginally better than the control group, it’s not an error. The hypothesis test made the correct decision by rejecting the null.

At that point, it becomes a distinction between statistical significance and practical significance (i.e., importance in the real world).

So, what you’re asking about is a concern, but a different type of concern than what you mention. The null hypothesis using equals is just fine. The real concern is whether after rejecting the null if the effect is practically significant.

Carolina Musso says

Hi Jim, thank you for this explanation. I have one question. It is a probably a dumb question, but I am going to ask it anyway…

Suppose I define the alpha as 5%. Does this mean that I have decided to reject the null hypothesis if p<0.05? Or when I define alpha as 5% I could use another threshold for the p-value?

Jim Frost says

Hi Carolina,

Yes, that’s correct! Technically, you reject the null if the p-value is less than

or equal to0.05 when you use an alpha of 0.05. So, basically what you said, but it’s less than or equal to.Andreas says

I found this blogpost by googling for “significance false positive rate”. I noticed that what you call “false positive rate” is apparently called “false discovery rate” elsewhere. According to Wikipedia, the false positive rate is the number of false positives (FP) divided by the number of negatives (TN + FP). So FP is _not_ divided by the number of positives (TP + FP); doing this, you would get (according to Wikipedia) just the “false discovery rate”.

https://en.wikipedia.org/wiki/False_positive_rate

https://en.wikipedia.org/wiki/False_discovery_rate

Now I fully understand that the p value is not the same as the false discovery rate, as you correctly show. But how is the p value related to the false positive rate (defined as FP/(TN + FP))?

Jim Frost says

Hi Andreas,

The False Discovery Rate (FDR) and the False Positive Rate (FPR) are synonymous in this context. In statistics, one concept will sometimes have several different names. For example, alpha, the significance level, and the Type I error rate all mean the same thing!

As you have found, analysts from different backgrounds will sometimes use these terms differently. It does make it a bit confusing! That’s why it’s good practice to include the calculations, as I do in this post.

Thanks for writing!

Joseph Lombardi says

Many moons ago, when I was a junior electrical engineer, I wrote a white paper (for the US Navy). At the time, there was a big push to inject all sorts of Built-In Test (BIT) and Built-in Test Electronics (BITE) into avionics (i.e., aircraft weapon systems). The rapid pace of miniaturization of electronics made this a very attractive idea. In the paper I recommended we should slow down and inject BIT/E very judiciously, mainly for the reasons illustrated in your post.

Specifically, if the actual failure rate of a weapon system is very low (i.e., the Prevalence of Real Effects is very small), and the Significance Level is too large, we will get a very high False Positive rate, which will result in the “pulling” of numerous “black boxes” for repair that don’t require maintenance. (BTW, this is what, in fact, happened. The incidence of “No Fault Found” on systems sent in for repair has gone up drastically.)

And the Bayesian logic illustrated above is why certain medical diagnostic tests aren’t (or shouldn’t be) given to the general public: The prevalence in the general population is too low. The tests must be reserved for a sub-group of persons who are high risk for disease.

Cheers,

Joe

Jim Frost says

Hi Joe,

Thanks so much for your insightful comment! These issues have real-world implications and I appreciate you sharing your experiences with us. Whenever anyone analyzes data, it’s crucial to know the underlying processes and subject area to understand correctly the implications, particularly when basing decisions on the analysis!

Kushal says

Hello Jim, I have been binge reading the blogs/articles written by you. It is very helpful. I have a question related to prevalence. Is the concept of prevalence applicable to all scenarios and end goals (for which the analysis is performed) similar to the way alpha and beta are. For example, in the example that is relate to change in per capita income (from 260 to 330), my understanding is that prevalence does not hold true, Is that correct? If not, how to interpret/understand prevalence in that example? Your inputs will be helpful.

Jim Frost says

Hi Kushal,

In this context, the prevalence is the probability that the effect exists in the population. You’d need to be able to come up with some probability that the per capita income has changed from 260 to 330. I think coming up with a good estimate can often be difficult. It becomes easier as a track record develops. Is that size change typical or unusual in previous years? Does it fit other economic observations? Etc. Coming up with a rough estimate can help you evaluate p-values.

Steven H. Gutfreund says

Thank you so much Jim. This was even better than what I expected when I asked you to explain: Sellke et al. I am going to suggest to all my fellow (Data) Scientists that this be a must read.

Jim Frost says

Thanks, Steven! I appreciate the kind words and sharing!

Steven H. Gutfreund says

Looking forward to that.

Emmanuel UGOCHUKWU Ndukwu says

Hi Jim,

This is a nice post. The language is not just elementary, it also made complex concepts intuitively easier to grasp. I have read these concepts several times in many textbooks, for the first time I have a better understanding of the lay application behind the erstwhile difficult topics.

Thank you,

Emmanuel

vishnu kramesh says

Thanks a lot Jim. It will be better, if you take this in the context of Panel data

Marty Shudak says

Jim, thank you. As always, so informative and you are constantly challenging me with different ways of approaching concepts. Have you or do you know of any studies that applies this approach to COVID testing? I’m thinking about recent news from Elon Musk in which he said he had 4 tests done in the same day, same test, same health professional. Two came back positive and two negative. Is there a substantial error rate on these tests?

vishnu kramesh says

Dear Sir

My question is that I have a dep variable say X and a variable of interest Y with some control variables(Z)

Now when I run following regressions

1) X at time t , Y & Z at t-1

2) X at time t , Y at t-1 & Z at t

3) X at time t , Y & Z at t

The sign of my variable of interest changes(significance too). If there are not any theory to guide me with respect lag specification of variable of interest and control variables, which one from the above model should I use? What is the general principle

Jim Frost says

Hi Vishnu,

A good method for identify lags to include is to use the cross correlation function (CCF). This helps find lags of on time series that can predict the current value of your time series of interest. You can also use the autocorrelation function (ACF) and partial autocorrelation function (PACF) to identify lags within one time series. These functions simply look for correlations between observations of a time series that are separated by k time units. CCF is between different sets of time series data while ACF and PACF are within one set of time series data.

I don’t currently have posts about these topics but they’re on my list!

Mavis says

Hi Jim,

Thanks so much for your great post. It’s always been tremendously helpful.

I have one simple question about the difference between a significance level and a false positive rate.

I have read your comment in one of your p-value posts: “When youâ€™re talking significance levels and the Type I error rate, youâ€™re talking about an entire class of studies. You canâ€™t apply that to individual studies.”

But, in this post, we simulated a test 1000 times, and in my humble opinion, it seemed like we treated 1000 tests as a kind of “a class of studies.” However, the false positive rate, 0.36, is still pretty different from the initial significance level setup, 0.05.

I think this is a silly question, but could you please kindly clarify this?

Thanks!

Jim Frost says

Hi Mavis,

That’s a great question. And there’s a myriad of details details like that which are crucial to understand. That’s why it’s such a deep, dark rabbit hole!

What you’re asking about gets to the heart of a major difference Frequentist and Bayesian statistics.

Using Frequentist methodology, there’s no probability associated with the null hypothesis. It’s true or not true but you don’t know. The significance level is part of the Frequentist methodology. So, it can’t calculate a probability about whether the null is true. Instead, the significance level assumes the null hypothesis is true and goes from there. The significance level indicates the probability of the hypothesis producing significant results when the null is true. So, you don’t know whether the null is true or not, but you do know that IF it is true, your test is unlikely to be significant. Think of the significance level as a conditional probability based on the null being true.

Compare that to the Bayesian approach, where you can have probabilities associated with the null hypothesis. The example I work through is akin to the Bayesian approach because we’re stating that the null has a 90% chance of being correct and a 10% chance of being incorrect. That’s a different scenario than Frequentist methodology where you assume the null is true. That’s why the numbers are different because they’re assessing different scenarios and assumptions.

In a nutshell, yes, the 1000 tests can be a class of studies but this class includes cases where the null is both true and false at some assumed proportion. For significance levels, the class of studies contains only studies where the null hypothesis is true (e.g., 5% of all studies where the null is true).

I hope that clarifies that point!

Nikita N Khromov-Borisov says

Idea!

It is not necessary to use the notation Î± for the threshold (critical) value of the random variable

P Ìƒ_v=Pr[(T Ìƒâ‰¤-|t|â”‚H_0 )+(T Ìƒâ‰¥+|t|â”‚H_0 )]

and call it the significance level. For it a different notation, for instance, p_crit should be used.

There is no direct relationship between the observed p-value (p_val) and the probability of the null hypothesis P(H_0â”‚data), just as there is no direct relationship between the critical p-value p_crit and the significance level Î± (the probability of a type I error)!

Jim Frost says

Hi Nikita,

I don’t follow your comment. Is this just your preference for the notation or something more? Alpha is the usual notation for this concept.

Ramamurthy Pasupathy says

Very informative and useful. Thank you

Jim Frost says

You’re very welcome! I’m glad it was helpful!