What is the Relationship Between the Reproducibility of Experimental Results and P Values?

The ability to reproduce experimental results should be related to P values. After all, both of these statistical concepts have similar foundations.

P values help you separate the signal of population level effects from the noise in sample data.
Reproducible results support the notion that the findings can be generalized to the population rather than applying only to a specific sample.

So, P values are related to reproducibility in theory. But, does this relationship exist in the real world? In this blog post, I present the findings of an exciting study that answers this question!

Let’s cover the basics of replication and reproducibility quickly. Replication of a research study refers to repeating a study by using the same procedures but with a different sample. The researchers want to see if the replicate study reproduces the original findings.

Clearly, if the replicate study obtains similar findings, you can have more confidence in the results. If an effect exists in the population, it should be apparent in most random samples drawn from that population. Failure to reproduce the results raises the possibility that the original study was a fluke based on the vagaries of the sample or some other problem.

I explain how to interpret P values correctly in a different post. Of particular importance to this discussion is the fact that P values are misinterpreted frequently. Often, a P value of 0.05 is misinterpreted as a 5% chance of a false positive. This probability seems like a safe bet. Unfortunately, the actual probability is often between 25-50%! These probabilities are based on simulation studies and Bayesian analyses. For more information about those studies, read my post P-values, Error Rates, and False Positive.

The article we explore in this post shines a nice empirical light on this matter.

Estimating the Reproducibility Rate

The researchers involved with a study published in August 2015, Estimating the reproducibility of psychological science, wanted to estimate the reproducibility rate and to identify predictors for successfully reproducing experimental results in psychological studies. However, there is a shortage of replication studies available to analyze. Sadly, the shortage exists because it is easier for authors to publish new results than to replicate prior studies.

Because of this shortage, the group of 300 researchers first had to conduct their own replication studies. They identified 100 psychology studies that had statistically significant findings and had been published in three top psychology journals. Then, the research group replicated these 100 studies. After finishing the follow-up studies, they calculated the reproducibility rate and looked for predictors of success. To do this, they compared the results of each replicate study to the corresponding original study.

The researchers found that only 36 of the 100 replicate studies were statistically significant. That’s a 36% reproducibility rate. This finding sent shock waves through the field of psychology!

My view of this low reproducibility rate is that science isn’t a neat, linear process. It can be messy. For science, we take relatively small samples and attempt to model the complexities of the real world. When you’re working with samples, false positives are an unavoidable part of the process. Of course it’s going to take repeated experimentation to determine which results represent real findings rather than random noise in the data. I’ve written elsewhere that you shouldn’t expect a single study to prove anything conclusively. You need to do the replication studies.

Replication studies are a great way to establish the generalizability of the results, which statisticians call external validity. To learn more, read my post about internal and external validity.

P Values and Reproducibility Rates

The researchers evaluate a variety of measures to see if they can predict the probability that a follow up study will reproduce the original results. These potential predictors include professional traits of the original researchers, hypotheses, methodology, and strength of evidence measures, such as the P value.

Most measures don’t predict the reproducibility rates. However, P values are good predictors! The chart below shows how lower P values in the original studies are associated with higher reproducibility rates in the replicate studies.

Clearly, P values provide vital information about which studies have findings that are more likely to be reproduced in the future.

The reproducibility study reinforces what I write in Five P Value Tips to Avoid Being Fooled.

Knowing the precise P value is important—just reporting statistical significance is insufficient.
A P value near 0.05 isn’t worth much by itself.
Replication is crucial.

Collectively, the tips in that post help you distinguish statistical findings that apply to the entire population from those that merely reflect the quirks of a sample. In other words, they help you answer the question: can you generalize the results from the sample to the entire population?

The low reproducibility rate confirms the critical nature of replicating studies before recognizing that a finding has been experimentally established. I vehemently oppose the “one and done” practice that makes it easier to publish new studies than publishing replication studies. In my view, replication studies are just as important as the original studies. After all, if it weren’t for the 100 replication studies in this analysis, we’d have the wrong impression about 64% of the original studies!

Finally, the low reproducibility rate might indicate the presence of p-hacking in the original studies. Learn more about What is P-Hacking: Methods & Best Practices.

Comments

J Verden says

February 14, 2022 at 2:46 am

Hi Jim
Another MCQ I could use your help on that relates to your post here on P-value and reproducibility:
In a trial that was conducted the P value was calculated as 0.05 exactly. What are the chances that if the trial was completed in exactly the same way, that the P value with be greater than 0.05?
a. 1%
b. 10%
c. 50%
d. 95%
e. 99%

Loading...

Eric Tucker says

May 13, 2021 at 12:44 pm

Thanks for the great article Jim! I stumbled upon it pursuing a question that I don’t see answered specifically in here but I suspect you are able to answer… In repeated studies, say with a p-value of 0.01 observed in each, would it be appropriate to multiply these p-values together [(0.01)*(0.01) = 0.0001] to represent the overall significance observed through replication? After all, this would be consistent with the probability of observing the sample data in experiment 1 and then again in experiment 2, assuming the null hypothesis is true and the samples are therefore independent of each other, yes?

Loading...

- Jim Frost says
  
  May 14, 2021 at 12:46 am
  
  Hi Eric,
  
  I actually write about this in an article titled Five P-value Tips. One thing to be careful about is that you’d need to use all p-values for a set of similar experiments. Not just the significant p-values. That can be tricky if you’re just using journal articles that publish the significant results!
  
  Loading...
  
Trent says

October 28, 2019 at 11:56 pm

Thank you for the reply as always, Jim. That makes sense to me.

Loading...

Trent says

October 28, 2019 at 2:52 pm

Thank you for the insight, Jim! So, to clarify:

In the table above, a P<0.02 has a reproducibility rate of ~40% for psychology research.
If I were doing medical research on very concrete exposures and outcomes (eg "Will drug A reduce cancer mortality?"), a P<0.02 would have a reproducibility rate of less than 40%?

Loading...

- Jim Frost says
  
  October 28, 2019 at 2:59 pm
  
  Hey Trent, I’m trying to think about how to word this with the proper caveats. My suspicion is that more concrete studies would have a higher (i.e., better) reproducibility rate. Of course, I don’t know how much higher. And, I certainly wouldn’t bet people’s lives on my hunch. But I do think psychology will have lower reproducibility than other fields given the inherently unpredictable and uncertain nature of human psychology!
  
  Amongst other criteria, I believe the FDA requires at least three significant studies before approving new medicine. I don’t think they ever approve anything based on a single study–which is a good thing.
  
  Loading...
  
Trent says

October 26, 2019 at 11:06 pm

This is a great article! I’m surprised by much lower the reproducibility rate is compared to what I originally believed. I need to recheck all my subconscious assumptions now!

Loading...

- Jim Frost says
  
  October 28, 2019 at 10:03 am
  
  Hi Trent,
  
  There’s several things going on here. If you think back to the Interpret P value post and the error rates I show in the table near the end, the lower reproducibility rates shouldn’t be too surprising. Additionally, these are psychology studies and they tend to measure less concrete things. Subject areas that measure more concrete aspects with high precision should have lower p-values and, hence, higher reproducibility. There’s inherently more uncertainty when studying humans!
  
  It is important to factor this into everyone’s thinking. The importance of replication has been lost. Many people think that a single conclusive study is sufficient to “prove” something. Replication is crucial!
  
  Loading...