The ability to reproduce experimental results should be related to P values. After all, both of these statistical concepts have similar foundations.
- P values help you separate the signal of population level effects from the noise in sample data.
- Reproducible results support the notion that the findings can be generalized to the population rather than applying only to a specific sample.
So, P values are related to reproducibility in theory. But, does this relationship exist in the real world? In this blog post, I present the findings of an exciting study that answers this question!
Let’s cover the basics of replication and reproducibility quickly. Replication of a research study refers to repeating a study by using the same procedures but with a different sample. The researchers want to see if the replicate study reproduces the original findings.
Clearly, if the replicate study obtains similar findings, you can have more confidence in the results. If an effect exists in the population, it should be apparent in most random samples drawn from that population. Failure to reproduce the results raises the possibility that the original study was a fluke based on the vagaries of the sample or some other problem.
I explain how to interpret P values correctly in a different post. Of particular importance to this discussion is the fact that P values are misinterpreted frequently. Often, a P value of 0.05 is misinterpreted as a 5% chance of a false positive. This probability seems like a safe bet. Unfortunately, the actual probability is often between 25-50%! These probabilities are based on simulation studies and Bayesian analyses. For more information about those studies, read my post P-values, Error Rates, and False Positive.
The article we explore in this post shines a nice empirical light on this matter.
Estimating the Reproducibility Rate
The researchers involved with a study published in August 2015, Estimating the reproducibility of psychological science, wanted to estimate the reproducibility rate and to identify predictors for successfully reproducing experimental results in psychological studies. However, there is a shortage of replication studies available to analyze. Sadly, the shortage exists because it is easier for authors to publish new results than to replicate prior studies.
Because of this shortage, the group of 300 researchers first had to conduct their own replication studies. They identified 100 psychology studies that had statistically significant findings and had been published in three top psychology journals. Then, the research group replicated these 100 studies. After finishing the follow-up studies, they calculated the reproducibility rate and looked for predictors of success. To do this, they compared the results of each replicate study to the corresponding original study.
The researchers found that only 36 of the 100 replicate studies were statistically significant. That’s a 36% reproducibility rate. This finding sent shock waves through the field of psychology!
My view of this low reproducibility rate is that science isn’t a neat, linear process. It can be messy. For science, we take relatively small samples and attempt to model the complexities of the real world. When you’re working with samples, false positives are an unavoidable part of the process. Of course it’s going to take repeated experimentation to determine which results represent real findings rather than random noise in the data. I’ve written elsewhere that you shouldn’t expect a single study to prove anything conclusively. You need to do the replication studies.
P Values and Reproducibility Rates
The researchers evaluate a variety of measures to see if they can predict the probability that a follow up study will reproduce the original results. These potential predictors include professional traits of the original researchers, hypotheses, methodology, and strength of evidence measures, such as the P value.
Most measures don’t predict the reproducibility rates. However, P values are good predictors! The chart below shows how lower P values in the original studies are associated with higher reproducibility rates in the replicate studies.
Clearly, P values provide vital information about which studies have findings that are more likely to be reproduced in the future.
The reproducibility study reinforces what I write in Five P Value Tips to Avoid Being Fooled.
- Knowing the precise P value is important—just reporting statistical significance is insufficient.
- A P value near 0.05 isn’t worth much by itself.
- Replication is crucial.
Collectively, the tips in that post help you distinguish statistical findings that apply to the entire population from those that merely reflect the quirks of a sample. In other words, they help you answer the question: can you generalize the results from the sample to the entire population?
The low reproducibility rate confirms the critical nature of replicating studies before recognizing that a finding has been experimentally established. I vehemently oppose the “one and done” practice that makes it easier to publish new studies than publishing replication studies. In my view, replication studies are just as important as the original studies. After all, if it weren’t for the 100 replication studies in this analysis, we’d have the wrong impression about 64% of the original studies!
Finally, the low reproducibility rate might indicate the presence of p-hacking in the original studies. Learn more about What is P-Hacking: Methods & Best Practices.