P hacking is a set of statistical decisions and methodology choices during research that artificially produces statistically significant results. These decisions increase the probability of false positives—where the study indicates an effect exists when it actually does not. P-hacking is also known as data dredging, data fishing, and data snooping.
P hacking is the manipulation of data analysis until it produces statistically significant results, compromising the truthfulness of the findings. This problematic practice undermines the integrity of scientific research.
It occurs because high-impact journals strongly favor statistically significant results in today’s scientific landscape. For researchers, publishing in these prestigious outlets is a career-boosting achievement. However, this prestige comes with pressure that can tempt researchers towards the perilous path of p-hacking.
P Hacking History
The term p-hacking was born during a crisis within the scientific community. Scientists were struggling as some of their landmark findings were failing to replicate. A large-scale research project that repeated 100 previously significant studies found that a shocking 64% were not significant the second time. The colossal failure to replicate two-thirds of the significant results suggests that most of the original studies were false positives. Read my look at the Reproducibility Study.
With growing unease, investigators tried to find the causes of the false positives. The suspect? Some deeply ingrained research practices. As the plot thickened, it became clear that p-hacking was a major player in this unfolding replication crisis.
All studies have a false positive rate. That’s the probability of concluding an effect or relationship exists (i.e., significant results) when it does not. Statisticians refer to that as the Type I error rate in hypothesis testing. When you do everything correctly, the error rate equals your significance level (e.g., 0.05 or 5%).
P-hacking jacks up that false positive rate, sometimes drastically! False positive studies tend not to reproduce significant results when scientists repeat them—explaining the replication crisis. Clearly, the 64% reproducibility failure in the aforementioned study is much greater than expected!
The takeaway is that p-hacking’s detrimental effects are real and not theoretical. Scientists have already noticed its impact in the literature with the replication crisis. Additionally, studies have found an overabundance of p-values just below 0.05 (see the p-curve article in the references). These results suggest that researchers have tweaked their studies until they get their p-values just below the standard significance threshold of 0.05.
Learn more about Type I and Type II Error in Hypothesis Testing.
Origin and Debate around the Term P-Hacking
Simonsohn, Nelson, and Simmons are authors of landmark studies in p-hacking and introduced the term at a psychology conference in 2012. They wanted to create a memorable name for the set of practices. While catchy, the term has sparked a debate among statisticians. Critics argue that the word “hacking” implies it refers only to intentionally deceptive manipulation. In actuality, the word applies to both unintentional and intentional cases.
Unintentional P-Hacking: Many researchers don’t fully realize they’re p-hacking. With so many ways to analyze data (imagine wandering through a maze of paths), it’s easy to veer toward biased decisions unknowingly. It’s like convincing ourselves that the shorter, easier route is the ‘right’ one, even when it’s not.
Intentional P-Hacking: P-hacking can be willful manipulation using an iterative trial and error method that hones in on significant results. Here, researchers knowingly twist their analysis to create outcomes they desire. It’s akin to deliberately changing the evidence at a crime scene to create a misleading narrative.
Whether done knowingly or not, p-hacking clouds the truth and jeopardizes the pursuit of knowledge. Let’s look at how p hacking occurs and the best practices for avoiding it.
P Hacking Methods
P-hacking covers a wide range of methods. The difficulty is that all studies require researchers to make numerous decisions about data collection, variable manipulation, analysis techniques, and reporting the results. Making the correct choices is crucial to producing valid results.
Some of the following methods can be legitimate decisions when done correctly. P hacking refers to cases where the researchers make poor choices that produce unwarranted statistically significant results.
Research has identified the following most common p hacking methods. However, there are numerous others.
Learn more about Statistical Significance: Definition & Meaning.
P-hackers might stop collecting data once they achieve a significant result, ignoring the need for a predetermined sample size.
Optional stopping, or as some call it, ‘data peeking,’ is when a researcher keeps testing their hypothesis as they gather data. The moment they hit a significant result, they stop collecting data. It’s like prematurely declaring victory in a game before it’s officially over.
This premature termination of data collection is a p-hacking technique that can inflate Type I errors.
Outliers can significantly impact your data. P-hackers might choose to remove outliers based on whether it helps them achieve significance. The decision about outliers ideally should be based on theoretical grounds about the variable and measurement issues relating to specific observations. The outliers’ impact on the p-value should not be a factor at all.
‘Data trimming’ is a common form of p-hacking where researchers selectively exclude outliers. There’s room for bending the rules with so many ways to identify outliers (39 common methods!). Plus, reporting on how researchers handle outliers is often sketchy, making it easier to hide data trimming. Some studies even fail to mention it, leading to unexplained differences in sample sizes and degrees of freedom. So, watch out for the elusive outliers!
Researchers frequently need to manipulate their variables for various legitimate reasons. But p-hackers will make changes to produce statistical significance.
In the p-hacking realm, analysts might slice, subgroup, or subset their data in ways that produce significance when the original arrangement does not. For example, combining comparison groups, recoding a continuous variable into a discrete variable, and looking at only a subset of the sample can produce statistical significance that wouldn’t exist otherwise. In regression analysis, unnecessarily transforming the independent and dependent variables can create unwarranted significance.
According to Stefan and Schönbrodt, one of the most common forms of p-hacking is changing the primary outcome variable while the study is occurring. Researchers peek at the data and then change their primary outcome to a variable that seems more likely to achieve significance. For instance, a diabetes medicine study starts by tracking blood glucose levels but changes to another outcome measure after six months because it is more likely to produce statistically significant results.
This ‘moving of the goalposts’ is a classic example of p-hacking, altering the study’s outcome variable mid-stream to achieve statistical significance.
In the worst cases, researchers use a trial-and-error approach of manipulating variables until it produces statistical significance. Changing the study design and analysis to chase significance inflates the false positive rate.
Excessive Hypothesis Testing & Multiple Comparisons
When researchers perform many hypothesis tests, they increase the likelihood of stumbling upon a statistically significant result purely by chance. A single hypothesis test has a false positive (Type I error rate) equal to the significance level (e.g., 0.05). For a set of hypothesis tests, the family error rate increases for each additional test.
It’s like flipping a coin many times; you will get a string of heads sooner or later. But remember, this doesn’t imply that the coin is biased, just as a significant result amid numerous tests doesn’t necessarily signify a meaningful finding.
Similarly, the more groups that researchers compare, the higher the chances of finding a significant result purely by chance. Correcting for multiple comparisons is essential to maintain the integrity of the results.
Additionally, p-hackers might run multiple variations of the same analysis, try similar analyses, relax assumptions, and alter little things each time—like the control variables or subsets of data used. They continue this process until they stumble upon a significant result.
Researchers need to limit the testing they perform during a study and use the proper corrections for multiple comparisons and hypothesis tests.
Excessive Model Fitting
This problem is like excessive hypothesis testing but relates to fitting many different regression models. P-hackers can experiment with numerous statistical models until they find one that delivers the desired results. This process becomes problematic when model selection is driven by the statistical significance rather than the appropriateness for the data and research question.
While it’s essential to control for confounding variables by including them in the model, it can be a double-edged sword. Deciding which variables to control can be twisted into another form of p-hacking, especially if researchers base the decision on chasing statistical significance rather than for theoretical and subject-area reasons.
If you fit many models and use statistical significance to guide you, you can produce models that “explain” relationships in randomly generated data. To see this in action, read my post about the Dangers of Using Data Mining to Select Models.
This sneaky p-hacking technique isn’t limited to regression analyses; it can happen anytime there’s an option to pick and choose variables.
Read more about How to Specify Regression Models for tips on how to do it correctly.
Selective Reporting of Results
This p-hacking method involves cherry-picking the outcomes and hypothesis tests for reporting while failing to discuss nonsignificant results and changes in the study design. This method creates a false impression of the results’ strength by overemphasizing the significance and downplaying the weaknesses and nonsignificant findings.
For example, a study might measure many different outcomes and find a significant result for only one. Or they conduct many hypothesis tests and only present the few that give statistically significant results, conveniently leaving out the ones that don’t. This approach is akin to showing a highlight reel without the unimpressive plays.
If a study mainly finds nonsignificant results, you’d have good reason to question its few significant findings. However, if the reporting doesn’t discuss the slew of nonsignificant conclusions, you won’t know the proper context for evaluating the results.
Additionally, a series of nonsignificant findings followed by a significant result is a red flag for the previous trial-and-error methods I describe above.
Best Practices to Avoid P Hacking
As we delve deeper into p-hacking methods, it becomes increasingly clear how easy it is to veer into these practices, intentionally or not. It underlines the importance of sound statistical training and an unwavering commitment to scientific integrity. Aim to tell the story of the data as it is, not as we’d like it to be.
P-hacking can quietly erode the foundations of scientific research. But don’t despair. Here are some best practices to keep you on the right path.
Develop a Clear Research Plan
Create a detailed plan before conducting the research. It should include your hypotheses, data collection methods, and analyses. This clear roadmap helps prevent you from going down the p-hacking trial-and-error approach of performing variable manipulation and data analysis variations until you get significant results.
Pre-Register Your Studies
Publicly specify your research plan before conducting the study. This approach further reduces the temptation to deviate based on interim findings. And it signals other researchers that they can take your research more seriously. You can preregister studies at places like the Center for Open Science (cos.io).
Report all your steps, even the not-so-successful ones. Honesty is your best ally in research. This transparency includes defining comparison groups in advance, reporting all variables, all conditions, all data exclusions, all tests, and all measures.
Education and Training
Many p-hacked studies stem from not understanding the pitfalls rather than malicious deception. Ensure you have a strong understanding of statistical principles and maintain an awareness of the pitfalls of p-hacking. Continuous learning is an essential tool in any researcher’s kit. It’s one of the many reasons I think understanding statistics is vital.
Ultimately, remember that every decision made during statistical analysis affects the results. P-hacking might not always be a deliberate act of deception. It can often stem from a lack of understanding of statistical principles.
Adhering to these best practices can keep our research robust and our findings credible. Avoiding p-hacking isn’t just about securing valid results; it’s about preserving the integrity of the scientific process.
Stay ethical and keep crunching those numbers responsibly!
P Hacking References
Simonsohn, Uri and Nelson, Leif D. and Simmons, Joseph P., P-Curve: A Key to the File Drawer (April 24, 2013). Journal of Experimental Psychology: General.
Stefan Angelika M. and Schönbrodt Felix D. 2023 Big little lies: a compendium and simulation of p-hacking strategies, R. Soc. open sci.