Can high p-values be helpful? What do high p-values mean?
Typically, when you perform a hypothesis test, you want to obtain low p-values that are statistically significant. Low p-values are sexy. They represent exciting findings and can help you get articles published.
However, you might be surprised to learn that higher p-values, the ones that are not statistically significant, are also valuable. In this post, I’ll show you the potential value of a p-value that is greater than 0.05, or whatever significance level you’re using.
The Role of Hypothesis Testing and P-Values
I’ve written about hypothesis testing and interpreting p-values in many other blog posts. I’ll summarize them for this blog post, but please read the related posts for more details.
Hypothesis testing is a form of inferential statistics. You want to use your sample data to draw conclusions about the entire population. When you collect a random sample, you might observe an effect within the sample, such as a difference between group means. But, does that effect exist in the population? Or, is it just random error in the sample?
For example, suppose you’re comparing two teaching methods and want to determine whether one method produces higher mean test scores. In your sample data, you see that the mean for Method A is greater than Method B. However, random samples contain random error, which makes your sample means very unlikely to equal the population means precisely. Unfortunately, the difference between the sample means of two teaching methods can represent either an effect that exists in the population or random error in your sample.
This point is where p-values and significance levels come in. Typically, you want p-values that are less than your significance levels (e.g., 0.05) because it indicates your sample evidence is strong enough to conclude that Method A is better than Method B for the entire population. Teaching method appears to have a real effect. Exciting stuff!
Related posts:
Higher P-Values and Their Importance
However, for this post, I’ll go in the opposite direction and try to help you appreciate higher, insignificant p-values! These are cases where you cannot conclude that an effect exists in the population. For the teaching method example above, a higher p-value indicates that we have insufficient evidence to conclude that one teaching method is better than the other.
Let’s graphically illustrate three different hypothetical studies about teaching methods in the plots below. Which of the following three studies have statistically significant results? The difference between the two groups is the effect size for each study. Here’s the CSV data file: studies.
All three studies appear to have differences between their sample means. However, even if the population means are exactly equal, the sample means are unlikely to be equal. We need to filter out the signal (real differences) from the noise (random error). That’s where hypothesis tests play a role.
The table displays the p-values from the 2-sample t-tests for the three studies.
Study | Effect Size | P-value |
1 | 6.01 | 0.116 |
2 | 9.97 | 0.140 |
3 | 1.94 | 0.042 |
Surprise! Only the graph with the smaller difference between means is statistically significant!
The key takeaway here is that you can use graphs to illustrate experimental results, but you must use hypothesis tests to draw conclusions about effects in the population. Don’t jump to conclusions because the patterns in your graph might represent random error!
P-values Greater Than the Significance Level
A crucial point to remember is that the effect size that you see in the graphs is only one of several factors that influence statistical significance. These factors include the following:
- Effect size: Larger effect sizes are less likely to represent random error. However, by itself, the effect size is insufficient.
- Sample size: Larger sample sizes allow hypothesis tests to detect smaller effects.
- Variability: When your sample data are more variable, random sampling error is more likely to produce substantial differences between groups even when no effect exists in the population.
You can have a large effect size, but if your sample size is small and/or the variability in your sample is high, random error can produce large differences between the groups. High p-values help identify cases where random error is a likely culprit for differences between groups in your sample.
Studies one and two, which are not significant, show the protective function of high p-values in action. For these studies, the differences in the graphs above might be random error even though it appears like there is a real difference. It’s tempting to jump to conclusions and shout to the world that Method A is better, “Everyone, start teaching using Method A!”
However, the higher p-values for the first two studies indicate that our sample evidence is not strong enough to reject the notion that we’re observing random sample error. If it is random error, Method A isn’t truly producing better results than Method B. Instead, the luck of the draw created a sample where subjects in the Method A group were, by chance, able to score higher for some reason other than teaching method, such as a greater inherent ability. In fact, if you perform the study again, it would not be surprising if the difference vanished or even went the other direction!
What High P-Values Mean and Don’t Mean
One thing to note, a high p-value does not prove that your groups are equal or that there is no effect. High p-values indicate that your evidence is not strong enough to suggest an effect exists in the population. An effect might exist but it’s possible that the effect size is too small, the sample size is too small, or there is too much variability for the hypothesis test to detect it.
While you might not like obtaining results that are not statistically significant, these results can stop you from jumping to conclusions and making decisions based on random noise in your data! High p-values help prevent costly mistakes. After all, if you base decisions on random error, you won’t gain the benefits you expect. This protection against jumping to conclusions applies to studies about teaching methods, medication effectiveness, product strength, and so on.
High p-values can be a valuable caution against making rash decisions or drawing conclusions based on differences that look important but might be random error!
If I get a result like the following after performing the non-parametric Kruskal-Wallis test (31 observations, 7 groups that I compared) and calculating the epsilon-squared effect size (e2). I get high p-values and high e2 values. Is this another example for random errors?
variables obs.tot obs.groups df statistic pvalue e2
variable1 31 7 6 12.5 0.0514 0.42
variable2 31 7 6 11.6 0.0723 0.39
variable3 31 7 6 8.0 0.2391 0.27
variable4 31 7 6 6.8 0.3411 0.23
Thanks!
All the best,
Aenna
Hi Aenna,
Yes, it looks like your p-values are all greater than 0.05. I can’t tell if you have 31 observations per group, or split between groups. If you have closer to 7 observations per group, that’s a very small sample size the test will have low power. In other words, it would be difficult for it to detect an effect even when it exists. That could explain why you have some apparently strong effect sizes but not statistical significance. Your test results might represent the protection that high p-values provide that I refer to in this post. Those strong effect sizes might be random error.
I hope this helps!
Dear Jim ! . I am a fan of your posts. I only request you to provide a sequence ( which should be read first and which is last to gain more knowledge). It will help beginners like me to get maximum benefits from these posts.
Thank you
Hi Asutosh!
I’m happy to hear that you like my blog posts! I will soon be writing an ebook that discusses all of the topics and much more in order. I’ve just released an ebook for regression analysis that does just that. Next up is a more introductory to statistics ebook. Stay tuned!
Thanks for reading!
Thanks for the valuable insights and perspective in interpreting the p-Value
Hi Jim, it is good practice to identify a difference between the two group means that is important before the study is implemented – the difference to detect. Then a sample size can be calculated that is sufficient to detect that difference or greater, given the background noise (standard deviation). Also, the chosen significance level is not an absolute criterion for reliable or meaningful decision making. Is p-value 0.05 really different than 0.06? The selection of significance level alpha is often a choice between .05 or .01. While the difference between the two is clear, p-values .04 to .06, and .01 to .02 may not be meaningful. Regards, Stan.
Hi Stan,
I 100% agree with your points. It’s always good to perform a power and sample size analysis before you conduct a study. Otherwise, your study might not have a good chance of detecting a meaningful effect. And, small differences between p-values don’t represent substantially differing amounts of evidence against the null hypothesis. Cheers!
Informative, many of us won’t jump to the conclusions after observing smaller p-values. Sir i request you to write something about degrees of freedom, df concept sometimes gets quite ambiguous
Hi, you’re in luck! I’ve already written about degrees of freedom. Enjoy!