Can high p-values be helpful? What do high p-values mean?
Typically, when you perform a hypothesis test, you want to obtain low p-values that are statistically significant. Low p-values are sexy. They represent exciting findings and can help you get articles published.
However, you might be surprised to learn that higher p-values, the ones that are not statistically significant, are also valuable. In this post, I’ll show you the potential value of a p-value that is greater than 0.05, or whatever significance level you’re using.
The Role of Hypothesis Testing and P-Values
I’ve written about hypothesis testing and interpreting p-values in many other blog posts. I’ll summarize them for this blog post, but please read the related posts for more details.
Hypothesis testing is a form of inferential statistics. You want to use your sample data to draw conclusions about the entire population. When you collect a random sample, you might observe an effect within the sample, such as a difference between group means. But, does that effect exist in the population? Or, is it just random error in the sample?
For example, suppose you’re comparing two teaching methods and want to determine whether one method produces higher mean test scores. In your sample data, you see that the mean for Method A is greater than Method B. However, random samples contain random error, which makes your sample means very unlikely to equal the population means precisely. Unfortunately, the difference between the sample means of two teaching methods can represent either an effect that exists in the population or random error in your sample.
This point is where p-values and significance levels come in. Typically, you want p-values that are less than your significance levels (e.g., 0.05) because it indicates your sample evidence is strong enough to conclude that Method A is better than Method B for the entire population. Teaching method appears to have a real effect. Exciting stuff!
Related posts:
Higher P-Values and Their Importance
However, for this post, I’ll go in the opposite direction and try to help you appreciate higher, insignificant p-values! These are cases where you cannot conclude that an effect exists in the population. For the teaching method example above, a higher p-value indicates that we have insufficient evidence to conclude that one teaching method is better than the other.
Let’s graphically illustrate three different hypothetical studies about teaching methods in the plots below. Which of the following three studies have statistically significant results? The difference between the two groups is the effect size for each study. Here’s the CSV data file: studies.
All three studies appear to have differences between their sample means. However, even if the population means are exactly equal, the sample means are unlikely to be equal. We need to filter out the signal (real differences) from the noise (random error). That’s where hypothesis tests play a role.
The table displays the p-values from the 2-sample t-tests for the three studies.
Study | Effect Size | P-value |
1 | 6.01 | 0.116 |
2 | 9.97 | 0.140 |
3 | 1.94 | 0.042 |
Surprise! Only the graph with the smaller difference between means is statistically significant!
The key takeaway here is that you can use graphs to illustrate experimental results, but you must use hypothesis tests to draw conclusions about effects in the population. Don’t jump to conclusions because the patterns in your graph might represent random error!
P-values Greater Than the Significance Level
A crucial point to remember is that the effect size that you see in the graphs is only one of several factors that influence statistical significance. These factors include the following:
- Effect size: Larger effect sizes are less likely to represent random error. However, by itself, the effect size is insufficient.
- Sample size: Larger sample sizes allow hypothesis tests to detect smaller effects.
- Variability: When your sample data are more variable, random sampling error is more likely to produce substantial differences between groups even when no effect exists in the population.
You can have a large effect size, but if your sample size is small and/or the variability in your sample is high, random error can produce large differences between the groups. High p-values help identify cases where random error is a likely culprit for differences between groups in your sample.
Studies one and two, which are not significant, show the protective function of high p-values in action. For these studies, the differences in the graphs above might be random error even though it appears like there is a real difference. It’s tempting to jump to conclusions and shout to the world that Method A is better, “Everyone, start teaching using Method A!”
However, the higher p-values for the first two studies indicate that our sample evidence is not strong enough to reject the notion that we’re observing random sample error. If it is random error, Method A isn’t truly producing better results than Method B. Instead, the luck of the draw created a sample where subjects in the Method A group were, by chance, able to score higher for some reason other than teaching method, such as a greater inherent ability. In fact, if you perform the study again, it would not be surprising if the difference vanished or even went the other direction!
What High P-Values Mean and Don’t Mean
One thing to note, a high p-value does not prove that your groups are equal or that there is no effect. High p-values indicate that your evidence is not strong enough to suggest an effect exists in the population. An effect might exist but it’s possible that the effect size is too small, the sample size is too small, or there is too much variability for the hypothesis test to detect it.
While you might not like obtaining results that are not statistically significant, these results can stop you from jumping to conclusions and making decisions based on random noise in your data! High p-values help prevent costly mistakes. After all, if you base decisions on random error, you won’t gain the benefits you expect. This protection against jumping to conclusions applies to studies about teaching methods, medication effectiveness, product strength, and so on.
High p-values can be a valuable caution against making rash decisions or drawing conclusions based on differences that look important but might be random error!
Pruthvi Jagadeesh GK says
This blog helped in considering the parameters for my predicting algorithm but still one doubt is bugging me that if the sample size is more and the effect size is 0 results in high p value then should i consider that parameter for prediction as there is effect in the population confirming from the high P-value.
Second thing is that, i am working on a project to prevent pressure ulcer through predicting algorithm. For reference i studied through two papers where they have considered the same parameters and explained that some parameters with high p value should be removed in one paper and another paper tells that same parameters having p value less than 0.05 should be considered. These two papers deal with pressure ulcer prevention. So how will you advice to me to consider the parameters?
if possible can i share the paper so that you can go through it and explain me
Jim Frost says
Hi Pruthvi,
I’m so glad my blog has been helpful!
When the p-value is high, it indicates that the effect size is indistinguishable from zero statistically. Often, analysts will remove insignificant predictors from a model because they’re not contributing to the model. However, you don’t have to remove them. Just be aware that including too many insignificant predictors will increase the error variance in your model.
Read my post about regression coefficients and their p-values will be helpful.
Also, I’d highly recommend by book about regression analysis, which you can find in my web store. It covers all of these topics and many more in an easy to understand manner.
keletso says
Hello Jim,
Being unfamiliar with Stats and not having enough time to go through the syllabus, your blog has literally been a life line.
I found this particular insert because I’m faced with yet another dilemma. I’m testing the effects of two different pharmaceutical drugs on the same experimental model. one has an 80% effect and the other has a negative value (which is most likely to the deteriorating of the drug and other factors). when I did a t-test on these two results, they turned out to have a p-value that is higher than 0.05.
My thought process says this is impossible but excel just gave it to me.
I also calculated SEM and it was quite low. Would it be correct to assume that, since the SEM is low, then the chances of this p-value being random are also low?
Would you have any advise for me? I performed a two tailed test for two samples with unequal variances.
Jim Frost says
Hi,
It’s definitely possible to get high p-values (non-significant results) even when you have an apparent effect, like you do. First off, the p-value incorporates the SEM, so that’s factored in. Given the the little bit I know of your data, I’d say that while you have an apparent effect, your sample doesn’t provide enough evidence to say it exists in the population. It might just be random sampling error and there is no population effect. Or, there might be a population effect but you have too small of a sample and/or too much variability in your data to detect it.
Keep in mind that insignificant results don’t prove that an effect doesn’t exist in the population. It just means you don’t have strong enough evidence to say it does. I explain that in more depth in my post about failing to reject the null hypothesis. The discussion in that post applies to your results.
I hope that helps!
Stan Alekman says
Jim, I think of random as meaning absence of structure in a data set, which is what runs tests estimate.Yes, we can estimate random error in a model if the model is a true replica of the population it is purported to represent.
Stan Alekman says
Jim, I think of random as a process, not as a thing.
Jim Frost says
Hi Stan, random is a bit of a slippery term in statistics. Sometimes it’s a process. Sometimes it’s a thing you can measure and assess, such as the presence amount of random error in a model. Sometimes you need the process (e.g., random sampling) in order to obtain the thing (e.g. random error).
Stan Alekman says
I work in the pharmaceutical and chemical area so samples are physical. When I collect a sample, I perform a runs test to check for absence of randomness and an I/mR test for statistical process control. Sample sizes cannot be small to do this.
I’m reminded of a sentence I read some time ago, “Randomness is too important to be left to chance alone.”
Jim Frost says
I love that expression, Stan! Randomness can be surprisingly difficult to obtain. Whenever I talk about obtaining random samples, I often clarify that while “random” might sound like a synonym of half-hazard, it’s often difficult to obtain a truly random sample.
Collinz says
Thanks Jim for the good work done in simplifying statistics.
“It’s really true that random chance in the sampling process can influence even the best research projects.”
But now I wonder whether there can be possible ways of modeling random chance so that we can have reliable results.
In other words how random is random???? Because there might be need to cater for the unexpected randomness.
Otherwise happy 2021.
Jim Frost says
Hi Collinz,
Thanks for the kind words! And Happy New Year!
To supplement Stan Alekman’s wise words, I have written a post about using control charts in conjunction with hypothesis tests, of which the IMR chart he mentions is one type.
I’d also add that hypothesis tests do incorporate random sampling error into their calculations. Suppose you have measures of a characteristic in two random samples, Sample A and Sample B. Sample A has a higher mean than Sample B. If you had measured the entire population, you’d know that the mean of Population A is truly higher. However, because you’re dealing with samples, you don’t know for sure. It’s possible Population A does have a higher mean than Population B. However, it’s also possible that Population B has a higher mean and that by chance you obtained an unusual sample drawn from either/both populations.
I write about this process in my post about how hypothesis testing works, where I discuss random error in detail and how it’s incorporated. For random error in the context of regression analysis, read my post about assessing residual plots. These plots help ensure that your model’s error is appropriately random!
Chester says
Hi Jim,
I conducted a replication experiment and aimed to have 90% power to detect 75% of the original effect but wasn’t able to recruit enough pps and had 83% power in the end. The p-value was large (.28) and effect size (Cohen’s d) was small 0.09 vs 0.26. I’m trying to interpret how much the lack of power effected my inability to detect an effect. Would it be correct to say that a high p value and small effect size suggests that power alone is unlikely to account for the lack of effect? Pretty much I don’t want to conclude with only “can’t really make conclusions due to lack of power”, would rather have a more nuanced view but struggling to articulate.
Thanks in advance!
Chester
Jim Frost says
Hi Chester,
It sounds like you’re trying to determine whether the effect exists and you didn’t detect it due to a lack of power or the effect doesn’t exist. It’s not possible to do that with just the two studies. It’s possible that if you had more subjects, the results would’ve been significant. On the other hand, it’s also possible that the results would’ve stayed not significant. There’s just no way to know for sure.
The statistically correct conclusion to draw is that your sample data provides insufficient evidence to conclude that the effect exists in the population. This interpretation is worded very carefully. The results are not evidence that the effect does NOT exist. Just that the sample doesn’t provide evidence that it does exist. For more information, read my post about failing to reject the null hypothesis where I explain in more detail. That’ll show you how to interpret and think about your results.
Also, consider that failing to replicate results is not unusual. You might be interested in reading my post where I look at the relationship between p-values and the reproducibility of the results.
I hope this helps!
Stan Aleeman says
The p-value speaks to the null hypothesis. We either accept or reject the null based on the p-value at the chosen significance level.. It does not address the alternate hypothesis. Rejecting the null does not mean accepting the alternate hypothesis.
BC says
Dear Jim,
Thank you for helping us who are struggling with statistics with your articles and answers 🙂 I would appreciate your help with following case:
I calculated Spearman’s correlation for several different combination of variables. In some cases, the correlation is low, for example 0.15 (which would mean that variables are not correlated), but the sig. is lower than 0,05 (which would mean that correlation is significant, n=225). If H0 is “there is no correlation”, while H1 is “there is correlation between variables”, how should I interpret results? If I look Sig. I would need to reject H0 and conclude that there is correlation between variables, but if I look at the Spearman’s coefficient the conclusion should be that there is no correlation. At the same time, I have a case (different sample, n=10 , not the same as in my first example) where correlation is 0,50, but sig. is higher than 0,05. Is that mean that I should accept H0 (there is no correlation), even if the correlation is 0,50?
I see from your article that the reason is probably the size of the sample, but I’m not sure how to interpret the results and whether the results can remain as they are or I have to do some new, additional testing?
Thank you a lot in advance!
Jim Frost says
Hi BC
This appears to be a matter of statistical power and how small of an effect the test can detect. Keep in mind that when you have a smaller sample size, the margin of error around the sample estimate is relatively large.
A correlation of 0.15 indicates there is a very weak relationship between that pair of variables in your sample–it doesn’t mean there’s no relationship. The statistical significance indicates that you have sufficient evidence to conclude that the weak relationships exists in the population from which you drew your sample. So, one statistic is a measure of the strength (very weak) of the relationship in the sample while the p-value indicates how much evidence you have that sample relationship exists in the population. They’re two different, but related, things. 255 is large sample size that gives the test a large amount statistical power, which allows it to detect even very small effects. This sample size is able to determine that while 0.15 is very small, it likely does not equal zero even accounting for random sampling error.
For the correlation of 0.5, you have much smaller sample size, which reduces the statistical power and, hence, the ability to detect an effect. After accounting for random sampling error, you can’t confidently say that the correlation is different from zero in the population even though 0.50 is a larger sample correlation.
I hope that helps clarify it!
Andrew says
Interesting article.
I remember reading an article years ago that sometimes a high p-value can be used to determine that an experiment was “too good to be true”. I believe it was in the context of a Chi-square test. Have you ever encountered or discussed this case?
Thanks for the great articles.
Jim Frost says
Hi Andrew,
I’m not sure exactly what the article meant by “too good to be true.” In this post, I explain how random sample error can create the appearance of effects that don’t exist in the population. If some of these faux effects are surprising and “too good to be true,” then p-values can reveal it. But, I’m not sure if that is the type of too good to be true results the article is referencing.
ca says
Hi , can I just ask I’m working through some data in which we are looking at Quality of life improvements, and whether that improved with 2 interventions- yes or no, in which I did a chi squared test, then we used the actual changes in Quality of Life improvement values to do linear regression. The chi squared p value was over 0.05 significance level whereas the linear regression model gave a below 0.05 significance level. And I’m just unsure as to why I would get those different results?
Jim Frost says
Hi,
It sounds like you need to use binary logistic regression because of the binary dependent variable (Yes, No). You can use categorical variables for the independent variables to represent the interventions. You don’t want to use a chi-squared test for what you want to do.
For more information, read my post about choosing the correction type of regression analysis and look for binary logistic regression in that post.
Dinh-Phuong Duong says
Hi Jim,
maybe the answer is in the article and I somehow missed it due to not being so familiar with statistics and its terms.
Is there are meaning regarding really high p-values like 0.78 or would you treat/interpret this value the same as a p-value of 0.10 (both above alpha level of 0.05).
What would be the cause of really high p-values (0.78)? I guess low sampling size? So far I couldn’t find literature which explain really high p-values. The only thing which seems to be important is, that they are above or not above 0.05.
All the best
Phuong
Jim Frost says
Hi Phuong,
To start, p-values are the probability of obtaining the observed results, or more extreme, if the null hypothesis is true.
So, the easiest way to think about both low and high p-values is in the context of the null hypothesis, which usually reflects no effect, difference, or relationship. For this discussion, imagine that we’re comparing two group means and that the null hypothesis is that the difference between the means is zero (i.e., there’s no difference between groups).
Suppose you collect a random sample for both groups and perform a 2-sample t-test.
If the difference between the means is:
There are several factors that determine what counts as near and far from the null hypothesis value. One is sample size. Larger samples are likely to produce means that are close to the true population mean. So, if the difference between means is zero at the population level and you have a larger sample size, your observed difference is more likely to be close to zero. The other factor is the variability in the data. When the true difference is zero but you have larger variability in your data, it won’t be surprising to obtain sample differences further from zero even though the true difference is zero.
Putting all of this together, there are three key reasons for large p-values:
In terms of interpreting 0.1 versus 0.78, technically you can say that you are more likely to observe data that produce a p-value of 0.78 when the null is true. However, in a practical sense, both p-values indicate that your results are not statistically significant. Your data do not provide sufficient evidence to conclude that an effect/difference/relationship exists. It’s also important to note that a non-significant p-value does not indicate that the two means are equal (using our example). That is one possibility on the list above. But it could also reflect a small sample size and/or highly variable data.
I hope that helps!
Jonathan Franklin says
Hi Jim
this article and several other on this website were really helpful to understand statistics, especially the concept of high p-values. But at the same time I am beginning to doubt my results. I’m am not so familiar with statistic, since I am a computer science student. I hope it is alrightto ask this question here.
I am currently working towards the end of a thesis. I compared the User Experience of Android Apps and Progressive Web Apps (PWA- basically Websites which act like normal apps) and the research question was if PWA can keep up with android apps in terms of UX.
For measuring UX I used an existing questionnaire (UEQ), which calculates different different scales of UX.
To find out if a significant exists between exists I performed t-Tests for each scale.
Every calculated p-Value was way above the alpha-level of 0.05, so I reject the Null hypothesis for every scale. At the beginning I thought everything was fine -> no statistical difference found -> PWA can keep up.
But now I am finding out in this article, a high p-value does not indicate that the groups are equal or that there is no effect. Other sources say “no statistical significance” just means that the Null hypothesis cannot be rejected and “anything is possible”.
Question: Are the results of the questionnaire of any use to answer my research question or do I have to say “anything is possible”?
I performed a qualitative study as well to answer the research question, which conclude both apps offer the same UX, but now I am doubting if I can put it into relation with the results of the quantitative questionnaire.
Originally I thought you could use the Null hypothesis as a result but I guess I was wrong.
A number of research papers, which had a similar topic and t-Test, just said “no statistical significance has been found, therefore we conclude that both types of Apps were possibly able to ……..”, which doesn’t seem to be mathematically correct.
Best regards
Jonathan
Jim Frost says
Hi Jonathan,
Yes, the article you’ve read is correct. In a hypothesis test, when you get a low p-value, you can reject the null hypothesis. Most hypothesis tests are setup so that rejecting the null indicates that there is a difference between groups. In most cases, the researchers are hoping to find a difference because that represents an exciting find of some sort. So, the burden of proof is on them to find sufficient amounts of high quality data that suggests that the exciting difference does indeed exist.
It’s important to note that when you have a high p-value, you fail to reject the null. You’re not accepting the null? That’s entirely different thing. Here’s why.
Let’s say your comparing two groups and your hoping to show a difference. The more typical scenario for a hypothesis test. However, you collect a small sample size and thanks to poor measurements the data are noisy (high variability). You’ll obtain a high p-value. Darn it, so much for your exciting findings! But given the low quality of your data, it’s not surprising, right? A small sample size and noisy data will produce high p-values very easily. It doesn’t mean the groups are equal, it means you have poor data that don’t show the groups are different. To get your exciting differences, you need to work hard, put the time and resources into collecting a sufficient amount of high quality data.
Your scenario is the opposite of that. You’re hoping to find that the groups are equal. That would important. So, you need to use a test to set it up such that you need to work hard to obtain proof that the groups are equal. If you use a typical hypothesis test, you could just do a sloppy job, collect that small sample of noisy data, get the inevitable high p-value and claim that you’ve proven they’re equal. But, you shouldn’t use poor quality data to make any claims. So, you need to use a hypothesis test that makes you work hard to obtain a sufficient amount of high quality data in order to prove that the groups are equal. Think of it in terms of where the burden of proof lies. In your case, the burden of proof lies on proving they’re equal. Poor quality data will not help you here!
Fortunately, statistics have created these tests. I’ll need to write about them someday! For now, you can read about Equivalence Tests in this Wikepedia article.
In a nutshell, equivalence tests flip things around. The null hypothesis is that the groups are different. The alternative is that they’re same. The burden of proof falls on obtain good quality data that suggests equivalency. You’ll work to obtain that low, statistically significant p-value. In these tests, you need to define what is functionally equivalent. After all, thanks to random error, the groups are unlikely to be exactly the same in your sample data even if they are in the full population.
I hope this helps!
Aenna says
If I get a result like the following after performing the non-parametric Kruskal-Wallis test (31 observations, 7 groups that I compared) and calculating the epsilon-squared effect size (e2). I get high p-values and high e2 values. Is this another example for random errors?
variables obs.tot obs.groups df statistic pvalue e2
variable1 31 7 6 12.5 0.0514 0.42
variable2 31 7 6 11.6 0.0723 0.39
variable3 31 7 6 8.0 0.2391 0.27
variable4 31 7 6 6.8 0.3411 0.23
Thanks!
All the best,
Aenna
Jim Frost says
Hi Aenna,
Yes, it looks like your p-values are all greater than 0.05. I can’t tell if you have 31 observations per group, or split between groups. If you have closer to 7 observations per group, that’s a very small sample size the test will have low power. In other words, it would be difficult for it to detect an effect even when it exists. That could explain why you have some apparently strong effect sizes but not statistical significance. Your test results might represent the protection that high p-values provide that I refer to in this post. Those strong effect sizes might be random error.
I hope this helps!
Asutosh says
Dear Jim ! . I am a fan of your posts. I only request you to provide a sequence ( which should be read first and which is last to gain more knowledge). It will help beginners like me to get maximum benefits from these posts.
Thank you
Jim Frost says
Hi Asutosh!
I’m happy to hear that you like my blog posts! I will soon be writing an ebook that discusses all of the topics and much more in order. I’ve just released an ebook for regression analysis that does just that. Next up is a more introductory to statistics ebook. Stay tuned!
Thanks for reading!
alifffirdaus MY says
Thanks for the valuable insights and perspective in interpreting the p-Value
Stanley Alekman says
Hi Jim, it is good practice to identify a difference between the two group means that is important before the study is implemented – the difference to detect. Then a sample size can be calculated that is sufficient to detect that difference or greater, given the background noise (standard deviation). Also, the chosen significance level is not an absolute criterion for reliable or meaningful decision making. Is p-value 0.05 really different than 0.06? The selection of significance level alpha is often a choice between .05 or .01. While the difference between the two is clear, p-values .04 to .06, and .01 to .02 may not be meaningful. Regards, Stan.
Jim Frost says
Hi Stan,
I 100% agree with your points. It’s always good to perform a power and sample size analysis before you conduct a study. Otherwise, your study might not have a good chance of detecting a meaningful effect. And, small differences between p-values don’t represent substantially differing amounts of evidence against the null hypothesis. Cheers!
qayoom khachoo says
Informative, many of us won’t jump to the conclusions after observing smaller p-values. Sir i request you to write something about degrees of freedom, df concept sometimes gets quite ambiguous
Jim Frost says
Hi, you’re in luck! I’ve already written about degrees of freedom. Enjoy!