How high does R-squared need to be in regression analysis? That seems to be an eternal question.
Previously, I explained how to interpret R-squared. I showed how the interpretation of R2 is not always straightforward. A low R-squared isn’t always a problem, and a high R-squared doesn’t automatically indicate that you have a good model.
So, how high should R-squared be? The definitive answer is . . . it depends. You’ll need some patience because my assertion is that this question is the wrong question. In this post, I reveal why it is the wrong question and which questions you should ask instead.
Related post: Five Reasons Why Your R-squared can be Too High
How High Does R-squared Need to be is the Wrong Question
How high does R-squared need to be? If you think about it, there is only one correct answer. R-squared should accurately reflect the percentage of the dependent variable variation that the linear model explains. Your R2 should not be any higher or lower than this value.
The correct R2 value depends on your study area. Different research questions have different amounts of variability that are inherently unexplainable. Case in point, humans are hard to predict. Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. However, if you analyze a physical process and have very good measurements, you might expect R-squared values over 90%. There is no one-size fits all best answer for how high R-squared should be.
Consequently, the answer to “how high does R-squared need to be?” is that it depends on the amount of variability that is actually explainable. Clearly, your R-squared should not be greater than the amount of variability that is actually explainable—which can happen in regression. To see if your R-squared is in the right ballpark, compare your R2 to those from other studies.
Chasing a high R2 value can produce an inflated value and a misleading model. Read my post about adjusted R-squared and predicted R-squared to see how these statistics can help you avoid these problems.
Related post: Model Specification: Choosing the Correct Regression Model
Define Your Objectives for the Regression Model
When you wonder if the R-squared is high enough, it’s probably because you want to know if the regression model satisfies your objectives. Given your requirements, does the model meet your needs? Therefore, you need to define your objectives before proceeding.
To determine whether a model meets your objectives, you’ll need to ask different questions because R2 doesn’t address this issue. The correct questions depend on whether your primary goal for the model is:
- To understand the relationships between the independent variables and dependent variable. Or,
- To predict the dependent variable.
R-squared and Understanding the Relationships between the Variables
If your primary goal is to understand the relationships between the variables in your model, the answer to how high R-squared needs to be is very simple. For this objective, R2 is irrelevant.
This statement might surprise you. However, the interpretation of the significant relationships in a regression model does not change regardless of whether your R2 is 15% or 85%! The regression coefficients define the relationship between each independent variable and the dependent variable. The interpretation of the coefficients doesn’t change based on the value of R-squared.
Suppose we have a statistically significant coefficient that equals 2. This coefficient indicates that the mean of the dependent variable increases by 2 for every one-unit increase in the independent variable irrespective of the R2 value.
Related post: See a graphical illustration of why the interpretation of coefficients does not depend on R-squared.
The question about how high R-squared needs to be doesn’t make sense in this context because it doesn’t matter. A small R2 doesn’t nullify or change the interpretation of the coefficient for an independent variable that is statistically significant.
Instead of wondering if your R-squared value is high enough, you should ask the following questions to ensure that you can trust your results:
- Do I have a sound basis for my model?
- Can I trust my data?
- Do the residual plots look good?
- Do the results fit theory?
- How do I interpret the regression coefficients and P-values?
R-squared and Predicting the Dependent Variable
On the other hand, if your primary goal is to use your regression model to predict the value of the dependent variable, R-squared is a consideration.
Predictions are more complex than just the single predicted value. Predictions include a margin of error. More precise predictions have a smaller amount of error.
R2 is relevant in this context because it is a measure of the error. Lower R2 values correspond to models with more error, which in turn produces predictions that are less precise. In other words, if your R2 is too low, your predictions will be too imprecise to be useful.
A low R-squared can be an indicator of imprecision predictions. However, R2 doesn’t tell you directly whether the predictions are sufficiently precise for your requirements.
We need a direct measure of precision that uses the units of the dependent variable. That’s why asking, “How high does R-squared need to be?” still is not the correct question.
Instead, you should ask the questions above plus the following question:
- Are the prediction intervals precise enough for my requirements?
Using Prediction intervals to Assess Precision
Most statistical software can calculate prediction intervals, and they are easy to use.
A prediction interval is a range where a single new observation is likely to fall given values of the independent variable(s) that you specify. These ranges incorporate the margin of error around the predicted value. If the prediction intervals are too wide, the predictions don’t provide useful information. Narrow prediction intervals represent more precise predictions.
In my post about using regression analysis to make predictions, I present the model displayed in the graph. This model uses BMI to predict the percentage of body fat. The 95% prediction interval for a BMI of 18 is 16-30% body fat. We can be 95% confident that an individual with a BMI of 18 will fall within this range.
At this point, you need to use client requirements, spec limits, and subject area knowledge to determine whether the prediction intervals are narrow enough to represent meaningful predictions. By assessing the prediction intervals, you are evaluating the precision of the model directly rather than relying on an arbitrary cut-off value for R-squared.
I’m not a medical expert, but I’d guess that the 14 point range of 16-30% is too wide to provide meaningful information. If this is true, our regression model is too imprecise to be useful.
Related posts: Understand Precision in Applied Regression to Avoid Costly Mistakes and Confidence Intervals vs Prediction Intervals vs Tolerance Intervals.
R-squared Is Overrated!
Asking “How high does R-squared need to be?” is usually not the correct question to ask. You probably want to know if the regression model can meet your needs. To this end, there are better questions that you should ask.
R-squared gets all of the attention for assessing the goodness-of-fit. It seems like a simple statistic to interpret. However, evaluating the fit involves more than just this single statistic. You need to use subject area knowledge, residual plots, coefficients, and prediction intervals if you’re making predictions.
However, R-squared does have some good uses. For one thing, compare your R2 value to values from similar studies. If your R2 is markedly higher or lower, you should investigate because there might be a problem.
Be sure to read my post about the standard error of the regression (S), which is a different type of goodness-of-fit measure that is more useful when you need to make predictions.
If you’re learning regression, check out my Regression Tutorial!
Mark Solomons says
Hi Jim
I appreciate you making this field accessible to the Average Joe like me. I have a really basic question. Let’s say I’ve done a study with a treatment that increases my dependent variable by 20%. That’s a pretty good outcome. Just assume also that it has a low p-value below 0.05 so I assume it translates to a broader population. Then I run through a simple linear regression and get an R^2 of 0.15. Ugh (but maybe not unexpected). If I add some control variables I can maybe get the R^2 up to 0.25. How do I interpret this? I like the treatment effect, but the R^2 are pretty shabby. Can I assume that I could get 20% treatment effect across the broader population but I can only explain 15-25% with my treatment and control variables. Or Do I assume I should only get 20% x 15-20% treatment effect across the population? Sorry if this is a stupid question.
-M
Jim Frost says
Hi Mark,
I have written posts that answers this question exactly. What does it mean when you have a regression model with statistically significant variables but a low R-squared?
Read that post and then if you have unanswered questions, put them in the comments section in that post.
Moses Fernandes says
Hi Jim,
I am a bit new to panel data regression, I was trying to apply a fixed and random effects model to my dataset. after running the regression for my panel data set, I found 2 issues
1. when the Hausman test indicated that the Random effects model is appropriate my R-squared is very low (sometimes below 8%) but when the Hausman test indicates Fixed effect model is appropriate the R-squared is fine. however, having read your book I have understood that the R-squared is not very important if my aim to only find out which variables are significant. so is have that low of an R-squared, ok? or does it need to higher? if yes, is there a thumb rule that I can follow.
2. the second issue that I am facing is that my Durbin Watson statistic is less than 1 which point to a positive autocorrelation. I can’t add more variables to my study because I am studying the significance of these specific variables only. If I add a lag value of my dependent variable as an independent it does correct for the autocorrelation but now there is a risk of potential endogeneity. please suggest a possible course of action to remedy the issue.
thanking you in anticipation.
Regards,
Moses.
Jim Frost says
Hi Moses,
Yes, you’re correct that having a high R-squared isn’t crucial when you have statistically significant independent variables, and your main goal is understanding the relationship between the variables. So, your low R-squared isn’t necessarily a problem. However, there are several considerations.
One, you should see what R-squared values similar studies have obtained. Are they similar or different? If they’re noticeably higher, it could indicate there’s something wrong with your study or analysis. At the very least, it is a potential critique by reviewers. So, you’ll want to look at similar studies to understand how yours fits in.
Two, be sure that the low R-squared isn’t occurring because of some assumption violation. Assumption violations can distort the results. So, if you’re primarily interested in the relationships between the variables and the low R-squared reflects a violation that distorts those relationships, that IS a problem because you can’t trust the results for those relationships.
That brings me to your second question because it appears there are some potential assumption violations.
That sounds like a tricky model to fit! The autocorrelation will reduce the precision of the regression coefficients while the endogeneity will bias them.
One potential solution for that is adding an instrumental variable. I haven’t written about this type of variable (yet) but click the link to go to the Wikipedia article. Instrumental variables are basically explanatory variables that correlate with the explanatory endogenous variable (the lagged value in your case) but do not correlate with the error term. That’ll involve using your subject area knowledge to identify an appropriate instrumental variable. The Wikipedia article provides some examples.
Finally, even when you only want to test specific variables, you might need to add other variables. Your focus can be primarily on the ones you’re testing even when you add the other variables. The problem is when you don’t add other variables that should be in the model, it can distort the results for the variables that you specifically want to evaluate! The additional explanatory variables can statistically control biases that would otherwise distort your primary explanatory variables of interest. So, don’t let the fact that you’re only interested in specific variables prevent you from adding other explanatory variables as needed.
I hope that helps!
Yushuf Sharker says
Hello Jim,
Nice R^2 discussion. Regarding “compare your R2 to those from other studies”, is R2 really comparable bluntly across studies? I appreciate your thoughts.
Jim Frost says
Hi Yushuf,
Yes, in fact, it’s designed to be comparable between studies, even those using different units for the dependent variable. R-squared is a standardized effect. That means it doesn’t use any natural units and it fits within a specific range, 0 – 100%. The fact that it doesn’t use the natural DV units and all studies will fall somewhere in that range, allows you to compare different studies. In this regard, it’s similar to correlation coefficients.
In contrast are effect sizes measured using the natural data units. Those aren’t necessarily comparable between studies when the various studies use different units for their outcome measure. These include means, mean differences, and standard error of the regression.
For more information, read my post about effect sizes, which compares standardized to non-standardize effect sizes. And read about R-squared vs. Standard Error of the Regression, which gets into the same type of issues but in the regression specific context.
Jerick Galindo Gingatan says
Hello sir! Do you have citation for this? Lower R2 values correspond to models with more error, which in turn produces predictions that are less precise. In other words, if your R2 is too low, your predictions will be too imprecise to be useful.
Jim Frost says
Hi Jerick,
This is a general property of linear models just due to how they work and their underlying calculations. Most any textbook should cover it. I always refer to Applied Linear Statistical Models by Neter et al.
Arnola says
Hi! thank you for this post and i was wondering if you have a citation for “If your primary goal is to understand the relationships between the variables in your model, the answer to how high R-squared needs to be is very simple. For this objective, R2 is irrelevant.”? thank you in advanced
Jim Frost says
Hi Arnola,
That’s a fundamental property of linear models. I’d imagine most textbooks would explain that. I always use Applied Linear Statistical Models by Neter et al.
Kaitlyn says
OK, thanks so much for the reply!
Kaitlyn Suski says
Hi. I was wondering if you have a citation for the assertion “Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. ” Thanks!
Jim Frost says
Hi Kaitlyn,
Unfortunately, I don’t have a citation on hand. I have read that in the literature, but don’t remember where. I’ve also observed this occurring in the many studies I’ve read over the years. It’s just hard to predict human behavior compared to, say, a physical process. That shows up in the R-squared! Keep in mind that this is not a hard and fast rule. It’s a tendency. But, a strong one.
kembhootha says
Mr. Frost, Thank you for these articles. I have generally taken a math-first approach to all of the statistical learning methods, and after hours of sweat, the intuition and logical conclusions made their appearance. I have been using your articles to supplement my own learning and I have found them to be incredibly enlightening in identifying patterns, intuitive thinking etc. Thank you for your efforts.
avianto nugroho says
Hi Jim,
thank you very much for your posts, very helpful. Although, I am still trying to figure every single theory out.
I am Avi, a master student who is currently writing master’s thesis.
So, I am analysing my data using GAM. To this point, I have come up with several models and done model selections. As a result, I got a model which I think (still not sure) that it is the best model. Considering, the best model is the model having the lowest AIC.
My question is, which one do I have to choose between the highest R-squared and the lowest AIC? Or in between?
Kindly, give me some advices on this case. Likewise, I think you should consider writing a post concerning GAM modelling or similar models particularly its operation using R. Because I would say that your blog is a more simple and understandable for the beginners in statistics.
Thank you again. You’re just cool.
All the best,
Avi