Previously, I explained how to interpret R-squared. I showed how the interpretation of R2 is not always straightforward. A low R-squared isn’t always a problem, and a high R-squared doesn’t automatically indicate that you have a good model.
So, how high should R-squared be? The definitive answer is . . . it depends. You’ll need some patience because my assertion is that this question is the wrong question. In this post, I reveal why it is the wrong question and which questions you should ask instead.
Related post: Five Reasons Why Your R-squared can be Too High
How High Does R-squared Need to be is the Wrong Question
How high does R-squared need to be? If you think about it, there is only one correct answer. R-squared should accurately reflect the percentage of the dependent variable variation that the linear model explains. Your R2 should not be any higher or lower than this value.
The correct R2 value depends on your study area. Different research questions have different amounts of variability that are inherently unexplainable. Case in point, humans are hard to predict. Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. However, if you analyze a physical process and have very good measurements, you might expect R-squared values over 90%. There is no one-size fits all best answer for how high R-squared should be.
Consequently, the answer to “how high does R-squared need to be?” is that it depends on the amount of variability that is actually explainable. Clearly, your R-squared should not be greater than the amount of variability that is actually explainable—which can happen in regression. To see if your R-squared is in the right ballpark, compare your R2 to those from other studies.
Chasing a high R2 value can produce an inflated value and a misleading model. Read my post about adjusted R-squared and predicted R-squared to see how these statistics can help you avoid these problems.
Define Your Objectives for the Regression Model
When you wonder if the R-squared is high enough, it’s probably because you want to know if the regression model satisfies your objectives. Given your requirements, does the model meet your needs? Therefore, you need to define your objectives before proceeding.
To determine whether a model meets your objectives, you’ll need to ask different questions because R2 doesn’t address this issue. The correct questions depend on whether your primary goal for the model is:
- To understand the relationships between the independent variables and dependent variable. Or,
- To predict the dependent variable.
R-squared and Understanding the Relationships between the Variables
If your primary goal is to understand the relationships between the variables in your model, the answer to how high R-squared needs to be is very simple. For this objective, R2 is irrelevant.
This statement might surprise you. However, the interpretation of the significant relationships in a regression model does not change regardless of whether your R2 is 15% or 85%! The regression coefficients define the relationship between each independent variable and the dependent variable. The interpretation of the coefficients doesn’t change based on the value of R-squared.
Suppose we have a statistically significant coefficient that equals 2. This coefficient indicates that the mean of the dependent variable increases by 2 for every one-unit increase in the independent variable irrespective of the R2 value.
The question about how high R-squared needs to be doesn’t make sense in this context because it doesn’t matter. A small R2 doesn’t nullify or change the interpretation of the coefficient for an independent variable that is statistically significant.
Instead of wondering if your R-squared value is high enough, you should ask the following questions to ensure that you can trust your results:
- Do I have a sound basis for my model?
- Can I trust my data?
- Do the residual plots look good?
- Do the results fit theory?
- How do I interpret the regression coefficients and P-values?
R-squared and Predicting the Dependent Variable
On the other hand, if your primary goal is to use your regression model to predict the value of the dependent variable, R-squared is a consideration.
Predictions are more complex than just the single predicted value. Predictions include a margin of error. More precise predictions have a smaller amount of error.
R2 is relevant in this context because it is a measure of the error. Lower R2 values correspond to models with more error, which in turn produces predictions that are less precise. In other words, if your R2 is too low, your predictions will be too imprecise to be useful.
A low R-squared can be an indicator of imprecision predictions. However, R2 doesn’t tell you directly whether the predictions are sufficiently precise for your requirements.
We need a direct measure of precision that uses the units of the dependent variable. That’s why asking, “How high does R-squared need to be?” still is not the correct question.
Instead, you should ask the questions above plus the following question:
- Are the prediction intervals precise enough for my requirements?
Using Prediction intervals to Assess Precision
Most statistical software can calculate prediction intervals, and they are easy to use.
A prediction interval is a range where a single new observation is likely to fall given values of the independent variable(s) that you specify. These ranges incorporate the margin of error around the predicted value. If the prediction intervals are too wide, the predictions don’t provide useful information. Narrow prediction intervals represent more precise predictions.
In my post about using regression analysis to make predictions, I present the model displayed in the graph. This model uses BMI to predict the percentage of body fat. The 95% prediction interval for a BMI of 18 is 16-30% body fat. We can be 95% confident that an individual with a BMI of 18 will fall within this range.
At this point, you need to use client requirements, spec limits, and subject area knowledge to determine whether the prediction intervals are narrow enough to represent meaningful predictions. By assessing the prediction intervals, you are evaluating the precision of the model directly rather than relying on an arbitrary cut-off value for R-squared.
I’m not a medical expert, but I’d guess that the 14 point range of 16-30% is too wide to provide meaningful information. If this is true, our regression model is too imprecise to be useful.
Related posts: Understand Precision in Applied Regression to Avoid Costly Mistakes and Confidence Intervals vs Prediction Intervals vs Tolerance Intervals.
R-squared Is Overrated!
Asking “How high does R-squared need to be?” is usually not the correct question to ask. You probably want to know if the regression model can meet your needs. To this end, there are better questions that you should ask.
R-squared gets all of the attention for assessing the goodness-of-fit. It seems like a simple statistic to interpret. However, evaluating the fit involves more than just this single statistic. You need to use subject area knowledge, residual plots, coefficients, and prediction intervals if you’re making predictions.
However, R-squared does have some good uses. For one thing, compare your R2 value to values from similar studies. If your R2 is markedly higher or lower, you should investigate because there might be a problem.
If you’re learning regression, check out my Regression Tutorial!