If your regression model contains independent variables that are statistically significant, a reasonably high R-squared value makes sense. The statistical significance indicates that changes in the independent variables correlate with shifts in the dependent variable. Correspondingly, the good R-squared value signifies that your model explains a good proportion of the variability in the dependent variable.
That seems like a logical combination, right?
However, what if your model has independent variables that are statistically significant but a low R-squared value? This combination indicates that the independent variables are correlated with the dependent variable, but they do not explain much of the variability in the dependent variable. Huh?
Over the years, I’ve had many questions about how to interpret this combination. Some people have wondered whether the significant variables are meaningful. Do these results even make sense? Yes, they do!
In this post, I show how to interpret regression models that have significant independent variables but a low R-squared. To do this, I’ll compare regression models with low and high R-squared values so you can really grasp the similarities and differences and what it all means.
Comparing Regression Models with Low and High R-squared Values
Like many concepts in statistics, it’s so much easier to understand this one using graphs. In fact, research finds that charts are crucial to convey certain information about regression models accurately.
Consequently, I’ll use fitted line plots to illustrate the concepts for models with one independent variable. However, these interpretations remain valid for multiple regression.
Let’s consider two regression models that assess the relationship between Input and Output. In both models, Input is statistically significant. The equations for these models are below:
- Output1 = 44.53 + 2.024*Input
- Output2 = 44.86 + 2.134*Input
These two regression equations are almost exactly equal. If you saw only the equations, you’d think the models are very similar. Now consider that the R-squared for the Output1 model is 14.7% and for Output2 it is 86.5%. The models aren’t as similar as they first appear.
Graphs can really bring the differences to life. Let’s see what these models and data actually look like. In the two graphs below, the scales are the same to make the comparison easier. You can download the CSV data file: HighLowRsquaredData.
Whoa! Did you expect that much of a difference?
To understand how to interpret a regression model with significant independent variables but a low R-squared, we’ll compare the similarities and the differences between these two models.
Regression Model Similarities
The models are similar in the following ways:
- The equations are nearly equal: Output = 44 + 2 * Input
- Input is significant with a p-value < 0.001
Additionally, the regression lines in both plots provide an unbiased fit to the upward trend in both datasets. They have the same upward slope of 2.
Interpreting a regression coefficient that is statistically significant does not change based on the R-squared value. Both graphs show that if you move to the right on the x-axis by one unit of Input, Output increases on the y-axis by an average of two units. This mean change in output is the same for both models even though the R-squared values are different.
Furthermore, if you enter the same Input value in the two equations, you’ll obtain approximately equal predicted values for Output. For example, an Input of 10 produces predicted values of 66.2 and 64.8. These values represent the predicted mean value of the dependent variable.
Regression Model Differences
The similarities all focus on the mean—the mean change and the mean predicted value. However, the biggest difference between the two models is the variability around those means. In fact, I’d guess that the difference in variability is the first thing about the plots that grabbed your attention. Understanding this topic boils down to grasping the separate concepts of central tendency and variability, and how they relate to the distribution of data points around the fitted line.
While the regression coefficients and predicted values focus on the mean, R-squared measures the scatter of the data around the regression lines. That’s why the two R-squared values are so different. For a given dataset, higher variability around the regression line produces a lower R-squared value.
Take a look at the chart with the low R-squared. Even these relatively noisy data have a significant trend. You can see that as the Input value increases, the Output value also increases. This statistically significant relationship between the variables tells us that knowing the value of Input provides information about the value of Output. The difference between the models is the spread of the data points around the predicted mean at any given location along the regression line.
Be sure to keep the low R-squared graph in mind if you need to comprehend a model that has significant independent variables but a low R-squared!
While the two models produce mean predictions that are almost the same, the variability (i.e., the precision) around the predictions is different. I’ll show you how to assess precision using prediction intervals. This method is particularly useful when you have more than one independent variable and can’t graph the models to see the spread of data around the regression line.
Using Prediction Intervals to See the Variability
A prediction interval is a range where a single new observation is likely to fall given values of the independent variables that you specify. Narrower prediction intervals represent more precise predictions. Most statistical software can calculate prediction intervals.
The statistical output below displays the fitted values and prediction intervals that are associated with an Input value of 10 for both models. The first output is for the model with the low R-squared.
As I mentioned earlier, the mean predicted values (i.e., the fit) are nearly equal. However, the prediction intervals are very different because they incorporate the variability. The high variability/low R-squared model has a prediction interval of approximately -500 to 630. That’s over 1100 units!
On the other hand, the low variability/high R-squared model has a much narrower prediction interval of roughly -30 to 160, about 190 units.
After seeing the variability in the data, the differing levels of precision should make sense.
Key Points about Low R-squared Values
Let’s go over the key points.
- Regression coefficients and fitted values represent means.
- R-squared and prediction intervals represent variability.
- You interpret the coefficients for significant variables the same way regardless of the R-squared value.
- Low R-squared values can warn of imprecise predictions.
What can be done about that low R-squared value? That’s the next question I usually hear in this context. Often, the first thought is to add more variables to the model to increase R-squared.
Related post: How High Does R-squared Need to Be?
If you can find legitimate predictors, that can work in some cases. However, for every study area there is an inherent amount of unexplainable variability. For instance, studies that attempt to predict human behavior generally have R-squared values less than 50%. People are hard to predict. You can force a regression model to go past this point but it comes at the cost of misleading regression coefficients, p-values, and R-squared.
Adjusted R-squared and predicted R-squared are tools that help you avoid this problem.
If you are mainly interested in understanding the relationships between the variables, the good news is that a low R-squared does not negate the importance of any significant variables. Even with a low R-squared, statistically significant P-values continue to identify relationships and coefficients have the same interpretation. Generally, you have no additional cause to discount these findings.
For more information about choosing the correct regression model, see my post about model specification.