Does your regression model have a low R-squared? That seems like a problem—but it might not be. Learn what a low R-squared does and does not mean for your model.

If your regression model contains independent variables that are statistically significant, a reasonably high R-squared value makes sense. The statistical significance indicates that changes in the independent variables correlate with shifts in the dependent variable. Correspondingly, the good R-squared value signifies that your model explains a good proportion of the variability in the dependent variable.

That seems like a logical combination, right?

However, what if your model has independent variables that are statistically significant but a *low* R-squared value? This combination indicates that the independent variables are correlated with the dependent variable, but they do not explain much of the variability in the dependent variable. Huh?

Over the years, I’ve had many questions about how to interpret this combination. Some people have wondered whether the significant variables are meaningful. Do these results even make sense? Yes, they do!

In this post, I show how to interpret regression models that have significant independent variables but a low R-squared. To do this, I’ll compare regression models with low and high R-squared values so you can really grasp the similarities and differences and what it all means.

**Related post**: When Should I Use Regression Analysis?

## Comparing Regression Models with Low and High R-squared Values

Like many concepts in statistics, it’s so much easier to understand this one using graphs. In fact, research finds that charts are crucial to convey certain information about regression models accurately.

Consequently, I’ll use fitted line plots to illustrate the concepts for models with one independent variable. However, these interpretations remain valid for multiple regression.

Let’s consider two regression models that assess the relationship between Input and Output. In both models, Input is statistically significant. The equations for these models are below:

- Output1 = 44.53 + 2.024*Input
- Output2 = 44.86 + 2.134*Input

These two regression equations are almost exactly equal. If you saw only the equations, you’d think the models are very similar. Now consider that the R-squared for the Output1 model is 14.7% and for Output2 it is 86.5%. The models aren’t as similar as they first appear.

Graphs can really bring the differences to life. Let’s see what these models and data actually look like. In the two graphs below, the scales are the same to make the comparison easier. You can download the CSV data file: HighLowRsquaredData.

Whoa! Did you expect that much of a difference?

To understand how to interpret a regression model with significant independent variables but a low R-squared, we’ll compare the similarities and the differences between these two models.

## Regression Model Similarities

The models are similar in the following ways:

- The equations are nearly equal: Output = 44 + 2 * Input
- Input is significant with a p-value < 0.001

Additionally, the regression lines in both plots provide an unbiased fit to the upward trend in both datasets. They have the same upward slope of 2.

Interpreting a regression coefficient that is statistically significant does not change based on the R-squared value. Both graphs show that if you move to the right on the x-axis by one unit of Input, Output increases on the y-axis by an average of two units. This mean change in output is the same for both models even though the R-squared values are different.

Furthermore, if you enter the same Input value in the two equations, you’ll obtain approximately equal predicted values for Output. For example, an Input of 10 produces predicted values of 66.2 and 64.8. These values represent the predicted *mean* value of the dependent variable.

## Regression Model Differences

The similarities all focus on the mean—the mean change and the mean predicted value. However, the biggest difference between the two models is the *variability* around those means. In fact, I’d guess that the difference in variability is the first thing about the plots that grabbed your attention. Understanding this topic boils down to grasping the separate concepts of central tendency and variability, and how they relate to the distribution of data points around the fitted line.

While the regression coefficients and predicted values focus on the mean, R-squared measures the scatter of the data around the regression lines. That’s why the two R-squared values are so different. For a given dataset, higher variability around the regression line produces a lower R-squared value.

Take a look at the chart with the low R-squared. Even these relatively noisy data have a significant trend. You can see that as the Input value increases, the Output value also increases. This statistically significant relationship between the variables tells us that knowing the value of Input provides information about the value of Output. The difference between the models is the spread of the data points around the predicted mean at any given location along the regression line.

Be sure to keep the low R-squared graph in mind if you need to comprehend a model that has significant independent variables but a low R-squared!

While the two models produce mean predictions that are almost the same, the variability (i.e., the precision) around the predictions is different. I’ll show you how to assess precision using prediction intervals. This method is particularly useful when you have more than one independent variable and can’t graph the models to see the spread of data around the regression line.

## Using Prediction Intervals to See the Variability

A prediction interval is a range where a single new observation is likely to fall given values of the independent variables that you specify. Narrower prediction intervals represent more precise predictions. Most statistical software can calculate prediction intervals.

**Related posts**: Making Predictions with Regression Analysis and Understand Precision in Applied Regression to Avoid Costly Mistakes

The statistical output below displays the fitted values and prediction intervals that are associated with an Input value of 10 for both models. The first output is for the model with the low R-squared.

As I mentioned earlier, the mean predicted values (i.e., the fit) are nearly equal. However, the prediction intervals are very different because they incorporate the variability. The high variability/low R-squared model has a prediction interval of approximately -500 to 630. That’s over 1100 units!

On the other hand, the low variability/high R-squared model has a much narrower prediction interval of roughly -30 to 160, about 190 units.

After seeing the variability in the data, the differing levels of precision should make sense.

## Key Points about Low R-squared Values

Let’s go over the key points.

- Regression coefficients and fitted values represent means.
- R-squared and prediction intervals represent variability.
- You interpret the coefficients for significant variables the same way regardless of the R-squared value.
- Low R-squared values can warn of imprecise predictions.

What can be done about that low R-squared value? That’s the next question I usually hear in this context. Often, the first thought is to add more variables to the model to increase R-squared.

**Related post**: How High Does R-squared Need to Be?

If you can find legitimate predictors, that can work in some cases. However, for every study area there is an inherent amount of unexplainable variability. For instance, studies that attempt to predict human behavior generally have R-squared values less than 50%. People are hard to predict. You can force a regression model to go past this point but it comes at the cost of misleading regression coefficients, p-values, and R-squared.

Adjusted R-squared and predicted R-squared are tools that help you avoid this problem.

If you are mainly interested in understanding the relationships between the variables, the good news is that a low R-squared does not negate the importance of any significant variables. Even with a low R-squared, statistically significant P-values continue to identify relationships and coefficients have the same interpretation. Generally, you have no additional cause to discount these findings.

For more information about choosing the correct regression model, see my post about model specification.

pradip says

June 14, 2019 at 2:46 amsir, how we know over all F test is significant when independent variable is not signficant

Jim Frost says

June 14, 2019 at 2:44 pmHi Pradip,

I’m not 100% sure that I understand what you’re asking. However, you can determine whether the overall F-test is significant by comparing its p-value to your significance level. If its p-value is less than your significance level, the overall F-test is significant. You can read more in my post about the overall F-test.

I hope this helps!

Mostafa Haj Hamidi says

June 13, 2019 at 4:39 amI thank you for your efforts and I wish you a healthy health

Mackenzie says

June 7, 2019 at 4:17 pmHi Jim,

Thank you for the quick response and helpful discussion. I really appreciate it!

Mackenzie says

June 6, 2019 at 4:04 pmHi Jim,

How would you interpret a model with a significant (p<.001), but low R-squared (16%) and only one of the seven predictors are significant (β=.23, p <.01).

Jim Frost says

June 7, 2019 at 10:19 amHi Mackenzie,

Because only one predictor is significant, you have enough evidence to conclude that there is relationship between only that predictor and the response (DV). The low R-squared indicates that there is a lot of variability around the fitted line.

Look at the images of the fitted line plots for the two models in this blog post. Your model more closely resembles the plot for the low R-squared model. You can see that there is a trend, but the distance between the data points and the lines are greater.

A low R-squared like that is a problem only when you’re trying to produce accurate predictions. But, if you’re just want to understand the relationships between variables, you still have that relationship which is statistically significant.

I hope this helps!

Waseem Majeed says

April 2, 2019 at 3:50 pmHow can we calculate or say that our regression model is significant?

Jim Frost says

April 2, 2019 at 4:01 pmHi Waseem,

You can assess the coefficient p-values and the F-test of overall significance. I’ve written blog posts on both, which you should read.

Also, consider buying my ebook about regression analysis for a more thorough look!

I hope that helps!

lesley l says

March 29, 2019 at 3:27 pmHI when reporting results from a linear regression with a t test, ( rsquared has a low value (the x axis is temperature, the y axis is change in species richness) ,and high variabiltiy abut the line of best fit, how do I report these resutls, do I include

I am not sure which numbers to report basically and how to make sense of them once the test has been run .do i include the slope of line, ? the t stat ? how do I incorporate the standard error into reporting statistical evidence

I have coefficient for intercept

and x variable ,SE and t stat values but I don’t know how to use them in my report

Hari Poudel says

March 14, 2019 at 1:39 amDear Jim,

I need to ask a follow-up question to the above explanation. I run a logistic regression in which one of the IVs is significant across various models. Other IVs are not siginificant. I also ran a postestimation model -test and it also gave me similar kinds of relationships. What would be better way to explain such type of relationships? Thnka you for your time and help.

Jim Frost says

March 15, 2019 at 4:54 pmHi Hari,

It sounds like your sample only provides sufficient evidence to suggest that the one IV has a relationship with the DV in the population. For your other IVs, the sample does not provide sufficient evidence to conclude that those relationships exist in the population. Whatever sample effects you’re seeing for the non-significant IVs might well be random error rather than representing a relationship that exists in the population.

Pranesh Debnath says

March 11, 2019 at 10:46 amHello Jim nice to see your explanation!!!!

I am giving following data. Kindly Calrify my question…

Multiple R=0.339958

R-Square=0.115571

Adj. R-Square=0.109458

Standard Error=0.221261

Observation (sample size)=875

No. of Independent variables=6

My question: Is there any mathematical relationship with “Standard Error” and other information given above as like R-Square and Adj. R-Square?

Jim Frost says

March 11, 2019 at 4:40 pmHi Pranesh,

I’d say that the standard error of the regression is probably the most closely related to adjusted R-squared. Both goodness-of-fit measures use the adjusted sums of squares, which incorporates the error degrees of freedom. However, it would also be related to R-squared even though it does not make that adjustment. In general, as R-squared and, particularly, adjusted R-squared increase for a particular dataset, the standard error tends to decrease.

Jerry says

January 9, 2019 at 11:56 amThis is indeed an excellent explanation of R-squared with good links to definitions and other articles. Thank you! I’d like to know more about prediction intervals though. Is a prediction interval the range of all output values that occur in the data from the specified input value? And can SPSS produce prediction intervals? thanks

Jim Frost says

January 9, 2019 at 3:47 pmHi Jerry,

Prediction intervals aren’t based on the range of output given the input but rather the mean square error, which is a measure of the variability of the residuals. The mean square error is small when the observed values are close to the fitted values–which in turn produces tighter prediction intervals.

As for SPSS, I don’t regularly use that software. However, I checked around and you can create prediction intervals for simple regression at least. I’m not sure about multiple regression. SPSS refers to them as individual confidence intervals in the regression context. You can create them in SPSS scatterplots. Go to the Chart Editor dialog for scatterplots and click the Fit Line tab. Under Confidence Intervals, choose Individual.

I hope that helps!

Chris Spadi says

December 20, 2018 at 7:45 pmHi Jim,

I made scatter plots where x is the year (spanning over 70 years) and Y is the count (fires) for each month in a specific region. My R2 values range between .22 to .62. Showing a positive trend.

Things like increased drought, rising temperature, decreased precipitation, and wind factor into each year.

Is a R2 value range of .22 to .62 necessarily bad in this case?

Thanks,

Chris

Jim Frost says

December 21, 2018 at 10:31 amHi Chris, it sounds like in addition to year, you have other independent (X) variables? And, the R-squared varies based on the specific combination of those variables you include in the variable?

I’ve noticed that statisticians sometimes want to categorize the strength of the results based on things like the value of R-squared. A well-known one even created ranges of R-squared values and attributed labels such as strong, moderate, and weak.

I don’t think that’s the way to go. Some study areas are inherently more unpredictable than others. When an area is inherently more unpredictable, you’d expect lower R-squared values as a matter of course. You have to adjust your idea of what constitutes a higher R-squared based on the subject area.

So, that all said, those R-squared values aren’t necessarily bad. However, you’ll need to use subject area knowledge and compare your results to similar studies to make that determination. As along as you have significant predictors, your model is supplying important information about the factors that increase fires. One potential downside occurs if you want to use your model to make predictions rather than just understanding the role of underlying factors. Models with R-squared values in the range you mention often produce imprecise predictions (wide prediction intervals).

Best of luck with your analysis!

Ivan says

December 6, 2018 at 1:26 amHello Jim,

I am trying to predict house prices based on the number of basic house characteristics. R-sq is 0.52, but most of the independent variables have a p-value within 0.2 and 0.99. Does it mean they are all insignificant? How do I know which ones to include and which ones to eliminate? Thank you very much

Jim Frost says

December 6, 2018 at 11:25 amHi Ivan,

Yes, you those variables are not statistically significant. You’d often remove them from the model. For more information, read my post about specifying the correct regression model.

Keryn says

October 20, 2018 at 10:16 pmHi Jim, if my regression model is not significant (R ² = .011, adjusted R ² = .005, F (3, 574) = 2.031, p = .108). But my beta value for one of the IV is significant (just) beta =.082, p=.048, how do I interpret this?

Jim Frost says

October 21, 2018 at 1:16 amHi Keryn,

It can happen that the overall significance doesn’t necessarily match the fact of whether there are any significant independent variables, such as in your model. These tests usually agree. If you have a significant IV, you usually obtain a significant overall test of significance. But, when you have borderline results, like in your model, they can disagree. For more information, read my post about the overall F-test of significance.

As for having a significant variable but a very low R-squared, interpret it exactly as I describe in this post. The relationship between the IV and the DV is statistically significant. In other words, knowing the value of the IV provides some information about the DV. However, there is a lot variability around the fitted values that your model doesn’t explain.

A couple of suggestions for you. Given the extremely low R-squared value, I’d double check the residual plots to be sure that they look good. If you see a non-random pattern in the residual plots, you might be able to improve your model.

Also, given the borderline nature of the p-value for your IV, combined with the non-significant overall test and low R-squared, you should consider these results as suggestive or preliminary.

Best of luck with your analysis!

Alvinn-emmanuel Yao says

April 28, 2018 at 11:32 amHello Sir,

I have a question. if we have a low R squared value but the P value of F is significant, is the model a good one?

Jim Frost says

April 28, 2018 at 2:39 pmHi,

The first thing you should check is to determine whether you have an independent variable (IV) that has a significant p-value. If you do, then I’d say that the model is probably a good one (always assuming that your residuals look good). You can link an effect to a specific IV or IVs.

If you don’t have an IV that is significant but the overall F-test is significant, it gets a little tricky. In this case, you don’t have sufficient evidence to conclude that any particular IV has a statistically significant relationship with the dependent variable (DV). However, all the IVs in the model taken together have a significant relationship with the DV. Unfortunately, that relationship, as measured by the low R-squared, is fairly weak. You’re pretty much at the minimum limits of useful knowledge in this scenario. You can’t pinpoint the effect to specific IVs and it’s a weak effect to boot. I’d say that a study like this potentially provides evidence that some effect is present but you’d need additional, larger studies to really learn something useful.

So, first step is to look at the p-values for the IVs to see which general scenario I describe above your model fits in. Then, take it from there!

I hope this helps!

Thomas Mkhabela says

April 18, 2018 at 2:59 amVery precise, i really appreciate the answers.

Jim Frost says

April 18, 2018 at 10:26 amHi Thomas, Thanks! I’m glad it was helpful!

Mark Davis says

April 16, 2018 at 9:29 amVery clear and concise article. You did a great job helping me figure out the right questions to ask!

Abhisek Guha says

April 9, 2018 at 2:14 pmExcellent concept clearing article.

Jim Frost says

April 9, 2018 at 3:53 pmThank you, Abhisek!