Does your regression model have a low R-squared? That seems like a problem—but it might not be. Learn what a low R-squared does and does not mean for your model.

If your regression model contains independent variables that are statistically significant, a reasonably high R-squared value makes sense because it measures goodness-of-fit. The statistical significance indicates that changes in the independent variables correlate with shifts in the dependent variable. Correspondingly, the good R-squared value signifies that your model explains a good proportion of the variability in the dependent variable.

That seems like a logical combination, right?

However, what if your model has independent variables that are statistically significant but a *low* R-squared value? This combination indicates that the independent variables are correlated with the dependent variable, but they do not explain much of the variability in the dependent variable. Huh?

Over the years, I’ve had many questions about how to interpret this combination. Some people have wondered whether the significant variables are meaningful. Do these results even make sense? Yes, they do!

In this post, I show how to interpret regression models that have significant independent variables but a low R-squared. To do this, I’ll compare regression models with low and high R-squared values so you can really grasp the similarities and differences and what it all means.

**Related post**: When Should I Use Regression Analysis?

## Comparing Regression Models with Low and High R-squared Values

Like many concepts in statistics, it’s so much easier to understand this one using graphs. In fact, research finds that charts are crucial to convey certain information about regression models accurately.

Consequently, I’ll use fitted line plots to illustrate the concepts for models with one independent variable. However, these interpretations remain valid for multiple regression.

Let’s consider two regression models that assess the relationship between Input and Output. In both models, Input is statistically significant. The equations for these models are below:

- Output1 = 44.53 + 2.024*Input
- Output2 = 44.86 + 2.134*Input

These two regression equations are almost exactly equal. If you saw only the equations, you’d think the models are very similar. Now consider that the R-squared for the Output1 model is 14.7% and for Output2 it is 86.5%. The models aren’t as similar as they first appear.

Graphs can really bring the differences to life. Let’s see what these models and data actually look like. In the two graphs below, the scales are the same to make the comparison easier. You can download the CSV data file: HighLowRsquaredData.

Whoa! Did you expect that much of a difference?

To understand how to interpret a regression model with significant independent variables but a low R-squared, we’ll compare the similarities and the differences between these two models.

## Regression Model Similarities

The models are similar in the following ways:

- The equations are nearly equal: Output = 44 + 2 * Input
- Input is significant with a p-value < 0.001

Additionally, the regression lines in both plots provide an unbiased fit to the upward trend in both datasets. They have the same upward slope of 2.

Interpreting a regression coefficient that is statistically significant does not change based on the R-squared value. Both graphs show that if you move to the right on the x-axis by one unit of Input, Output increases on the y-axis by an average of two units. This mean change in output is the same for both models even though the R-squared values are different.

Furthermore, if you enter the same Input value in the two equations, you’ll obtain approximately equal predicted values for Output. For example, an Input of 10 produces predicted values of 66.2 and 64.8. These values represent the predicted *mean* value of the dependent variable.

## Regression Model Differences

The similarities all focus on the mean—the mean change and the mean predicted value. However, the biggest difference between the two models is the *variability* around those means. In fact, I’d guess that the difference in variability is the first thing about the plots that grabbed your attention. Understanding this topic boils down to grasping the separate concepts of central tendency and variability, and how they relate to the distribution of data points around the fitted line.

While the regression coefficients and predicted values focus on the mean, R-squared measures the scatter of the data around the regression lines. That’s why the two R-squared values are so different. For a given dataset, higher variability around the regression line produces a lower R-squared value.

Take a look at the chart with the low R-squared. Even these relatively noisy data have a significant trend. You can see that as the Input value increases, the Output value also increases. This statistically significant relationship between the variables tells us that knowing the value of Input provides information about the value of Output. The difference between the models is the spread of the data points around the predicted mean at any given location along the regression line.

Be sure to keep the low R-squared graph in mind if you need to comprehend a model that has significant independent variables but a low R-squared!

While the two models produce mean predictions that are almost the same, the variability (i.e., the precision) around the predictions is different. I’ll show you how to assess precision using prediction intervals. This method is particularly useful when you have more than one independent variable and can’t graph the models to see the spread of data around the regression line.

## Using Prediction Intervals to See the Variability

A prediction interval is a range where a single new observation is likely to fall given values of the independent variables that you specify. Narrower prediction intervals represent more precise predictions. Most statistical software can calculate prediction intervals.

**Related posts**: Making Predictions with Regression Analysis and Understand Precision in Applied Regression to Avoid Costly Mistakes

The statistical output below displays the fitted values and prediction intervals that are associated with an Input value of 10 for both models. The first output is for the model with the low R-squared.

As I mentioned earlier, the mean predicted values (i.e., the fit) are nearly equal. However, the prediction intervals are very different because they incorporate the variability. The high variability/low R-squared model has a prediction interval of approximately -500 to 630. That’s over 1100 units!

On the other hand, the low variability/high R-squared model has a much narrower prediction interval of roughly -30 to 160, about 190 units.

After seeing the variability in the data, the differing levels of precision should make sense.

## Key Points about Low R-squared Values

Let’s go over the key points.

- Regression coefficients and fitted values represent means.
- R-squared and prediction intervals represent variability.
- You interpret the coefficients for significant variables the same way regardless of the R-squared value.
- Low R-squared values can warn of imprecise predictions.

What can be done about that low R-squared value? That’s the next question I usually hear in this context. Often, the first thought is to add more variables to the model to increase R-squared.

**Related post**: How High Does R-squared Need to Be?

If you can find legitimate predictors, that can work in some cases. However, for every study area there is an inherent amount of unexplainable variability. For instance, studies that attempt to predict human behavior generally have R-squared values less than 50%. People are hard to predict. You can force a regression model to go past this point but it comes at the cost of misleading regression coefficients, p-values, and R-squared.

Adjusted R-squared and predicted R-squared are tools that help you avoid this problem.

If you are mainly interested in understanding the relationships between the variables, the good news is that a low R-squared does not negate the importance of any significant variables. Even with a low R-squared, statistically significant P-values continue to identify relationships and coefficients have the same interpretation. Generally, you have no additional cause to discount these findings.

For more information about choosing the correct regression model, see my post about model specification.

Emma says

Thank you for this details response. I have tested my normal distribution for variables and they are left scewed, significant .001

How would I report this in a dissertation? I’ve also been asked to to a correlation would this be Pearson?

Jim Frost says

Hi Emma,

For regression analysis, you need to assess the normality of the residuals NOT the variables. In my previous reply, I mentioned some things for you to try to produce normal residuals. This is something you need to fix because otherwise your results are unreliable.

Assuming all your variables are continuous, yes, you’d probably use Pearson’s correlation. However, you should graph them in a scatterplot too so you can see if they are straight line relationships because Pearson’s is only valid for straight line relationships.

Emma Wiles says

How would you suggest, I document this?

I can not take out the second variable which yes was non significant with p value of .164. My normal distribution tests also came up as significant so assumptions were not Met? Is this OK? My participant number was 57. So may lack statistical power?

Jim Frost says

When you say the normal distribution tests, what are you testing exactly? The variables or the residuals? The assumption applies only to the residuals. If they are severely nonnormal, it is a problem. The significant normality test suggests it violates the normality assumption if its for the residuals.

Look at the residual plots. If the residuals are problematic, you can also create scatterplots with your each of your predictors and the response. Look for curvature that you might need to model. You might try fitting an interaction term if that makes theoretical sense. You might be able to find a way to tweak the model to produce normal residuals. If nothing works, you might need to use a transformation or even a different type of regression. It really depends on what you’re seeing in the residual plots and scatterplots.

Outside of that issue, having an overall F-test that is not significant and the other predictor not being significant are not problems. You have a significant predictor. Focus on that significant predictor and what it means. And mention that you must include the 2nd predictor even though it’s not significant (explain why you need to include it). It’s usually not a problem to include an extra variable that’s not significant. (It’s more problematic to not include a variable that should be in.) So, you really don’t have a problem on that side of things. You don’t really need to discuss the overall significance of the model because you have a significant predictor.

57 observations is decent for a model with two variables. It could be a lack of statistical power if the effect of the insignificant predictor is small. But you really can’t say for sure. Is it a variable that you think should be significant for theoretical reasons?

But, you should try to get normal residuals. Depending on what’s causing the nonnormality, there’s a chance that resolving that will improve your model’s fit and perhaps make the other predictor significant. Maybe. Maybe no. But correctly specifying the model should at least improve its fit.

miss Emma Wiles says

Hi Jim.

great information! I have a query.

I have two IV predictors and one DV

my regression model summary states a non-significant effect of p= .099, however, my R square is a low 8.7%.

One predictor is significant p= .04

What does this mean, regarding my low r-value but non-significant model? It’s confusing me as isn’t that contradictive?

Jim Frost says

Hi Emma,

I’ll assume you’re referring to the F-test of overall significance as the regression model summary with the p-value of 0.099. In a technical sense, that’s tell you that you can’t reject the null hypothesis that your R-squared is not significantly different from zero.

It also measures statistical significance differently from the individual predictors, so it’s not surprising that it can get different results.

The second predictor, which you don’t mention the p-value for, must not be significant. If you were to remove the 2nd predictor, chances are the significant predictor remains significant and the overall model would also become significant. Basically, right now, your model is borderline insignificant while the one predictor is borderline significant. It’s these borderline conditions that tend to produce contradictory results. One is just on one side of significance and the other is just on the other side. And the non-significant predictor is enough to tilt the balance for the model.

While you can likely produce a significant overall model using the approach I mentioned above, that may or may not be the best thing to do. If there’s a strong theoretical reason for including the non-significant predictor, and/or if removing produces residuals the violate assumptions, you probably should leave it in despite the non-significance. As I write in my article about choosing the correct regression model, you shouldn’t settle on your final model due to statistical significance alone but let theory guide you.

As I write in this post you’re commenting in, even though your R-squared is low you still have a significant predictor. The low R-squared doesn’t takeaway from the predictor’s significance. But it does tell you there is still a lot of unexplained variance in the response/dependent variable.

I hope that helps!

Stefania says

Hi Jim,

I am encountering a similar problem in my thesis. Could you help me with a reference supporting the validity of the results even in the case of a non-significant model (one of the coefficients is still significant and I would like to interpret it as such)

Aarti says

Hello Jim, Thank you so much for your valuable inputs. Now I think, I should provide a strong justification for my results and proceed with the same.

Talking about control variables, I have already tried to run the model with control variables but that does not make any changes into my result.

Jim Frost says

Hi Aarti,

Agreed! I think you’re all set for that!

Best wishes and I’m sure you’ll do great! It sounds like you have a great understanding of your data and results.

Jim

Aarti says

Hello Jim, I am here sharing the SEM results of my study.

Original sample (O)

Sample (M) (STDEV) T statistic P values

IV -> DV 0.07 0.073 0.065 1.078 0.281

IV -> M 1 0.687 0.69 0.029 23.687 0

IV -> M 2 0.35 0.356 0.054 6.496 0

M1 -> M2 0.365 0.361 0.055 6.69 0

M2 -> DV 0.178 0.182 0.06 2.978 0.003

Serial Mediation

specific indirect effect

Original sample Sample (M) (STDEV) T statistics P values

IV -> M1 -> M2 0.251 0.249 0.038 6.573 0

M1 -> M2 -> DV 0.065 0.065 0.022 2.901 0.004

IV -> M2 -> DV 0.062 0.065 0.024 2.594 0.01

IV -> M1 ->M2 ->DV 0.045 0.045 0.016 2.841 0.005

F-square IV DV M1 M2

IV 0.003 0.895 0.113

DV

M1 0.123

M2 0.021

R-square R-square adjusted

DV 0.052 0.048

M1 0.472 0.471

M2 0.431 0.428

Jim Frost says

Hi Aarti,

Taking a look at your results, I don’t see anything that contradicts what you and I have both said previously about your results. There does not seem to be a relationship directly between the IV and DV in your sample because the R-squared is low and (more importantly) the p-value is high (not significant). However, the serial mediation (IV > M1 > M2 > DV) is very significant.

The one thing I would ask is if the previous studies included any control variables besides the mediators? Perhaps controlling for demographic information? If so, it’s possible the IV/DV relationship might become significant.

Otherwise, it’s interesting that your data were collected during COVID. The pandemic conditions could explain that lack of a relationship in your sample whereas the previous studies found a direct relationship. Could working remotely explain it? That could make an interesting discussion right there.

I don’t have much else I can add. You’ve got a great handle on what your results are telling you. And you have good possible explanation for why there are differences with previous literature. Just be transparent about the differences and discuss possible reasons.

Aarti says

Hello Jim, Thank you . I understand it might be hard to answer my query but I appreciate your inputs and I find them quite informative.

To continue the same, I am working on HR practices and Employee attitude, my sample size is 530. I am running my model in PLS SEM.

My main concern is that I can explain the insignificant relationship between my IV and DV with help of Literature and keeping in mind that my data was collected during covid period that might be a possible reason for deviation in result. BUT my main concern here is the low value of r-square (0.057 for main dv). How should I interpret this.

Jim Frost says

Hi Aarti,

The low R-squared value of 0.057 indicates that the relationship between the IV and DV is very weak and possibly non-existent. What is the p-value for the IV in that specific model?

Aarti says

Dear Jim, First of all Thanku I found your comments on R -square very useful. My question is- I am working on my thesis data and my data analysis shows that there is no relationship between IV and DV, only the indirect relationship exist in presence of 2 mediators (serial mediation). Whereas the literature supports the positive relationship between the IV and DV.

The r square value for the same is very low 0.052.

*all the measurement model criteria i.e, loadings, Cronbach alpha, ave, htmt, vif value meet the criteria.

* all other relationships are significant (iv-mediator-dv)

Now my question is what could be the possible reason for insignificant relationship between IV and DV.

How should I proceed with this.

Why the value of r square for main DV is so low , whereas for mediators its 0.45 and 0.52.

*If I proceed with the same, would it create any problem at time of my PhD Defence.

Jim Frost says

Hi Aarti,

It will be difficult for me to provide specific answers because I don’t know the subject area. But here are some points to consider.

For one thing, you (and everyone else) are working with samples. So, it’s not only possible but expected that your sample results won’t always match those of previous studies. Consider the context, sample, and methods of previous studies to how they compare to yours and whether they might explain the differences. You might find issues that explain the difference or perhaps it’s just random differences between samples. Either way, that’ll make good discussion in your defense.

Consider whether your study is adequately powered. An underpowered study might fail to detect a significant relationship that exists. Perhaps the relationship between the IV and DV does exist but your sample was too small or perhaps the measurements were to imprecise.

It is interesting that with relationships with the mediator are significant, but not directly between the IV and DV. That might not fit previous studies, but it might give you clues in looking for an answer. It’s possible that the relationship between the variables and the mediators is just stronger and therefore detectable by your study while the weaker direct relationship is not. Or, again, it could be just random variation in your sample that is making it different from previous studies. Hard to say. But that’s the type of thing to look at.

I wish I could give you more concrete answers, but I would address the discrepancy head-on. Discuss the differences between your study and previous literature, offering potential explanations. Highlight the importance of the mediators in this context. Even if you don’t fully understand why the differences exist between your study and previous studies, be sure to know whether your study’s methodology, context, or sample differed in any identifiable why from the previous studies because that will provide context that might explain the differences in results.

I hope that helps at least somewhat! Best of luck with your Dissertation!!

Ifeanyichukwu Okoro says

Good day Sir,

From my research, I found out that the independent variables could account for the variation or have a relationship with the dependent variable. But the R2 is 48, 46 and 42 percent. please how do I interpret this.

Thank you.

HASRUL BIN HASHOM says

I just wanna say, “thank you Jim” for the sharing.. it is such really helpfull.

Lee Hen Cheah says

Good day. I am trying to gather from literature review to help draw a conclusion that r square of 50% is acceptable in predicting probability of default. Can anyone recommendation me some credible research papers or books please. Thanks a lot

.

Jim Frost says

Hi,

R-squared isn’t the best measure to use when determining whether your model’s predictions are sufficiently precise. You’d want to use either the standard error of the regression (s) or prediction intervals to evaluation precision. Click those links to learn more!

Habiba Ahmed says

Hi Jim, thanks for the very useful article. I ran a regression model trying to predict the impact of a specific practice on human behavior/personality traits. I have a large sample of 1k, p<.001, and R2 of .058 and .037. Do the low R2 values make this whole question/project not meaningful? Not sure is there's much to say about this unfortunately. Is the high significant level not meaningful because the large sample size makes anything basically significant or is this still insightful.

Thanks in advance.

Jim Frost says

Hi Habiba,

The low R-squared value doesn’t necessarily make your project not meaningful for the reasons I explain in this article.

However, the large sample size does raise the possibility that the coefficient effect size might not be practically significant even though it is statistically significant. Using your subject area knowledge and probably a literature review, I’d assess your effect size and see if it represents practical significance. You might want to read my article about statistical vs. practical significance for ideas.

There is a chance that your coefficient represents a practically significant effect, and you could still have meaningful results. But you’ll need to look into it more to know for sure.

jean says

Hi Mr.Jim,

what if my R-squared is 0.074 which is very low, is that acceptable in predicting grade point average?

Jim Frost says

Hi Jean,

A high R-squared isn’t always important, particularly when you’re mainly interested in understanding the relationships between variables. However, when you need to make predictions, you need to have a high R-squared to produce precise predictions. While you can use your model to make predictions, you’ll have no reason to believe that the predicted values will be near the observed values. In other words, your grade point average predictions will likely be off by quite a bit with such a low R-squared.

Adele Manulat says

hello sir, what if ihave 5 predictor variable and only one was significant, so i did backward deletion technique, as i remove one variable at a time, then the adjusted r-square was increasing this means that i have removed the useless variable. howver as i have only 2 predictor variable the adjusted r-square from 0.27 became 0.25 ( what does this imply)? But still amkng the remaining 2 only one has a signkficant p-value

Umang Gada says

How would you interpret a low r-squared and high p-value result for a SLR?

Jim Frost says

Hi Umang,

Unfortunately, in that type of model, it doesn’t explain much of the variance and you don’t have any significant independent variables. Taken together, the model doesn’t explain changes in the IV and there’s no significant relationships between the IV and DV. There really isn’t anything noteworthy occurring. You should check the residual plots to be sure that you’re not missing curvature or obviously mispecifying the model.

TapiwaMlambo says

Hi Jim thank a lot for this article this is very clearly explained. How about a situation wher the linear model is for a dependent variable that is a scale(1-5, ignoring the debate around this for now) and the IVs are binary and you are getting very low R square but significant IVs for the model. How do you explain that and will you please include graphs as you did in that article. Also any hints on how you can check the data for a possible explanation for this occurance. Thanks!

Jim Frost says

Hi,

If you’re using Likert scale data for your DV, then you should be using ordinal logistic regression. You might be OK using least squares regression but be extra careful about checking the assumptions. It might be harder to obtain an adequate fit using OLS with an ordinal DV.

As for your question, I don’t have data to create graphs that show that, but the idea will be the same as I discuss in this post. When you have significant independent variables (even binary), you know there’s a relationship between it and the dependent variable. It might not explain a lot of the variance, but there is a relationship.

Arul says

Hi Jim,

Using the below link I tested the strength of the x and y relationship.

https://www.statskingdom.com/linear-regression-calculator.html

For the given X and Y values, the system returned the below result. and the outcome matches exactly with my SPSS report.

Y and X relationship

================

“R Square (R2) equals 0.1753. It means that 17.5% of the variability of Y is explained by X.

correlation (R) equals 0.4187. It means that there is a “MODERATE DIRECT RELATIONSHIP BETWEEN X and Y.”

Y and X relationship

================

R Square (R2) equals 0.1564. It means that 15.6% of the variability of Y is explained by X.

correlation (R) equals 0.3954. It means that there is a “WEAK DIRECT RELATIONSHIP BETWEEN X and Y.”

Y and X relationship

================

R Square (R2) equals 0.2795. It means that 28% of the variability of Y is explained by X.

correlation (R) equals 0.5287. It means that there is a “MODERATE DIRECT RELATIONSHIP BETWEEN X and Y.”

Y and X relationship

================

R Square (R2) equals 0.3873. It means that 38.7% of the variability of Y is explained by X.

correlation (R) equals 0.6223. It means that there is a STRONG DIRECT RELATIONSHIP BETWEEN X and Y.”

I referred many papers, and the R /R2 value range used to calculate the strength varies.

Question:

=======

It would be great if you could explain on what guidelines or rule of thumb the outcome shown above is derived and also It would be helpful if you provide some reference so that it will be helpful for my research.

Awaiting for your reply.

Best,

Arul

Jim Frost says

Hi Arul,

I’m not a fan of those types of categorizations. There’s no rule of thumb that applies to all fields. If you’re assessing a physical phenomenon and have high precision measurements, you’d expect correlations to be in the high 90s. In that context, a correlation of 0.8 might be considered unacceptably low.

However, if you’re in psychology and predicting human before, you’d be doing very well if you obtained a correlation of 0.5.

That same reason applies to R-squared, which is literally just taking r (the correlation) and squaring it.

So, a generic rule that applies to all cases across all fields of study just isn’t possible. Consequently, I don’t subscribe to any of them. Instead, you need to be familiar with the subject area and what other studies have obtained. Then you can see how your study compares to them.

Reem says

Hi Jim,

Thank you for the useful information. I just want to know the reference/citation of the information provided regarding low R squared. I want to support my explanation since I have low R squared value in my research results.

Jim Frost says

Hi Reem,

That’s a basic property of regression models. I think most regression textbooks would cover that. I generally use

Applied Linear Statistical Modelsby Neter et al.Huazhao Li says

Dear Jim,

I was confused when I used binary logistic regression with pseduo R square (Cox and Snell). Firstly, I found the R square was very low. Then I noticed that it caused by the similarity of results of dependent variable.

E.g., I set two protocols (protocol A and B) as independent variables with n=100 in each protocol and dependent variables were Yes and No. in protocol A, the proportion was 90%. So, there were 90 protocol A with Yes result and 10 were No result. Then I set the proportion was 40% in protocol B. After regression analysis, the pseduo R square was 0.2568. the value was not bad. However, When I set the proportion was 50% in protocol A and 40% in protocol B with n=100 or n=1000 in each protocol. When n=100, the p value >0.05, when n=1000, the p value <0.05, but the R square was very low and it equal to 0.0101 no matter n=100 or 1000.

It seems, that if the result of dependent variable were close to each other, the R square was very low no mater how great the n was.

Hence, is there any term to describe this phenomenon?

And can I still compared to models to select the most important independent variables by R square even the R square value were very low?

Thank you for your time and help.

Jarrod says

Hi Jim!

I have to analyse a multivariate regression models and determine which variables are significant in the model, how do I do this?

Jim Frost says

Hi Jarrod,

Well, I have a whole book dedicated to that! Regression Analysis: An Intuitive Guide.

If you’re looking for something a bit shorter to get started on, read my post about How to Perform Regression Analysis using Excel.

Katie says

Hi Jim,

I am having a few issues interpreting my multiple regression results. my overall model is not significant (F(5, 64) = 2.27, p = .058.)/ and low R squared, and i have 5 predictors, two of which significantly predict the DV (p= 0.01, and p = 0.02). all my assumptions have been met (e.,g multicollinearity) and i cannot add/remove any IVs. the significant results found my those two predictors confirm one of my hypotheses, however, is this actually confirmed if my overall model is not significant. and i am unsure how to explain WHY my overall model was not significant, even when some IVs predict the DV within the model.

Jim Frost says

Hi Katie,

It sounds like you have 2 significant IVs and 3 IVs that are not significant. Why can’t you removed the insignificant IVs? There are sometimes legitimate reasons you shouldn’t remove them, but analysts will often remove insignificant IVs from the model.

The insignificant IVs are the reason why your overall F-test is not significant. They dilute the overall model’s significance.

If you truly can’t remove the non-significant IVs and the assumptions are satisfied, you can still trust your results. The two significant IVs are truly significant. The low overall F-test doesn’t change that. Including the insignificant IVs can reduce the precision of your model but they don’t invalidate the results.

Tom says

Thank you very much Jim.

I really appreciate your help.

Best Wishes,

Tom

Tom says

Hi Jim,

Once again thank you for all explanations, they all very useful. I have one more question please. If on the basis of my model (taking into account 3 predictors) I removed an influential point, and would like now to remove a variable from the model to leave just 2 predictors , would it be ok to continue further from the sample with removed influential point, or should I go back to the total sample that I started with?

Best wishes,

Tom

Jim Frost says

Hi Tom,

I’d fit your final model with and without the influential point to see how much the model changes. If you have a good rationale for removing it from model (3 predictors), I’m guessing it’ll still be a valid reason to remove it from the reduced model with two predictors. However, I find it’s helpful knowing how much changes the model estimates. Be sure to explain the rationale for removing the data point in your write up!

Tom says

Dear Jim,

Unfortunately I still need some more guidance if possible please. I decided to rerun my analysis and again used the linear multiple regression but with different set of independent variables. Firstly, I used a model with 3 independent variables which was significant but 2 variables missed the P<.05 significance level. One of them had very large β and a p=0.12, the other non-significant variable's β was tiny and p=.43. I decided to remove from the model the one with small β and achieved a nice model which is both significant and with significant predictors (there is still one residual outlier ,just over 1.5 SD- is it ok to leave it?)

I also tried a different method and run back step regression which also indicated the same set of variable to be the best model. However I know that there are mixed views about the back step , which indeed with having only 3 variables, I did not have to run because I easily worked out which variable should be removed.

Nevertheless, I need to write how I worked it out in my thesis. What would be more professional, to write: that I chose to remove the one with small β and large significance level or that I used back step method? Also I do not know how I could write my alternative hypothesis to allow for showing both models . Is it possible to make this hypothesis without stating particular predictors? Or should I make two hypothesis , one for the model with 3 and one for the model with 2 predictors?

Best Wishes,

Tom

Jim Frost says

Hi Tom,

Ideally, you use a mix of statistical measures and theoretical issues to determine which variables to leave in and remove from your model. Read my post about regression model specification for more details about that process.

For your case, unless you have strong theoretical reasons to leave the insignificant variable in the model, I’d consider removing it and just stating that it was not significant. The rationale is that your sample provides insufficient evidence to indicate that it has a relationship with the DV. However, check the residual plots to be sure that removing that variable does not create patterns in your residuals. If you have random residuals after removing the variable, that’s something else you can report to support your decision. If it creates patterns, it’s back to the drawing board!

You have individual null and alternative hypotheses for each independent variable. That’s what the p-values for each IV relate to. Read my post about interpreting regression coefficients and their p-values for more details.

You can also have a null and alternative hypothesis for the entire final model. For more information on that, read my post about the Overall F-test of Significance.

Tom says

Dear Jim,

Thank you very much for all the clarification. I found it extremely helpful.

Best wishes,

Tom

Tom says

Dear Jim ,

I have a significant model with adjusted R square = .65. Only one of my two predictors is significant β = .71, p<.001, the second predictor is not significant β=.17 p=.29. I only had a small sample of 22 observations, could you help please with the interpretation of he data?

Should I say that IV1 predicts DV when controlled for IV2, but IV2 is not a valid predictor?

Also if hypothesis is that those two predict the DV , can I say that it was partially supported?

thanks,

Tom

Jim Frost says

Hi Tom,

I’ve written a post with tips about how to choose the correct regression model that I recommend you read. It can be a challenging process!

The correct answer depends heavily on theory and other findings in the subject area. So, I can’t provide a concrete answer.

Analysts will often remove independent variables that are not significant. That’s one route you can go. In that case, you can talk about the relationship between the one IV and the DV and not worry about the 2nd IV. However, if theory and/or similar studies suggest that the non-significant IV is an important variable, you should consider leaving it in your model. It’s possible that it is relevant variable that has a place in the correct model even though it’s not significant in your model. That can happen due to chance (sampling error) or because you have a small sample size and your model had insufficient power to detect that effect.

If you include both IVs, you can still say that the coefficient for each IV is estimated while controlling the other IV. Also, if you go with both IVs, you’re leaving the 2nd one in despite a lack of evidence that it is significant. You would do so because of some theoretical reasons. In that case, you can say that while the IV was not significant, you left it in because of theory/subject matter concerns. Your study doesn’t provide evidence that it belongs in the model but other studies and/or theory suggests it belongs.

Keep in mind that insignificant results like that are NOT proof that an effect doesn’t exist. Instead, it just means your study didn’t detect it. To see what I mean in more detail, read my post about failing to reject the null hypothesis (which is what happens for insignificant results).

I don’t know which route is best for you, but that should give you ideas for what to consider.

Several other things to think about. Check the residual plots. If they look good with both variables but not just the one, then go with both. Check for multicollinearity between the two variables. That condition can reduce your statistical power and hide findings. You’ll find those and other tips in my post about choosing the correct model.

I hope that helps! Also, if you’re working on a projecting involving regression, consider my Regression Analysis: An Intuitive Guide book!

Fanny says

Hello,

I am dealing with GLMM and I found your article which is very interesting. I have the same problem of low R2 in my modeling. I was wondering how I could estimate the fiability of my GLMM model. R2 is very low but as you said, it is not the best indicator, so what should I use to see if my model is fitted?

Thank you very much in advance for any advices you could give me.

Jim Frost says

Hi Fanny,

First and foremost, I’d check those residual plots! Maybe you’ll see a pattern. If so, that means you can improve your model and potentially increase the R-squared. Also, check to see that at least one predictor is significant.

You might also want to read my article about Specifying the Correct Model for other tips about what to look for!

Ethan says

Dear Jim,

I found a strange phenomenon while checking the R square change in my study.

Could you please help identify some potential reasons why I see this phenomenon?

Also, do you think is this normal that could happen while applying the fixed-effect panel model?

Basically, I apply the fixed-effect panel model on my sample (~9 yearly observations per company).

My sample included 3000 companies and 12250 observations.

When I apply the fixed-effect panel and include the year fixed-effect, the R square do not change much between models when I add new regressors. The adjusted R square increase from 0.1938 to 0.1951 when I add 5 more variables into the base model, which can be said almost no increment.

Nevertheless, if I did not apply the year fixed-effect, and only add industry dummies in using fixed-effect model. The adjusted R square change become normal from 0.1741 to 0.1951 from the base model to main model (add 5 more independent variables).

I do not know how to deal with this situation. Do you mind offering some suggestions and explanations on this issue?

Thank you very much in advance.

Best regards,

Ethan

Vicky says

What if you add other independent variables that correlates with the independent variable you use in the model? Will a model with extra such independent variables decrease effect size and increase p-value of this independent variable?

Janet Acke says

Great article – thank you.

How would you answer the question ‘how much of the variation is explained by these 2 different factors’? If one of the explanatory variables had lots of variation between observations and the other one didn’t, in trying to explain for example wage levels.

If you had 4 individuals who all had very similar levels of schooling but very different levels of IQ, and the dependent variable is wage levels. Would it be best to run a regression on each individually and see which has the higher coefficient, or the higher R^2. Or run them together and look at coefficients?

Say schooling is very significant but doesn’t vary much, would it be fair to say that “schooling explains much of the difference in wages”. Or vice versa and something is hardly significant (lower coefficient, or indeed insignificant) but it varies a lot between observations.

Or is there another value aside from r^2 or the coefficient which captures this idea of variance in both X and Y, as well as variation between them?

Is there a single value that quantifies the notion of ‘how much variation in Y is explained by X1 vs X2… etc’

Thanks so much!

Jim Frost says

Hi Janet,

I’ve written an article that I think will help you out: Identifying the Most Important Variables in Regression Models. It talks about some of the issues you mention and ways to answer that question. If you have more questions after reading that one, please post them in the comments for that post. Thanks!

And, I’ll just add that basing conclusions on four people is an extremely small sample size. I would think it would be hard to draw any firm conclusions from such a small sample.

Akanksha says

Thank you for your easy and to the point concept clarity articles. You are awesome. Every small and big term and concept is explained so intricately.

sajad ahmad mir says

which is true: if r squared increases, variable is significant. if r squared decreases variable is not significant

Jim Frost says

Hi Sajad,

Neither statement is necessarily true. The first can be true or false. If you add a variable to a model, R-squared always increases by some amount whether or not the variable is significant. It does not mean the variable is significant. The second statement is false because the R-squared never decreases when you add a variable even when it is not significant.

If you’re talking about adjusted R-squared, the answers are slightly different.

Matt G says

Hi Jim,

What does it mean when the ANOVA table has reported the model to be significant, but the individual coefficient variables are insignificant? Does this suggest that the independent variables cannot significantly predict the outcome of the dependent variable, but they can better predict the dependent variable once combined into a model?

Many thanks.

Jim Frost says

Hi Matt,

I write about that in my post about the overall F-test. I think you’ll find your answer there. If you more questions after reading that, don’t hesitate to post them!

Zakia Siddiqui says

Dear Sir,

I hope you are doing well.

It seems my regression model isn’t significant (R square is 0.18, Adjusted R square is 0.12, F is 3.28 and p is 0.13 (Intercept), right? However, beta is 1.36 and p-value of the IV is 0.09, Can you please suggest how to do the interpretation?

Thanks in advance.

Jim Frost says

Hi Zakia,

That f-value and it’s corresponding p-value is for the overall significance of your regression model. Clink that link to learn how to interpret it.

From the from what you write, it looks like neither your model nor the IV is significant. Click to learn how to interpret regression coefficients and their p-values.

Sheran says

Hello Sir, What does a null model means in simpler terms? and what does a statistically significant model means?

Jim Frost says

Hi Sheran, I discuss null models and what a statistically significant model means in my post about the overall F-test for regression.

Simon S says

Dear Jim!

We are writhing a master thesis, and we are having the same problems of the adjusted R-square being too low for most our independent variables. What are the minimum requirements for the of beta, R-square and significance levels? Do we also need to calculate the statistical power of our test?

Thank you in advance!

mandakini das says

sir,

My R squared value for moderator variables is 22% only which is low. But my main effects and interaction effects (moderator) statistically significant. Is the model is good one?

Deepa Chandwani says

Hi sir..

Thanks a lot for the explaination.

Please help me with my query.

There are two independent variables in my study where my hypothesis is one has positive impact i.e. direct relationship and the other has negative impact i.e. inverse relationship with the dependent variable.

i carried out multiple regression in Microsoft excel. R2 was 0.19.. F value = 3.79 and Signifance F =0.03

But since the relationship is different , how should i interpret p values…

If multiple regression is not to be done then which test should be carried out?

Thanks SIR.

Jim Frost says

Hi Deepa,

It certainly sounds like the right type of problem for multiple regression. You don’t state what the p-values for the independent variables are? It looks like overall you have a significant model. It sounds like you have two IVs. As one of the IV increases, the mean DV also tends to increase. For other IV, as it increases the mean of the DV tends to decrease. Although, without the p-values for the IVs, I don’t know if both of those relationships are statistically significant.

For more information, read my post about how to interpret regression coefficients and their p-values.

I’m not sure exactly what you’re asking?

Grace Ison says

Thank you so much, Mr. Frost! This is really a great help for my study. Thanks again! 🙂

Jim Frost says

You’re very welcome, Grace. Best of luck with your studies! 🙂

Grace Ison says

Hi Mr. Frost! Thank you for your prompt reply. This helps me a lot in my thesis. I just have one more question. How about if there is a moderating variable like for example, ‘age’ in the multiple regression. Should I also state it this way: “Is there a significant relationship between the independent and dependent variable when age is used as test factor?” or is there another way of stating the research question? Thank you so much!

Jim Frost says

Hi again Grace, a moderating variable exists when there is a significant interaction effect. Suppose you have DV Y, IV X, and a moderating variable M. If there is a statistically significant interaction between X*M, then it indicates that the relationship between X and Y changes based on the value of M. Or, posed as a question, does the relationship between X and Y change based on the value of M?

For more information, read my post about interaction effects.

Grace Ison says

Hi, I just have a question. What is the right research question for a multiple regression analysis? Is it, “is there a significant relationship between dependent variable and independent variable” or “is there a significant difference between dependent variable and independent variable”. Thank you so much!

Jim Frost says

Hi Grace,

Regression determines whether there is a significant relationship between an independent variable and the dependent variable. You mention “multiple” regression, which means your model contains at least two IVs. The analysis tests the relationship between each IV and the DV.

leeterhao says

hi sir. I have a question.

If my anova result get the r=0.2195 and my r^2 = 0.04. So how can i interpret the value

Jim Frost says

Hi, you’re r-squared value is extremely low. In fact, it might not be significantly different from zero. Check your F-test of overall significance to check that.

Your model accounts for 4% of the variance in your dependent variable.

Narendra says

Hello Sir,

In my case, Predicted model is adequate as F value is greater than tabulated F value. But I am getting pure error 29.71 and R-squared value is 77.00%. How can I interpret the results? Please help!

pradip says

sir, how we know over all F test is significant when independent variable is not signficant

Jim Frost says

Hi Pradip,

I’m not 100% sure that I understand what you’re asking. However, you can determine whether the overall F-test is significant by comparing its p-value to your significance level. If its p-value is less than your significance level, the overall F-test is significant. You can read more in my post about the overall F-test.

I hope this helps!

Mostafa Haj Hamidi says

I thank you for your efforts and I wish you a healthy health

Mackenzie says

Hi Jim,

Thank you for the quick response and helpful discussion. I really appreciate it!

Mackenzie says

Hi Jim,

How would you interpret a model with a significant (p<.001), but low R-squared (16%) and only one of the seven predictors are significant (β=.23, p <.01).

Jim Frost says

Hi Mackenzie,

Because only one predictor is significant, you have enough evidence to conclude that there is relationship between only that predictor and the response (DV). The low R-squared indicates that there is a lot of variability around the fitted line.

Look at the images of the fitted line plots for the two models in this blog post. Your model more closely resembles the plot for the low R-squared model. You can see that there is a trend, but the distance between the data points and the lines are greater.

A low R-squared like that is a problem only when you’re trying to produce accurate predictions. But, if you’re just want to understand the relationships between variables, you still have that relationship which is statistically significant.

I hope this helps!

Waseem Majeed says

How can we calculate or say that our regression model is significant?

Jim Frost says

Hi Waseem,

You can assess the coefficient p-values and the F-test of overall significance. I’ve written blog posts on both, which you should read.

Also, consider buying my ebook about regression analysis for a more thorough look!

I hope that helps!

lesley l says

HI when reporting results from a linear regression with a t test, ( rsquared has a low value (the x axis is temperature, the y axis is change in species richness) ,and high variabiltiy abut the line of best fit, how do I report these resutls, do I include

I am not sure which numbers to report basically and how to make sense of them once the test has been run .do i include the slope of line, ? the t stat ? how do I incorporate the standard error into reporting statistical evidence

I have coefficient for intercept

and x variable ,SE and t stat values but I don’t know how to use them in my report

Hari Poudel says

Dear Jim,

I need to ask a follow-up question to the above explanation. I run a logistic regression in which one of the IVs is significant across various models. Other IVs are not siginificant. I also ran a postestimation model -test and it also gave me similar kinds of relationships. What would be better way to explain such type of relationships? Thnka you for your time and help.

Jim Frost says

Hi Hari,

It sounds like your sample only provides sufficient evidence to suggest that the one IV has a relationship with the DV in the population. For your other IVs, the sample does not provide sufficient evidence to conclude that those relationships exist in the population. Whatever sample effects you’re seeing for the non-significant IVs might well be random error rather than representing a relationship that exists in the population.

Pranesh Debnath says

Hello Jim nice to see your explanation!!!!

I am giving following data. Kindly Calrify my question…

Multiple R=0.339958

R-Square=0.115571

Adj. R-Square=0.109458

Standard Error=0.221261

Observation (sample size)=875

No. of Independent variables=6

My question: Is there any mathematical relationship with “Standard Error” and other information given above as like R-Square and Adj. R-Square?

Jim Frost says

Hi Pranesh,

I’d say that the standard error of the regression is probably the most closely related to adjusted R-squared. Both goodness-of-fit measures use the adjusted sums of squares, which incorporates the error degrees of freedom. However, it would also be related to R-squared even though it does not make that adjustment. In general, as R-squared and, particularly, adjusted R-squared increase for a particular dataset, the standard error tends to decrease.

Jerry says

This is indeed an excellent explanation of R-squared with good links to definitions and other articles. Thank you! I’d like to know more about prediction intervals though. Is a prediction interval the range of all output values that occur in the data from the specified input value? And can SPSS produce prediction intervals? thanks

Jim Frost says

Hi Jerry,

Prediction intervals aren’t based on the range of output given the input but rather the mean square error, which is a measure of the variability of the residuals. The mean square error is small when the observed values are close to the fitted values–which in turn produces tighter prediction intervals.

As for SPSS, I don’t regularly use that software. However, I checked around and you can create prediction intervals for simple regression at least. I’m not sure about multiple regression. SPSS refers to them as individual confidence intervals in the regression context. You can create them in SPSS scatterplots. Go to the Chart Editor dialog for scatterplots and click the Fit Line tab. Under Confidence Intervals, choose Individual.

I hope that helps!

Chris Spadi says

Hi Jim,

I made scatter plots where x is the year (spanning over 70 years) and Y is the count (fires) for each month in a specific region. My R2 values range between .22 to .62. Showing a positive trend.

Things like increased drought, rising temperature, decreased precipitation, and wind factor into each year.

Is a R2 value range of .22 to .62 necessarily bad in this case?

Thanks,

Chris

Jim Frost says

Hi Chris, it sounds like in addition to year, you have other independent (X) variables? And, the R-squared varies based on the specific combination of those variables you include in the variable?

I’ve noticed that statisticians sometimes want to categorize the strength of the results based on things like the value of R-squared. A well-known one even created ranges of R-squared values and attributed labels such as strong, moderate, and weak.

I don’t think that’s the way to go. Some study areas are inherently more unpredictable than others. When an area is inherently more unpredictable, you’d expect lower R-squared values as a matter of course. You have to adjust your idea of what constitutes a higher R-squared based on the subject area.

So, that all said, those R-squared values aren’t necessarily bad. However, you’ll need to use subject area knowledge and compare your results to similar studies to make that determination. As along as you have significant predictors, your model is supplying important information about the factors that increase fires. One potential downside occurs if you want to use your model to make predictions rather than just understanding the role of underlying factors. Models with R-squared values in the range you mention often produce imprecise predictions (wide prediction intervals).

Best of luck with your analysis!

Ivan says

Hello Jim,

I am trying to predict house prices based on the number of basic house characteristics. R-sq is 0.52, but most of the independent variables have a p-value within 0.2 and 0.99. Does it mean they are all insignificant? How do I know which ones to include and which ones to eliminate? Thank you very much

Jim Frost says

Hi Ivan,

Yes, you those variables are not statistically significant. You’d often remove them from the model. For more information, read my post about specifying the correct regression model.

Keryn says

Hi Jim, if my regression model is not significant (R ² = .011, adjusted R ² = .005, F (3, 574) = 2.031, p = .108). But my beta value for one of the IV is significant (just) beta =.082, p=.048, how do I interpret this?

Jim Frost says

Hi Keryn,

It can happen that the overall significance doesn’t necessarily match the fact of whether there are any significant independent variables, such as in your model. These tests usually agree. If you have a significant IV, you usually obtain a significant overall test of significance. But, when you have borderline results, like in your model, they can disagree. For more information, read my post about the overall F-test of significance.

As for having a significant variable but a very low R-squared, interpret it exactly as I describe in this post. The relationship between the IV and the DV is statistically significant. In other words, knowing the value of the IV provides some information about the DV. However, there is a lot variability around the fitted values that your model doesn’t explain.

A couple of suggestions for you. Given the extremely low R-squared value, I’d double check the residual plots to be sure that they look good. If you see a non-random pattern in the residual plots, you might be able to improve your model.

Also, given the borderline nature of the p-value for your IV, combined with the non-significant overall test and low R-squared, you should consider these results as suggestive or preliminary.

Best of luck with your analysis!

Alvinn-emmanuel Yao says

Hello Sir,

I have a question. if we have a low R squared value but the P value of F is significant, is the model a good one?

Jim Frost says

Hi,

The first thing you should check is to determine whether you have an independent variable (IV) that has a significant p-value. If you do, then I’d say that the model is probably a good one (always assuming that your residuals look good). You can link an effect to a specific IV or IVs.

If you don’t have an IV that is significant but the overall F-test is significant, it gets a little tricky. In this case, you don’t have sufficient evidence to conclude that any particular IV has a statistically significant relationship with the dependent variable (DV). However, all the IVs in the model taken together have a significant relationship with the DV. Unfortunately, that relationship, as measured by the low R-squared, is fairly weak. You’re pretty much at the minimum limits of useful knowledge in this scenario. You can’t pinpoint the effect to specific IVs and it’s a weak effect to boot. I’d say that a study like this potentially provides evidence that some effect is present but you’d need additional, larger studies to really learn something useful.

So, first step is to look at the p-values for the IVs to see which general scenario I describe above your model fits in. Then, take it from there!

I hope this helps!

Thomas Mkhabela says

Very precise, i really appreciate the answers.

Jim Frost says

Hi Thomas, Thanks! I’m glad it was helpful!

Mark Davis says

Very clear and concise article. You did a great job helping me figure out the right questions to ask!

Abhisek Guha says

Excellent concept clearing article.

Jim Frost says

Thank you, Abhisek!