R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.

After fitting a linear regression model, you need to determine how well the model fits the data. Does it do a good job of explaining changes in the dependent variable? There are a several key goodness-of-fit statistics for regression analysis. In this post, we’ll examine R-squared (R^{2 }), highlight some of its limitations, and discover some surprises. For instance, small R-squared values are not always a problem, and high R-squared values are not necessarily good!

**Related post**: When Should I Use Regression Analysis?

## Assessing Goodness-of-Fit in a Regression Model

Linear regression identifies the equation that produces the smallest difference between all of the observed values and their fitted values. To be precise, linear regression finds the smallest sum of squared residuals that is possible for the dataset.

Statisticians say that a regression model fits the data well if the differences between the observations and the predicted values are small and unbiased. Unbiased in this context means that the fitted values are not systematically too high or too low anywhere in the observation space.

However, before assessing numeric measures of goodness-of-fit, like R-squared, you should evaluate the residual plots. Residual plots can expose a biased model far more effectively than the numeric output by displaying problematic patterns in the residuals. If your model is biased, you cannot trust the results. If your residual plots look good, go ahead and assess your R-squared and other statistics.

Read my post about checking the residual plots.

## R-squared and the Goodness-of-Fit

R-squared evaluates the scatter of the data points around the fitted regression line. It is also called the coefficient of determination, or the coefficient of multiple determination for multiple regression. For the same data set, higher R-squared values represent smaller differences between the observed data and the fitted values.

R-squared is the percentage of the dependent variable variation that a linear model explains.

R-squared is always between 0 and 100%:

- 0% represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.
- 100% represents a model that explains all of the variation in the response variable around its mean.

Usually, the larger the R^{2}, the better the regression model fits your observations. However, this guideline has important caveats that I’ll discuss in both this post and the next post.

## Visual Representation of R-squared

To visually demonstrate how R-squared values represent the scatter around the regression line, you can plot the fitted values by observed values.

The R-squared for the regression model on the left is 15%, and for the model on the right it is 85%. When a regression model accounts for more of the variance, the data points are closer to the regression line. In practice, you’ll never see a regression model with an R^{2} of 100%. In that case, the fitted values equal the data values and, consequently, all of the observations fall exactly on the regression line.

## R-squared has Limitations

You cannot use R-squared to determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate if a regression model provides an adequate fit to your data. A good model can have a low R^{2} value. On the other hand, a biased model can have a high R^{2} value!

## Are Low R-squared Values Always a Problem?

No! Regression models with low R-squared values can be perfectly good models for several reasons.

Some fields of study have an inherently greater amount of unexplainable variation. In these areas, your R^{2} values are bound to be lower. For example, studies that try to explain human behavior generally have R^{2} values less than 50%. People are just harder to predict than things like physical processes.

Fortunately, if you have a low R-squared value but the independent variables are statistically significant, you can still draw important conclusions about the relationships between the variables. Statistically significant coefficients continue to represent the mean change in the dependent variable given a one-unit shift in the independent variable. Clearly, being able to draw conclusions like this is vital.

**Related post**: How to Interpret Regression Models that have Significant Variables but a Low R-squared

There is a scenario where small R-squared values can cause problems. If you need to generate predictions that are relatively precise (narrow prediction intervals), a low R^{2} can be a show stopper.

How high does R-squared need to be for the model produce useful predictions? That depends on the precision that you require and the amount of variation present in your data. A high R^{2} is necessary for precise predictions, but it is not sufficient by itself, as we’ll uncover in the next section.

**Related post**: Understand Precision in Applied Regression to Avoid Costly Mistakes

## Are High R-squared Values Always Great?

No! A regression model with a high R-squared value can have a multitude of problems. You probably expect that a high R^{2} indicates a good model but examine the graphs below. The fitted line plot models the association between electron mobility and density.

The data in the fitted line plot follow a very low noise relationship, and the R-squared is 98.5%, which seems fantastic. However, the regression line consistently under and over-predicts the data along the curve, which is bias. The Residuals versus Fits plot emphasizes this unwanted pattern. An unbiased model has residuals that are randomly scattered around zero. Non-random residual patterns indicate a bad fit despite a high R^{2}. Always check your residual plots!

This type of specification bias occurs when your linear model is underspecified. In other words, it is missing significant independent variables, polynomial terms, and interaction terms. To produce random residuals, try adding terms to the model or fitting a nonlinear model.

**Related post**: Model Specification: Choosing the Correct Regression Model

A variety of other circumstances can artificially inflate your R^{2}. These reasons include overfitting the model and data mining. Either of these can produce a model that looks like it provides an excellent fit to the data but in reality the results can be entirely deceptive.

An overfit model is one where the model fits the random quirks of the sample. Data mining can take advantage of chance correlations. In either case, you can obtain a model with a high R^{2} even for entirely random data!

**Related post**: Five Reasons Why Your R-squared can be Too High

## R-squared Is Not Always Straightforward

At first glance, R-squared seems like an easy to understand statistic that indicates how well a regression model fits a data set. However, it doesn’t tell us the entire story. To get the full picture, you must consider R^{2} values in combination with residual plots, other statistics, and in-depth knowledge of the subject area.

I’ll continue to explore the limitations of R^{2} in my next post and examine two other types of R^{2}: adjusted R-squared and predicted R-squared. These two statistics address particular problems with R-squared. They provide extra information by which you can assess your regression model’s goodness-of-fit.

You can also read about the standard error of the regression, which is a different type of goodness-of-fit measure.

Be sure to read my post where I answer the eternal question: How high does R-squared need to be?

If you’re learning regression and like the approach I use in my blog, check out my eBook!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Hitesh says

Hello Jim… Thanks for such a wonderful explanation about R2..

I would like to know the references like book or journal which can give explain the limitations of R2 as you have explained.

Jim Frost says

Hi Hitesh,

Because these involve the basic properties of R-squared, you should be able to find references to these properties in any textbook.

Luca romen says

Hello Jim thanks for your help can you talk pleas about discrete variables regression ? How does it work? In which is it different from normaL regression?

Jeff says

Thank you for all Jim, you can explain difficult concept so easley

Badr says

Thanks Jim for the article.

I am wondering if there is any way in stat to enhance R2s in the linear regression?

Jim Frost says

Hi, I’m not sure what you mean by “enhance?” If you’re asking how to increase R-squared, you can do that by adding independent variables to your model, properly modeling curvature, and considering interaction terms where appropriate.

However, be sure that if you take any of these actions you’re doing so because they are appropriate variables to add given subject-area knowledge and theory. Don’t simply make your model more complicate to chase a higher R-squared. For any given study area, there’s an inherent amount of unexplainable uncertainty, which represent a ceiling for R-squared. As I mention in this post, you can push past the ceiling but at he risk of producing results that you can’t trust!

And, remember that a lower R-squared isn’t necessarily bad!

Thomas says

Hello, thanks for the great article! There is a concept I can’t wrap my head around for some reason and I’m hoping you can shed some light!

What do we mean when we say that a model “explains” a percentage of a dependent variable’s variation? The variation of the variable y is the squared differences of the actual observations from their mean ! Now, I think that the word explain is used metaphorically but still Im not exactly sure what it actually means! Thanks in advance Jim and thanks again for the article

Jim Frost says

Hi Thomas,

That’s a great question. I actually answer it in my ebook. You can get the free sample version which has the complete first two chapters. The reason I mention that is because I talk about this issue in the 2nd chapter, which is included in the free sample book. It’s a free download and you don’t need a credit card. If you get it, read pages 44-46.

In a nutshell, you calculate different variances to see how well the data fit the data. I personally prefer saying the model “accounts” for variability. It might be a semantics issue, but to me “explains” implies more a casual relationship. You can actually have a model with proxy variables that are only indirectly correlated with the dependent variable and “explains” doesn’t sound right for that situation either! Read that section of the ebook and if you still have questions, let me know!

I hope this helps!

Dana says

In a hierarchical regression, would R2 change for, say, the third predictor, tell us the percentage of variance that that predictor is reponsible for? I seem to have things that way for some reason but I’m unsure where I got that from or if it was a mistake.

Jim Frost says

Hi Dana,

In general, I find that determining how much R-squared changes when you add a predictor to the model last can be very meaningful. It indicates the amount of variance that a variable accounts for uniquely. You can read more about it in my post about identifying the most important variables in a model.

Takunda says

Hi Jim,

I used random effects model to perform my regression analysis. Would r squared and adjusted r squared serve as appropriate reliability tests. I have all the results from Stata

Jim Frost says

Hi Takunda,

Unfortunately, I have not used Stata for random effects model. I’ve found this discussion thread, which might be helpful.

Jim says

Wow Jim, thank you so much for this article, I’ve been banging my head against the wall for a while now watching every youtube video I could find trying to understand this. I finally actually feel like I can relate a lot of what you’ve said to my own regression analysis, which is huge for me…… thank you so much.

Jim Frost says

You’re very welcome! Your comment really makes my day because I strive to make statistics more relatable. Because you’re using regression analysis, you might consider my ebook about regression analysis.

Greg says

Very interesting discussion. And I actually understood most of it! I was using Microsoft Excel to chart some daily variances we are experiencing in our fuel storage tanks. (we operate gas bars & convenience stores).

I see that we are experiencing day to day variances (both gains and losses), but I wanted to graph these variances, and run a trend line, to see if we were losing or gaining fuel – over time. Excel has a few options for trend lines (linear, logarthimetic & polynomial). Based on your discussion, I used the option with the highest R-squared value, thinking it would be the best predictor. However, all the trend line options had extremely low R-square values…ranging from .5% to 3%. I thought perhaps my data variances were too extreme to allow for a predictive trend line. I was curious as to what a high r-square trend line might look like, so I created a “mock” table of data, covering 30 days, and used numbers that were in a fairly tight range (95 to 105). I graphed this range of data & ran the trend line. I expected the R-square value to be close to 100% – but its only at 10%.

Long story short – I can’t figure out why my R-square values are so low!

Jim Frost says

Hi Greg,

I’m glad that you understood most of it! That’s my goal! 🙂

The first thing you should do is just graph it in a scatterplot. Do you see an upward or downward trend? If it’s flat overall, that explains your low R-squared right there. It might be that your variances aren’t related to time. Or, perhaps they are but your data don’t cover enough time to capture it. In other words, your predictor (time it sounds like) just aren’t explaining the variances.

Make sure that your trend line follows the data. You can get a lower R-squared when your model isn’t fitting the data.

You also want to check for something called heteroscedasticity. If you’re measuring variances using plus and minus values, and the absolute value of the variances increases over time, you could see a flat trend but increases in the spread of values around the fitted line over time. You’d see a cone shape in your data. You can also see that in a residuals plot. You can use the search box on my website to find my post about heteroscedasticity if you see that fan/cone shape in the graph of your data over time. That could happen if both plus and minus variances grew in magnitude over time.

Those are the types of things I’d look into. You could also try other variables instead/in addition to time such as temperature. Anything that might be related to the variances. Maybe it’s related to other conditions rather than the passage of time. Or maybe you need a longer time frame for the time effect to reveal itself?

Sajad Hussain says

Indeed a clear and precise way to understand the concept of R square.I was bit worried as my R square value was coming .037.Thanks for your assistance over Multiple regression and its related parameters. I Would like to get benefited more from coming online study materials on statistics.

Thanks & Regards

Angie says

Very easy to follow. Glad I found your site.

Charles says

Thanks Jim, i wonder whether you youtube videos. Second, i would to see an explanation of how to reshape data to have it, in a time to event nature, in STATA.

Luyando says

thank you

Alexandros says

Dear Jim,

Very nice work! Just one question.

Could you please share some references? I’m especially interested for the part “Are Low R-squared Values Always a Problem”.

I am writing a report concerning my research (field of asphalt properties) and I’m experiencing lower R square (from 0.21 to 0.469 for different models). In my case, I really believe that asphalt can be as complex as a human and therefore when you try to fit properties in a regression model the interpretation of the result can be similar to the case you give as an example concerning human behavior.

Again thank you for sharing this article and looking forward to your reply!

Cheers!

Jim Frost says

Hi Alexandros,

I don’t have a specific reference for that issue about low R-squared values not always being a problem other than it is based on the equations and accepted properties of R-squared that you’ll find in any regression/linear model text book.

A couple of thoughts for you. One, if you haven’t read it already, you should probably read my post about how to interpret regression models with low R-squared values and significant independent variables. It sounds like this situation matches yours.

Also, there’s an important distinction between complexity and predictability. You can have a complex subject area that is still predictable. Although, it might be more difficult to specify the correct model and/or obtain all the necessary data. On the other hand, I think you can probably argue that you can have a simple subject area that is hard to predict. For example, rolling a fair die, you can only predict the outcome accurately 1/6 of the time! Of course, you can be in a subject area that is both complex and unpredictable!

I mention this distinction because you’ll need to determine whether your subject area is predictable rather than just the complexity. I don’t know your field so I can’t answer that but typically physical properties are more predictable than human behavior.

The practical aspect you need to determine is whether your R-squared is low because it’s inherently unpredictable or because you’re not including an important variable, modeling curvature, modeling an interaction, or possibly using imprecise measurements? If it’s inherently unpredictable, then you’ve hit a brick wall that you can’t legitimately get passed. However, if it’s one of the other issues, you can legitimately improve your model. The trick is to determine which case you fall under!

I hope this helps at least a bit!

Kesinee Meekaewnoi says

Hi Jim,

Your works are amazing. It’s very clear and useful. Please keep up the good work.

Jim Frost says

Thank you so much for your kind words, Kesinee! It’s great motivation to keep going! 🙂

Narasimha murthy says

Hi Jim,

Excellent articulation and the language is simple..I enjoy reading your blog.

Jim Frost says

Thank you very much, Narasimha!

Nic says

Hi Jim,

What a fantastic, clear article explaining the ideas behind R-squared;

Jim Frost says

Hi Nic,

Thank you. As for your note about the other text, yes, I’m the author of both–hence the similarity.

Mahogany Hartley says

What would be considered a low R2

Jim Frost says

Hi Mahogany,

That depends on the subject matter. If you are working in the physical sciences and has a low noise, predictable process, then an R-squared of 60% would be considered to be extremely low and represent some sort of problem with the study. However, if you’re predicting human behavior, the same R-squared would be very high! However, I think any study would consider and R-squared of 15% to be very low.

I hope this helps!

Don says

Sir I would like to ask, if I acquire is R^2 = .1027 (10.27%) can I assumed that the dependent and indepent variable are inversely proportional to each other?

Jim Frost says

Hi Don,

From the R-squared value, you can’t determine the direction of the relationship(s) between the independent variable(s) and dependent variable. The 10% value indicates that the relationship between your independent variable and dependent variable is weak, but it doesn’t tell you the direction. To make that determination, I’d create a scatterplot using those variables and visually assess the relationship. You can also calculate the correlation, which does indicate the direction. Read my post about understanding correlation.

Ajay verma says

Hello Sir,

In some situation adjusted R square may be negative then how we interpret them?

Jim Frost says

Hi Ajay,

Yes, it’s entirely possible for adjusted R-squared (and predicted R-squared) to be negative. Some statistical software will report a 0% for these cases while other software returns the negative value.

The interpretation is really no different than if you had an adjusted R-squared of zero. In the case of zero, you’d say your model is terrible! I guess you could say that a negative value is even worse, but that doesn’t change what you’d do. If you have a zero value (or negative), you know that your model is unusable. The next step is to check the regular R-squared. Is it much higher? If so, your problem might be only that you’re including too many independent variables and you need to use a simpler model. However, if the regular R-squared is similarly low, then you know that your model just isn’t explaining much of the variance.

The only additional information that you might glean from a negative value, as opposed to a small or zero value, is that there is a higher probability that you’re working with a small sample and including too many variables. If you have a large sample size, it’s harder to get a negative value even when your model doesn’t explain much of the variance. So, if you obtain a negative value, be aware that you are probably working with a particularly small sample, which severely limits the degree of complexity for your model that will yield valid results.

I hope this helps!

Kamala says

Hi Jim,

Thank you for your reply…

I will go through your reference for the low R-squared values and get back to you.

Thank You,

Kamala.

Kamala says

:)….

am just seeing the relationship between variance and regression…is it so that for more variance does the data points are closer to the regression line??? I doubt on this statement….

Can you explain ….

Regards

Kamala

Jim Frost says

Hi Kamala,

R-squared measures the amount of variance around the fitted values. If you have a simple regression model with one independent variable and create a fitted line plot, it measures the amount of variance around the fitted line. The lower the variance around the fitted values, the higher the R-squared. Another way to think about it is that it measures the strength of the relationship between the set of independent variables and the dependent variable. Either way, the closer the observed values are to the fitted values for a given dataset, the higher the R-squared.

I’ve written a couple of other posts that illustrate this concept in action. In this post about low R-squared values, I compare models with high and low R-squared values to show how they’re different. And, in this post about correlation, I show how the variance around a line that indicates the strength of the relationship.

I hope this helps!

Kamala says

Jim Frost…I have seen many sites regarding these concepts…your’s is the best!!!Very Clear..Thank You so much.

Jim Frost says

Hi Kamala, thanks so much! Your kind comments made my day!

Qayoom Khachoo says

Hi sir thank you very much for the informative post. It would be more enriching if you could kindly highlight the causes of low R-square in panel data. I’m struggling to defend my model because of low R-square value. Thank you

Jim Frost says

Hi Qayoom, you’re in luck because I’ve written an entire post about how a low R-squared isn’t necessarily a problem! It isn’t specifically about panel data, but I think you’ll find the concepts helpful. Best of luck with your model!

Akhilesh Gupta says

Sir how can i calculate R-square for time series models and how to interpret that R-square

Miteya says

Hello Sir, Thank you for the data. Can you please suggest some methods like the R-square method to compare the results I get by using R square methods.

Jim Frost says

Hi Miteya, R-squared is an example of a goodness-of-fit statistic. There are other goodness-of-fit statistics that you can use. I have written about some of them. In the last section of this post, look for and click the links for two posts that are about:

adjusted and predicted R-squared, and

standard error of the regression.

These three statistics all assess the goodness-of-fit, like R-squared, but they are different.

The Akaike information criterion (AIC) is another goodness-of-fit statistic. I haven’t written about that one yet, but you can search for it.

I hope this helps!

Dharmendra Dubey says

It is very useful. Thanks a lot sir

Jim Frost says

You’re very welcome! I’m glad you found it to be helpful!