R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.

After fitting a linear regression model, you need to determine how well the model fits the data. Does it do a good job of explaining changes in the dependent variable? There are a several key goodness-of-fit statistics for regression analysis. In this post, we’ll examine R-squared (R^{2 }), highlight some of its limitations, and discover some surprises. For instance, small R-squared values are not always a problem, and high R-squared values are not necessarily good!

**Related post**: When Should I Use Regression Analysis?

## Assessing Goodness-of-Fit in a Regression Model

Linear regression identifies the equation that produces the smallest difference between all of the observed values and their fitted values. To be precise, linear regression finds the smallest sum of squared residuals that is possible for the dataset.

Statisticians say that a regression model fits the data well if the differences between the observations and the predicted values are small and unbiased. Unbiased in this context means that the fitted values are not systematically too high or too low anywhere in the observation space.

However, before assessing numeric measures of goodness-of-fit, like R-squared, you should evaluate the residual plots. Residual plots can expose a biased model far more effectively than the numeric output by displaying problematic patterns in the residuals. If your model is biased, you cannot trust the results. If your residual plots look good, go ahead and assess your R-squared and other statistics.

Read my post about checking the residual plots.

## R-squared and the Goodness-of-Fit

R-squared evaluates the scatter of the data points around the fitted regression line. It is also called the coefficient of determination, or the coefficient of multiple determination for multiple regression. For the same data set, higher R-squared values represent smaller differences between the observed data and the fitted values.

R-squared is the percentage of the dependent variable variation that a linear model explains.

R-squared is always between 0 and 100%:

- 0% represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.
- 100% represents a model that explains all of the variation in the response variable around its mean.

Usually, the larger the R^{2}, the better the regression model fits your observations. However, this guideline has important caveats that I’ll discuss in both this post and the next post.

## Visual Representation of R-squared

To visually demonstrate how R-squared values represent the scatter around the regression line, you can plot the fitted values by observed values.

The R-squared for the regression model on the left is 15%, and for the model on the right it is 85%. When a regression model accounts for more of the variance, the data points are closer to the regression line. In practice, you’ll never see a regression model with an R^{2} of 100%. In that case, the fitted values equal the data values and, consequently, all of the observations fall exactly on the regression line.

## R-squared has Limitations

You cannot use R-squared to determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate if a regression model provides an adequate fit to your data. A good model can have a low R^{2} value. On the other hand, a biased model can have a high R^{2} value!

## Are Low R-squared Values Always a Problem?

No! Regression models with low R-squared values can be perfectly good models for several reasons.

Some fields of study have an inherently greater amount of unexplainable variation. In these areas, your R^{2} values are bound to be lower. For example, studies that try to explain human behavior generally have R^{2} values less than 50%. People are just harder to predict than things like physical processes.

Fortunately, if you have a low R-squared value but the independent variables are statistically significant, you can still draw important conclusions about the relationships between the variables. Statistically significant coefficients continue to represent the mean change in the dependent variable given a one-unit shift in the independent variable. Clearly, being able to draw conclusions like this is vital.

**Related post**: How to Interpret Regression Models that have Significant Variables but a Low R-squared

There is a scenario where small R-squared values can cause problems. If you need to generate predictions that are relatively precise (narrow prediction intervals), a low R^{2} can be a show stopper.

How high does R-squared need to be for the model produce useful predictions? That depends on the precision that you require and the amount of variation present in your data. A high R^{2} is necessary for precise predictions, but it is not sufficient by itself, as we’ll uncover in the next section.

**Related post**: Understand Precision in Applied Regression to Avoid Costly Mistakes

## Are High R-squared Values Always Great?

No! A regression model with a high R-squared value can have a multitude of problems. You probably expect that a high R^{2} indicates a good model but examine the graphs below. The fitted line plot models the association between electron mobility and density.

The data in the fitted line plot follow a very low noise relationship, and the R-squared is 98.5%, which seems fantastic. However, the regression line consistently under and over-predicts the data along the curve, which is bias. The Residuals versus Fits plot emphasizes this unwanted pattern. An unbiased model has residuals that are randomly scattered around zero. Non-random residual patterns indicate a bad fit despite a high R^{2}. Always check your residual plots!

This type of specification bias occurs when your linear model is underspecified. In other words, it is missing significant independent variables, polynomial terms, and interaction terms. To produce random residuals, try adding terms to the model or fitting a nonlinear model.

**Related post**: Model Specification: Choosing the Correct Regression Model

A variety of other circumstances can artificially inflate your R^{2}. These reasons include overfitting the model and data mining. Either of these can produce a model that looks like it provides an excellent fit to the data but in reality the results can be entirely deceptive.

An overfit model is one where the model fits the random quirks of the sample. Data mining can take advantage of chance correlations. In either case, you can obtain a model with a high R^{2} even for entirely random data!

**Related post**: Five Reasons Why Your R-squared can be Too High

## R-squared Is Not Always Straightforward

At first glance, R-squared seems like an easy to understand statistic that indicates how well a regression model fits a data set. However, it doesn’t tell us the entire story. To get the full picture, you must consider R^{2} values in combination with residual plots, other statistics, and in-depth knowledge of the subject area.

I’ll continue to explore the limitations of R^{2} in my next post and examine two other types of R^{2}: adjusted R-squared and predicted R-squared. These two statistics address particular problems with R-squared. They provide extra information by which you can assess your regression model’s goodness-of-fit.

You can also read about the standard error of the regression, which is a different type of goodness-of-fit measure.

Be sure to read my post where I answer the eternal question: How high does R-squared need to be?

If you’re learning regression, check out my Regression Tutorial!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Dharmendra Dubey says

It is very useful. Thanks a lot sir

Jim Frost says

You’re very welcome! I’m glad you found it to be helpful!

Miteya says

Hello Sir, Thank you for the data. Can you please suggest some methods like the R-square method to compare the results I get by using R square methods.

Jim Frost says

Hi Miteya, R-squared is an example of a goodness-of-fit statistic. There are other goodness-of-fit statistics that you can use. I have written about some of them. In the last section of this post, look for and click the links for two posts that are about:

adjusted and predicted R-squared, and

standard error of the regression.

These three statistics all assess the goodness-of-fit, like R-squared, but they are different.

The Akaike information criterion (AIC) is another goodness-of-fit statistic. I haven’t written about that one yet, but you can search for it.

I hope this helps!

Akhilesh Gupta says

Sir how can i calculate R-square for time series models and how to interpret that R-square

Qayoom Khachoo says

Hi sir thank you very much for the informative post. It would be more enriching if you could kindly highlight the causes of low R-square in panel data. I’m struggling to defend my model because of low R-square value. Thank you

Jim Frost says

Hi Qayoom, you’re in luck because I’ve written an entire post about how a low R-squared isn’t necessarily a problem! It isn’t specifically about panel data, but I think you’ll find the concepts helpful. Best of luck with your model!

Kamala says

Jim Frost…I have seen many sites regarding these concepts…your’s is the best!!!Very Clear..Thank You so much.

Jim Frost says

Hi Kamala, thanks so much! Your kind comments made my day!

Kamala says

:)….

am just seeing the relationship between variance and regression…is it so that for more variance does the data points are closer to the regression line??? I doubt on this statement….

Can you explain ….

Regards

Kamala

Jim Frost says

Hi Kamala,

R-squared measures the amount of variance around the fitted values. If you have a simple regression model with one independent variable and create a fitted line plot, it measures the amount of variance around the fitted line. The lower the variance around the fitted values, the higher the R-squared. Another way to think about it is that it measures the strength of the relationship between the set of independent variables and the dependent variable. Either way, the closer the observed values are to the fitted values for a given dataset, the higher the R-squared.

I’ve written a couple of other posts that illustrate this concept in action. In this post about low R-squared values, I compare models with high and low R-squared values to show how they’re different. And, in this post about correlation, I show how the variance around a line that indicates the strength of the relationship.

I hope this helps!

Kamala says

Hi Jim,

Thank you for your reply…

I will go through your reference for the low R-squared values and get back to you.

Thank You,

Kamala.

Ajay verma says

Hello Sir,

In some situation adjusted R square may be negative then how we interpret them?

Jim Frost says

Hi Ajay,

Yes, it’s entirely possible for adjusted R-squared (and predicted R-squared) to be negative. Some statistical software will report a 0% for these cases while other software returns the negative value.

The interpretation is really no different than if you had an adjusted R-squared of zero. In the case of zero, you’d say your model is terrible! I guess you could say that a negative value is even worse, but that doesn’t change what you’d do. If you have a zero value (or negative), you know that your model is unusable. The next step is to check the regular R-squared. Is it much higher? If so, your problem might be only that you’re including too many independent variables and you need to use a simpler model. However, if the regular R-squared is similarly low, then you know that your model just isn’t explaining much of the variance.

The only additional information that you might glean from a negative value, as opposed to a small or zero value, is that there is a higher probability that you’re working with a small sample and including too many variables. If you have a large sample size, it’s harder to get a negative value even when your model doesn’t explain much of the variance. So, if you obtain a negative value, be aware that you are probably working with a particularly small sample, which severely limits the degree of complexity for your model that will yield valid results.

I hope this helps!

Don says

Sir I would like to ask, if I acquire is R^2 = .1027 (10.27%) can I assumed that the dependent and indepent variable are inversely proportional to each other?

Jim Frost says

Hi Don,

From the R-squared value, you can’t determine the direction of the relationship(s) between the independent variable(s) and dependent variable. The 10% value indicates that the relationship between your independent variable and dependent variable is weak, but it doesn’t tell you the direction. To make that determination, I’d create a scatterplot using those variables and visually assess the relationship. You can also calculate the correlation, which does indicate the direction. Read my post about understanding correlation.

Mahogany Hartley says

What would be considered a low R2

Jim Frost says

Hi Mahogany,

That depends on the subject matter. If you are working in the physical sciences and has a low noise, predictable process, then an R-squared of 60% would be considered to be extremely low and represent some sort of problem with the study. However, if you’re predicting human behavior, the same R-squared would be very high! However, I think any study would consider and R-squared of 15% to be very low.

I hope this helps!

Nic says

Hi Jim,

What a fantastic, clear article explaining the ideas behind R-squared;

Jim Frost says

Hi Nic,

Thank you. As for your note about the other text, yes, I’m the author of both–hence the similarity.

Narasimha murthy says

Hi Jim,

Excellent articulation and the language is simple..I enjoy reading your blog.

Jim Frost says

Thank you very much, Narasimha!

Kesinee Meekaewnoi says

Hi Jim,

Your works are amazing. It’s very clear and useful. Please keep up the good work.

Jim Frost says

Thank you so much for your kind words, Kesinee! It’s great motivation to keep going! ðŸ™‚

Alexandros says

Dear Jim,

Very nice work! Just one question.

Could you please share some references? I’m especially interested for the part “Are Low R-squared Values Always a Problem”.

I am writing a report concerning my research (field of asphalt properties) and I’m experiencing lower R square (from 0.21 to 0.469 for different models). In my case, I really believe that asphalt can be as complex as a human and therefore when you try to fit properties in a regression model the interpretation of the result can be similar to the case you give as an example concerning human behavior.

Again thank you for sharing this article and looking forward to your reply!

Cheers!

Jim Frost says

Hi Alexandros,

I don’t have a specific reference for that issue about low R-squared values not always being a problem other than it is based on the equations and accepted properties of R-squared that you’ll find in any regression/linear model text book.

A couple of thoughts for you. One, if you haven’t read it already, you should probably read my post about how to interpret regression models with low R-squared values and significant independent variables. It sounds like this situation matches yours.

Also, there’s an important distinction between complexity and predictability. You can have a complex subject area that is still predictable. Although, it might be more difficult to specify the correct model and/or obtain all the necessary data. On the other hand, I think you can probably argue that you can have a simple subject area that is hard to predict. For example, rolling a fair die, you can only predict the outcome accurately 1/6 of the time! Of course, you can be in a subject area that is both complex and unpredictable!

I mention this distinction because you’ll need to determine whether your subject area is predictable rather than just the complexity. I don’t know your field so I can’t answer that but typically physical properties are more predictable than human behavior.

The practical aspect you need to determine is whether your R-squared is low because it’s inherently unpredictable or because you’re not including an important variable, modeling curvature, modeling an interaction, or possibly using imprecise measurements? If it’s inherently unpredictable, then you’ve hit a brick wall that you can’t legitimately get passed. However, if it’s one of the other issues, you can legitimately improve your model. The trick is to determine which case you fall under!

I hope this helps at least a bit!

Luyando says

thank you

Charles says

Thanks Jim, i wonder whether you youtube videos. Second, i would to see an explanation of how to reshape data to have it, in a time to event nature, in STATA.