How high does R-squared need to be in regression analysis? That seems to be an eternal question.

Previously, I explained how to interpret R-squared. I showed how the interpretation of R^{2} is not always straightforward. A low R-squared isn’t always a problem, and a high R-squared doesn’t automatically indicate that you have a good model.

So, how high should R-squared be? The definitive answer is . . . it depends. You’ll need some patience because my assertion is that this question is the wrong question. In this post, I reveal why it is the wrong question and which questions you should ask instead.

**Related post**: Five Reasons Why Your R-squared can be Too High

## How High Does R-squared Need to be is the Wrong Question

How high does R-squared need to be? If you think about it, there is only one correct answer. R-squared should accurately reflect the percentage of the dependent variable variation that the linear model explains. Your R^{2} should not be any higher or lower than this value.

The correct R^{2} value depends on your study area. Different research questions have different amounts of variability that are inherently unexplainable. Case in point, humans are hard to predict. Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. However, if you analyze a physical process and have very good measurements, you might expect R-squared values over 90%. There is no one-size fits all best answer for how high R-squared should be.

Consequently, the answer to “how high does R-squared need to be?” is that it depends on the amount of variability that is actually explainable. Clearly, your R-squared should not be greater than the amount of variability that is actually explainable—which can happen in regression. To see if your R-squared is in the right ballpark, compare your R^{2} to those from other studies.

Chasing a high R^{2} value can produce an inflated value and a misleading model. Read my post about adjusted R-squared and predicted R-squared to see how these statistics can help you avoid these problems.

**Related post**: Model Specification: Choosing the Correct Regression Model

## Define Your Objectives for the Regression Model

When you wonder if the R-squared is high enough, it’s probably because you want to know if the regression model satisfies your objectives. Given your requirements, does the model meet your needs? Therefore, you need to define your objectives before proceeding.

To determine whether a model meets your objectives, you’ll need to ask different questions because R^{2} doesn’t address this issue. The correct questions depend on whether your primary goal for the model is:

- To understand the relationships between the independent variables and dependent variable. Or,
- To predict the dependent variable.

## R-squared and Understanding the Relationships between the Variables

If your primary goal is to understand the relationships between the variables in your model, the answer to how high R-squared needs to be is very simple. For this objective, R^{2} is irrelevant.

This statement might surprise you. However, the interpretation of the significant relationships in a regression model does not change regardless of whether your R^{2} is 15% or 85%! The regression coefficients define the relationship between each independent variable and the dependent variable. The interpretation of the coefficients doesn’t change based on the value of R-squared.

Suppose we have a statistically significant coefficient that equals 2. This coefficient indicates that the mean of the dependent variable increases by 2 for every one-unit increase in the independent variable *irrespective* of the R^{2} value.

**Related post**: See a graphical illustration of why the interpretation of coefficients does not depend on R-squared.

The question about how high R-squared needs to be doesn’t make sense in this context because it *doesn’t matter*. A small R^{2} doesn’t nullify or change the interpretation of the coefficient for an independent variable that is statistically significant.

Instead of wondering if your R-squared value is high enough, you should ask the following questions to ensure that you can trust your results:

- Do I have a sound basis for my model?
- Can I trust my data?
- Do the residual plots look good?
- Do the results fit theory?
- How do I interpret the regression coefficients and P-values?

## R-squared and Predicting the Dependent Variable

On the other hand, if your primary goal is to use your regression model to predict the value of the dependent variable, R-squared is a consideration.

Predictions are more complex than just the single predicted value. Predictions include a margin of error. More precise predictions have a smaller amount of error.

R^{2} is relevant in this context because it is a measure of the error. Lower R^{2} values correspond to models with more error, which in turn produces predictions that are less precise. In other words, if your R^{2} is too low, your predictions will be too imprecise to be useful.

A low R-squared can be an indicator of imprecision predictions. However, R^{2} doesn’t tell you directly whether the predictions are sufficiently precise for your requirements.

We need a direct measure of precision that uses the units of the dependent variable. That’s why asking, “How high does R-squared need to be?” still is not the correct question.

Instead, you should ask the questions above plus the following question:

- Are the prediction intervals precise enough for my requirements?

## Using Prediction intervals to Assess Precision

Most statistical software can calculate prediction intervals, and they are easy to use.

A prediction interval is a range where a single new observation is likely to fall given values of the independent variable(s) that you specify. These ranges incorporate the margin of error around the predicted value. If the prediction intervals are too wide, the predictions don’t provide useful information. Narrow prediction intervals represent more precise predictions.

In my post about using regression analysis to make predictions, I present the model displayed in the graph. This model uses BMI to predict the percentage of body fat. The 95% prediction interval for a BMI of 18 is 16-30% body fat. We can be 95% confident that an individual with a BMI of 18 will fall within this range.

At this point, you need to use client requirements, spec limits, and subject area knowledge to determine whether the prediction intervals are narrow enough to represent meaningful predictions. By assessing the prediction intervals, you are evaluating the precision of the model directly rather than relying on an arbitrary cut-off value for R-squared.

I’m not a medical expert, but I’d guess that the 14 point range of 16-30% is too wide to provide meaningful information. If this is true, our regression model is too imprecise to be useful.

**Related posts**: Understand Precision in Applied Regression to Avoid Costly Mistakes and Confidence Intervals vs Prediction Intervals vs Tolerance Intervals.

## R-squared Is Overrated!

Asking “How high does R-squared need to be?” is usually not the correct question to ask. You probably want to know if the regression model can meet your needs. To this end, there are better questions that you should ask.

R-squared gets all of the attention for assessing the goodness-of-fit. It seems like a simple statistic to interpret. However, evaluating the fit involves more than just this single statistic. You need to use subject area knowledge, residual plots, coefficients, and prediction intervals if you’re making predictions.

However, R-squared does have some good uses. For one thing, compare your R^{2} value to values from similar studies. If your R^{2} is markedly higher or lower, you should investigate because there might be a problem.

Be sure to read my post about the standard error of the regression (S), which is a different type of goodness-of-fit measure that is more useful when you need to make predictions.

If you’re learning regression, check out my Regression Tutorial!

avianto nugroho says

Hi Jim,

thank you very much for your posts, very helpful. Although, I am still trying to figure every single theory out.

I am Avi, a master student who is currently writing master’s thesis.

So, I am analysing my data using GAM. To this point, I have come up with several models and done model selections. As a result, I got a model which I think (still not sure) that it is the best model. Considering, the best model is the model having the lowest AIC.

My question is, which one do I have to choose between the highest R-squared and the lowest AIC? Or in between?

Kindly, give me some advices on this case. Likewise, I think you should consider writing a post concerning GAM modelling or similar models particularly its operation using R. Because I would say that your blog is a more simple and understandable for the beginners in statistics.

Thank you again. You’re just cool.

All the best,

Avi