When your regression model has a high R-squared, you assume it’s a good thing. You want a high R-squared, right? However, as I’ll show in this post, a high R-squared can occasionally indicate that there is a problem with your model. I’ll explain five reasons why your R-squared can be too high and how to determine whether one of them affects your regression model.

R-squared is not as intuitive as it seems. In my post about how to interpret R-squared, I explain that small R-squared values are not always a problem, and high R-squared values are not necessarily good! I have also written a post that explains why asking, “how high should R-squared be?” is the wrong question. And, in some cases, I think the standard error of the regression is a better goodness-of-fit measure.

The five reasons I go over aren’t a complete list, but they are the most common explanations.

## High R-squared Values can be a Problem

Let’s start by defining how R-squared can be too high.

R-squared is the percentage of the dependent variable variation that the model explains. The value in your statistical output is an estimate of the population value that is based on your sample. Like other estimates in inferential statistics, you want your R-squared estimate to be close to the population value.

The issues I discuss in this post can create situations where the R^{2} in your output is much higher than the correct value for the entire population. Additionally, these conditions can cause other problems, such as misleading coefficients. Consequently, it *is* possible to have an R-squared value that is too high even though that sounds counter-intuitive.

High R^{2} values are not always a problem. In fact, sometimes you can legitimately expect very large values. For example, if you are studying a physical process and have very precise and accurate measurements, it’s possible to obtain valid R-squared values in the high 90s.

On the other hand, human behavior inherently has much more unexplainable variability, and this produces R^{2} values that are usually less than 50%. 90% is way too high in this context!

You need to use your knowledge of the subject area to determine what R^{2} values are reasonable. Compare your study to comparable studies to see what values they obtained. How inherently unpredictable is your research question?

If your R-squared value is too high, consider the following potential explanations. To determine whether any apply to your regression model, use your expertise, knowledge about your sample data, and the details about the process you used to fit the model.

## Reason 1: R-squared is a biased estimate

Here’s a potential surprise for you. The R-squared value in your regression output has a tendency to be too high. When calculated from a sample, R^{2} is a biased estimator. In statistics, a biased estimator is one that is systematically higher or lower than the population value. R-squared estimates tend to be greater than the correct population value. This bias causes some researchers to avoid R^{2} altogether and use adjusted R^{2} instead.

Think of R-squared as a defective bathroom scale that reads too high on average. That’s the last thing you want! Statisticians have long understood that linear regression methodology gets tripped up by chance correlations that are present in the sample, which causes an inflated R^{2}.

If you had a bathroom scale that reads too high, you’d adjust it downward so that it displays the correct weight on average. Adjusted R-squared does just that with the R^{2} value. Adjusted R-squared reduces the value of R-squared until it becomes an unbiased estimate of the population value. Statisticians refer to this as R-squared shrinkage.

To determine the correct amount of shrinkage, the calculations compare the sample size to the number of terms in the model. When there are few samples per term, the R^{2} bias tends to be larger and requires more shrinkage to correct. Conversely, models with many samples per term need less shrinkage.

The graph below displays the amount of shrinkage required based on the number of samples per term.

I’ve also written about using adjusted R-squared in a different context. Adjusted R-squared allows you to compare the goodness-of-fit for models with different numbers of terms.

## Reason 2: Overfitting your model

Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. Unfortunately, one of the symptoms of an overfit model is an R-squared value that is too high.

While the R^{2} looks good, there can be serious problems with an overfit model. For one thing, the regression coefficients represent the noise rather than the genuine relationships in the population. Additionally, an overfit regression model is tailor-made to fit the random quirks of one sample and is unlikely to fit the random quirks of another sample. Thus, overfitting a regression model reduces its generalizability outside the original dataset.

Adjusted R-squared isn’t designed to detect overfitting, but predicted R-squared can.

**Related post**: Overfitting Regression Models: Problems, Detection, and Avoidance

## Reason 3: Data mining and chance correlations

Data mining is the process of fitting many different models, trying many different independent variables, and primarily using statistical significance to build the final model rather than being guided by theory. This process introduces a variety of problems, including misleading coefficients and an inflated R-squared value.

For all hypothesis tests, including tests for regression coefficients, there is always the chance of rejecting a null hypothesis that is actually true (Type I error). This error rate equals your significance level, which is often 5%.

Let’s apply this to regression analysis. When you fit many models, you are performing many hypothesis tests on all of the coefficients. In fact, if you use an automated model building procedure like stepwise or best subsets regression, you might be performing hundreds if not thousands of hypothesis tests on your sample. With this many tests, you will inevitably encounter false positives. If you are guided mainly by statistical significance, you’ll keep these variables in the model.

How serious is this problem? Data mining can produce statistically significant variables and a high R^{2} from data that are randomly generated! You can’t usually detect these problems using a statistical procedure, and your final model might not be overfit. Often there are no visible signs of problems. So, what do you do?

The answer lies in conducting subject-area research before you begin your study. This research helps you reduce the number of models you fit and allows you to compare your results to theory.

**Related post**: See how data mining can inflate the R-squared and cause other problems.

## Reason 4: Trends in Panel (Time Series) Data

If you have panel data and your dependent variable and an independent variable both have trends over time, this can produce inflated R-squared values. Try a time series analysis or include time-related independent variables in your regression model. For instance, try lagging and differencing your variables.

## Reason 5: Form of a Variable

If you include a different form of the same variable for both the dependent variable and an independent variable, you obtain an artificially inflated R-squared.

For example, if the dependent variable is temperature in Celsius and your model contains an independent variable of temperature on a different scale, your R^{2} is nearly 100%. That’s an obvious example, but there are more subtle forms of it. For instance, you can expect an inflated R^{2} value if your dependent variable is poverty rate and one of your independent variables is income. Poverty rate is defined by income.

If you’re learning regression, check out my Regression Tutorial!

Jonathan says

March 22, 2019 at 3:20 pmYes please!

Daniel Acland says

March 22, 2019 at 10:50 amThank you very much. What an awesome resource you provide!

Jim Frost says

March 22, 2019 at 11:19 amYou bet–and thanks for the kind words!

Also, I think I need to add some content to this post about R-squared without the constant!

Dan Acland says

March 21, 2019 at 5:06 pmThanks for your quick reply. Super helpful. I reran with an omitted category and the R-squared went from .8765 to .0286. That’s the range I’d expected. The point estimates and confidence intervals are the same. I don’t know how Stata runs the regression without an omitted category, but I’ve done it before and they clearly have some kind of solution, because, as I say, the results are the same. Only the F-stat and R-squared are different. The categories are definitely mutually exclusive, so it’s a mystery to me what’s going on under the hood. But if the point estimates and confidence intervals are the same, I should be okay, no?

Jim Frost says

March 21, 2019 at 10:34 pmThat sounds more reasonable! After you fit the model while omitting a level, it is possible to calculate the effect for that omitted level afterwards. That must be what Stata is doing. I’ve seen other software take that approach.

Yes, you should be good! Check the residuals to be sure but it sounds like you’re on the right path!

Dan Acland says

March 21, 2019 at 4:23 pmI’ve got 754 observations (different players of an online game dispersed around the planet, so probably not much correlation of errors), and my model has nothing but dummies for the eight values of a categorical variable. This is human behavior, explained by one characteristic of the individual. I’m getting R-square values as high as 90%, which seems improbable. I’m running the regressions with no constant term and no omitted category. I don’t know if that makes a difference. I don’t think I have any of the problems you listed above. Any thoughts about what might be going on?

Jim Frost says

March 21, 2019 at 4:49 pmHi Dan,

At least part of the problem is that you’re not including the constant. When you don’t include the constant, R-squared measures the amount of the variation around zero that the model accounts for rather than the amount of variation around the DV’s mean. Fit the model with the constant and you’ll get a valid R-squared. It’ll be lower, potentially much lower. You should almost always fit a regression model with a constant.

As for your categorical variable, here’s what it sounds like you’re saying, but let me know if I’m misunderstanding. You have a categorical variable, which you’re representing with 8 dummy variables (aka indicator variables). If that’s the case, and you didn’t omitted one level of your categorical variables, that’s a problem.

If your categorical variable has eight values (often referred to as levels), you should include seven of those levels and exclude one, which becomes the reference level. However, one thing puzzles me. If you have eight levels and you include all eight levels, your software shouldn’t even be able to fit the model. The problem occurs because you have perfect multicollinearity, which is a show stopper for fitting a regression model, You can use 7 levels to exactly predict the 8th level.

I’m not sure how you were able to even fit that model. Are each of your levels mutually exclusive from the other levels? If one level applies to an observation, all the other levels should not apply to that observation. If they are mutually exclusive, perhaps there were data entry errors that eliminated the perfect multicollinearity? Each observation (row in your dataset) should have a single value of “1” and 7 zeroes.

BTW, it’s ok if you manually created the indicator variables, but most software will take the raw categorical variable and create the indicator variables behind the scenes for you. It’s quite a bit easier.

I write about using categorical predictors in regression models in much more depth, along with many other aspects of regression, in my ebook about regression analysis that I just published. You can find that in the right-hand sidebar.

Best of luck with your analysis!

Jonathan says

February 13, 2018 at 1:58 amHi Jim,

“On the other hand, human behavior inherently has much more unexplainable variability, and this produces R2 values that are usually less than 50%. 90% is way too high in this context!”

I was wondering if you had some reputable sources that I could read up on that expand on this specific idea that an R-squared of 90%+ is way too high when modeling behavior. I’m writing a research paper about this idea and am looking for good places to read, but am having trouble navigating through the google clutter.

Thank you for your help!!

Jonathan

Jim Frost says

February 13, 2018 at 9:56 amHi Jonathan,

I’ve been asked this before and I never have a good answer. Unfortunately, I don’t have a good reference for you. Those statements are based on what I have seen of research that attempts to predict human. For instance, using SAT scores to predict college success is typically around 25%. Predicting the success of job candidates in their job is typically around 10-15%! And, there are other examples. But, from all of the research that I’ve seen, it’s usually 50% or less.

Randi Griffin says

July 30, 2017 at 10:20 pmWhat about autocorrelated / pseudoreplicated data? You touch on this when you mention temporal trends, but more generally, ANY unmodeled autocorrelation in the data will lead to a spurious / inflated R squared. Depending on the type of data, autocorrelation could be driven by time, space, phylogeny, a social network, etc.

Jim Frost says

July 30, 2017 at 11:37 pmVery true! Statistical analysis always requires a very close understanding of the data to avoid being tripped up. I wanted to cover some basics in this blog post, but there are definitely other possibilities.