When your regression model has a high R-squared, you assume it’s a good thing. You want a high R-squared, right? However, as I’ll show in this post, a high R-squared can occasionally indicate that there is a problem with your model. I’ll explain five reasons why your R-squared can be too high and how to determine whether one of them affects your regression model.
R-squared is not as intuitive as it seems. In my post about how to interpret R-squared, I explain that small R-squared values are not always a problem, and high R-squared values are not necessarily good! I have also written a post that explains why asking, “how high should R-squared be?” is the wrong question. And, in some cases, I think the standard error of the regression is a better goodness-of-fit measure.
The five reasons I go over aren’t a complete list, but they are the most common explanations.
High R-squared Values can be a Problem
Let’s start by defining how R-squared can be too high.
R-squared is the percentage of the dependent variable variation that the model explains. The value in your statistical output is an estimate of the population value that is based on your sample. Like other estimates in inferential statistics, you want your R-squared estimate to be close to the population value.
The issues I discuss in this post can create situations where the R2 in your output is much higher than the correct value for the entire population. Additionally, these conditions can cause other problems, such as misleading coefficients. Consequently, it is possible to have an R-squared value that is too high even though that sounds counter-intuitive.
High R2 values are not always a problem. In fact, sometimes you can legitimately expect very large values. For example, if you are studying a physical process and have very precise and accurate measurements, it’s possible to obtain valid R-squared values in the high 90s.
On the other hand, human behavior inherently has much more unexplainable variability, and this produces R2 values that are usually less than 50%. 90% is way too high in this context!
You need to use your knowledge of the subject area to determine what R2 values are reasonable. Compare your study to comparable studies to see what values they obtained. How inherently unpredictable is your research question?
If your R-squared value is too high, consider the following potential explanations. To determine whether any apply to your regression model, use your expertise, knowledge about your sample data, and the details about the process you used to fit the model.
Reason 1: R-squared is a biased estimate
Here’s a potential surprise for you. The R-squared value in your regression output has a tendency to be too high. When calculated from a sample, R2 is a biased estimator. In statistics, a biased estimator is one that is systematically higher or lower than the population value. R-squared estimates tend to be greater than the correct population value. This bias causes some researchers to avoid R2 altogether and use adjusted R2 instead.
Think of R-squared as a defective bathroom scale that reads too high on average. That’s the last thing you want! Statisticians have long understood that linear regression methodology gets tripped up by chance correlations that are present in the sample, which causes an inflated R2.
If you had a bathroom scale that reads too high, you’d adjust it downward so that it displays the correct weight on average. Adjusted R-squared does just that with the R2 value. Adjusted R-squared reduces the value of R-squared until it becomes an unbiased estimate of the population value. Statisticians refer to this as R-squared shrinkage.
To determine the correct amount of shrinkage, the calculations compare the sample size to the number of terms in the model. When there are few samples per term, the R2 bias tends to be larger and requires more shrinkage to correct. Conversely, models with many samples per term need less shrinkage.
The graph below displays the amount of shrinkage required based on the number of samples per term.
I’ve also written about using adjusted R-squared in a different context. Adjusted R-squared allows you to compare the goodness-of-fit for models with different numbers of terms.
Reason 2: Overfitting your model
Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. Unfortunately, one of the symptoms of an overfit model is an R-squared value that is too high.
While the R2 looks good, there can be serious problems with an overfit model. For one thing, the regression coefficients represent the noise rather than the genuine relationships in the population. Additionally, an overfit regression model is tailor-made to fit the random quirks of one sample and is unlikely to fit the random quirks of another sample. Thus, overfitting a regression model reduces its generalizability outside the original dataset.
Adjusted R-squared isn’t designed to detect overfitting, but predicted R-squared can.
Reason 3: Data mining and chance correlations
Data mining is the process of fitting many different models, trying many different independent variables, and primarily using statistical significance to build the final model rather than being guided by theory. This process introduces a variety of problems, including misleading coefficients and an inflated R-squared value.
For all hypothesis tests, including tests for regression coefficients, there is always the chance of rejecting a null hypothesis that is actually true (Type I error). This error rate equals your significance level, which is often 5%.
Let’s apply this to regression analysis. When you fit many models, you are performing many hypothesis tests on all of the coefficients. In fact, if you use an automated model building procedure like stepwise or best subsets regression, you might be performing hundreds if not thousands of hypothesis tests on your sample. With this many tests, you will inevitably encounter false positives. If you are guided mainly by statistical significance, you’ll keep these variables in the model.
How serious is this problem? Data mining can produce statistically significant variables and a high R2 from data that are randomly generated! You can’t usually detect these problems using a statistical procedure, and your final model might not be overfit. Often there are no visible signs of problems. So, what do you do?
The answer lies in conducting subject-area research before you begin your study. This research helps you reduce the number of models you fit and allows you to compare your results to theory.
Reason 4: Trends in Panel (Time Series) Data
If you have panel data and your dependent variable and an independent variable both have trends over time, this can produce inflated R-squared values. Try a time series analysis or include time-related independent variables in your regression model. For instance, try lagging and differencing your variables.
Reason 5: Form of a Variable
If you include a different form of the same variable for both the dependent variable and an independent variable, you obtain an artificially inflated R-squared.
For example, if the dependent variable is temperature in Celsius and your model contains an independent variable of temperature on a different scale, your R2 is nearly 100%. That’s an obvious example, but there are more subtle forms of it. For instance, you can expect an inflated R2 value if your dependent variable is poverty rate and one of your independent variables is income. Poverty rate is defined by income.
If you’re learning regression, check out my Regression Tutorial!