When your regression model has a high R-squared, you assume it’s a good thing. You want a high R-squared, right? However, as I’ll show in this post, a high R-squared can occasionally indicate that there is a problem with your model. I’ll explain five reasons why your R-squared can be too high and how to determine whether one of them affects your regression model.
R-squared is not as intuitive as it seems. In my post about how to interpret R-squared, I explain that small R-squared values are not always a problem, and high R-squared values are not necessarily good! I have also written a post that explains why asking, “how high should R-squared be?” is the wrong question. And, in some cases, I think the standard error of the regression is a better goodness-of-fit measure.
The five reasons I go over aren’t a complete list, but they are the most common explanations.
High R-squared Values can be a Problem
Let’s start by defining how R-squared can be too high.
R-squared is the percentage of the dependent variable variation that the model explains. The value in your statistical output is an estimate of the population value that is based on your sample. Like other estimates in inferential statistics, you want your R-squared estimate to be close to the population value.
The issues I discuss in this post can create situations where the R2 in your output is much higher than the correct value for the entire population. Additionally, these conditions can cause other problems, such as misleading coefficients. Consequently, it is possible to have an R-squared value that is too high even though that sounds counter-intuitive.
High R2 values are not always a problem. In fact, sometimes you can legitimately expect very large values. For example, if you are studying a physical process and have very precise and accurate measurements, it’s possible to obtain valid R-squared values in the high 90s.
On the other hand, human behavior inherently has much more unexplainable variability, and this produces R2 values that are usually less than 50%. 90% is way too high in this context!
You need to use your knowledge of the subject area to determine what R2 values are reasonable. Compare your study to comparable studies to see what values they obtained. How inherently unpredictable is your research question?
If your R-squared value is too high, consider the following potential explanations. To determine whether any apply to your regression model, use your expertise, knowledge about your sample data, and the details about the process you used to fit the model.
Reason 1: R-squared is a biased estimate
Here’s a potential surprise for you. The R-squared value in your regression output has a tendency to be too high. When calculated from a sample, R2 is a biased estimator. In statistics, a biased estimator is one that is systematically higher or lower than the population value. R-squared estimates tend to be greater than the correct population value. This bias causes some researchers to avoid R2 altogether and use adjusted R2 instead.
Think of R-squared as a defective bathroom scale that reads too high on average. That’s the last thing you want! Statisticians have long understood that linear regression methodology gets tripped up by chance correlations that are present in the sample, which causes an inflated R2.
If you had a bathroom scale that reads too high, you’d adjust it downward so that it displays the correct weight on average. Adjusted R-squared does just that with the R2 value. Adjusted R-squared reduces the value of R-squared until it becomes an unbiased estimate of the population value. Statisticians refer to this as R-squared shrinkage.
To determine the correct amount of shrinkage, the calculations compare the sample size to the number of terms in the model. When there are few samples per term, the R2 bias tends to be larger and requires more shrinkage to correct. Conversely, models with many samples per term need less shrinkage.
The graph below displays the amount of shrinkage required based on the number of samples per term.
I’ve also written about using adjusted R-squared in a different context. Adjusted R-squared allows you to compare the goodness-of-fit for models with different numbers of terms.
Reason 2: Overfitting your model
Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. Unfortunately, one of the symptoms of an overfit model is an R-squared value that is too high.
While the R2 looks good, there can be serious problems with an overfit model. For one thing, the regression coefficients represent the noise rather than the genuine relationships in the population. Additionally, an overfit regression model is tailor-made to fit the random quirks of one sample and is unlikely to fit the random quirks of another sample. Thus, overfitting a regression model reduces its generalizability outside the original dataset.
Adjusted R-squared isn’t designed to detect overfitting, but predicted R-squared can.
Related post: Overfitting Regression Models: Problems, Detection, and Avoidance
Reason 3: Data mining and chance correlations
Data mining is the process of fitting many different models, trying many different independent variables, and primarily using statistical significance to build the final model rather than being guided by theory. This process introduces a variety of problems, including misleading coefficients and an inflated R-squared value.
For all hypothesis tests, including tests for regression coefficients, there is always the chance of rejecting a null hypothesis that is actually true (Type I error). This error rate equals your significance level, which is often 5%.
Let’s apply this to regression analysis. When you fit many models, you are performing many hypothesis tests on all of the coefficients. In fact, if you use an automated model building procedure like stepwise or best subsets regression, you might be performing hundreds if not thousands of hypothesis tests on your sample. With this many tests, you will inevitably encounter false positives. If you are guided mainly by statistical significance, you’ll keep these variables in the model.
How serious is this problem? Data mining can produce statistically significant variables and a high R2 from data that are randomly generated! You can’t usually detect these problems using a statistical procedure, and your final model might not be overfit. Often there are no visible signs of problems. So, what do you do?
The answer lies in conducting subject-area research before you begin your study. This research helps you reduce the number of models you fit and allows you to compare your results to theory.
Related post: See how data mining can inflate the R-squared and cause other problems.
Reason 4: Trends in Panel (Time Series) Data
If you have panel data and your dependent variable and an independent variable both have trends over time, this can produce inflated R-squared values. Try a time series analysis or include time-related independent variables in your regression model. For instance, try lagging and differencing your variables.
Reason 5: Form of a Variable
If you include a different form of the same variable for both the dependent variable and an independent variable, you obtain an artificially inflated R-squared.
For example, if the dependent variable is temperature in Celsius and your model contains an independent variable of temperature on a different scale, your R2 is nearly 100%. That’s an obvious example, but there are more subtle forms of it. For instance, you can expect an inflated R2 value if your dependent variable is poverty rate and one of your independent variables is income. Poverty rate is defined by income.
If you’re learning regression, check out my Regression Tutorial!
I honestly have no idea what was going through their minds when they did a regression analysis and made one candidate’s number of votes a predictor variable of another candidate’s number of votes. Thank you for answering my question. Have a nice day!
Hi Jim! The elections in my country recently concluded. I am seeing posts of people doing correlation and regression analyses between the number of votes of two candidates across time stamps. In the regression analysis, the R-squared is really high, like 0.9. Is it because (a) the variables remotely resemble a trend in a clear direction, and (b) the nature of the variables (both being the number of votes) make the analysis invalid? Also, isn’t it a mistake to run a regression analysis between two time series data? You may also add possible reasons as to why the R-squared value is extremely high. Thank you.
Hi Paul,
So, reason #4 in this article, Trends in Panel (Time Series) Data, refers to what you’re talking about. That could well be inflating their R-squared. There are a variety methods and approaches for fitting data with trends, such as detrending, lagging, differencing, etc. Unfortunately, I don’t have a detailed article about it, but I do reference the potential problem in this one.
Also, I’m not really sure what those people think showing a correlation between votes and time shows really? Maybe one candidate started getting more votes as the day wore on? There could well be demographic reasons for something like that.
Is this info of one of your own publications that could be cited?
Hi Brian,
I do cover similar material in my books and you can cite them. I include a recommend citation for each book in the References section near the end. This type of content is my regression book.
Alternatively, you can cite this webpage using that format as shown in Purdue University’s Online Writing Center (OWL) page for citing electronic resources.
My Adjusted R square value is 0.366 and please would inter prate it?Thank you sir!
You are a great writer. Thank you!
sir in my research factors are highly co related but significant level is greater than 0.05 what can i do for solve this this issue
Yes please!
Thank you very much. What an awesome resource you provide!
You bet–and thanks for the kind words!
Also, I think I need to add some content to this post about R-squared without the constant!
Thanks for your quick reply. Super helpful. I reran with an omitted category and the R-squared went from .8765 to .0286. That’s the range I’d expected. The point estimates and confidence intervals are the same. I don’t know how Stata runs the regression without an omitted category, but I’ve done it before and they clearly have some kind of solution, because, as I say, the results are the same. Only the F-stat and R-squared are different. The categories are definitely mutually exclusive, so it’s a mystery to me what’s going on under the hood. But if the point estimates and confidence intervals are the same, I should be okay, no?
That sounds more reasonable! After you fit the model while omitting a level, it is possible to calculate the effect for that omitted level afterwards. That must be what Stata is doing. I’ve seen other software take that approach.
Yes, you should be good! Check the residuals to be sure but it sounds like you’re on the right path!
I’ve got 754 observations (different players of an online game dispersed around the planet, so probably not much correlation of errors), and my model has nothing but dummies for the eight values of a categorical variable. This is human behavior, explained by one characteristic of the individual. I’m getting R-square values as high as 90%, which seems improbable. I’m running the regressions with no constant term and no omitted category. I don’t know if that makes a difference. I don’t think I have any of the problems you listed above. Any thoughts about what might be going on?
Hi Dan,
At least part of the problem is that you’re not including the constant. When you don’t include the constant, R-squared measures the amount of the variation around zero that the model accounts for rather than the amount of variation around the DV’s mean. Fit the model with the constant and you’ll get a valid R-squared. It’ll be lower, potentially much lower. You should almost always fit a regression model with a constant.
As for your categorical variable, here’s what it sounds like you’re saying, but let me know if I’m misunderstanding. You have a categorical variable, which you’re representing with 8 dummy variables (aka indicator variables). If that’s the case, and you didn’t omitted one level of your categorical variables, that’s a problem.
If your categorical variable has eight values (often referred to as levels), you should include seven of those levels and exclude one, which becomes the reference level. However, one thing puzzles me. If you have eight levels and you include all eight levels, your software shouldn’t even be able to fit the model. The problem occurs because you have perfect multicollinearity, which is a show stopper for fitting a regression model, You can use 7 levels to exactly predict the 8th level.
I’m not sure how you were able to even fit that model. Are each of your levels mutually exclusive from the other levels? If one level applies to an observation, all the other levels should not apply to that observation. If they are mutually exclusive, perhaps there were data entry errors that eliminated the perfect multicollinearity? Each observation (row in your dataset) should have a single value of “1” and 7 zeroes.
BTW, it’s ok if you manually created the indicator variables, but most software will take the raw categorical variable and create the indicator variables behind the scenes for you. It’s quite a bit easier.
I write about using categorical predictors in regression models in much more depth, along with many other aspects of regression, in my ebook about regression analysis that I just published. You can find that in the right-hand sidebar.
Best of luck with your analysis!
Hi Jim,
“On the other hand, human behavior inherently has much more unexplainable variability, and this produces R2 values that are usually less than 50%. 90% is way too high in this context!”
I was wondering if you had some reputable sources that I could read up on that expand on this specific idea that an R-squared of 90%+ is way too high when modeling behavior. I’m writing a research paper about this idea and am looking for good places to read, but am having trouble navigating through the google clutter.
Thank you for your help!!
Jonathan
Hi Jonathan,
I’ve been asked this before and I never have a good answer. Unfortunately, I don’t have a good reference for you. Those statements are based on what I have seen of research that attempts to predict human. For instance, using SAT scores to predict college success is typically around 25%. Predicting the success of job candidates in their job is typically around 10-15%! And, there are other examples. But, from all of the research that I’ve seen, it’s usually 50% or less.
What about autocorrelated / pseudoreplicated data? You touch on this when you mention temporal trends, but more generally, ANY unmodeled autocorrelation in the data will lead to a spurious / inflated R squared. Depending on the type of data, autocorrelation could be driven by time, space, phylogeny, a social network, etc.
Very true! Statistical analysis always requires a very close understanding of the data to avoid being tripped up. I wanted to cover some basics in this blog post, but there are definitely other possibilities.