R-squared tends to reward you for including too many independent variables in a regression model, and it doesn’t provide any incentive to stop adding more. Adjusted R-squared and predicted R-squared use different approaches to help you fight that impulse to add too many. The protection that adjusted R-squared and predicted R-squared provide is critical because too many terms in a model can produce results that you can’t trust. These statistics help you include the correct number of independent variables in your regression model.
Multiple linear regression can seduce you! Yep, you read it here first. It’s an incredibly tempting statistical analysis that practically begs you to include additional independent variables in your model. Every time you add a variable, the R-squared increases, which tempts you to add more. Some of the independent variables will be statistically significant. Perhaps there is an actual relationship? Or is it just a chance correlation?
You just pop the variables into the model as they occur to you or just because the data are readily available. Higher-order polynomials curve your regression line any which way you want. But are you fitting real relationships or just playing connect the dots? Meanwhile, the R-squared increases, mischievously convincing you to include yet more variables!
In my post about interpreting R-squared, I show how evaluating how well a linear regression model fits the data is not as intuitive as you may think. Now, I’ll explore reasons why you need to use adjusted R-squared and predicted R-squared to help you specify a good regression model!
Some Problems with R-squared
Previously, I demonstrated that you cannot use R-squared to conclude whether your model is biased. To check for this bias, you need to check your residual plots. Unfortunately, there are yet more problems with R-squared that we need to address.
Problem 1: R-squared increases every time you add an independent variable to the model. The R-squared never decreases, not even when it’s just a chance correlation between variables. A regression model that contains more independent variables than another model can look like it provides a better fit merely because it contains more variables.
Problem 2: When a model contains an excessive number of independent variables and polynomial terms, it becomes overly customized to fit the peculiarities and random noise in your sample rather than reflecting the entire population. Statisticians call this overfitting the model, and it produces deceptively high R-squared values and a decreased capability for precise predictions.
Fortunately for us, adjusted R-squared and predicted R-squared address both of these problems.
What Is the Adjusted R-squared?
Use adjusted R-squared to compare the goodness-of-fit for regression models that contain differing numbers of independent variables.
Let’s say you are comparing a model with five independent variables to a model with one variable and the five variable model has a higher R-squared. Is the model with five variables actually a better model, or does it just have more variables? To determine this, just compare the adjusted R-squared values!
The adjusted R-squared adjusts for the number of terms in the model. Importantly, its value increases only when the new term improves the model fit more than expected by chance alone. The adjusted R-squared value actually decreases when the term doesn’t improve the model fit by a sufficient amount.
The example below shows how the adjusted R-squared increases up to a point and then decreases. On the other hand, R-squared blithely increases with each and every additional independent variable.
In this example, the researchers might want to include only three independent variables in their regression model. My R-squared blog post shows how an under-specified model (too few terms) can produce biased estimates. However, an overspecified model (too many terms) can reduce the model’s precision. In other words, both the coefficient estimates and predicted values can have larger margins of error around them. That’s why you don’t want to include too many terms in the regression model!
What Is the Predicted R-squared?
Use predicted R-squared to determine how well a regression model makes predictions. This statistic helps you identify cases where the model provides a good fit for the existing data but isn’t as good at making predictions. However, even if you aren’t using your model to make predictions, predicted R-squared still offers valuable insights about your model.
Statistical software calculates predicted R-squared using the following procedure:
- It removes a data point from the dataset.
- Calculates the regression equation.
- Evaluates how well the model predicts the missing observation.
- And repeats this for all data points in the dataset.
Predicted R-squared helps you determine whether you are overfitting a regression model. Again, an overfit model includes an excessive number of terms, and it begins to fit the random noise in your sample.
By its very definition, it is not possible to predict random noise. Consequently, if your model fits a lot of random noise, the predicted R-squared value must fall. A predicted R-squared that is distinctly smaller than R-squared is a warning sign that you are overfitting the model. Try reducing the number of terms.
If I had to name my favorite flavor of R-squared, it would be predicted R-squared!
Example of an Overfit Model and Predicted R-squared
You can try this example using this CSV data file: PresidentRanking.
These data come from an analysis I performed that assessed the relationship between the highest approval rating that a U.S. President achieved and their rank by historians. I found no correlation between these variables, as shown in the fitted line plot. It’s nearly a perfect example of no relationship because it is a flat line with an R-squared of 0.7%!
Now, imagine that we are chasing a high R-squared and we fit the model using a cubic term that provides an S-shape.
Amazing! R-squared and adjusted R-squared look great! The coefficients are statistically significant because their p-values are all less than 0.05. I didn’t show the residual plots, but they look good as well.
Hold on a moment! We’re just twisting the regression line to force it to connect the dots rather than finding an actual relationship. We overfit the model, and the predicted R-squared of 0% gives this away.
If the predicted R-squared is small compared to R-squared, you might be over-fitting the model even if the independent variables are statistically significant.
To read about the analysis above where I had to be extremely careful to avoid an overfit model, read Understanding Historians’ Rankings of U.S. Presidents using Regression Models.
A Caution about the Problems of Chasing a High R-squared
All study areas involve a certain amount of variability that you can’t explain. If you chase a high R-squared by including an excessive number of variables, you force the model to explain the unexplainable. This is not good. While this approach can obtain higher R-squared values, it comes at the cost of misleading regression coefficients, p-values, R-squared, and imprecise predictions.
Adjusted R-squared and predicted R-square help you resist the urge to add too many independent variables to your model.
- Adjusted R-square compares models with different numbers of variables.
- Predicted R-square can guard against models that are too complicated.
Remember, the great power that comes with multiple regression analysis requires your restraint to use it wisely!
If you’re learning regression, check out my Regression Tutorial!
Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.