Use residual plots to check the assumptions of an OLS linear regression model. If you violate the assumptions, you risk producing results that you can’t trust. Residual plots display the residual values on the y-axis and fitted values, or another variable, on the x-axis. After you fit a regression model, it is crucial to check the residual plots. If your plots display unwanted patterns, you can’t trust the regression coefficients and other numeric results.
In this post, I explain the conceptual reasons why residual plots help ensure that your regression model is valid. I’ll also show you what to look for and how to fix the problems.
First, let’s go over a couple of basics.
There are two fundamental parts to regression models, the deterministic and random components. If your model is not random where it supposed to be random, it has problems, and this is where residual plots come in.
The essential parts of a regression model:
Dependent Variable = (Constant +Independent Variables) + Error
Dependent Variable = Deterministic + Stochastic
The deterministic component is the portion of the variation in the dependent variable that the independent variables explain. In other words, the mean of the dependent variable is a function of the independent variables. In a regression model, all of the explanatory power should reside here.
Stochastic just means unpredictable. In statistics, the error is the difference between the expected value and the observed value. Let’s put these terms together—the gap between the expected and observed values must not be predictable. Or, no explanatory power should be in the error. If you can use the error to make predictions about the response, your model has a problem. This issue is where residual plots play a role.
The theory here is that the deterministic component of a regression model does such a great job of explaining the dependent variable that it leaves only the intrinsically inexplicable portion of your study area for the error. If you can identify non-randomness in the error term, your independent variables are not explaining everything that they can.
Don’t worry. This is actually easy to understand. It just means that you should not be able to see patterns in the residual plots!
Statistical note: The residuals estimate the true error in the same manner that regression coefficients estimate the true population coefficients.
How to Check Residual Plots
When looking at residual plots, you simply want to determine whether the residuals are consistent with random error. I’ll use an analogy of rolling a die. You shouldn’t be able to use one roll to predict the outcome of the next roll because it is supposed to be random. So, if you record a series of tosses, you should see only random results. If you start to see patterns, you know something is wrong with your model of how the die works. You think it’s random, but it’s not. If you were a gambler, you’d use this information to adjust how you play to match the actual die outcomes better.
You can apply this idea to regression models too. If you look at a series of errors, it should look random. If there are patterns in the errors, this means that you can use one error to predict another. As with the die analogy, if there are patterns in the residuals, you need to adjust your model. But, don’t fret, this just means that you can improve the fit of the model by moving this predictability over to the deterministic side of things (i.e., your independent variables).
How do you determine whether the residuals are random in regression analysis? It’s pretty simple, just check that they are randomly scattered around zero for the entire range of fitted values. When the residuals center on zero, they indicate that the model’s predictions are correct on average rather than systematically too high or low. Regression also assumes that the residuals follow a normal distribution and that the degree of scattering is the same for all fitted values.
Residuals should look like this.
How to Fix Problematic Residual Plots
The residual plot below clearly has a pattern!
If you know the fitted value, you can use it to predict the residual. For instance, fitted values near 5 and 10 tend to have positive residuals while fitted values near 7 tend to have negative values. If they were truly random, you wouldn’t be able to make these predictions.
This residual plot indicates that the independent variables do not capture the entire deterministic component. Unfortunately, some of the explanatory information has leaked over to the supposedly random error. There are a variety of reasons why a model can have this problem. The possibilities include a missing:
To fix the problem, you need to identify the missing information, variable, or higher-order term and include it in the model. After you correct the problem and refit the model, the residuals should look nice and random! It might require subject-area knowledge and research to do this. The solution is very particular to your research.
Other Potential Problems
There are several other ways that explanatory information might make its way into your residuals:
- Another variable must not be correlated with the residuals. If a variable is related to the residuals, that variable can predict the residuals, which is a no-no. Try including this variable in the model. To identify this correlation, graph the residuals by other variables. This problem relates to confounding variables and causes omitted variable bias.
- Neighboring residuals must not be correlated. If adjacent residuals are correlated, one residual can predict the next residual. In statistics, this is known as autocorrelation. This correlation represents explanatory information that the independent variables do not describe. Models that use time-series data are susceptible to this problem. To resolve this issue, try adding an independent variable that contains the pertinent time information. Use the Durbin-Watson test to assess autocorrelation.
- Residuals must have a constant variance. Heteroscedasticity refers to cases where the residuals have a non-constant variance. Read my post about how to identify and correct heteroscedasticity.
Residual Plots are Easy!
Hopefully, you see that checking your residuals plots is a crucial but simple thing to do. You need random residuals. Your independent variables should describe the relationship so thoroughly that only random error remains. Non-random patterns in your residuals signify that your variables are missing something.
Importantly, appreciate that if you do see unwanted patterns in your residual plots, it actually represents a chance to improve your model because there is something more that your independent variables can explain. That’s a good thing!
When your linear regression model satisfies the OLS assumptions, the procedure generates unbiased coefficient estimates that tend to be relatively close to the true population values (minimum variance). In fact, the Gauss-Markov theorem states that OLS produces estimates that are better than estimates from all other linear model estimation methods when the assumptions hold true.
For more information about the implications of this theorem on OLS estimates, read my post: The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates.
If you’re learning regression and like the approach I use in my blog, check out my eBook!
Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.