Use residual plots to check the assumptions of an OLS linear regression model. If you violate the assumptions, you risk producing results that you can’t trust. Residual plots display the residual values on the y-axis and fitted values, or another variable, on the x-axis. After you fit a regression model, it is crucial to check the residual plots. If your plots display unwanted patterns, you can’t trust the regression coefficients and other numeric results.

In this post, I explain the conceptual reasons why residual plots help ensure that your regression model is valid. I’ll also show you what to look for and how to fix the problems.

First, let’s go over a couple of basics.

There are two fundamental parts to regression models, the deterministic and random components. If your model is not random where it supposed to be random, it has problems, and this is where residual plots come in.

The essential parts of a regression model:

Dependent Variable = (Constant +Independent Variables) + Error

Or:

Dependent Variable = Deterministic + Stochastic

## Deterministic Component

The deterministic component is the portion of the variation in the dependent variable that the independent variables explain. In other words, the mean of the dependent variable is a function of the independent variables. In a regression model, all of the explanatory power should reside here.

## Stochastic Error

Stochastic just means unpredictable. In statistics, the error is the difference between the expected value and the observed value. Let’s put these terms together—the gap between the expected and observed values must not be predictable. Or, no explanatory power should be in the error. If you can use the error to make predictions about the response, your model has a problem. This issue is where residual plots play a role.

The theory here is that the deterministic component of a regression model does such a great job of explaining the dependent variable that it leaves only the intrinsically inexplicable portion of your study area for the error. If you can identify non-randomness in the error term, your independent variables are not explaining everything that they can.

Don’t worry. This is actually easy to understand. It just means that you should not be able to see patterns in the residual plots!

**Statistical note: **The residuals estimate the true error in the same manner that regression coefficients estimate the true population coefficients.

## How to Check Residual Plots

When looking at residual plots, you simply want to determine whether the residuals are consistent with random error. I’ll use an analogy of rolling a die. You shouldn’t be able to use one roll to predict the outcome of the next roll because it is supposed to be random. So, if you record a series of tosses, you should see only random results. If you start to see patterns, you know something is wrong with your model of how the die works. You think it’s random, but it’s not. If you were a gambler, you’d use this information to adjust how you play to match the actual die outcomes better.

You can apply this idea to regression models too. If you look at a series of errors, it should look random. If there are patterns in the errors, this means that you can use one error to predict another. As with the die analogy, if there are patterns in the residuals, you need to adjust your model. But, don’t fret, this just means that you can improve the fit of the model by moving this predictability over to the deterministic side of things (i.e., your independent variables).

How do you determine whether the residuals are random in regression analysis? It’s pretty simple, just check that they are randomly scattered around zero for the entire range of fitted values. When the residuals center on zero, they indicate that the model’s predictions are correct on average rather than systematically too high or low. Regression also assumes that the residuals follow a normal distribution and that the degree of scattering is the same for all fitted values.

Residuals should look like this.

## How to Fix Problematic Residual Plots

The residual plot below clearly has a pattern!

If you know the fitted value, you can use it to predict the residual. For instance, fitted values near 5 and 10 tend to have positive residuals while fitted values near 7 tend to have negative values. If they were truly random, you wouldn’t be able to make these predictions.

This residual plot indicates that the independent variables do not capture the entire deterministic component. Unfortunately, some of the explanatory information has leaked over to the supposedly random error. There are a variety of reasons why a model can have this problem. The possibilities include a missing:

- Independent variable.
- Polynomial term to model a curve.
- Interaction term.

To fix the problem, you need to identify the missing information, variable, or higher-order term and include it in the model. After you correct the problem and refit the model, the residuals should look nice and random! It might require subject-area knowledge and research to do this. The solution is very particular to your research.

## Other Potential Problems

There are several other ways that explanatory information might make its way into your residuals:

**Another variable must not be correlated with the residuals.**If a variable is related to the residuals, that variable can predict the residuals, which is a no-no. Try including this variable in the model. To identify this correlation, graph the residuals by other variables.**Neighboring residuals must not be correlated.**If adjacent residuals are correlated, one residual can predict the next residual. In statistics, this is known as autocorrelation. This correlation represents explanatory information that the independent variables do not describe. Models that use time-series data are susceptible to this problem. To resolve this issue, try adding an independent variable that contains the pertinent time information. Use the Durbin-Watson test to assess autocorrelation.**Residuals must have a constant variance.**Heteroscedasticity refers to cases where the residuals have a non-constant variance. Read my post about how to identify and correct heteroscedasticity.

## Residual Plots are Easy!

Hopefully, you see that checking your residuals plots is a crucial but simple thing to do. You need random residuals. Your independent variables should describe the relationship so thoroughly that only random error remains. Non-random patterns in your residuals signify that your variables are missing something.

Importantly, appreciate that if you do see unwanted patterns in your residual plots, it actually represents a chance to improve your model because there is something more that your independent variables can explain. That’s a good thing!

When your linear regression model satisfies the OLS assumptions, the procedure generates unbiased coefficient estimates that tend to be relatively close to the true population values (minimum variance). In fact, the Gauss-Markov theorem states that OLS produces estimates that are better than estimates from all other linear model estimation methods when the assumptions hold true.

For more information about the implications of this theorem on OLS estimates, read my post: The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates.

If you’re learning statistics, check out my Regression Tutorial!

Nithashanasar says

Wow.. It’s such an amazing discovery… I believe that u will be a mile stone in statistics… Here ..it facilitate the concept of scatter plot…am doing MSC statistics.. So…I am really proud of u..thank jim

Jim Frost says

Hi Nithashanasar, I’m very happy to hear that this is helpful for you! Also, thank you so much for your kind words. I really appreciate it!

Jim

john says

Jim, Can give an example, on “To resolve this issue, try adding an independent variable that contains the pertinent time information”? What is the pertinent time information? Do you have the other post to address it? thanksjohn

Jim Frost says

Hi John, I don’t have a post on that topic yet but will write one at some point. Suppose your data are time series data. You could include a lagged variable for an independent variable. For example, if you think a variable has a delayed effect on the outcome, you can lag the variable so that the value from a previous observation appears in the current observation. You could also possibly add a variable the indicates the month, day, season, hour, etc if you think that is relevant to the outcome. These types of variable can all supply important information to the model. A model that is missing important information can provide untrustworthy results. Sometimes this includes time-related information. As always, you have to use subject-area knowledge and expertise to include the correct information.

Thanks for writing with the great question!

Jim

Nate says

I have a question about my residual plot. I am looking at the influence of precipitation on population change in a species of ground nesting birds. I looked at precipitation for a given month as a percent of the average precipitation for that month. There was a correlation in March and August. So I conducted a regression analysis for the population change and precipitation for a given month. There is a positive relationship in March and negative in August. The residual plot in March does not show a pattern but the August residual plot shows a pattern. How would I look at the combination of March precipitation and August precipitation combined and population change?

Jim Frost says

Hi Nate, I don’t completely understand how your study is set up. It seems like you have separate models for each month rather than one model? Typically, you want the residual plots to be random. You don’t want patterns in the residuals even when you have correlations in the data. Because you see a pattern in August, I’d be considered that your model doesn’t fit the data well for that month.

If I understand correctly, in March, the more you have rain the more the population increases? And, in August, more rain equals less population? I don’t know the study area but you’d have determine whether that reversal makes sense. Also, with time series data, you have to be really careful about time order effects sneaking in. Is it possible that rain is correlated with something else that is driving population change? Maybe in March rain happens occurs at the same time as something else that actually drives the population to increase? And, in August maybe it happens to correspond with some causes population to decrease? I’m not saying that’s the case for sure, but you have to think about possibilities like that. Especially when the relationship changes direction like that. Does that make theoretical sense?

Mohammad Kamel says

Thanks Jim, your articles are really helpful, you make the statistics concepts very easy and logical

Jim Frost says

Thank you, Mohammad! I really appreciate the kind words and I’m so glad that you find them to be helpful!

Ghouse says

brilliant explanation. Thanks Jim

jonamjar says

Why dint I find this before about?!! Amazing

Jim Frost says

Thank you so much! And, I’m glad you found my blog!

Cathrine says

Hello Jim,

The fitted values that you plot against the residuals, what are they?

In the case of multiple linear regression, are they the mean of all the estimated independent variables in the model? Or just the Parameter of Interest?

Kind regards?

Jim Frost says

Hi Cathrine, that’s a great question and it suggests to me that I need to make my post more clear about that!

Here’s how a residuals by fitted value plot is created. The software uses the regression model to make a prediction for each observation (row in your data table). That prediction is the fitted value and it falls on the X-axis on the scatterplot. A fitted value is a statistical model’s prediction of the mean response value when you input the values of the predictors, factor levels, or components into the model.

The residual is the the difference between the observed value and the fitted value that the model predicts for each observation. This value falls on the Y-axis of the scatterplot.

Consequently, each observation in your data set produces the pair of X (fitted value) and Y (residual) values that are graphed on the scatterplot. The goal of a regression model is that the residuals do not fall systematically above or below zero because zero indicates that the model made a perfect prediction. If the residuals are systematially high or low, your model is biased and needs to be fixed. Some models might work better for high, medium, or low fitted values, which is why we use that for the X-axis.

I hope this answers your questions!

Mislav says

Hello, Mr Frost,

Since your explanations seem theoretically sound and also understandable to less educated statisticians, I would be glad if you can help me with regression. As far as I understand it has something to do with residuals.

I made a regression model (supposed for prediction), and it looks very good (equation confirmed in other research, nice fit, high R2, residuals vs. fitted value is almost Ok). However when I tried to predict values using this model there “popped out” kind of bias: higher values of y were underestimated, and lower values were overestimated. Then I looked at the residual plot vs. observed values, and there was clear trend – negative values of residuals for small y, positive for large y. I read some spurious explanation that it must happen, and I am asking – what is theoretic cause and how to correct that, practically? Does it maybe have something to do with data? Or Maybe validation set should be different from modeling data set? It seems that there is no lack of fit, nor do i have any other clever independent variables to include. Hoping my problem and question are clear enough, thank you in advance.

Jim Frost says

Hi Mislav, assessing residual plots is the perfect way to pick up problems like this one. Typically, when you see non-random patterns like this, you often have an under-specified model. In other words, you might be missing a variable, or not specifying the curvature correctly. A high R-squared by itself does not tell you that your model is good. Below are several links that will help you specify an unbiased model:

Model specification

Curve fitting

R-squared (high R-squared values are not necessarily good)

I hope this helps!

Naman says

Hi Jim,

As usual thank you for sharing your explanation and understanding of statistics in such a fluid and easy way.

It would be great if you can share some methods to fix auto correlation problem.

Jim Frost says

That’s a great topic for a future blog post!

Sundar says

Hi Jim,

Really nice article put up in a simple way for beginner like me to understand. However I plotted fitted values v/s residuals for my model and though it appear random almost all of the values are on the positive side of the axis. In your post you have mentioned that the values should be “randomly scattered around zero”. So does it mean mine is showing any pattern ?

Jim Frost says

Hi Sundar,

If almost all of your residuals are positive, that indicates that model is positively biased. The fitted values are systematically higher than the observed values. The residuals should be randomly scattered around zero. In fact, one of the assumptions for ordinary least squares regression is that the mean of the residuals equals zero.

If you fit a model with the constant (which is almost always the case), this forces the mean of your residuals to equal zero. For more information about this, read my post about the constant in a regression model. This leads to two possible cases based on whether your model includes the constant:

If your model does

notinclude the constant, consider adding it because your model will satisfy that particular assumption at least. You must still check the other assumptions.However, if your model does include the constant and it appears like almost all of your residuals are positive, that is an interesting case. Because of the constant, you know the mean must equal zero even though most of the residuals are positive. One possibility is that you have a few unusual observations with large negative residuals that offset the multitude of positive residuals and produces that zero mean. In this scenario, you’d need to investigate those data points and determine whether those observations are outliers that should removed from the model.

I hope this helps!