Use residual plots to check the assumptions of an OLS linear regression model. If you violate the assumptions, you risk producing results that you can’t trust. Residual plots display the residual values on the y-axis and fitted values, or another variable, on the x-axis. After you fit a regression model, it is crucial to check the residual plots. If your plots display unwanted patterns, you can’t trust the regression coefficients and other numeric results.

In this post, I explain the conceptual reasons why residual plots help ensure that your regression model is valid. I’ll also show you what to look for and how to fix the problems.

First, let’s go over a couple of basics.

There are two fundamental parts to regression models, the deterministic and random components. If your model is not random where it supposed to be random, it has problems, and this is where residual plots come in.

The essential parts of a regression model:

Dependent Variable = (Constant +Independent Variables) + Error

Or:

Dependent Variable = Deterministic + Stochastic

## Deterministic Component

The deterministic component is the portion of the variation in the dependent variable that the independent variables explain. In other words, the mean of the dependent variable is a function of the independent variables. In a regression model, all of the explanatory power should reside here.

## Stochastic Error

Stochastic just means unpredictable. In statistics, the error is the difference between the expected value and the observed value. Let’s put these terms together—the gap between the expected and observed values must not be predictable. Or, no explanatory power should be in the error. If you can use the error to make predictions about the response, your model has a problem. This issue is where residual plots play a role.

The theory here is that the deterministic component of a regression model does such a great job of explaining the dependent variable that it leaves only the intrinsically inexplicable portion of your study area for the error. If you can identify non-randomness in the error term, your independent variables are not explaining everything that they can.

Don’t worry. This is actually easy to understand. It just means that you should not be able to see patterns in the residual plots!

**Statistical note: **The residuals estimate the true error in the same manner that regression coefficients estimate the true population coefficients.

## How to Check Residual Plots

When looking at residual plots, you simply want to determine whether the residuals are consistent with random error. I’ll use an analogy of rolling a die. You shouldn’t be able to use one roll to predict the outcome of the next roll because it is supposed to be random. So, if you record a series of tosses, you should see only random results. If you start to see patterns, you know something is wrong with your model of how the die works. You think it’s random, but it’s not. If you were a gambler, you’d use this information to adjust how you play to match the actual die outcomes better.

You can apply this idea to regression models too. If you look at a series of errors, it should look random. If there are patterns in the errors, this means that you can use one error to predict another. As with the die analogy, if there are patterns in the residuals, you need to adjust your model. But, don’t fret, this just means that you can improve the fit of the model by moving this predictability over to the deterministic side of things (i.e., your independent variables).

How do you determine whether the residuals are random in regression analysis? It’s pretty simple, just check that they are randomly scattered around zero for the entire range of fitted values. When the residuals center on zero, they indicate that the model’s predictions are correct on average rather than systematically too high or low. Regression also assumes that the residuals follow a normal distribution and that the degree of scattering is the same for all fitted values.

Residuals should look like this.

## How to Fix Problematic Residual Plots

The residual plot below clearly has a pattern!

If you know the fitted value, you can use it to predict the residual. For instance, fitted values near 5 and 10 tend to have positive residuals while fitted values near 7 tend to have negative values. If they were truly random, you wouldn’t be able to make these predictions.

This residual plot indicates that the independent variables do not capture the entire deterministic component. Unfortunately, some of the explanatory information has leaked over to the supposedly random error. There are a variety of reasons why a model can have this problem. The possibilities include a missing:

To fix the problem, you need to identify the missing information, variable, or higher-order term and include it in the model. After you correct the problem and refit the model, the residuals should look nice and random! It might require subject-area knowledge and research to do this. The solution is very particular to your research.

## Other Potential Problems

There are several other ways that explanatory information might make its way into your residuals:

**Another variable must not be correlated with the residuals.**If a variable is related to the residuals, that variable can predict the residuals, which is a no-no. Try including this variable in the model. To identify this correlation, graph the residuals by other variables. This problem relates to confounding variables and causes omitted variable bias.**Neighboring residuals must not be correlated.**If adjacent residuals are correlated, one residual can predict the next residual. In statistics, this is known as autocorrelation. This correlation represents explanatory information that the independent variables do not describe. Models that use time-series data are susceptible to this problem. To resolve this issue, try adding an independent variable that contains the pertinent time information. Use the Durbin-Watson test to assess autocorrelation.**Residuals must have a constant variance.**Heteroscedasticity refers to cases where the residuals have a non-constant variance. Read my post about how to identify and correct heteroscedasticity.

## Residual Plots are Easy!

Hopefully, you see that checking your residuals plots is a crucial but simple thing to do. You need random residuals. Your independent variables should describe the relationship so thoroughly that only random error remains. Non-random patterns in your residuals signify that your variables are missing something.

Importantly, appreciate that if you do see unwanted patterns in your residual plots, it actually represents a chance to improve your model because there is something more that your independent variables can explain. That’s a good thing!

When your linear regression model satisfies the OLS assumptions, the procedure generates unbiased coefficient estimates that tend to be relatively close to the true population values (minimum variance). In fact, the Gauss-Markov theorem states that OLS produces estimates that are better than estimates from all other linear model estimation methods when the assumptions hold true.

For more information about the implications of this theorem on OLS estimates, read my post: The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Tridib dutta says

Hi Jim,

Let me thank you for these posts. Posts are full of information that are useful and hard to extract out of text books. So for self taught or statistically naive person, these posts are wonderful. Having said that, I would like to ask you a question. I have tried to use regression model with other algorithms (Support vector). Those have completely different sets of assumptions. I am wondering if in these cases, looking at the residual plots would help improve the model. In my case, the residual vs fitted values has a somewhat cone shape with tip of the cone being the origin and spreads to the right (all my predictors have either positive value or categorical in nature).

Your thoughts would be appreciated.

Thanks,

Jim Frost says

Hi Tridib,

What you describe sounds just like heteroscedasticity! Read my post about detecting heteroscedasticity to see if that helps you. I include several methods of how to handle it. I’m not completely familiar with the assumptions your analysis requires, but I’m expecting heteroscedasticity is not a good thing for it!

Mira says

Hi Jim!

I just wanted to thank you for all this work! I really like your blog and I’m going to use it to refresh contents I should already know (ups). I really appreciate how you always focus on how to apply the concepts, which was exactly what I was looking for. And I’m shocked you actually take time to answer questions. Thanks!

Jim Frost says

Hi Mira,

I’m so happy to hear that my website has been helpful. I work hard to make the explanations clear and helpful, so your kind words really mean a lot! Thank you!

Surya says

Hi Jim,

In the above post on residuals,

1) if they exhibit a pattern ,it means some variation in DV is still to be explained by the IVs..Can I say that SSR(sum of squares of regression) is less than it is ought to be and SSE(Sum of squares of error)I is more than it is ought to be. And that SSE reduction should be SSR’s gain.

And

2) Do you also have plan to publish articles on Logistic regression as well ?

3) Please ensure your book is available in India as well. You have fans over here 🙂

Jim Frost says

Hi Surya,

Yes, that’s exactly it. SSR + SSE = TSS. The total sum of squares (TSS) is independent of your model. Given a fixed dataset, it should be a fixed value. Therefore, if you specify different models with the same dataset, and it changes the SSR, then SSE much change accordingly. You’re if correct, if SSR decreases, then SSE must increase by the same amount. If you have missing values in your data, it can complicate that slightly. Missing values can cause entire observations to be omitted from the analysis, which can change TSS slightly depending on the model and the observations that are excluded.

I have one post about logistic regression, where I show an example of binary logistic regression. I do plan to write a more in-depth article about. Hopefully, mid to late 2019.

Thanks for asking about my book! I’ve been working hard to finish that up and I’m almost there! Initially, it’ll be available only as an ebook and I expect you should be able to purchase that from anywhere. And, I’ll make sure it’s available in India! I have a soft spot for India because I’ve been there several times. And, the second highest number of visitors come to my website from India! 🙂

Thanks for writing. It’s great hearing from you again!

Curious says

Hi Jim, great post!

I’ve a question: While choosing a model that best fits the data at hand, should one prefer the model that minimises RMSE (i.e. standard deviation of residuals) which is the general practice, or the model that most randomises the residuals i.e. extracts all the systematic information from the X-variables leaving only random noise?

Many thanks!

Jim Frost says

Hi,

This is a tricky question! There’s no solid rule. Yes, you generally want to minimize RMSE or, conversely, maximize R-squared. However, as I write in various places (see R-squared for instance), chasing a closer fit can lead to problems. For example, that approach can lead you to overfit your model and you won’t be able to trust the results.

However, you usually do not want patterns in the residual. And, when I say “usually,” I mean almost never! And, typically, models that produce residuals with patterns won’t have the lowest RMSE. You should have some sort of improvement by fitting a better model that eliminates the residual patterns. That’s addresses two issues simultaneously–eliminating the pattern and improving the fit.

To summarize, you probably don’t want a model with residual patterns. You want a model with random residuals and a

relativelylow RMSE orrelativelyhigh R-squared. But, you might not necessarily want to minimize/maximize those values.For more of my thoughts on this topic, read my post about choosing the correct regression model.

Keegan says

Hi Jim,

Are residual probability plot interpretations same for linear and non linear regression models?

Jim Frost says

Hi Keegan,

Yes, in fact, both the interpretation and the requirements for the residuals plots are the same for linear and nonlinear regression.

Muhammad Khan says

Hey Jim, Hope you are doing well. I am still not very clear about the reasons for why we need the residuals to be normally distributed. Could you please elaborate on that?

Jim Frost says

Hi Muhammad, if you want to be able to use the p-values to determine which predictors are statistically significant, those hypothesis tests assume that the residuals are normally distributed. Also, the CIs for the coefficients and the prediction intervals for new observations all assume that the residuals are normally distributed. If the residuals are not normally distributed, you might not be able to trust those results. I talk about this specific assumption a bit more in my post for OLS Assumptions. Look for the specific assumption about this issue.

I hope this helps!

Om Shanker Pandey says

I have read so many posts on statistics and heard so many people talk about regressions but loved the way you explain things. It is crystal clear and love reading your blogs. Please write more.

Jim Frost says

Thank you very much. I strive to make my blog posts as clear as possible so your comment means a lot to me. I’m glad that you found them to be helpful!

Dale says

Hi Jim,

I have a partial regression plot that looks very cyclical. So while it is technically scattered around 0, it seems like there is indeed a pattern. Would this mean that a polynomial factor would improve the model as you suggested? Also, is a partial regression plot the same as a residual plot? I am still a beginner as you can tell.

Sundar says

Hi Jim,

Really nice article put up in a simple way for beginner like me to understand. However I plotted fitted values v/s residuals for my model and though it appear random almost all of the values are on the positive side of the axis. In your post you have mentioned that the values should be “randomly scattered around zero”. So does it mean mine is showing any pattern ?

Jim Frost says

Hi Sundar,

If almost all of your residuals are positive, that indicates that model is positively biased. The fitted values are systematically higher than the observed values. The residuals should be randomly scattered around zero. In fact, one of the assumptions for ordinary least squares regression is that the mean of the residuals equals zero.

If you fit a model with the constant (which is almost always the case), this forces the mean of your residuals to equal zero. For more information about this, read my post about the constant in a regression model. This leads to two possible cases based on whether your model includes the constant:

If your model does

notinclude the constant, consider adding it because your model will satisfy that particular assumption at least. You must still check the other assumptions.However, if your model does include the constant and it appears like almost all of your residuals are positive, that is an interesting case. Because of the constant, you know the mean must equal zero even though most of the residuals are positive. One possibility is that you have a few unusual observations with large negative residuals that offset the multitude of positive residuals and produces that zero mean. In this scenario, you’d need to investigate those data points and determine whether those observations are outliers that should removed from the model.

I hope this helps!

Naman says

Hi Jim,

As usual thank you for sharing your explanation and understanding of statistics in such a fluid and easy way.

It would be great if you can share some methods to fix auto correlation problem.

Jim Frost says

That’s a great topic for a future blog post!

Mislav says

Hello, Mr Frost,

Since your explanations seem theoretically sound and also understandable to less educated statisticians, I would be glad if you can help me with regression. As far as I understand it has something to do with residuals.

I made a regression model (supposed for prediction), and it looks very good (equation confirmed in other research, nice fit, high R2, residuals vs. fitted value is almost Ok). However when I tried to predict values using this model there “popped out” kind of bias: higher values of y were underestimated, and lower values were overestimated. Then I looked at the residual plot vs. observed values, and there was clear trend – negative values of residuals for small y, positive for large y. I read some spurious explanation that it must happen, and I am asking – what is theoretic cause and how to correct that, practically? Does it maybe have something to do with data? Or Maybe validation set should be different from modeling data set? It seems that there is no lack of fit, nor do i have any other clever independent variables to include. Hoping my problem and question are clear enough, thank you in advance.

Jim Frost says

Hi Mislav, assessing residual plots is the perfect way to pick up problems like this one. Typically, when you see non-random patterns like this, you often have an under-specified model. In other words, you might be missing a variable, or not specifying the curvature correctly. A high R-squared by itself does not tell you that your model is good. Below are several links that will help you specify an unbiased model:

Model specification

Curve fitting

R-squared (high R-squared values are not necessarily good)

I hope this helps!

Cathrine says

Hello Jim,

The fitted values that you plot against the residuals, what are they?

In the case of multiple linear regression, are they the mean of all the estimated independent variables in the model? Or just the Parameter of Interest?

Kind regards?

Jim Frost says

Hi Cathrine, that’s a great question and it suggests to me that I need to make my post more clear about that!

Here’s how a residuals by fitted value plot is created. The software uses the regression model to make a prediction for each observation (row in your data table). That prediction is the fitted value and it falls on the X-axis on the scatterplot. A fitted value is a statistical model’s prediction of the mean response value when you input the values of the predictors, factor levels, or components into the model.

The residual is the the difference between the observed value and the fitted value that the model predicts for each observation. This value falls on the Y-axis of the scatterplot.

Consequently, each observation in your data set produces the pair of X (fitted value) and Y (residual) values that are graphed on the scatterplot. The goal of a regression model is that the residuals do not fall systematically above or below zero because zero indicates that the model made a perfect prediction. If the residuals are systematially high or low, your model is biased and needs to be fixed. Some models might work better for high, medium, or low fitted values, which is why we use that for the X-axis.

I hope this answers your questions!

jonamjar says

Why dint I find this before about?!! Amazing

Jim Frost says

Thank you so much! And, I’m glad you found my blog!

Ghouse says

brilliant explanation. Thanks Jim

Mohammad Kamel says

Thanks Jim, your articles are really helpful, you make the statistics concepts very easy and logical

Jim Frost says

Thank you, Mohammad! I really appreciate the kind words and I’m so glad that you find them to be helpful!

Nate says

I have a question about my residual plot. I am looking at the influence of precipitation on population change in a species of ground nesting birds. I looked at precipitation for a given month as a percent of the average precipitation for that month. There was a correlation in March and August. So I conducted a regression analysis for the population change and precipitation for a given month. There is a positive relationship in March and negative in August. The residual plot in March does not show a pattern but the August residual plot shows a pattern. How would I look at the combination of March precipitation and August precipitation combined and population change?

Jim Frost says

Hi Nate, I don’t completely understand how your study is set up. It seems like you have separate models for each month rather than one model? Typically, you want the residual plots to be random. You don’t want patterns in the residuals even when you have correlations in the data. Because you see a pattern in August, I’d be considered that your model doesn’t fit the data well for that month.

If I understand correctly, in March, the more you have rain the more the population increases? And, in August, more rain equals less population? I don’t know the study area but you’d have determine whether that reversal makes sense. Also, with time series data, you have to be really careful about time order effects sneaking in. Is it possible that rain is correlated with something else that is driving population change? Maybe in March rain happens occurs at the same time as something else that actually drives the population to increase? And, in August maybe it happens to correspond with some causes population to decrease? I’m not saying that’s the case for sure, but you have to think about possibilities like that. Especially when the relationship changes direction like that. Does that make theoretical sense?

john says

Jim, Can give an example, on “To resolve this issue, try adding an independent variable that contains the pertinent time information”? What is the pertinent time information? Do you have the other post to address it? thanksjohn

Jim Frost says

Hi John, I don’t have a post on that topic yet but will write one at some point. Suppose your data are time series data. You could include a lagged variable for an independent variable. For example, if you think a variable has a delayed effect on the outcome, you can lag the variable so that the value from a previous observation appears in the current observation. You could also possibly add a variable the indicates the month, day, season, hour, etc if you think that is relevant to the outcome. These types of variable can all supply important information to the model. A model that is missing important information can provide untrustworthy results. Sometimes this includes time-related information. As always, you have to use subject-area knowledge and expertise to include the correct information.

Thanks for writing with the great question!

Jim

Nithashanasar says

Wow.. It’s such an amazing discovery… I believe that u will be a mile stone in statistics… Here ..it facilitate the concept of scatter plot…am doing MSC statistics.. So…I am really proud of u..thank jim

Jim Frost says

Hi Nithashanasar, I’m very happy to hear that this is helpful for you! Also, thank you so much for your kind words. I really appreciate it!

Jim