Use residual plots to check the assumptions of an OLS linear regression model. If you violate the assumptions, you risk producing results that you can’t trust. Residual plots display the residual values on the y-axis and fitted values, or another variable, on the x-axis. After you fit a regression model, it is crucial to check the residual plots. If your plots display unwanted patterns, you can’t trust the regression coefficients and other numeric results.

In this post, I explain the conceptual reasons why residual plots help ensure that your regression model is valid. I’ll also show you what to look for and how to fix the problems.

First, let’s go over a couple of basics.

There are two fundamental parts to regression models, the deterministic and random components. If your model is not random where it supposed to be random, it has problems, and this is where residual plots come in.

The essential parts of a regression model:

Dependent Variable = (Constant +Independent Variables) + Error

Or:

Dependent Variable = Deterministic + Stochastic

## Deterministic Component

The deterministic component is the portion of the variation in the dependent variable that the independent variables explain. In other words, the mean of the dependent variable is a function of the independent variables. In a regression model, all of the explanatory power should reside here.

## Stochastic Error

Stochastic just means unpredictable. In statistics, the error is the difference between the expected value and the observed value. Let’s put these terms together—the gap between the expected and observed values must not be predictable. Or, no explanatory power should be in the error. If you can use the error to make predictions about the response, your model has a problem. This issue is where residual plots play a role.

The theory here is that the deterministic component of a regression model does such a great job of explaining the dependent variable that it leaves only the intrinsically inexplicable portion of your study area for the error. If you can identify non-randomness in the error term, your independent variables are not explaining everything that they can.

Don’t worry. This is actually easy to understand. It just means that you should not be able to see patterns in the residual plots!

**Statistical note: **The residuals estimate the true error in the same manner that regression coefficients estimate the true population coefficients.

## How to Check Residual Plots

When looking at residual plots, you simply want to determine whether the residuals are consistent with random error. I’ll use an analogy of rolling a die. You shouldn’t be able to use one roll to predict the outcome of the next roll because it is supposed to be random. So, if you record a series of tosses, you should see only random results. If you start to see patterns, you know something is wrong with your model of how the die works. You think it’s random, but it’s not. If you were a gambler, you’d use this information to adjust how you play to match the actual die outcomes better.

You can apply this idea to regression models too. If you look at a series of errors, it should look random. If there are patterns in the errors, this means that you can use one error to predict another. As with the die analogy, if there are patterns in the residuals, you need to adjust your model. But, don’t fret, this just means that you can improve the fit of the model by moving this predictability over to the deterministic side of things (i.e., your independent variables).

How do you determine whether the residuals are random in regression analysis? It’s pretty simple, just check that they are randomly scattered around zero for the entire range of fitted values. When the residuals center on zero, they indicate that the model’s predictions are correct on average rather than systematically too high or low. Regression also assumes that the residuals follow a normal distribution and that the degree of scattering is the same for all fitted values.

Residuals should look like this.

## How to Fix Problematic Residual Plots

The residual plot below clearly has a pattern!

If you know the fitted value, you can use it to predict the residual. For instance, fitted values near 5 and 10 tend to have positive residuals while fitted values near 7 tend to have negative values. If they were truly random, you wouldn’t be able to make these predictions.

This residual plot indicates that the independent variables do not capture the entire deterministic component. Unfortunately, some of the explanatory information has leaked over to the supposedly random error. There are a variety of reasons why a model can have this problem. The possibilities include a missing:

- Independent variable.
- Polynomial term to model a curve.
- Interaction term.

To fix the problem, you need to identify the missing information, variable, or higher-order term and include it in the model. After you correct the problem and refit the model, the residuals should look nice and random! It might require subject-area knowledge and research to do this. The solution is very particular to your research.

Residuals should follow the normal distribution if you plan to use the p-values and confidence intervals. Use a QQ plot to check them!

## Other Potential Problems

There are several other ways that explanatory information might make its way into your residuals:

**Another variable must not be correlated with the residuals.**If a variable is related to the residuals, that variable can predict the residuals, which is a no-no. Try including this variable in the model. To identify this correlation, graph the residuals by other variables. This problem relates to confounding variables and causes omitted variable bias.**Neighboring residuals must not be correlated.**If adjacent residuals are correlated, one residual can predict the next residual. In statistics, this is known as autocorrelation. This correlation represents explanatory information that the independent variables do not describe. Models that use time-series data are susceptible to this problem. To resolve this issue, try adding an independent variable that contains the pertinent time information. Use the Durbin-Watson test to assess autocorrelation.**Residuals must have a constant variance.**Heteroscedasticity refers to cases where the residuals have a non-constant variance. Read my post about how to identify and correct heteroscedasticity.

## Residual Plots are Easy!

Hopefully, you see that checking your residuals plots is a crucial but simple thing to do. You need random residuals. Your independent variables should describe the relationship so thoroughly that only random error remains. Non-random patterns in your residuals signify that your variables are missing something.

Importantly, appreciate that if you do see unwanted patterns in your residual plots, it actually represents a chance to improve your model because there is something more that your independent variables can explain. That’s a good thing!

When your linear regression model satisfies the OLS assumptions, the procedure generates unbiased coefficient estimates that tend to be relatively close to the true population values (minimum variance). In fact, the Gauss-Markov theorem states that OLS produces estimates that are better than estimates from all other linear model estimation methods when the assumptions hold true.

For more information about the implications of this theorem on OLS estimates, read my post: The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Rick says

Hi Jim, I’m sorry to say so, but I believe part of your explanation of residual plots is wrong. You have indicated that the x-axis shows the fitted values, but this is not at all what is plotted on the x-axis. First, on a scatter plot of x and y values containing a line of best fit, the x-axis is the independent variable, the y-axis is the dependent variable, and the line of best fit are the fitted values. Then, on a residuals plot, the x-axis is still the independent variable, the y-axis is the residuals, and the fitted values are shown by the line y = 0. The line y = 0 is the fitted values, not the values on the x-axis.

Jim Frost says

Hi Rick,

My explanation is correct. But there are multiple types of residual plots. One general type is with the residuals on the y-axis, and either the fitted values or

dependentvariable values on the x-axis. Both graphs present similar type of information and it’s the type I’m focusing on in this article. It helps you find properties that violate assumptions such as heteroscedasticity, bias, and outliers. Look at the residuals plot I show in this article and you’ll see that it is indeed a Residuals by Fitted Values Plot. ðŸ˜‰There are additional types of residual plots, one of which is what you’re describing. That’s a residual plot where you again have the residuals on the y-axis and the values of an

independentvariable on the x-axis. These graphs look for a different property that violates a different assumption than the above graph. Specifically, it looks for a relationship between an IV and the residuals. So, what you describeisone type of residual plot but it serves a different purpose, and it’s not the type that I’m focusing on in this post.The other graph you mention with the y-axis being the dependent variable and x-axis being a independent variable is just a scatterplot or a fitted line plot. It’s not a residual plot at all because it doesn’t display residuals! Although, you can use these to assess assumptions for a simple regression model because these model have only one independent variable. You can’t use them for multiple regression and will need to use residual plots in those cases.

One final point, your suggestion that the y = 0 line on the residuals by IV graph represents the fitted values is incorrect. This type of graph doesn’t display fitted values. Because the y-axis are residual values, a line at y = 0 is where the residuals equal zero, representing where the observed value = the fitted values. But it’s not actually displaying fitted values because neither axis represents fitted values. For all the residual plots I discuss in this comment, you want your residuals to cluster randomly and equally around this y = 0 line. It’s a reference line that makes this visual determination easier.

I hope that helps clarify!

Eneyi says

I just found your blog Jim. Amazing work. This is good quality stuff. Thank you Jim

Jim Frost says

Thanks so much, Eneyi!

David says

Hi Jim,

Great description on residuals. You have noted in the past that for very large sample sizes the need for normality of the residuals in order to form proper CI disappears because of the CLT. Is this always true? I am thinking if the model is biased or there is some autocorrelation between the residuals a large sample will not help you form a reliable CI? Thanks

David

Jim Frost says

Hi David,

For OLS regression, normality is only required when you need to use hypothesis testing and CIs. And, yes, like for hypothesis tests, the CLT can come into play here. However, the role of the CLT is much clearer for things like t-tests and one-way ANOVA where you’re dealing with two or more groups based on one variable. However, in regression, it’s a lot more complex. You can have differing numbers of variables and types of variables (e.g., continuous and categorical).

I haven’t looked into whether there are rules of thumb, such as number of observations per predictor, where you can assume that the CLT has sufficiently kicked in. While I suspect that would happen eventually, I don’t have a good idea as to when. And I’m sure that depends on the nature of your model.

On to your question, note that the CLT allows you to use the p-values and CIs when your residuals are non-normal only when you’ve specified the correct model. Your examples of biased models and autocorrelation are misspecified models. They’re not correct. Hence, a large sample size/CLT won’t fix those problems. So, you’re correct that for those conditions, a large sample size will not produce reliable CIs (or p-values). But that’s because your model is wrong and nothing to do with the CLT.

Enrico Mendoza says

Hi Jim,

Is checking for residual plots applicable for logistic regression? DV is binary, independent variables are all binary as well. Thank you very much

Timothy Ipevnor says

Good day Prof., Please, what kind of test can I carry out to satisfy the assumption of ‘Independent variables are not stochastic’ in panel data regression?

Pedro Barbosa says

Hi, Prof. Jim!

I would like to know if there is any case where would be better to use a Bland-Altman plot. I believe the main difference to the residual plot is the x-axis which in Bland-Altman is the average between observed and fitted values (originally, in their context, method A and method B). How would you compare the analysis power of those two kind of plots when assessing the model’s performance?

Thank you for your very didatic posts!

DAIANE CAROLINA SILVA says

Prof Jim, thank you very much for your work that has helped me in such a beautiful discipline like statistics.

I am performing my regressions models for my research project. Reading this post you talked about analysing residuals using Scatter plots.

I am using more Histograms to check the normality of the residuals. So I have two questions:

– I I use histograms and Shapiro wilk test to test and visualise the normality of the residuals,Is it ok, or a scatter is also needed?

-If I have significant variables and normal residuals, but the signs of some independent variables are not congruent with theory and previous models of the literature.

Would you judge, in this case, that the model is still valid ? or that Theory and literature Models should have the last word ? and therefore there should have a equal correspondence of signs in order for the model be justified as valid?

Thank you very much.

Jim Frost says

Hi Daiane,

You’re very welcome! I’m so glad that my website has been helpful!

For starters, histograms are not the best way to assess normality. It’s easier and more accurate using a normal probability plot (aka a Q-Q plot). For more information, read my post where I compare using histograms and normal probability plots to assess normality.

It is ok to use a normality tests (such as the Shapiro Wilk test). However, be aware that if you have a large sample size, the power of these tests increase. Eventually, they can detect trivial departures from normality as being statistically significant. However, these trivial departures don’t affect your result. Just something to be aware of when you’re using a large dataset. The normal probability plots I mention above remain a good method even for very large datasets. That’s why I recommend them!

There are several reasons why your coefficient signs don’t match theory. You really need to determine why it’s occurring and either correct your model or find an explanation for it that justifies the unexpected signs. In my post about model specification, I write about how theory needs to be your guide. I’d take a look at that.

Unfortunately, I can’t tell you what is causing your unexpected coefficient signs, but I’ll point you toward three main culprits that you can investigate.

Omitted variable bias/confounders: Leaving out important variables out of your model can bias the variables that you include.

Overfitting: If you include too many terms in your model given the number of observations, you can get strange results.

Multicollinearity: Can flip coefficient signs.

Also, I’d highly recommend picking up my regression analysis book, which you can find in my web store. In it, I cover many aspect of regression analysis in detail!

Kaushik Ghosh says

Hi Jim. I can’t say how thankful I am to understand statistics in a simple way, being from non-statistical background. Can you please tell me how i should cite your website in my research work. Thanks a lot

Rudzani Mulaudzi says

Thanks for this article, really helpful.

I have implemented about 16 different machine learning models in a forecasting task. My r-squared is negative across all models, which led me investigate the residual plots, which are surprisingly all bimodal. I have transformed the input data through a standard scaler which ensures all my data is between 0.0001 and 1. I am not sure what to do as this is across all my models and the data has been transformed.

I implemented regression techniques, kernel based techniques, deep learning techniques, and tree based techniques. My RMSE and MAE looks fine across all models.

Jim Duncan says

Hi Jim, Great website. Will be getting one or two of your e books pronto. Like you say, residuals are so important! Looking for confirmation or advice on what I did to counter a breach of a critical regression Assumption.

I am analyzing an owl abundance (dependant variable) time series that had a sig positive autocorrelation shedding doubt on the value of the significant simple linear regression significant result. The owl data came from a survey that changed methods about half-way through the 20 year period, and that affected detection rates. More owls were detected per km in the earlier period than the latter resulting in a significant trend that was likely due to the change in method. I believe the method change is a considered a stationary extrinsic trend, correct?

I detrended the owl data by calculating the residuals (observed – predicted) from the regression. The residuals are larger in absolute terms for larger indices. Therefore the residuals From each survey period can be made comparable by converting them to residual ratios: dividing each residual by the mean index for that methodâ€˜s period. It worked as there was a dramatic difference in the before and after detrending graphs. The transformed data was no longer significantly autocorrelated and the R squared increased and the P value became smaller or more significant.

Any eye pokers with this approach. Is this a method that you describe in one of your books? More references on this or other detrending methods much appreciated.

Cheers, Jim

Miae says

Thanks Jim for the very useful article!

What is the difference between a plot based on fitted values on the x-axis (residual plot described here) and a plot based on observed values on the x-axis?

Jim Frost says

Hi Miae,

They often present very similar information. In fact, when you have a good model, your fitted values should be very close to the observed values. In that case, they will be nearly identical. However, differences can appear when the model isn’t so good because the fitted values diverge more from the observed values. Even then, I don’t personally recall cases where they’re telling vastly different stories. Theoretically, you’d want to use the fitted values because you’re assessing the model. How do the residuals related to the model’s predictions? In other words, how wrong are the predictions? You’re not assessing the wrongness of the observed values.

But most times you’ll probably get the same impression from either type of graph.

Grace says

Hi Jim,

Firstly, thank you for the helpful information!

I’m working on a simple tide model that predicts the height of the high and low tides. It’s a nonlinear regression model. I’ve seen some websites that say that residual plots are used to see if a linear regression is suitable, and if the residual plot isn’t randomly scattered, that is why the nonlinear regression is used. However, I am already starting with a model that I know is nonlinear. I want to use the residual plots to show that there is no unexplained error in the model, not to decide between a linear or nonlinear regression. Is this alright or is the residual plot pointless? I’m trying to show the goodness of fit of the model.

When I plotted the residual plot, the points are scattered randomly but they lie between certain intervals, i.e. there is an empty space where there no data points in the middle of the graph. This makes sense because the fitted values are the high and low tides, so I think the low tides are in the lower interval, while the high tides are grouped around the higher interval. This doesn’t mean the points are predictable, right? At least not in the y-axis or response which is what the stochastic error is.

Jim Frost says

Hi Grace,

Nonlinear regression requires you to check the residual plots in the same manner as linear regression. You want the residuals to be randomly scattered around zero. I’m assuming that you know the difference between linear models that fit curves versus truly nonlinear models. For details, read my post about linear vs. nonlinear regression.

As you say, residual plots can help you determine whether a linear model is adequate for your data. Problems in the residual plots doesn’t necessarily mean you should go straight to a nonlinear model. Perhaps you need to fit curvature that is present or include additional variables. However, in some cases, you can’t get an adequate fit using linear regression and you’d consider using nonlinear regression.

It sounds like that you’re deciding based on subject area knowledge that you need to use a nonlinear model. And, I think it’s fantastic that you’ve done advance research to make that determination! Seeing how others have handled modeling in a research area is a crucial step. To answer your question, yes, check those residual plots for your nonlinear model to help you determine whether your model provides a good fit.

Assuming that you’re looking at a residuals versus fitted values (or actual values) plot, that gap doesn’t mean that they’re predictable as long as each cluster centers on zero. However, that gap between points on the residual plots worries me for another reason. When you have two clusters of dots like that, the model treats them almost as two data points, and it’s fitting those two data points. However, it’s possible that within each cluster the fit is biased a bit. That may or may not being happening. Make sure each cluster of points is randomly scattered around zero with no patterns inside each cluster. Also, I’d consider fitting separate models for low and high tides and comparing those model fit results to the model with both just to see.

Callum Brindley says

Jim, great website! Can you possibly expand on how this article relates to the OLS assumption about fixed versus random (stochastic) regressors? I couldn’t find this issue discussed anywhere. If I understand correctly, when we assume fixed regressors we are saying that the regessors and errors are independent and uncorrelated. Referencing the essential parts of a regression mentioned at the beginning of this article (i.e. dependent variable=deterministic+stochastic), this means there’s nothing random apart from the error. It also means that our parameter estimates are unbiased (correct on average) and efficient (close to the true value). I’m struggling to understand what this assumption means intuitively with examples of random and non-random regressors. Thanks for any insights or intuitive explanations you can share.

Sanne Nelissen says

Dear Jim,

Your articles are so helpful for my research!

I am doing a research internship for my thesis and i want to test whether my continuous variables are linear with each other. Hoever, in my scatter plots i don’t really see a clear correlation?

Could you help me finding a solution how to test linearity and whether i can do a linear regression analyses or not?

Thank you so much!

Sanne

Jim Frost says

Hi Sanne,

Thank you! And, I’m very happy to hear that my posts have been helpful!

It sounds like you’re doing the right thing to look for relationships between you continuous variables. I always recommend creating scatterplots as a first step for looking for relationships between continuous variables. If there is a relationship, you should be able to see it on the graph. You can also calculate correlation coefficients. However, I recommend scatterplots first because correlation only looks at straight line relationships and can missed curved relationships–which you’d still see on the scatterplots. So, you’re doing the right thing! Here’s a link to my blog post for correlation that might help.

If you’re not seeing a relationship, is it possible that there isn’t one? Or maybe you have to few observations to see it? Another possibility is that a confounding variable is masking the relationship.

Best of luck with your analysis!

JAMIL says

Whenever I face any problem with statistics OR with my research I browse your website. Thank you so much for such a helpful content, your posts are helping me understand a lot.

Jim Frost says

You’re very welcome, Jamil. It makes me happy to hear that my posts are helping you out! ðŸ™‚

Lola says

Jim, thanks for all of your great articles and most importantly congratulations on your book. I highly recommend it to anyone here that has found themselves returning to Jim’s articles over and over again. It covers all of the main topics of regression in a thoughtful order to build on each new topic. It has really helped me gain a deeper understanding and appreciation of regressions.

I’m hoping you could help me determine if I’m applying the residual plots of a regression correctly.

Are residual plots still useful/important to check when the regression uses Newey-West standard errors? If I understand correctly, using Newey-West standard errors does not affect the residuals because the coefficients are the same as a regression using the regular standard errors. Therefore, the plots of residuals versus predicted values are the same for both versions of the regression. Is this correct or am I missing something?

My specific worry is that my regression is using Newey-West standard error to account for heteroskedasticity and autocorrelation. The residuals from this regression are displaying patterns. Is this still a valid indication that I’m missing a variable or have some other type of model misspecification?

Thanks again for all your help, I look forward to your next book!

Betty LIU says

Thanks Jim!

I’m a university student, we just start learning those things and this page is very useful.

Actually I draw a residual plot and it looks like the picture you gave, the dots are uniformly distributed around zero except one outlier. And those point are lying between -5 and 5. Our target is to find whether the residuals satisfy the assumptions of a test for the existence of regression. But I am not sure about what are the assumptions… How can I know if those assumptions are satisfied from the residual plot? or the perfectly distributed graph implies the existence of regression?

Jim Frost says

Hi Betty,

That sounds promising result for your residual plots. Although you should check it the potential outlier.

I write much about checking the assumptions for linear regression. In that post, I go over the assumptions and how to check them.

Also, if you like how I write regression, you might consider getting my ebook about regression analysis.

Best of luck with your analysis!

philoinme says

Yeah that helped.

Which one is good practice – evaluate assumptions with the data I have and fit the model or fit the model, evaluate the assumptions and then improve it?

Jim Frost says

Hi, regression is a bit different than many other statistical analyses. For other analyses, you can often check some assumptions in advance, such as whether the data follow the normal distribution. In regression analysis, you typically assess the assumptions through the residuals, which means you have to fit the model first. For example, you fit a regression model and then determine whether the residuals follow the normal distribution. There aren’t assumptions about the DV or the IVs. At least not strictly. However, if you have a severely skewed DV, it’ll be difficult to obtain normally distributed residuals without using, say, a transformation.

Consequently, start by fitting the model as best as you can, hopefully guided by reviewing similar studies, but then assess the residual plots and change as needed.

philoinme says

Nice, Jim. I liked the post and most of your content focuses on application part and answers all the WHYs.

Question:

Seeing patterns in the residual plots when fitting smaller amount of data (say <100) is typical and easy. That doesn't mean, the model I fit has always some scope to improve. How to deal with this?

Jim Frost says

Hi,

I’d disagree that seeing patterns in residuals plots for small datasets are no reason for concern. Patterns in residual plots typically indicate that your model violates at least one assumption–and that is concerning. I have a post about OLS assumptions that describes problems of violating the assumptions and how to address different types of problems in residual plots. The solution depends on the pattern or problem that you detect. There is no general answer I can give you.

I hope this helps!

Felix says

Thank you for your response. Actually, this is an image processing area of work.

Although, it is an extracted features(vector features) from the data set in image processing. Since it is said that RMSE value with very low value is good, does that mean having zero is ok too.

Thank you.

Felix says

Hello Jim, thank you for the good work you are doing here. I have a doubt on the result of my residual pattern of data set. How can I interpret this where I just have a dot on the zero plane and my RMSE value is zero.

Thank you.

Jim Frost says

Hi Felix,

If you have just one dot on your residual plot, does this mean you have only one observation in your dataset?! It sounds like your model is explaining all the variance.

Tridib dutta says

Hi Jim,

Let me thank you for these posts. Posts are full of information that are useful and hard to extract out of text books. So for self taught or statistically naive person, these posts are wonderful. Having said that, I would like to ask you a question. I have tried to use regression model with other algorithms (Support vector). Those have completely different sets of assumptions. I am wondering if in these cases, looking at the residual plots would help improve the model. In my case, the residual vs fitted values has a somewhat cone shape with tip of the cone being the origin and spreads to the right (all my predictors have either positive value or categorical in nature).

Your thoughts would be appreciated.

Thanks,

Jim Frost says

Hi Tridib,

What you describe sounds just like heteroscedasticity! Read my post about detecting heteroscedasticity to see if that helps you. I include several methods of how to handle it. I’m not completely familiar with the assumptions your analysis requires, but I’m expecting heteroscedasticity is not a good thing for it!

Mira says

Hi Jim!

I just wanted to thank you for all this work! I really like your blog and I’m going to use it to refresh contents I should already know (ups). I really appreciate how you always focus on how to apply the concepts, which was exactly what I was looking for. And I’m shocked you actually take time to answer questions. Thanks!

Jim Frost says

Hi Mira,

I’m so happy to hear that my website has been helpful. I work hard to make the explanations clear and helpful, so your kind words really mean a lot! Thank you!

Surya says

Hi Jim,

In the above post on residuals,

1) if they exhibit a pattern ,it means some variation in DV is still to be explained by the IVs..Can I say that SSR(sum of squares of regression) is less than it is ought to be and SSE(Sum of squares of error)I is more than it is ought to be. And that SSE reduction should be SSR’s gain.

And

2) Do you also have plan to publish articles on Logistic regression as well ?

3) Please ensure your book is available in India as well. You have fans over here ðŸ™‚

Jim Frost says

Hi Surya,

Yes, that’s exactly it. SSR + SSE = TSS. The total sum of squares (TSS) is independent of your model. Given a fixed dataset, it should be a fixed value. Therefore, if you specify different models with the same dataset, and it changes the SSR, then SSE much change accordingly. You’re if correct, if SSR decreases, then SSE must increase by the same amount. If you have missing values in your data, it can complicate that slightly. Missing values can cause entire observations to be omitted from the analysis, which can change TSS slightly depending on the model and the observations that are excluded.

I have one post about logistic regression, where I show an example of binary logistic regression. I do plan to write a more in-depth article about. Hopefully, mid to late 2019.

Thanks for asking about my book! I’ve been working hard to finish that up and I’m almost there! Initially, it’ll be available only as an ebook and I expect you should be able to purchase that from anywhere. And, I’ll make sure it’s available in India! I have a soft spot for India because I’ve been there several times. And, the second highest number of visitors come to my website from India! ðŸ™‚

Thanks for writing. It’s great hearing from you again!

Curious says

Hi Jim, great post!

I’ve a question: While choosing a model that best fits the data at hand, should one prefer the model that minimises RMSE (i.e. standard deviation of residuals) which is the general practice, or the model that most randomises the residuals i.e. extracts all the systematic information from the X-variables leaving only random noise?

Many thanks!

Jim Frost says

Hi,

This is a tricky question! There’s no solid rule. Yes, you generally want to minimize RMSE or, conversely, maximize R-squared. However, as I write in various places (see R-squared for instance), chasing a closer fit can lead to problems. For example, that approach can lead you to overfit your model and you won’t be able to trust the results.

However, you usually do not want patterns in the residual. And, when I say “usually,” I mean almost never! And, typically, models that produce residuals with patterns won’t have the lowest RMSE. You should have some sort of improvement by fitting a better model that eliminates the residual patterns. That’s addresses two issues simultaneously–eliminating the pattern and improving the fit.

To summarize, you probably don’t want a model with residual patterns. You want a model with random residuals and a

relativelylow RMSE orrelativelyhigh R-squared. But, you might not necessarily want to minimize/maximize those values.For more of my thoughts on this topic, read my post about choosing the correct regression model.

Keegan says

Hi Jim,

Are residual probability plot interpretations same for linear and non linear regression models?

Jim Frost says

Hi Keegan,

Yes, in fact, both the interpretation and the requirements for the residuals plots are the same for linear and nonlinear regression.

Muhammad Khan says

Hey Jim, Hope you are doing well. I am still not very clear about the reasons for why we need the residuals to be normally distributed. Could you please elaborate on that?

Jim Frost says

Hi Muhammad, if you want to be able to use the p-values to determine which predictors are statistically significant, those hypothesis tests assume that the residuals are normally distributed. Also, the CIs for the coefficients and the prediction intervals for new observations all assume that the residuals are normally distributed. If the residuals are not normally distributed, you might not be able to trust those results. I talk about this specific assumption a bit more in my post for OLS Assumptions. Look for the specific assumption about this issue.

I hope this helps!

Om Shanker Pandey says

I have read so many posts on statistics and heard so many people talk about regressions but loved the way you explain things. It is crystal clear and love reading your blogs. Please write more.

Jim Frost says

Thank you very much. I strive to make my blog posts as clear as possible so your comment means a lot to me. I’m glad that you found them to be helpful!

Dale says

Hi Jim,

I have a partial regression plot that looks very cyclical. So while it is technically scattered around 0, it seems like there is indeed a pattern. Would this mean that a polynomial factor would improve the model as you suggested? Also, is a partial regression plot the same as a residual plot? I am still a beginner as you can tell.

Sundar says

Hi Jim,

Really nice article put up in a simple way for beginner like me to understand. However I plotted fitted values v/s residuals for my model and though it appear random almost all of the values are on the positive side of the axis. In your post you have mentioned that the values should be “randomly scattered around zero”. So does it mean mine is showing any pattern ?

Jim Frost says

Hi Sundar,

If almost all of your residuals are positive, that indicates that model is positively biased. The fitted values are systematically higher than the observed values. The residuals should be randomly scattered around zero. In fact, one of the assumptions for ordinary least squares regression is that the mean of the residuals equals zero.

If you fit a model with the constant (which is almost always the case), this forces the mean of your residuals to equal zero. For more information about this, read my post about the constant in a regression model. This leads to two possible cases based on whether your model includes the constant:

If your model does

notinclude the constant, consider adding it because your model will satisfy that particular assumption at least. You must still check the other assumptions.However, if your model does include the constant and it appears like almost all of your residuals are positive, that is an interesting case. Because of the constant, you know the mean must equal zero even though most of the residuals are positive. One possibility is that you have a few unusual observations with large negative residuals that offset the multitude of positive residuals and produces that zero mean. In this scenario, you’d need to investigate those data points and determine whether those observations are outliers that should removed from the model.

I hope this helps!

Naman says

Hi Jim,

As usual thank you for sharing your explanation and understanding of statistics in such a fluid and easy way.

It would be great if you can share some methods to fix auto correlation problem.

Jim Frost says

That’s a great topic for a future blog post!

Mislav says

Hello, Mr Frost,

Since your explanations seem theoretically sound and also understandable to less educated statisticians, I would be glad if you can help me with regression. As far as I understand it has something to do with residuals.

I made a regression model (supposed for prediction), and it looks very good (equation confirmed in other research, nice fit, high R2, residuals vs. fitted value is almost Ok). However when I tried to predict values using this model there “popped out” kind of bias: higher values of y were underestimated, and lower values were overestimated. Then I looked at the residual plot vs. observed values, and there was clear trend – negative values of residuals for small y, positive for large y. I read some spurious explanation that it must happen, and I am asking – what is theoretic cause and how to correct that, practically? Does it maybe have something to do with data? Or Maybe validation set should be different from modeling data set? It seems that there is no lack of fit, nor do i have any other clever independent variables to include. Hoping my problem and question are clear enough, thank you in advance.

Jim Frost says

Hi Mislav, assessing residual plots is the perfect way to pick up problems like this one. Typically, when you see non-random patterns like this, you often have an under-specified model. In other words, you might be missing a variable, or not specifying the curvature correctly. A high R-squared by itself does not tell you that your model is good. Below are several links that will help you specify an unbiased model:

Model specification

Curve fitting

R-squared (high R-squared values are not necessarily good)

I hope this helps!

Cathrine says

Hello Jim,

The fitted values that you plot against the residuals, what are they?

In the case of multiple linear regression, are they the mean of all the estimated independent variables in the model? Or just the Parameter of Interest?

Kind regards?

Jim Frost says

Hi Cathrine, that’s a great question and it suggests to me that I need to make my post more clear about that!

Here’s how a residuals by fitted value plot is created. The software uses the regression model to make a prediction for each observation (row in your data table). That prediction is the fitted value and it falls on the X-axis on the scatterplot. A fitted value is a statistical modelâ€™s prediction of the mean response value when you input the values of the predictors, factor levels, or components into the model.

The residual is the the difference between the observed value and the fitted value that the model predicts for each observation. This value falls on the Y-axis of the scatterplot.

Consequently, each observation in your data set produces the pair of X (fitted value) and Y (residual) values that are graphed on the scatterplot. The goal of a regression model is that the residuals do not fall systematically above or below zero because zero indicates that the model made a perfect prediction. If the residuals are systematially high or low, your model is biased and needs to be fixed. Some models might work better for high, medium, or low fitted values, which is why we use that for the X-axis.

I hope this answers your questions!

jonamjar says

Why dint I find this before about?!! Amazing

Jim Frost says

Thank you so much! And, I’m glad you found my blog!

Ghouse says

brilliant explanation. Thanks Jim

Mohammad Kamel says

Thanks Jim, your articles are really helpful, you make the statistics concepts very easy and logical

Jim Frost says

Thank you, Mohammad! I really appreciate the kind words and I’m so glad that you find them to be helpful!

Nate says

I have a question about my residual plot. I am looking at the influence of precipitation on population change in a species of ground nesting birds. I looked at precipitation for a given month as a percent of the average precipitation for that month. There was a correlation in March and August. So I conducted a regression analysis for the population change and precipitation for a given month. There is a positive relationship in March and negative in August. The residual plot in March does not show a pattern but the August residual plot shows a pattern. How would I look at the combination of March precipitation and August precipitation combined and population change?

Jim Frost says

Hi Nate, I don’t completely understand how your study is set up. It seems like you have separate models for each month rather than one model? Typically, you want the residual plots to be random. You don’t want patterns in the residuals even when you have correlations in the data. Because you see a pattern in August, I’d be considered that your model doesn’t fit the data well for that month.

If I understand correctly, in March, the more you have rain the more the population increases? And, in August, more rain equals less population? I don’t know the study area but you’d have determine whether that reversal makes sense. Also, with time series data, you have to be really careful about time order effects sneaking in. Is it possible that rain is correlated with something else that is driving population change? Maybe in March rain happens occurs at the same time as something else that actually drives the population to increase? And, in August maybe it happens to correspond with some causes population to decrease? I’m not saying that’s the case for sure, but you have to think about possibilities like that. Especially when the relationship changes direction like that. Does that make theoretical sense?

john says

Jim, Can give an example, on “To resolve this issue, try adding an independent variable that contains the pertinent time information”? What is the pertinent time information? Do you have the other post to address it? thanksjohn

Jim Frost says

Hi John, I don’t have a post on that topic yet but will write one at some point. Suppose your data are time series data. You could include a lagged variable for an independent variable. For example, if you think a variable has a delayed effect on the outcome, you can lag the variable so that the value from a previous observation appears in the current observation. You could also possibly add a variable the indicates the month, day, season, hour, etc if you think that is relevant to the outcome. These types of variable can all supply important information to the model. A model that is missing important information can provide untrustworthy results. Sometimes this includes time-related information. As always, you have to use subject-area knowledge and expertise to include the correct information.

Thanks for writing with the great question!

Jim

Nithashanasar says

Wow.. It’s such an amazing discovery… I believe that u will be a mile stone in statistics… Here ..it facilitate the concept of scatter plot…am doing MSC statistics.. So…I am really proud of u..thank jim

Jim Frost says

Hi Nithashanasar, I’m very happy to hear that this is helpful for you! Also, thank you so much for your kind words. I really appreciate it!

Jim