Ordinary Least Squares (OLS) is the most common estimation method for linear models—and that’s true for a good reason. As long as your model satisfies the OLS assumptions for linear regression, you can rest easy knowing that you’re getting the best possible estimates.

Regression is a powerful analysis that can analyze multiple variables simultaneously to answer complex research questions. However, if you don’t satisfy the OLS assumptions, you might not be able to trust the results.

In this post, I cover the OLS linear regression assumptions, why they’re essential, and help you determine whether your model satisfies the assumptions.

## What Does OLS Estimate and What are Good Estimates?

First, a bit of context.

Regression analysis is like other inferential methodologies. Our goal is to draw a random sample from a population and use it to estimate the properties of that population.

In regression analysis, the coefficients in the regression equation are estimates of the actual population parameters. We want these coefficient estimates to be the best possible estimates!

Suppose you request an estimate—say for the cost of a service that you are considering. How would you define a reasonable estimate?

- The estimates should tend to be right on target. They should not be systematically too high or too low. In other words, they should be unbiased or correct on average.
- Recognizing that estimates are almost never exactly correct, you want to minimize the discrepancy between the estimated value and actual value. Large differences are bad!

These two properties are exactly what we need for our coefficient estimates!

When your linear regression model satisfies the OLS assumptions, the procedure generates unbiased coefficient estimates that tend to be relatively close to the true population values (minimum variance). In fact, the Gauss-Markov theorem states that OLS produces estimates that are better than estimates from all other linear model estimation methods when the assumptions hold true.

For more information about the implications of this theorem on OLS estimates, read my post: The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates.

## The Seven Classical OLS Assumptions

Like many statistical analyses, ordinary least squares (OLS) regression has underlying assumptions. When these classical assumptions for linear regression are true, ordinary least squares produces the best estimates. However, if some of these assumptions are not true, you might need to employ remedial measures or use other estimation methods to improve the results.

Many of these assumptions describe properties of the error term. Unfortunately, the error term is a population value that we’ll never know. Instead, we’ll use the next best thing that is available—the residuals. Residuals are the sample estimate of the error for each observation.

Residuals = Observed value – the fitted value

When it comes to checking OLS assumptions, assessing the residuals is crucial!

There are seven classical OLS assumptions for linear regression. The first six are mandatory to produce the best estimates. While the quality of the estimates does not depend on the seventh assumption, analysts often evaluate it for other important reasons that I’ll cover.

## OLS Assumption 1: The regression model is linear in the coefficients and the error term

This assumption addresses the functional form of the model. In statistics, a regression model is linear when all terms in the model are either the constant or a parameter multiplied by an independent variable. You build the model equation only by adding the terms together. These rules constrain the model to one type:

In the equation, the betas (βs) are the parameters that OLS estimates. Epsilon (ε) is the random error.

In fact, the defining characteristic of linear regression is this functional form of the *parameters* rather than the ability to model curvature. Linear models can model curvature by including nonlinear *variables* such as polynomials and transforming exponential functions.

To satisfy this assumption, the correctly specified model must fit the linear pattern.

**Related posts**: The Difference Between Linear and Nonlinear Regression and How to Specify a Regression Model

## OLS Assumption 2: The error term has a population mean of zero

The error term accounts for the variation in the dependent variable that the independent variables do not explain. Random chance should determine the values of the error term. For your model to be unbiased, the average value of the error term must equal zero.

Suppose the average error is +7. This non-zero average error indicates that our model systematically underpredicts the observed values. Statisticians refer to systematic error like this as bias, and it signifies that our model is inadequate because it is not correct on average.

Stated another way, we want the expected value of the error to equal zero. If the expected value is +7 rather than zero, part of the error term is predictable, and we should add that information to the regression model itself. We want only random error left for the error term.

You don’t need to worry about this assumption when you include the constant in your regression model because it forces the mean of the residuals to equal zero. For more information about this assumption, read my post about the regression constant.

## OLS Assumption 3: All independent variables are uncorrelated with the error term

If an independent variable is correlated with the error term, we can use the independent variable to predict the error term, which violates the notion that the error term represents unpredictable random error. We need to find a way to incorporate that information into the regression model itself.

This assumption is also referred to as exogeneity. When this type of correlation exists, there is endogeneity. Violations of this assumption can occur because there is simultaneity between the independent and dependent variables, omitted variable bias, or measurement error in the independent variables.

Violating this assumption biases the coefficient estimate. To understand why this bias occurs, keep in mind that the error term always explains some of the variability in the dependent variable. However, when an independent variable correlates with the error term, OLS incorrectly attributes some of the variance that the error term actually explains to the independent variable instead. For more information about violating this assumption, read my post about confounding variables and omitted variable bias.

## OLS Assumption 4: Observations of the error term are uncorrelated with each other

One observation of the error term should not predict the next observation. For instance, if the error for one observation is positive and that systematically increases the probability that the following error is positive, that is a positive correlation. If the subsequent error is more likely to have the opposite sign, that is a negative correlation. This problem is known both as serial correlation and autocorrelation.

Assess this assumption by graphing the residuals in the order that the data were collected. You want to see a randomness in the plot. In the graph for a sales model, there appears to be a cyclical pattern with a positive correlation.

As I’ve explained, if you have information that allows you to predict the error term for an observation, you need to incorporate that information into the model itself. Serial correlation is most likely to occur in time series models. To resolve this issue, you might need to add an independent variable to the model that captures this information. For the sales model above, we probably need to add variables that explains the cyclical pattern.

Serial correlation reduces the precision of OLS estimates.

## OLS Assumption 5: The error term has a constant variance (no heteroscedasticity)

The variance of the errors should be consistent for all observations. In other words, the variance does not change for each observation or for a range of observations. This preferred condition is known as homoscedasticity (same scatter). If the variance changes, we refer to that as heteroscedasticity (different scatter).

The easiest way to check this assumption is to create a residuals versus fitted value plot. On this type of graph, heteroscedasticity appears as a cone shape where the spread of the residuals increases in one direction. In the graph below, the spread of the residuals increases as the fitted value increases.

Heteroscedasticity reduces the precision of the estimates in OLS linear regression.

**Related post**: Heteroscedasticity in Regression Analysis

Note: When assumption 4 (no autocorrelation) and 5 (homoscedasticity) are both true, statisticians say that the error term is independent and identically distributed (IID) and refer to them as spherical errors.

## OLS Assumption 6: No independent variable is a perfect linear function of other explanatory variables

Perfect correlation occurs when two variables have a Pearson’s correlation coefficient of +1 or -1. When one of the variables changes, the other variable also changes by a completely fixed proportion. The two variables move in unison.

Perfect correlation suggests that two variables are different forms of the same variable. For example, games won and games lost have a perfect negative correlation (-1). The temperature in Fahrenheit and Celsius have a perfect positive correlation (+1).

Ordinary least squares cannot distinguish one variable from the other when they are perfectly correlated. If you specify a model that contains independent variables with perfect correlation, your statistical software can’t fit the model, and it will display an error message. You must remove one of the variables from the model to proceed.

Perfect correlation is a show stopper. However, your statistical software can fit OLS regression models with imperfect but strong relationships between the independent variables. If these correlations are high enough, they can cause problems. Statisticians refer to this condition as multicollinearity, and it reduces the precision of the estimates in OLS linear regression.

**Related post**: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions

## OLS Assumption 7: The error term is normally distributed (optional)

OLS does not require that the error term follows a normal distribution to produce unbiased estimates with the minimum variance. However, satisfying this assumption allows you to perform statistical hypothesis testing and generate reliable confidence intervals and prediction intervals.

The easiest way to determine whether the residuals follow a normal distribution is to assess a normal probability plot. If the residuals follow the straight line on this type of graph, they are normally distributed. They look good on the plot below!

If you need to obtain p-values for the coefficient estimates and the overall test of significance, check this assumption!

## Why You Should Care About the Classical OLS Assumptions

In a nutshell, your linear model should produce residuals that have a mean of zero, have a constant variance, and are not correlated with themselves or other variables.

If these assumptions hold true, the OLS procedure creates the best possible estimates. In statistics, estimators that produce unbiased estimates that have the smallest variance are referred to as being “efficient.” Efficiency is a statistical concept that compares the quality of the estimates calculated by different procedures while holding the sample size constant. OLS is the most efficient linear regression estimator when the assumptions hold true.

Another benefit of satisfying these assumptions is that as the sample size increases to infinity, the coefficient estimates converge on the actual population parameters.

If your error term also follows the normal distribution, you can safely use hypothesis testing to determine whether the independent variables and the entire model are statistically significant. You can also produce reliable confidence intervals and prediction intervals.

Knowing that you’re maximizing the value of your data by using the most efficient methodology to obtain the best possible estimates should set your mind at ease. It’s worthwhile checking these OLS assumptions! The best way to assess them is by using residual plots. To learn how to do this, read my post about using residual plots!

If you’re learning regression and like the approach I use in my blog, check out my eBook!

jack says

Hi jim,

is the t-test in hypothesis testing requires that the sampling distribution of estimators

follow the normal distribution. Do you agree with this statement?

Jim Frost says

Hi Jack, yes, and the distribution of the coefficient estimates is linked to the assumption about the distribution of the residuals. If the residuals follow a normal distribution, you can conclude that the distributions for the estimators are also normal. I suspect that the central limit theorem applies here as well. In that if you have sufficiently large sample size, the sampling distributions will approximate the normal distribution even when the residuals are nonnormal. However, I don’t have good numbers for when that would kick in. Presumably it depends on the number of observations per predictor.

jackson says

Sir, thank you for explaining well

jackson says

“The ordinary least squares (OLS) estimators are still unbiased even though the error term is not normally distributed”. Comment on this statement.

Jim Frost says

That statement can be correct but it isn’t necessarily correct. If the residuals are nonnormal because you misspecified the model, the estimators will be biased. As I state in this post, OLS does not require that the error term follows a normal distribution to produce unbiased estimates with the minimum variance. However, if you want to test hypotheses, it should follow a normal distribution.

NYAMUYONJO DAVID says

Jim, thank you for explaining well and being kind.

Chris Akenroye says

Please add me to your mailing list. I just went through your post on 7 Classical Assumptions of Ordinary Least Squares (OLS) Linear Regression. Your explanation was reader-friendly and simple. I really appreciate you and your style of knowledge dissemination.

Thanks

Lovelyn says

Very useful and informative.

Katrina says

Hi Jim,

Thanks for this post and all you do. I really appreciate it! I’m trying to solve a business problem and I want to know if OLS is the right regression here..essentially I’m trying to do a driver analysis of net promoter score/customer happiness. I have survey results from 20000 respondents.

My dependent variable: overall score (11-pt scale containing a single rating from 0 – 10)

My predictor variables: three driver scores (11-pt scale containing a single rating from 0 – 10) asked in the same survey.

So far, I did correlation for the overall rating against the three driver ratings separately and results show they are positive correlated.

To answer the question “if I increase any of the three driver’s rating by 1, how much would that affect the overall score,” I tried to use excel’s regression but p-value for all three drivers are basically 0 and that doesn’t tell me which one is the most important and explains the overall score. My company wants to know which driver area they should prioritize so I’m running into a wall and I’m wondering if OLS is even the right model to use as these aren’t measurement data but rather ordinal data.. could you please advise? I also read something about multinomial logistic regression online but that’s beyond me.. any tips to proceed?

Thank you so much!!

Katrina

dbadrysys says

Hi Jim,

Your post is very helpful in explaining detail.

But I have some unclear in case if the relationship is nonlinear, we can use transforming exponential or logarithm functions to transform the data before using it (Fixes for Linearity – OLS assumption 1)

Thanks.

Jim Frost says

Hi,

Yes, you can use data transformations as you describe to help. However, I always recommend those as a last resort. Try other possibilities, such as specifying a better model, before using data transformations. However, when the dependent variables is very skewed/nonnormal, it can difficult to satisfy the assumptions without a transformation.

Harry says

Hello Mr Jim,

Thank you for your exceptional work that helps so many including me.

I am working on the abalone dataset, which has been previously used to model the age of abalone (rings) through a number of predictors mainly involving gender (categorical variable), weight variables and other mm variables (eg. length of abalone). In my case I want to model instead the shucked weight (i.e. meat) for explanatory purpose, using the other variables as independent and I don’t have in my possession the age variable this time. However it appears that the independent variables are pairwise highly collinear in which case it makes it really hard to find a proper model. In addition the biplots against the shucked weight are not all linear and they all have a tendency of increasing variation of the response. As a consequence the residual vs fitted plot of most possible models, say we take the full model, show heteroscedasticity which I tried to solve via transformation of either the response or the independent variables. At the moment I don’t seem to be finding an exit to a model that has at least constant variation.

Do you think you could give me some constructive advice? Thanks.

Jim Frost says

Hi Harry,

That sounds like a very interesting analysis that you’re performing!

You mention you don’t have the age, but do you have the number of rings that could be a proxy for age? If so, use that as a proxy variable. Proxy variables are variables that aren’t the actual DV but they’re related to both the DV and IV and can incorporate some of the same information into the model. It helps prevent omitted variable bias. Read my post about omitted variable bias, which also discusses proxy variables as a potential solution.

You also mention the problems of multicollinearity and heteroscedasticity. Please read my post about multicollinearity and learn about VIFs. Some multicollinearity is OK. VIFs will tell you if you have problematic levels. I also present some solutions.

And, read my post about heteroscedasticity. Again that post shows how to detect it (which it seems like you have) and potential solutions.

You also mention the need to fit curvature, so I think my post about curve fitting will be helpful.

I think those posts will provide many answers to your questions. If after reading those posts, you have specific questions about addressing those problems, please post in the relevant posts.

Also, because your analysis depends so heavily on regression analysis, I highly recommend buying my regression ebook. In that book, I go into much more detail and cover more topics. For example, I would not be surprised if you need to use some sort of data transformation. That seems to be common in biology growth models (but I’m not an expert in that field). I don’t cover data transformations for regression in a blog post (although I do have an example post) but I cover transformations in my ebook.

Best of luck with your analysis!

khedidja djaballah says

I wish to know what is the relevance of intercept only models. And on what kind of data can it be applied?

Jim Frost says

Hi Khedidja,

An intercept model refers to one which does not contain any independent variables. These models fitted values (or predictions) that equal the mean of the dependent variable. Use the F-test of overall significance to compare a model to an intercept only model. This test determines whether your model with IVs is better than the intercept only model (no IVs). When the F-test is statistically significant, you can conclude that your model is better than the intercept only model. If that F-test is not significant, your model with IVs is not better than the intercept only model. In other words, your model doesn’t explain the changes in the DV any better than just using the DV mean. For more information, read my post about the overall F-test of significance.

Typically, you’d only use an intercept only model when you have no significant IVs and when the overall F-test is not significant. You have no IVs that have a significant relationship with the DV. In this scenario, the only meaningful way you can predict the DV is by using the mean of the DV. This outcome is not a good one because you want to find IVs that explain changes in the DV.

EMMANUEL APPIAHH says

what is the importance of relative efficiency

Shannon says

This was an actually lifesaver. I’m taking cross section econometrics at my university and was really struggling with OLS. We were doing all this these linear algebraic proofs and went over the assumptions, but the conceptual explanations were a bit difficult to navigate. I particularly found your section on the importance of the “expected value of the error term being zero” extremely helpful.

Sophia says

Thanks for the amazingly detailed response, Jim! This is very helpful – I’m eagerly looking forward to reading the articles in your reply. Thanks also for maintaining such an informative blog!

Sophia says

Thanks for replying, I was asking about this because even after adding what we think are relevant regressors (like weather..), we are always either significantly under-ordering or over-ordering, and I was wondering if it was because the assumptions of Linear Regression were not being met for the store sales data, and if we should look into a different model.

Any guidance on alternative models to accurately estimate appropriate inventory ordering quantities would be very helpful! Thanks,

Jim Frost says

Hi Sophia, and apologies because your previous question about using OLS for your ordering system fell through the cracks. The answer is that, yes, it might well be suitable system. However, there are some potential challenges. First, be sure that your model does satisfy the assumptions. This helps ensure that your model fits the data. It’s possible there’s, say, curvature in the data that the model isn’t fitting. That can cause biased predictions (systematically too high or too low).

If the model does satisfy the assumptions, say everything is perfect, it’s still possible to have large differences between the predicted value and the eventual actual value. It’s possible that your model fits the data well but it’s insufficiently precise. In other words, the standard deviation of the residuals is simply too large produce predictions that are sufficiently precise for your requirements. There are different possibilities at work, and I can’t be sure which one(s) would apply to your case.

You might have:

Too few data points.

Too few independent variables.

Too much inherent variability in the data.

The first two items, you can address. Unfortunately, the last one you can’t.

I’ve written about this issue with prediction in other blog posts. First and foremost, read this blog post to see if imprecision might be affecting your model: Understand Precision in Predictions. There are two key measures you should become familiar with: standard error of the regression and prediction intervals.

I also walk through using regression to make predictions and assess the precision.

Also, what is the R-squared for your model. While I’ve written that R-squared is overrated, a low R-squared does indicate that predictions will be imprecise. You can see this at work in this post about models with low R-squared values. In practice, I’ve found that models with R-squared values less than 70% produce fairly imprecision predictions. Even higher R-squared values can be too imprecise depending on your requirements. Again, the standard error of the regression and prediction intervals are better and more direct measures of prediction precision.

So, the first step would be to identify where the problem lies. Does the model fit the data? If not, resolve that. If it does fit the data, assess the precision of the predictions.

I hope this helps!

Sinks says

You can only use these assumptions if the sample is drawn from the main population to ensure that the results generated are close or are a reflection of the population. However, there is no harm in using regression for the entire population (as in your case) to assess the trend in the sales.

Jim Frost says

Hi, I’d disagree with your statement slightly. It’s true that if you perform regression analysis on the entire population, you don’t need to perform hypothesis testing. In that light, the residuals don’t need to be normally distributed. However, other assumptions certainly apply. There are other assumptions that address how well your model fits the data point. For example, if you data exhibit curvature, your model needs to fit that curvature. If it doesn’t, you’ll see it show up in the residuals. Consequently, you still want to check the residuals vs. fits plot to ensure that the residuals are randomly scattered around zero.

Additionally, if you want to use your model to making predictions, the prediction intervals are valid only when the residuals are normally distributed. Consequently, even when you’re working with a population, that normality assumption might still be in effect!

Dan says

Would you mind discussing (briefly or point me to the right direction) the relationships among unbiasness, variance, and consistency for an estimator? Thanks!

Jim Frost says

Hi Dan,

Unbiased means that there is no systematic tendency for the estimator to be too high or too low. Overall, the estimator tends to be correct on average. When you assess and unbiased estimator, you know that it’s equally likely to be too high as it is to be too low.

Variance relates to the margin of error around the estimator. In other words, how precise of an estimate is it? You want minimum variance because that indicates that your estimator will be relatively close to the correct value. As variance increases, probability increases that the estimator is further away from the correct value.

Consistency indicates that as you increase the sample size, the value of the estimator converges on the correct value.

I hope this helps!

Florentino Menéndez says

Thanks a lot for your answer! There are some topics in which I need some additional study. But now I had a place where to look at! Thanks again 🙂

Florentino Menéndez says

First of all, congratulations on your book. I have bought it and find it very, very clear. I have learnt a lot of details that collaborate to round up my comprehension of the topic. Again: congratulations and thanks for it.

Second thing: a question. I teach basic statistics and a student bring me a linear regression with repeated observations. There were 40 medical patients measured three time each one. The file was 120 rows. In other ways the regression was ok, but I objected that the observations were not independent, so the p-values were not real.

My student asked me how he could do the regression in order to use the 120 measurements, but I don´t know what could we do. I use Stata and Spss.

Any help will be very much appreciated.

Jim Frost says

Hi Florentino,

Thank you for buying my book! I’m very happy to hear that you found it to be helpful!

I once had a repeated measures model where I tracked the outcome over time. For that model, I simply used the change in the outcome variable as the dependent variable. Perhaps that would work from your student?

I believe that repeated measures designs are more frequently covered in ANOVA classes. I do have a post about repeated measures ANOVA. That post talks about crossover designs where subjects are in multiple treatment groups–which may or may not be relevant to your student’s study. However, it also discusses how to include subjects as a random factor in the model, which is relevant and will give you a glimpse into how linear mixed models work, which I discuss more below. Mixed models contain both fixed effects and random effects. That post also explains about how these models account for the variability of each subject.

Linear mixed models, also known as mixed effects models, are a more complex but a very flexible type of model that you can use for this type of situation. This type of model adds random effects to the model for the subjects. In other words, the model controls for the variability of each subject. There are different types of linear mixed models. Random intercept models accounts for subjects that always have high or low values. Individual growth curve models describes each subject’s over time. These types of models are very flexible but also very advanced. I don’t have much experience with them so I don’t want to give bad advice about what type to use. Just be aware that they are complicated and easy to misspecify. If your student goes this route, it’ll take some research to find the correct type and model specification that meets their study’s requirements.

Another possibility is multilevel modelling. This type of model is particularly good for nested data.

Again, it’ll take a bit of research combined with knowledge about the student’s data and objectives to determine the best approach.

Hopefully this will at least help point your student in the right direction!

Gelgelo says

hello mr jim it was helpful.but i have some questions to ask

I am conducting some research on socio economic impacts of droughts and i want want to show some effects of droughts by logit models e.g i have pasroralist and agro pastoralist and also sex,age.farm land … and the efects are livestock mortality,rangeland degradation,loss of services and others so how can i use logit regression models by using coefficient estimated using an ordinary least squares regression.

Jim Frost says

Hi Gelgelo,

I’m not sure that I understand what you’re asking. Typically, you use logit models for binary dependent variables and OLS for continuous dependent variables. That’s usually how you decide.

I hope this helps!

Sophia says

Very insightful article, Jim! I have a fairly basic question – I get that these assumptions need to be checked when you’re working on a sample of the data, but what if I perform linear regression on the entire population – would the same assumptions still need to be satisfied?

I’m working on an inventory ordering project at school, and have sales data for a few years for some products sold in a store. I’d like to regress the sales data for each product against some independent variables (to understand which variables affect the demand for a product), and I’m trying to figure out if Linear regression would be a suitable model.

I’d greatly appreciate any insight – thanks!

Pavel grabov says

Thank you very much for your blog.

I have only one question: what about the assumption, that only ‘Y’ is subject to the errors of measurement? We are searching the best fit line on the basis of the sum of the distances between the results of experiments and a fitting line, and we measuring these distances in parallel to the Y-axis. In principle, these distances could be measured in parallel to the X-axis or in orthogonal to the fitted line. Obviously, if the X-values are supposed to be error-free, the distances should be measured in parallel to the Y-axis, but if this assumption is invalid – the linear model presenting Ordinary Least Squares Linear Regression will be incorrect.

Best Regards,

Pavel

Jim Frost says

Hi Pavel,

So, the assumption about no errors in the X-values is the ideal scenario. I don’t think any model is going to satisfy that 100%. You just have to try to minimize measurement errors of the independent variables (X-values). Most studies don’t need to worry about this problem.

Justin Tusoe says

hi Jim,

I must say the book is very helpful. i am doing a research on the impact of human capital development on economic growth in Ghana. there are thousands of factors that affect economic growth. How do i choose the right variables?

Jim Frost says

Hi Justin,

First off, thank you so much for buying my ebook about regression analysis! I’m glad you’re finding it is very helpful. One of the themes throughout the book is that you need conduct a lot of research before beginning your analysis. What variables do other similar studies include? Which variables tend to be significant, what are their signs and magnitudes, etc. This will help you gather the correct data and include the correct variables in the model. You need to gain a lot of subject-area knowledge to know which variables you should include.

In the book, chapter 7 discusses how to specify and settle on a final model.

Best of luck with your analysis!

Alberto Javier Vigil-Escalera says

Hi Jim

First of all, thank you for so kindly getting back to me.

A couple of final questions.

In order to detrend the series, do you mean for example differentiating them or using percentage change, like

Year on Year changes?

If that is what you mean, I am doing that: My dependent variable is the % anual change in the SP500 and I am doing the same with the independent variable. (Sorry for my lack of clarity in my explanation)

Also by reading your excellent book, I am already in page 106, I realized that I may have another problem: Which is that the equation that I try to find may be non linear, or if linear, I may need to transform some of the variables. Could this be also an explanation for the heteroscedasticity and the residual correlations that I have in my model?

How do I know which variables I should transform? How do I know what non-linear regression is the one that I should look? Is that also in your book?

Many thanks.

Jim Frost says

Hi again Alberto,

Yes, that’s exactly what I meant. If you’ve done that, I would’ve thought that would’ve also removed the heteroscedasticity. You might have to try other options for resolving that issue, which I cover later in the book, starting on page 213.

You can certainly try a transformation. I have section dedicated to transformations, starting on page 244. Consider that as the last resort though. But, yes, transformations can fit nonlinear relationships, fix heteroscedasticity, and fix residuals issues. However, while that all sounds great, save that until the last. Try the other solutions first. I do provide guidelines, tips, etc. for choosing transformations and for which variables. However, choosing the correct transformation method and variables to transform is a bit trial and error. You can also look to see how other similar studies have handled it.

I hope this helps!

Davidjackline says

This is an awesome and brilliant elaboration. Thank you Jim.

Alberto Javier Vigil-Escalera says

Hi Jim

Thank you very much for your blog, and congratulation for your book, I found it very interesting and easy to understand, specially since I have not a mathematic background, I am just a lawyer that got involved in finance.

My question is: Can I still use the results of a regression that violates the rules on heterokedasticity and residual correlations?

Let me give you a bit of background.

I am doing regression analysis on the SP 500. My depended variable is the monthly SP500 Year on Year return. My independent variables are the usual ones: Activity indicators, Monetary Indicators, Volatility indicators. I have more than 200 monthly observations and I don´t have more than 7 independent variables.

All, dependent variables and independent variables are used in Year on Year changes.

I run the regression and I get a SQ R above 0,7. I looked at F and it looks very significant. Then I looked at the T scores on the independent variables and they are all significant (with the exception of the constant).

Also the regression makes sense from an economic point of view. For example: the more economic activity the higher the returns, the sign (+/-) of the monetary policy also makes economic sense…

I look for Collinearity and there is not. So far so good, I think, but my lucky strike ends right there.

When I checked for Heterokedasticity (B-P) and residual correlation (DW) it shows that both exists.

I know that those are serious problems but I still wonder:

1. Residual serial correlation means that there are still independent variables out there that I still should find. However my Rsq is high and my F is too. So even if I don’t know of all the factors that move the SP 500 I know a good amount of them.

2. My hetorokedasticity makes my estimations of the SP 500 very weak. However, could I still use this model to give me the SP 500 direction instead of using it to find a specific target of return? I mean could I use this model to tell me if the SP 500 could go up or down from here, instead of using it to tell me if it will go up 10% or -15%.

Thanks for your help.

Jim Frost says

Hi Alberto,

First, thanks for buying the book and I’m glad to hear that it was helpful!

The quick answer is that you really should fix both heteroscedasticity and autocorrelation. Both of these condition produce less precise coefficient estimates. And, heteroscedasticity tends to produces p-values for the coefficients that are smaller than they should be. So, you’re thinking they’re significant when they might not be.

I also see an additional potential problem with your model. I think you’re going to have long term positive trends in both economic activity and the S&P 500. When you have trends in your data and perform regression analysis, you’ll get significant results and an inflated R-squared. After all, both things are following long term trends. What you really need to do is detrend the data. Then show how deviations from the trend in the economic activity side of things related to deviations from trends in the S&P 500 side of things. That might also help with your heteroscedasticity (I have a section in the book about handling heteroscedasticity you should read). Consider adding lag variables to reduce the autocorrelation. It might be that previous economic activity relates to the current S&P 500. That might be the type of information you’re seeing in the residuals. There’s a bunch of additional things to consider with using regression analysis with time series data.

Predicting the stock market is very difficult. It’s not surprising that you have a weak model. After all, if it was easy, everyone would be able to make a fortune predicting the market!

Kenji Kitamura says

Thank you very much for you explanation. It became clear!

Kenji Kitamura says

Thank you very much for your great explanation. It is so helpful.

I have a small question.

You state that

“if the average error is +7, this non-zero error indicates that our model systematically underpredicts the observed values.”

The average error is for single linear regression is

E[e|x] = (1/n) Summation of [yi – (a+b1xi)]

Thus, the size of the average error E[e|x] depends on the scale of dependent and independent variables. Therefore, I wonder why you can say +7 is a cut off point for the bias. I understand that we don’t really care about this in practice given that the constant addresses this bias, but just curious about this claim.

Jim Frost says

Hi Kenji,

I see I’m going to have to clarify that text! I was just using +7 as an example, and didn’t mean to make it sound like a cutoff value for bias. If the average residual is anything other than zero, the model is biased to some extent. You want the model to be correct on average, which suggests that the mean of the residuals should equal zero. Note that right after I mention +7, I refer to it as a non-zero value, which is the real problem.

And, you’re correct, as I mention in the post, when you include the constant, the average will always equal zero which eliminates the worry of an overall bias. Although, you can still have local bias, such as when you don’t correctly model curvature.

Farhan says

Thanks Jim…. 😐

MahboobUllah says

The Best

CATHERINE NAMIRIMU says

Thanks Jim, the explanations are far easy to understand. Its very interesting to study from in here.

kwaters126 says

I think you meant to say 4 and 5 here:

“Note: When assumption 5 (no autocorrelation) and 6 (homoscedasticity) are both true, statisticians say that the error term is independent and identically distributed (IID) and refer to them as spherical errors.”

Otherwise, wonderful post

Jim Frost says

Yes! Thank you for catching that! I’m making the change now.

John Grenci says

Hey Jim, I happened to find your site, and hoping you can help me. I am doing a study on predicting home run rates of actual baseball players. so, I set up criteria, certain number of plate appearances, etc. and modeled Home run rate for a year based on their previous home rate (going back 5 years). I also have an age flag. all coefficients are highly significant. I performed a test of normality. the r squared for several thousand observations is a little more than .6, so I think the fit is good. but here is my question, and this question could apply for many contexts, I think. it deals with homoscedasticity. it seems intuitively that this should rarely hold up. why? because isn’t it true that if you have two (or more) ranges of similar values, the variances will be in a similar proportion. in other words, take two rooms of males. one has 30 newborns, and one has 30 20 year olds. you are analyzing their weights. assume the mean weights of the newborns olds is 9, and the mean weights of the 20 year olds is 200. It is certain that the variance of the newborns will be smaller than the variance of the 20 year olds, and my best guess would be that the variances have the same ratio as the ratio of the means (9 to 200). proportion to the . so, when predicting ANYTHING, whether it be advertising predicting revenue, or previous home run rates predicting home run rates for the upcoming season it just seems almost certain that at least in the case of home run rates, the same type of phenomenon will happen. so, much like the weights.. in the 20 year olds, you have some people who weigh 350 pounds, and some that weigh 130. It is IMPOSSIBLE to have that variability among newborns. I gave an extreme example to illustrate my point. thanks John

Jim Frost says

Hi John,

I’m glad you found my site! Great questions!

One potential issue I see for your model is the fact that you’re using the model to make predictions and you have an R-squared of 0.6. Now, one thing I never do is have a blanket rule for what an R-squared should be. That might be the perfectly correct R-squared for the subject area. However, R-squared values that aren’t particularly high are often associated with prediction intervals that are too wide to be useful. I’ve written several posts about using regression to make predictions, prediction intervals and precision, etc. that talk about this. One you should check out is my post about how high does your R-squared need to be, and then maybe some of the others.

Now, on to homoscedasticity. First, you should check out my post about heteroscedasticity. It talks about the issues you discuss among along with solutions. You’re absolutely correct that when you have a large range of dependent variable values, you’re more likely to have heteroscedasticity. In contrast, I often use a height-weight dataset as an example, but it’s limited to young teen girls. It’s more restricted and there’s no heteroscedasticity present, which fits in nicely as the converse of your example.

That all said, I’m often surprised at how rarely heteroscedasticity appears outside of extreme cases like one that you describe. Anyway, read that blog post, and if you questions after that, don’t hesitate to ask!

Rainard Mutuku says

Hae Jim, thanks.

Your presentation is well illustrated and precise.

ghazanfar says

sir thanks you make stat easy for me by your good explanations but one thing is confusing me which test is best to check the heteroskedasticity.

Amit says

SInce you replied sir…I am elated to ask some of my doubts sir:

a)Sir we know expectation of errors is zero is a basic assumption but sir we also get Summation of errors=0 as the first constraint from LSM(least square method).Now what is the difference between two…..I think that always linear regression line is going to pass through the center of points but only LSM is going to minimise the errors .but E(e)=0,even if we donot use LSM(OLS)..Am I right?

b)Sometimes,our software predict a line for the curves ,then also our E(e)=0,then we need to add square terms or transformations to meet homoscadasticity ,still E(e)=0…meaning software always try to predict and get E(e)=0

Jim Frost says

Hi Amit,

I’m not 100% sure that I understand your questions. But, yes, the expectation that errors are zero and the summation of errors equaling zero are related. Furthermore, if you include the constant in your model, you’ll automatically satisfy this assumption. Read my post about the regression constant for more information about this aspect.

However, what I find is that while the overall expectation is that the error equals zero, you can have patterns in the residuals where it won’t equal zero for specification ranges. The classic example of that is where you try to fit a straight line to a data that have curvature. You might have ranges of fitted values that systematically under-predict the observed values and other ranges that over-predict it even though the overall expectation is zero. In that case, you need to fit the curvature so that those patterns in the residuals no longer exist. In other words, having an overall expectation equal zero is not sufficient. Check those residual plots for patterns. I talk about this in my post about residual plots.

I don’t know about your software and what it does automatically, but in general the analyst needs to be sure that not only does the overall expectation equals zero, which isn’t a problem when you include the constant, but that there are no ranges of fitted values that systematically over- and under-predict. Again, read my post about checking the residual plots!

I hope this helps!

Uma Shankar says

Hi Jim,

I agree to the point where Y need not follow normal distribution as we don’t know the distribution of population of Y . However the sample statistics i.e. the regression coefficients or the parameter estimates follow norma distribution ( Thanks to Central Limit Theorem – the sampling distribution of sample mean follows normal distribution). In that case, since Y-hat is a linear combination of paramters estimates, it should turn out that y-hat should follow normal distribution right?

The linear combination of normally distributed random variables results in a normal distribution .

Thank you.

Jim Frost says

Hi Uma,

Sorry about the delay in replying!

As it turns out, y-hat doesn’t necessarily follow a normal distribution even though it is a linear combination of parameter estimates.

If the residuals are normally distributed, it implies that the betas are also normally distributed. That part is true. It would also seem to imply that the y-hats are also normally distributed but that’s not necessarily true. However, if you include polynomials to model curvature, they can allow the model to fit nonnormally distributed Ys and yet still produce normally distributed residuals. Even though it is modeling curvature, it is still a linear model. I actually have an example of this using real data, which you can download–using regression to make predictions. I don’t mention it in the post, but the dependent variable is not normally distributed. Because the model provides a good fit, we know that the y-hats are also nonnormal.

I hope this helps!

Uma Shankar says

Expected value of error is still zero as it is assumed that the mean value of error clusters around zero. However the error need not be normally distributed which is not a strict assumption even in OLS regression.

In Linear regression, Y – hat is linear combination of parameter estimates with expected value of error being zero as the errors are assumed to be iids with mean clustered around zero. Same applies here as well. Because errors are independent and all independent variables are exogenous.

My question here is , how can the Y-hat satisfy the normality assumption (it being a sampling distribution)as here, Y-hat is not a linear combination of parameter estimates unlike in Linear regression. How does the inferential statistics work here?

Jim, Please help with the analysis and correct me if I’m wrong here with expected error being zero in the question asked.

Thanks all.

Jim Frost says

Hi Uma,

Neither Y-hat nor Y need to follow the normal distribution. The assumptions all apply to the residuals for both linear and nonlinear regression. While the residuals don’t need to be normally distributed, it is helpful if you want to perform hypothesis testing and generate confidence intervals and prediction intervals. Does that answer your question?

Amit says

If some Y=e^xb is the function then E(e)=0 or not?In other words if it is not linear regression ,will the expectation of errors be zero?Why or why not?

Jim Frost says

Hi Amit,

The assumptions for the residuals from nonlinear regression are the same as those from linear regression. Consequently, you want the expectation of the errors to equal zero. If fit a model that adequately describes the data, that expectation will be zero. Of course, if the model doesn’t fit the data, it might not equal zero. But, that is the goal!

I hope this helps!

Riana says

This is just wonderfully written! Thank you so much! I often heard this iid assumption, but never quite knew what was meant by it! I will definitely read all your other posts.

I hope you will also easily explain the field of time series econometrics and/or asymptotics anytime soon 🙂

Jim Frost says

Hi Riana,

Thank you so much! Your kind words mean a lot to me!

I plan to write about those other topics at a future date, but there’s so much to write about!

Uma Shankar Surreddy says

Hi Jim, Wonderful explanation. I have a doubt, in assumption 2 – “The error term has a population mean of zero”, Isn’t this about residual and not the error/disturbance term ? Because the error/disturbance term ( a population object) is ideally independent or uncorrelated with other errors and their sum is almost never zero. But in the case of a sample statistic like, sample mean, the residuals are not independent and hence make up for a mean value of zero. Please correct me if I’m wrong.

Jim Frost says

Hi Uma,

The error term is an unknown just like the true parameter values. The coefficients estimate the parameters while the residuals estimate the error term. Ideally, the error term has a zero mean and are independent of each other. Because we can’t know the real errors, the best we can do is to have a model that produces residuals with these properties.

So, yes, the error term can and should have a mean of zero. But, we can only use the residuals to estimate these properties. Consequently, the residuals should have a mean of zero and be independent of each other.

I hope this helps!

Giulio Graziani says

This is gold Jim thanks a lot!

Felix Ajayi says

Sir you are simply wonderful. Your post is reader-friendly. Kindly send this piece to my email

f********@*****.com

I want to follow you for a guide to learning and teaching in econometrics and more importantly running the analyses in my academic research.

Regards.

Jim Frost says

Thank you, Felix! That means a lot to me. I removed your email address from your comment for privacy. I don’t have anything to email now, but I’ll save your email address for when that occasion arises. You can always receive alerts about new posts by filling in the subscribe box in the right navigation pane. I don’t seen any junk mail!

Isaac kojo Annan Yalley says

Thanks for making statistics easy and understanding for us

Jim Frost says

You’re very welcome, Isaac. I’m glad my website has been helpful!

Tavares says

Thank You. I appreciated the content

Jim Frost says

You’re very welcome. I’m glad it was helpful!