Omitted variable bias occurs when a regression model leaves out relevant independent variables, which are known as confounding variables. This condition forces the model to attribute the effects of omitted variables to variables that are in the model, which biases the coefficient estimates.
This problem occurs because your linear regression model is specified incorrectly—either because the confounding variables are unknown or because the data do not exist. If this bias affects your model, it is a severe condition because you can’t trust your results.
In this post, you’ll learn about this type of bias, how it occurs, and how to detect and correct it.
Related post: Specifying the Correct Regression Model
What Are the Effects of Omitted Variable Bias?
Omitting confounding variables from your regression model can bias the coefficient estimates. What does that mean exactly? When you’re assessing the effects of the independent variables in the regression output, this bias can produce the following problems:
- Overestimate the strength of an effect.
- Underestimate the strength of an effect.
- Change the sign of an effect.
- Mask an effect that actually exists
You don’t want any of these problems to affect your regression results!
To learn more about the properties of biased and unbiased estimates in regression analysis, read my post about the Gauss-Markov theorem.
Synonyms for Confounding Variables and Omitted Variable Bias
In the context of regression analysis, there are various synonyms for omitted variables and the bias they can cause. Analysts often refer to omitted variables that cause bias as confounding variables, confounders, and lurking variables. These are important variables that the statistical model does not include and, therefore, cannot control. Additionally, they call the bias itself omitted variable bias, spurious effects, and spurious relationships. I’ll use these terms interchangeably.
What Conditions Cause Omitted Variable Bias?
How does this bias occur? How can variables you leave out of the model affect the variables that you include in the model? At first glance, this problem might not make sense.
For omitted variable bias to occur, the following two conditions must exist:
- The omitted variable must correlate with the dependent variable.
- The omitted variable must correlate with at least one independent variable that is in the regression model.
The diagram below illustrates these two conditions. There must be non-zero correlations (r) on all three sides of the triangle.
This correlation structure causes confounding variables that are not in the model to bias the estimates that appear in your regression results. For example, removing either X variable will bias the other X variable.
The amount of bias depends on the strength of these correlations. Strong correlations produce greater bias. If the relationships are weak, the bias might not be severe. And, if the omitted variable is not correlated with another independent variable at all, excluding it does not produce bias.
Finally, if you’re performing a experiment that uses random assignment, omitted variable bias is less likely to be a problem. Random assignment minimizes the effects of confounding variables by equally distributing them across the experimental groups. Omitted variable bias tends to occur in observational studies.
I’ll explain how confounding variables can bias the results using two approaches. First, I’ll work through an example and describe how the omitted variable forces the model to attribute the effects of the excluded variable to the one in the model. Then, I’ll go into a more statistical explanation that details the correlation structure, residuals, and an assumption violation. Explaining confounding variables using both approaches will give you a solid grasp of how the bias occurs.
Related posts: Understanding Correlations and Random Assignment in Experiments and Observational Studies Explained
Practical Example of How Confounding Variables Can Produce Bias
I used to work in a biomechanics lab. One study assessed the effects of physical activity on bone density. We measured various characteristics including the subjects’ activity levels, their weights, and bone densities among many others. Theories about how our bodies build bone suggest that there should be a positive correlation between activity level and bone density. In other words, higher activity produces greater bone density.
Early in the study, I wanted to validate our initial data quickly by using simple regression analysis to determine whether there is a relationship between activity and bone density. If our data were valid, there should be a positive relationship. To my great surprise, there was no relationship at all!
What was happening? The theory is well established in the field. Maybe our data was messed up somehow? Long story short, thanks to a confounding variable, the model was exhibiting omitted variable bias.
To perform the quick assessment, I included activity level as the only independent variable, but it turns out there is another variable that correlates with both activity and bone density—the subject’s weight.
After including weight in the regression model, along with activity, the results indicated that both activity and weight are statistically significant and have positive correlations with bone density. The diagram below shows the signs of the correlations between the variables.
How the Omitted Confounding Variable Hid the Relationship
Right away we see that these conditions can produce omitted variable bias because all three sides of the triangle have non-zero correlations. Let’s find out how leaving weight out of the model masked the relationship between activity and bone density.
Subjects who are more active tend to have higher bone density. Additionally, subjects who weigh more also tend to have higher bone density. However, there is a negative correlation between activity and weight. More active subjects tend to weigh less.
This correlation structure produces two opposing effects of activity. More active subjects get a bone density boost. However, they also tend to weigh less, which reduces bone density.
When I fit a regression model with only activity, the model had to attribute both opposing effects to activity alone. Hence, the zero correlation. However, when I fit the model with both activity and weight, it could assign the opposing effects to each variable separately.
For this example, when I omitted weight from the model, it produced a negative bias because the model underestimated the effect of activity. The results said there is no correlation when there is, in fact, a positive correlation.
Correlations, Residuals, and OLS Assumptions
Now, let’s look at this from another angle that involves the residuals and an assumption. When you satisfy the ordinary least squares (OLS) assumptions, the Gauss-Markov theorem states that your estimates will be unbiased and have minimum variance.
However, omitted variable bias occurs because the residuals violate one of the assumptions. To see how this works, you need to follow a chain of events.
Suppose you have a regression model with two significant independent variables, X1 and X2. These independent variables correlate with each other and the dependent variable—which are the requirements for omitted variable bias.
Now, imagine that we take variable X2 out of the model. It is the confounding variable. Here’s what happens:
- The model fits the data less well because we’ve removed a significant explanatory variable. Consequently, the gap between the observed values and the fitted values increases. These gaps are the residuals.
- The degree to which each residual increases depends on the relationship between X2 and the dependent variable. Consequently, the residuals correlate with X2.
- X1 correlates with X2, and X2 correlates with the residuals. Ergo, variable X1 correlates with the residuals.
- Hence, this condition violates the ordinary least squares assumption that independent variables in the model do not correlate with the residuals. Violations of this assumption produce biased estimates.
This explanation serves a purpose later in this post!
The important takeaway here is that leaving out an important variable not only reduces the goodness-of-fit (larger residuals), but it can also bias the coefficient estimates.
Related posts: 7 Classical OLS Assumptions and Check Your Residual Plots
Predicting the Direction of Omitted Variable Bias
We can use correlation structures, like the one in the example, to predict the direction of bias that occurs when the model omits a confounding variable. The direction depends on both the correlation between the included and omitted independent variables and the correlation between the included independent variable and the dependent variable. The table below summarizes these relationships and the direction of bias.
Included and Omitted: Negative Correlation | Included and Omitted: Positive Correlation | |
Included and Dependent: Negative Correlation | Positive bias: coefficient is overestimated. | Negative bias: coefficient is underestimated. |
Included and Dependent: Positive Correlation | Negative bias: coefficient is underestimated. | Positive bias: coefficient is overestimated. |
Let’s apply this table to the bone density example. The included (Activity) and omitted confounding variable (Weight) have a negative correlation, so we need to use the middle column. The included variable (Weight) and the dependent variable (Bone Density) have a positive relationship, which corresponds to the bottom row. At the intersection of the middle column and bottom row, the table indicates that we can expect a negative bias, which matches our results.
Suppose we hadn’t collected weight and were unable to include it in the model. In that case, we can use this table, along with the hypothesized relationships, to predict the direction of the omitted variable bias.
How to Detect Omitted Variable Bias and Identify Confounding Variables
You saw one method of detecting omitted variable bias in this post. If you include different combinations of independent variables in the model, and you see the coefficients changing, you’re watching omitted variable bias in action!
In this post, I started with a regression model that has activity as the lone independent variable and bone density as the dependent variable. After adding weight to the model, the correlation changed from zero to positive.
However, if we don’t have the data, it can be harder to detect omitted variable bias. If my study hadn’t collected the weight data, the answer would not be as clear.
I presented a clue earlier in this post. We know that for omitted variable bias to exist, an independent variable must correlate with the residuals. Consequently, we can plot the residuals by the variables in our model. If we see a relationship in the plot, rather than random scatter, it both tells us that there is a problem and points us towards the solution. We know which independent variable correlates with the confounding variable.
Another step is to carefully consider theory and other studies. Ask yourself several questions:
- Do the coefficient estimates match the theoretical signs and magnitudes? If not, you need to investigate. That was my first tip-off!
- Can you think of confounding variables that you didn’t measure that are likely to correlate with both the dependent variable and at least one independent variable? Reviewing the literature, consulting experts, and brainstorming sessions can shed light on this possibility.
Obstacles to Correcting Omitted Variable Bias
Again, you saw the best correction possible in this post—including the variable in the model! Including confounding variables in a regression model allows the analysis to control for them and prevent the spurious effects that the omitted variables would have caused otherwise. Theoretically, you should include all independent variables that have a relationship with the dependent variable. That’s easier said than done because this approach produces real-world problems.
For starters, you might need to collect data on many more characteristics than is feasible. Additionally, some of these characteristics might be very difficult or even impossible to measure. Suppose you fit a model for salary that includes experience and education. Ability might also be a significant variable, but one that is much harder to measure in some fields.
Furthermore, as you include more variables in the model, the number of observations must increase to avoid overfitting the model, which can also produce unreliable results. Measuring more characteristics and gathering a larger sample size can be an expensive proposition!
Because the bias occurs when the confounding variables correlate with independent variables, including these confounders invariably introduces multicollinearity into your model. Multicollinearity causes its own problems including unstable coefficient estimates, lower statistical power, and less precise estimates.
It’s important to note a tradeoff that might occur between precision and bias. As you include the formerly omitted variables, you lessen the bias, but the multicollinearity can potentially reduce the precision of the estimates.
It’s a balancing act! Let’s get into some practical recommendations.
Related posts: Overfitting Regression Models and Multicollinearity in Regression
Recommendations for Addressing Confounding Variables and Omitted Variable Bias
Before you begin your study, arm yourself with all the possible background information you can gather. Research the study area, review the literature, and consult with experts. This process enables you to identify and measure the crucial variables that you should include in your model. It helps you avoid the problem in the first place. Just imagine if you collect all your data and then realize that you didn’t measure a critical variable. That’s an expensive mistake!
After the analysis, this background information can help you identify potential bias, and, if necessary, track down the solution.
Check those residual plots! Sometimes you might not be sure whether bias exists, but the plots can clearly display the hallmarks of confounding variables.
Recognize that omitted variable bias lessens as the degree of correlations decrease. It might not always be a significant problem. Understanding the relationships between the variables helps you make this determination.
Remember that a tradeoff between bias and the precision of the estimates might occur. As you add confounding variables to reduce the bias, keep an eye on the precision of the estimates. To track the precision, check the confidence intervals of the coefficient estimates. If the intervals become wider, the estimates are less precise. In the end, you might accept a little bias if it significantly improves precision.
Related post: 5 Steps for Conducting Scientific Studies with Statistical Analyses
What to Do When Including Confounding Variables is Impossible
If you absolutely cannot include an important variable and it causes omitted variable bias, consider using a proxy variable. Typically, proxy variables are easy to measure, and analysts use them instead of variables that are either impossible or difficult to measure. The proxy variable can be a characteristic that is not of any great importance itself, but has a good correlation with the confounding variable. These variables allow you to include some of the information in your model that would not otherwise be possible, and, thereby, reduce omitted variable bias. For example, if it is crucial to include historical climate data in your model, but those data do not exist, you might include tree ring widths instead.
Finally, if you can’t correct omitted variable bias using any method, you can at least predict the direction of bias for your estimates. After identifying confounding variable candidates, you can estimate their theoretical correlations with the relevant variables and predict the direction of the bias—as we did with the bone density example.
If you aren’t careful, the hidden hazards of confounding variables and omitted variable bias can completely flip the results of your regression analysis!
If you’re learning regression and like the approach I use in my blog, check out my eBook!
Erin says
Hi Jim,
I have been trying to figure out covariates for a study we are doing for some time. My colleague believes that if two covariates have a high correlation (>20%) then one should be removed from the model. I’m assuming this is true unless both are correlated to the dependent variable, per your discussion above? Also, what do you think about selecting covariates by using the 10% change method?
Any thoughts would be helpful. We’ve had a heck of a time selecting covariates for this study.
Thanks,
Erin
Jim Frost says
Hi Erin,
It’s usually ok to have covariates that have a correlation greater than 20%. The exact value depends on the number of covariates and the strength of their correlations. But 20% is low and almost never a problem. When covariates are corelated, it’s known as multicollinearity. And, there’s a special measure known as VIFs that determine whether you have an excessive amount of correlation amongst your covariates. I have a post that discusses multicollinearity and how to detect and correct it.
I have not used the 10% change method myself. However, I would suggest using that method only as one point of information. I’d really place more emphasis on theory and understanding the subject area. However, observing how much a covariate changes can provide useful information about whether bias is a problem or not. In general, if you’re uncertain, I’d err on the side of unnecessarily including a covariate than leaving it out. There are usually fewer problems associated with having an additional variable than omitting one. However, keep an eye out on the VIFs as you do that. And, having a number of unnecessary variables could lead to problems if taken to an extreme or if you have a really small sample size.
I wrote a post about model selection. I give some practical tips in it. Overall, I suggest using a mix of theory, subject area knowledge, and statistical approaches. I’d suggest reading that. It’s not specifically about controlling for confounders but the same principles apply. Also, I’d highly recommend reading about what researchers performing similar studies have done if that’s at all possible. They might have already addressed that issue!
Charlotte Stuart says
Hi Jim
Im not sure whether my problem fits under this category or not so apologies if not. I am looking at whether an inflammatory biomarker (independant variable) correlates with a measure of cognitive function (dependant variable). It does if its just a simple linear regression however the biomarker (independant variable) is affected by age, sex and whether you’re a smoker or not. Correcting for these 3 covariables in the model shows that actually there is no correlation between the biomarker and cognitive function. I assume this was the correct thing to do but wanted to make sure seeing as a) none of the 3 covariables correlate with/predict my dependant variable, and b) as age correlates highly with the biomarker, does this not introduce colinearity?
Thanks!
Charlotte
Jim Frost says
Hi Charlotte,
Yes, it sounds like you did the right thing. Including the other variables in the model allows the model to control for them.
The collinearity (aka multicollinearity or correlation between independent variables) between age and the biomarker is a potential concern. However, a little correlation, or a moderate amount of correlation is fine. What you really need to do is to assess the VIFs for your independent variables. I discuss VIFs and multicollinearity in my post about multicollinearity. So, your next step should be to determine whether you have problematic levels of multicollinearity.
One symptom of multicollinearity is a lack of statistical significance, which your model is experience. So, it would be good to check.
Actually, I’m noticing that at least several of your independent variables are binary. Smoker. Gender. Is the biomarker also binary? Present or not present? If so, that’s doesn’t change the rational for including the other variables in the model but it does mean VIFs won’t detect the multicollinearity.
Humberto Calvani says
Thanks for the clarification, Jim. Best regards.
Humberto Calvani says
Hi Jim,
I think the section on “Predicting the Direction of Omitted Variable Bias” has a typo on the first column, first two rows. It should state:
*Omitted* and Dependent: Negative Correlation
*Omitted* and Dependent: Positive Correlation
This makes it consistent with the required two conditions for Omitted Variable Bias to occurs:
The *omitted* variable must correlate with the dependent variable.
The omitted variable must correlate with at least one independent variable that is in the regression model.
Jim Frost says
Hi Humberto,
Thanks for the close reading of my article! The table is correct as it is, but you are also correct. Let’s see why!
There are the following two requirements for omitted variable bias to exist:
*The omitted variable must correlate with an IV in the model.
*That IV must correlate with the DV.
The table accurately depicts both those conditions. The columns indicate the relationship between the IV (included) and omitted variable. The rows indicate the nature of the relationship between the IV and DV.
If both those conditions are true, you can then infer that there is a correlation between the omitted variable and the dependent variable and the nature of the correlation, as you indicate. I could include that in the table, but it is redundant information.
We’re thinking along the same lines and portraying the same overall picture. Alas, I’d need to use a three dimensional matrix to portray those three conditions! Fortunately, using the two conditions that I show in the table, we can still determine the direction of bias. And you could use those two relationships to determine the relationship between the omitted variable and dependent variable if you so wanted. However, that information doesn’t change our understanding of the direction of bias because it’s redundant with information already in the table.
Thanks for the great comment and it’s always beneficial thinking through these things using a different perspective!
Vineeth says
Thank you for the intuitive explanation, Jim!
I would like to ask a query. Suppose i have two groups-one with a recently diagnosed lung disease and another with chronic lung disease where i would like to do an independent t-test for the amount of lung damage. It happens that the two groups also significantly differ in their mean age. The group with recently diagnosed disease has a lesser mean age than the group with chronic disease. Also theory says Age can cause some damage in lung as a normal course too. So if i include age as a covariate in the model, wont it regress out the effect of DV and give underestimated effect as the IV (age) significantly correlates with DV (lung damage)? How do we address this confounding effect of correlation between only IV and DV? Should it be by having a control group without lung disease? If so can one control group help? Or should there be 2 control groups with age-matching to the two study groups? Thank you in advance.
Jim Frost says
Hi Vineeth,
First, yes, if you know age is a factor, you should include it as a covariate in the model. It won’t “regress out” the true effect between the two groups. I would think of it a little differently.
You have two groups and you suspect that something caused those two groups to have differing amounts of lung damage. You also know that age plays a role. And those groups have different ages. So, if you look only at the groups without factoring in age, the effect of age is still present but the model is incorrectly attributing it to the groups. In your case, it will make the effect look larger.
When you include age, yes, it will reduce the effect size between the groups, but it’s reveal the correct effect by accounting for age. So, yes, in your cases, it’ll make the group difference look smaller, but don’t think of it as “regressing out” the effect but instead it is removing the bias in the other results. In other words, you’re improving the quality of your results.
When you look at your model results for say the grouping variable, it’s already controlling for the age variable. So, you’re left with what you need, just the effect between the IV and DV that is accounted for by another variable in the model, such as age. That’s what you need!
A control group for any experiment is always a good idea if you can manage one. However, it’s not always possible. I write about these experimental design issues, randomized experiments, observational studies, how to design a good experiment, etc. among other topics in my Introduction to Statistics ebook, which you might consider. It’s also just now available in print on Amazon!
Ivan says
Hi Jim,
I was wondering whether it’s correct to check the correlation between the independent variables and the error term in order to check for endogeneity.
If we assume that there is endogeneity then the estimated errors aren’t correct and so the correlation between the independent variables and those errors doesn’t say much. Am I missing something here?
best regards,
Ivan.
Lauren Madia says
Hi Jim,
I wanted to look at the effects of confounders on my study but I’m not sure what analysis(es) to use for dichotomous covariates. I have one categorical iv with two levels, two continuous dvs, and then the two dichotomous confounding variables. It was hard to finds information for categorical covariates online. Thanks in advance Jim!
Dirk says
Hi Jim,
Thank you for your nice blog. I have still a question. Let’s say I want to determine the effect of one independent variable on a dependent variable with a linear regression analysis. I have selected a number of potential variables for this relationship based on literature, such as age, gender, health status and education level. How can I check (with statistical analyses) if these are indeed confounders? I would like to know for which of them I should control for in my linear regression analysis. Can I create a correlationmatrix beforehand to see if the potential confounder is both correlated with my independent and dependent variable? And what threshold for the correlation coefficient should be taken here? Is this every correlation coefficient except zero (for instance 0.004? Are there scientific articles/books that endorce this threshold? Or is it maybe better to use a “change-in-estimate” criterion to see if my regression coefficient changes with a particular size after adding my potential confounder in the linear regression model? What would be the threshold here?
I hope my question is clear. Thanks in advance!
Martin Fierz says
Dear Jim,
thanks for a wonderful website!
I love your example with the bone density which does not appear to be correlated to physical activity if looked at alone, and needs to have the weight added as explanatory variable to make both of them appear as significantly correlated with bone density.
I would love to use this example in my class, as I think it is very important to understand that there are situations where a single-parameter model can lead you badly astray (here into thinking activity is not correlated with bone density).
Of course, I could make up some numbers for my students, but it would be even nicer if I could give them your real data. Could you by any chance make a file of real measurements of bone densities, physical activity and weight available? I would be very grateful, and I suppose a lot of other teachers/students too!
best regards
Martin
Jim Frost says
Hi Martin,
When I wrote this post, I wanted to share the data. Unfortunately, it seems like I no longer have it. If I uncover it, I’ll add it to the post.
evangelia panagiotidou says
Hello Jim,
The work you have done is amazing, and I’ve learned so much through this website. .
I am at beginner level in SPSS and I would be grateful if you could answer my question.
I have found that a medical treatment results in worse quality of life.
But I know from crosstabs that people that are taking this treatment present more severe disease (continuous variable) that also correlates to quality of life.
How can I test if it is treatment or severity that worsens quality of life?
Jim Frost says
Hi Evangelia,
Thanks so much for your kind words, I really appreciate them! And, I’m glad my website has been helpful!
That’s a great question and a valid concern to have. Fortunately, in a regression model, the solution is very simple. Just include both the treatment and severity of the disease in the model as independent variables. Doing that allows the model to hold disease severity constant (i.e., controls for it) while it estimates the effect of the treatment.
Conversely, if you did not include severity of the disease in the model, and it correlates with both the treatment and quality of life, it is uncontrolled and will be a confounding variable. In other words, if you don’t include severity of disease, the estimate for the relationship between treatment and quality of life will be biased.
We can use the table in this post for estimating the direction of bias. Based on what you wrote, I’ll assume that the treatment condition and severity have a positive correlation. Those taking the treatment present a more severe disease. And, that the treatment condition has a negative correlation with quality of life. Those on the treatment have a lower quality of life for the reasons you indicated. That puts us in the top-right quadrant of the table, which indicates that if you do not include severity of disease as an IV, the treatment effect will be underestimated.
Again, simply by including disease severity in your model will reduce the bias!
I hope that helps!
Johnny says
Hello,
Just a question about what you said about power. Will adding more independent variables to a regression model cause a loss of power? (at a fixed sample size). Or does it depend on the type of independent variable added: confounder vs. non confounder.
Luis says
you mention “Suppose you have a regression model with two significant independent variables, X1 and X2. These independent variables correlate with each other and the dependent variable”
How is possible for two random variables (in this case the two factors) to correlate with each other if they are independent? If two random variables are independent then covariance is zero and therefore correlaton is zero.
Corr(X1,X2)=Cov(X1, X2)/(sqrt(var(X1))*sqrt(var(X2)))
Cov(X1,X2)=E[X1*X2]-E[X1]*E[X2]
if X1 and X2 are independent then E[X1*X2]=E[X1]*E[X2] and therefore covariance is zero.
Jim Frost says
Hi Luis,
Ah, there’s a bit of confusion here. The explanatory variables in a regression model are often referred to as independent variables, as well as predictors, x-variables, inputs, etc. I was using “independent variable” as the name. You’re correct, if they were independent in the sense that you describe them, there would be no correlation. Ideally, there would be no correlation between them in a regression model. However, they can, in fact, be correlated. If that correlation is too strong, it will cause problems with the model.
“Independent variable” in the regression context refers to the predictors and describes their ideal state. In practice, they’ll often have some degree of correlation.
I hope this helps!
Scott Stevens says
Ah! Enlightenment!
I had taken your statement about the correlation of the independent variable with the residuals to be a statement about computed value of the correlation between them, that is, that cor(X1, resid) was nonzero. I believe that (in a model with a constant term), this is impossible.
But I think I get now that that you were using the term more loosely, referring to a (nonlinear) pattern appearing between the values of X1 and the corresponding residuals, in the same way as you would see a parabolic pattern in a scatterplot of residuals versus X if you tried to make a linear fit of quadratic data. The linear correlation between X and the residuals would still compute out, numerically, to zero, so X1 and the residuals would would technically be uncorrelated, but they would not be statistically independent. If the residuals are showing a nonlinear pattern when plotted against X, look for a lurker.
The Albany example was very helpful. Thanks so much for digging it up!
Scott
Scott Stevens says
Hi, Jim! Thanks very much for you speedy reply!
I appreciate the clarity that you aim for in your writing, and I’m sorry if I wasn’t clear in my post. Let me try again, being a bit more precise, hopefully without getting too technical.
My problem is that I think that the very process used in finding the OLS coefficients (like minimizing the sum squared error of the residuals) results in a regression equation that satisfies two properties. First, that the sum (or mean) of the resulting residuals is zero. Second, that for any regressor Xi, Xi is orthogonal to the vector of residuals, which in turn leads to the covariance of the residuals with any regressor having to be zero. Certainly, the true error terms need not sum to zero, nor need they be uncorrelated with a regressor…but if I understand correctly, these properties of the _residuals_ is an automatic consequence of fitting OLS to a data set, regardless of whether the actual error terms are correlated to the regressor or not.
I’ve found a number of sources that seem to say this–one online example is on page two here: https://www.stat.berkeley.edu/~aditya/resources/LectureSIX.pdf. I’ll be happy to provide others on request.
I’ve also generated a number of my own data sets with correlated regressors X1 and X2 and Y values generated by a X1 + b X2 + (error), where a and b are constants and (error) is a normally distributed error term of fixed variance, independently chosen for each point in the data set. In each case, leaving X2 out of the model still left me with zero correlation between X1 and the residuals, although there was a correlation between X1 and the true error terms, of course.
If I have it wrong, I’d love to see a data set that demonstrates what you’re talking about. If you don’t have time to find one (which I certainly understand), I’d be quite happy with any reference you might point me to that talks about this kind of correlation between residuals and one of the regressors in OLS, in any context.
Thanks again for your help, and for making regression more comprehensible to so many people.
Scott Stevens
Jim Frost says
Hi Scott,
Unfortunately, the analysis doesn’t fix all possible problems with the residuals. It is possible to specify models where the residuals exhibit various problems. You mention that residuals will sum to zero. However, if you specify a model without a constant, the residuals won’t necessarily sum to zero-read about that here. If you have a time series model, it’s possible to have autocorrelation in the residuals if you leave out important variables. If you specify a model that doesn’t adequately model curvature in the data, you’ll see patterns in the residuals.
In a similar vein, if you leave out an important variable that is correlated both with the DV and another IV in the model, you can have residuals that correlate with an IV. The standard practice is to graph the residuals by the independent variable to look for that relationship because it might have a curved shape which indicates a relationship but not necessarily a linear one that correlation would detect.
As for references, any regression textbook should cover this assumption. Again, it’ll refer to error, but the key is to remember that residuals are the proxy for error.
Here’s a reference from the University of Albany about Omitted Variable Bias that goes into it in more detail from the standpoint of residuals and includes an example of graphing the residuals by the omitted variable.
Scott Stevens says
Hi, Jim. I very much enjoy how you make regression more accessible, and I like to use your approaches with my own students. I’m confused, though by the matter brought up by SFDude.
I certainly see how the _error_ term in a regression model will be correlated with an independent variable when a confounding variable is omitted, but it seems to me that the normal equations that define the regression coefficients assure that an independent variable in the model will always be uncorrelated with the _residuals_ of that model, regardless of whether an omitted confounding variable exists or not. Certainly, “X1 correlates with X2, and X2 correlates with the residuals. Ergo, variable X1 correlates with the residuals” would not hold for any three variables X1 and X2 and R. For example, if A and B are independent, then “A correlates with A + B, A + B correlates with B. Ergo, A correlates with B” is a false statement.
If I’m missing something here, I’d very much appreciate a data set that demonstrates the kind of correlation between an independent variable and the residuals of the model that it seems you’re talking about.
Thanks!
Scott Stevens
Jim Frost says
Hi Scott,
Thanks for writing. And, I’m glad to hear that you find my website helpful!
The key thing to remember is that while the OLS assumptions refer to the error, we can’t directly observe the true error. So, we use the residuals as estimates of the error. If the error is correlated with an omitted variable, we’d expect the residuals to be correlated as well in approximately the same manner. Omitted variable bias is a real condition, and that description is simply getting deep into the nuts and bolts of how it works. But, it’s the accepted explanation. You can read it in textbooks. While the assumptions refer to error, we can only assess the residuals instead. They’re the best we’ve got!
When you say A and B are “independent”, if you mean they are not correlated, I’d agree that removing a truly uncorrelated variable from the model does not cause this type of bias. I mention that in this post. This bias only occurs when independent variables are correlated with each other to some degree, and with the dependent variable, and you exclude one of the IVs.
I guess I’m not exactly sure which part is causing the difficulty? The regression equations can’t ensure that the residuals are not uncorrelated if the model is specified in such a way that it causes them to be correlated. It’s just like in time series regression models, you have to be on the look out for autocorrelation (correlated residuals) because the model doesn’t account for time-order effects. Incorrectly specified models can and do cause problems with the residuals, including residuals that are correlated with other variables and themselves.
I’ll have to see if I can find a dataset with this condition.
I hope this helps!
Reeba says
Hi Jim,
I am involved in a study which involves looking into s number of clinical paramaters like platelet count and Haemogobin for patients who underwent emergency change of a mechanical circulatory support device due to thrombosis or clotting of the actual device. The purpose is to look if there is a trend in these parameters in the time frame of before 3 days and after 3 days of the change and establish if these parameters could be used as predictor of the event. My concern is that there is no control group for this study. But I dont see the need for looking into trend in a group which never had an event itself. Will not having a control group be considered as a weakness for this study?
Also, what would be best statistical test for this. I was thinking of the generalized linear model.
I would really appreciate your guidance here.
Thank you
Susan Mitchell says
Dear Jim,
I’m looking at a published paper that develops clinical prediction rules by using logistic regression in order to help primary care doctors to decide who to refer to breast clinics for further investigation. The dependent variable is simply whether breast cancer is found to be present or not. The independent variables include 11 symptoms and age in (mostly) ten year increments (six separate age bands). The age bands were decided before the logistical regression was carried out. The paper goes on to use the data to create a scoring system based on symptoms and age. If this scoring system were to be used then above a certain score a woman would be referred, and below a certain score a woman would not be referred.
The total sample size is 6590 women referred to a breast clinic of which 320 were found to have breast cancer. The sample itself is very skewed. In younger women, breast cancer is rare and so some categories the numbers are very low. So for instance, in the 18-29 age band there are 62 women referred of whom 8 women have breast cancer, and in the 30-39 age band there are 755 women referred of which only one woman has breast cancer. So my first question is: if there are fewer individuals in particular categories than symptoms can the paper still use logistic regression to predict who to refer to a breast clinic based on a scoring system that includes both age and symptoms? My second question is: if there is meant to be at least 10 individuals per variable in logistic regression, are the numbers of women with breast cancer in these age groups too small for logistic regression to apply?
When I look at the total number of women in the sample (6590) and then the total number of symptoms (8616) there is a discrepancy. This means that some women have had more than one symptom recorded. (Or from the symptoms’ point of view, some women have been recorded more than once). So my third question is: does this mean that some of the independent variables are not actually independent of each other? (There is around a 30%-32% discrepancy in all categories. How significant is this?)
There are lots of other problems with the paper (the fact the authors only look at referred women rather than all the symptomatic women that a primary care doctor sees is a case in point) but I’d like to know whether the statistics are flawed too. If there are any other questions I need to ask about the data please do let me know.
With very best wishes,
Ms Susan Mitchell
Jim Frost says
Hi Susan,
Offhand, I don’t see anything that screams to me that there is a definite problem. I’d have to read the study to be really sure. Here’s some thoughts.
I’m not in the medical field, but I’ve heard talks by people in the that field and it sounds like this is a fairly common use for binary logistic regression. The analyst creates a model where you indicate which characteristics, risk factors, etc apply to an individual. Then, the model predicts the probability of an outcome for them. I’ve seen similar models for surgical success, death, etc. The idea is that it’s fairly easy to use because some can just enter the characteristics of the patient and the model spits out a probability. For any model of this type, you’d really have to check the residuals and see all the output to determine how well the model fits the data. But, there’s nothing inherently wrong with this approach.
I don’t see a problem with the sample size (6590) and the number of IVs (12). That’s actually a very good ratio of observations per IV.
It’s ok that there are fewer individuals in some categories. It’s better if you have a fairly equal number but it’s not a show stopper. Categories with fewer observations will have less precise estimates. It can potentially reduce the precision of model. You’d have to see how well the model fit the data to really know how well it works out. But, yes, if you have an extremely low number of individuals that have a particular symptom, you won’t get as precise of an estimate for that symptoms effect. You might see a wider CI for its odds ratio. But, it’s hard to say without seeing all of that output and how the numbers by symptoms. And, it’s possible that they selected the characteristics that apply to a sufficient number of women. Again, I wouldn’t be able to say. It’s an issue to consider for sure.
As for the number of symptoms versus the number of women, it’s ok that a woman can have more than one symptom. Each symptom is in it’s own column and will be coded with a 1 or 0. A row corresponds to one woman and she’ll have a 1 for each characteristic that she has and 0s for the ones that she does not have. It’s possible these symptoms are correlated. These are categorical variables, so you couldn’t use Pearson’s correlation. You’d need to use something like the chi-square test of independence. And, some correlation is okay. Only very high correlation would be problematic. Again, I can’t say whether that’s a problem in this study or not because it depends on the degree of correlation. It might be, but it’s not necessarily a problem. You’d hope that the study strategically included a good set of IVs that aren’t overly correlated.
Regarding the referred women vs symptomatic women, that comes down to the population that is being sampled and how generalizeable the results are. Not being familiar with the field, I don’t have a good sense for how that affects generalizability, but yes that would be a concern to consider.
So, I don’t see anything that shouts to me that it’s a definite problem. But, as with any regression model, it would come down to the usual assessments of how well the model fits the data. You mention issues that could be concerns, but again, it depends on the specifics.
Sorry I couldn’t provide more detailed thoughts but evaluating these things requires real specific information. But, the general approach for this study seems sound to me.
Terri Leach says
Hi Jim,
I have a question, how well can we evaluate a regression equation “fits” the data by examing the R Square statistic, and test for statistical significance of the whole regression equation using the F-Test?
Thank you~
Jim Frost says
Hi Terri,
I have two blog posts that will be perfect for you!
Interpreting R-squared
Interpreting the F-test of Overall Significance
If you have questions about either one, please post it in the comments section of the corresponding post. But, I think those posts will go a long way in answering your questions!
Dahlia says
Mr. Frost I know I need to run a regression model however I’m still unsure of which one. I’m examining the effects of alcohol use on teenagers with 4 confounders.
Jim Frost says
Hi Dahlia, to make the decision, I’d need to know what types of variables they all are (continuous, categorical, binary, etc). However, if the effect of alcohol is a continuous variable, then OLS linear regression is a great place to start!
Best of luck with your analysis!
Patrik Silva says
Thank you very much Jim,
Very helpful, I think my problem is really on the number of observation (25 obs). Yes, I have read that post also, and I always keep the theory in mind when analyzing the IVs.
My main objective is to show the existing relationship between X2 and Y, which is also supported by literature, however, if I do not control for X1 I will never be sure that the effect I have found is due to X2 or X1, because X1 and X2 are correlated.
I think only correlation would be ok, since my number of observation are limited and by using regression it limits me about the number of IVs to be included in the model also, which may make me leave out of the model some others IVs, which is also bad.
Thank you again
Best regards!
PS
Patrik Silva says
Hi Jim,
Thank you for this very good post.
However, I have a question. What to do if the (IV) X1 and X2 are correlated (says 0.75) and both are correlated to Y (DV) at 0.60. However, when include X1 and X2 in the same model X2 is not statistically significant, but when put separably they become statistically significant. On the other hand, the model with only X1 has higher explanatory power than the model with only X2.
Note: In individual model both meet the OLS assumptions but, together, X2 become not statistically significant (using stepwise regression X2 is removed from the model), what this means.
In addition, I know from the literarture that X2 affects Y, but I am testing X1, and X1 is showing better fits that X2.
Thank you in advance, I hope you understand my question!
Jim Frost says
Hi Patrik,
Yes, I understand completely! This situation isn’t too unusual. The underlying problem is that because the two IVs are correlated, they’re supplying a similar type of predictive information. There isn’t enough unique predictive information for both of them to be statistically significant. If you had a larger sample size, it’s possible that both would significant. Also, keep in mind that correlation is a pairwise measure and doesn’t account for other variables. When you include both IVs in the model, the relationship between each IV and the DV is determined after accounting for the other variables in the model. That’s why you can see a pairwise correlation but not a relationship in a regression model.
I know you’ve read a number of my posts, but I’m not sure if you’ve read the one about model specification. In that post, a key point I make is not to use statistical measures alone to determine which IVs to leave in the model. If theory suggests that X2 should be included, you have a very strong case for including it even if it’s not significant when X1 is in the model–just be sure to include that discussion in your write-up.
Conversely, just because X2 seems to provide a better fit statistically and is significant with or without X1 doesn’t mean you must include it in the model. Those are strong signs that you should consider including a variable in the model. However, as always, use theory as a guide and document the rational for the decisions you make.
For your case, you might consider include both IVs in the model. If they’re both supplying similar information and X2 is justified by theory, chances are that X1 is as well. Again, document your rationale. If you include both, check the VIFs to be sure that you don’t have problematic levels of multicollinearity when you include both IVs. If those are the only two IVs in your model, that won’t be problematic given the correlations you describe. But, it could be problematic if you more IVs in the model that are also correlated to X1 and X2.
Another thing to look at is whether the coefficients for X1 and X2 vary greatly depending on whether you have one or both of the IVs in the model. If they don’t change much, that’s nice and simple. However, if they do change quite a bit, then you need to determine which coefficient values are likely to be closer to the correct value because that corresponds to the choice about which IVs to include! I’m sounding like a broken record, but if this is a factor, document your rational and decisions.
I hope that helps! Best of luck with your analysis!
Patrick says
Hi Jim,
Another great post! Thank you for truly making statistics intuitive.
I learned a lot of this material back in school, but am only now understanding them more conceptually thanks to you. Super useful for my work in analytics. Please keep it up!
Jim Frost says
Thanks, Patrick! It’s great to hear that it was helpful!
Jayant Jain says
I think there may be a typo here – “These are important variables that the statistical model does include and, therefore, cannot control.” Shouldn’t it be “does not include”, if I understand correctly?
Jim Frost says
Thanks, Jayant! Good eagle eyes! That is indeed a typo. I will fix it. Thanks for pointing it out!
Lucy Quinlan says
Mr. Jim thank you for making me understand econometrics. I thought that omitted variable is excluded from the model and that why they under/overestimate the coefficients. Somewhere in this article you mentioned that they are still included in the model but not controlled for. I find that very confusing, would you be able to clarify ?
Thanks a lot.
Jim Frost says
Hi Lucy,
You’re definitely correct. Omitted variable bias occurs when you exclude a variable from the model. If I gave the impression that it’s included, please let me know where in the text because I want to clarify that! Thanks!
By excluding the variable, the model does not control for it, which biases the results. When you include a previously excluded variable, the model can now control for it and the bias goes away. Maybe I wrote that in a confusing way?
Thanks! I always strive to make my posts as clear as possible, so I’ll think about how to explain this better.
Stan Alekman says
In addition to mean square error, adj R-squared, I use Cp, IC, HQC, and SBIC to decide the number of dependent variables in multiple regression.
Jim Frost says
I think there are a variety of good measures. I’d also add predicted R-squared–as long as you use them in conjunction with subject-area expertise. As I mention in this post, the entire set of estimate relationships must make theoretical sense. If they don’t, the statistical measures are not important.
Stan Alekman says
i have to read the article you named. Having said that, caution should be given when regression models model systems or processes not in statistical control. Also, some processes have physical bounds that a regression model does not capture and calculated predicted values have no physical meaning. Further, models from narrow ranges of independent variables may not be applicable outside the ranges of the independent variables.
Jim Frost says
Hi Stan, those are all great points, and true. They all illustrate how you need to use your subject-area knowledge in conjunction with statistical analyses.
I talk about the issue of not going outside the range of the data, amongst other issues, in my post about Using Regression to Make Predictions.
I also agree about statistical control, which I think is under appreciated outside of the quality improvement arena. I’ve written about this in a post about using control charts with hypothesis tests.
Stan Alekman says
Valid confidence/prediction intervals are important if the regression model represents a process that is being characterized. When the prediction intervals are wide or too wide, the model’s validity and utility are in question.
Jim Frost says
Hi Stan,
You’re definitely correct! If the model doesn’t fit the data, your predictions are worthless. One minor caveat that I’d add to your comment.
The prediction intervals can be too wide to be useful yet the model might still be valid. It’s really two separate assessments. Valid model and degree of precision. I write about this in several posts including the following: Understanding Precision in Prediction
Stan Alekman says
Jim, does centering any independent explanatory variable require centering them all? Center the dependent and explanatory variables?
I always make a normal probability plot of the deleted residuals as one test of the prediction capability of the fitted model. It is remarkable how good models give good normal probability plots. I also use the Shapiro-Wilks test to assess the deleted variables for normality.
Stan Alekman
Jim Frost says
Hi Stan,
Yes, you should center all of the continuous independent variables if your goal is to reduce multicollinearity and/or to be able to interpret the intercept. I’ve never seen a reason to center the dependent variable.
It’s funny that you mention that about normally distributed residuals! I, too, have been impressed with how frequently that occurs even with fairly simple models. I’ve recently written a post about OLS assumptions and I mention how normal residuals are sort of optional. They only need to be normally distributed if you want to perform hypothesis tests and have valid confidence/prediction intervals. Most analysts want at least the hypothesis tests!
Mugdha Bhatnagar says
Hey Jim,your blogs are really helpful for me to learn data science.Here is my question in my assignment:
You have built a classification model with 90% accuracy but your client is not happy
because False Positive rate was very high then what will you do?
Can we do something to it by precision or recall??
this is the question..nothing is given in the background
though they should have given!
Brahim KHOUILED says
Thank you Jim
Really interesting
Jim Frost says
Hi Brahim, you’re very welcome! I’m glad it was interesting!
MG says
Hey Jim, you are awesome.
Jim Frost says
Aw, MG, thanks so much!! ðŸ™‚
SFdude says
Thanks for another great article, Jim!.
Q: Could you expand with a specific plot example
to explain more clearly, this statement:
“We know that for omitted variable bias to exist, an independent variable must correlate with the residuals. Consequently, we can plot the residuals by the variables in our model. If we see a relationship in the plot, rather than random scatter, it both tells us that there is a problem and points us towards the solution. We know which independent variable correlates with the confounding variable.”
Thanks!
SFdude
Jim Frost says
Hi, thanks!
I’ll try to find a good example plot to include soon. Basically, you’re looking for any non-random pattern. For example, the residuals might tend to either increase or decrease as the value of independent variable increases. That relationship can follow a straight line or display curvature, depending on the nature of relationship.
I hope this helps!
Saketh prasad says
It’s been a long time I heard from you Jim . Missed your stats
Jim Frost says
Hi Saketh, thanks, you’re too kind! I try to post here every two weeks at least. Occasionally, weekly!