Omitted variable bias occurs when a regression model leaves out relevant independent variables, which are known as confounding variables. This condition forces the model to attribute the effects of omitted variables to variables that are in the model, which biases the coefficient estimates.

This problem occurs because your linear regression model is specified incorrectly—either because the confounding variables are unknown or because the data do not exist. If this bias affects your model, it is a severe condition because you can’t trust your results.

In this post, you’ll learn about this type of bias, how it occurs, and how to detect and correct it.

**Related post**: Specifying the Correct Regression Model

## What Are the Effects of Omitted Variable Bias?

Omitting confounding variables from your regression model can bias the coefficient estimates. What does that mean exactly? When you’re assessing the effects of the independent variables in the regression output, this bias can produce the following problems:

- Overestimate the strength of an effect.
- Underestimate the strength of an effect.
- Change the sign of an effect.
- Mask an effect that actually exists

You don’t want any of these problems to affect your regression results!

To learn more about the properties of biased and unbiased estimates in regression analysis, read my post about the Gauss-Markov theorem.

## Synonyms for Confounding Variables and Omitted Variable Bias

In the context of regression analysis, there are various synonyms for omitted variables and the bias they can cause. Analysts often refer to omitted variables that cause bias as confounding variables, confounders, and lurking variables. These are important variables that the statistical model does not include and, therefore, cannot control. Additionally, they call the bias itself omitted variable bias, spurious effects, and spurious relationships. I’ll use these terms interchangeably.

## What Conditions Cause Omitted Variable Bias?

How does this bias occur? How can variables you leave out of the model affect the variables that you include in the model? At first glance, this problem might not make sense.

For omitted variable bias to occur, the following two conditions must exist:

- The omitted variable must correlate with the dependent variable.
- The omitted variable must correlate with at least one independent variable that is in the regression model.

The diagram below illustrates these two conditions. There must be non-zero correlations (r) on all three sides of the triangle.

This correlation structure causes confounding variables that are not in the model to bias the estimates that appear in your regression results. For example, removing either X variable will bias the other X variable.

The amount of bias depends on the strength of these correlations. Strong correlations produce greater bias. If the relationships are weak, the bias might not be severe. And, if the omitted variable is not correlated with another independent variable at all, excluding it does not produce bias.

Finally, if you’re performing a randomized experiment, omitted variable bias is less likely to be a problem. Randomized studies minimize the effects of confounding variables by equally distributing them across the treatment groups. Omitted variable bias tends to occur in observational studies.

I’ll explain how confounding variables can bias the results using two approaches. First, I’ll work through an example and describe how the omitted variable forces the model to attribute the effects of the excluded variable to the one in the model. Then, I’ll go into a more statistical explanation that details the correlation structure, residuals, and an assumption violation. Explaining confounding variables using both approaches will give you a solid grasp of how the bias occurs.

**Related post**: Understanding Correlations

## Practical Example of How Confounding Variables Can Produce Bias

I used to work in a biomechanics lab. One study assessed the effects of physical activity on bone density. We measured various characteristics including the subjects’ activity levels, their weights, and bone densities among many others. Theories about how our bodies build bone suggest that there should be a positive correlation between activity level and bone density. In other words, higher activity produces greater bone density.

Early in the study, I wanted to validate our initial data quickly by using simple regression analysis to determine whether there is a relationship between activity and bone density. If our data were valid, there should be a positive relationship. To my great surprise, there was no relationship at all!

What was happening? The theory is well established in the field. Maybe our data was messed up somehow? Long story short, thanks to a confounding variable, the model was exhibiting omitted variable bias.

To perform the quick assessment, I included activity level as the only independent variable, but it turns out there is another variable that correlates with both activity and bone density—the subject’s weight.

After including weight in the regression model, along with activity, the results indicated that both activity and weight are statistically significant and have positive correlations with bone density. The diagram below shows the signs of the correlations between the variables.

## How the Omitted Confounding Variable Hid the Relationship

Right away we see that these conditions can produce omitted variable bias because all three sides of the triangle have non-zero correlations. Let’s find out how leaving weight out of the model masked the relationship between activity and bone density.

Subjects who are more active tend to have higher bone density. Additionally, subjects who weigh more also tend to have higher bone density. However, there is a negative correlation between activity and weight. More active subjects tend to weigh less.

This correlation structure produces two opposing effects of activity. More active subjects get a bone density boost. However, they also tend to weigh less, which reduces bone density.

When I fit a regression model with only activity, the model had to attribute both opposing effects to activity alone. Hence, the zero correlation. However, when I fit the model with both activity and weight, it could assign the opposing effects to each variable separately.

For this example, when I omitted weight from the model, it produced a negative bias because the model underestimated the effect of activity. The results said there is no correlation when there is, in fact, a positive correlation.

## Correlations, Residuals, and OLS Assumptions

Now, let’s look at this from another angle that involves the residuals and an assumption. When you satisfy the ordinary least squares (OLS) assumptions, the Gauss-Markov theorem states that your estimates will be unbiased and have minimum variance.

However, omitted variable bias occurs because the residuals violate one of the assumptions. To see how this works, you need to follow a chain of events.

Suppose you have a regression model with two significant independent variables, X1 and X2. These independent variables correlate with each other and the dependent variable—which are the requirements for omitted variable bias.

Now, imagine that we take variable X2 out of the model. It is the confounding variable. Here’s what happens:

- The model fits the data less well because we’ve removed a significant explanatory variable. Consequently, the gap between the observed values and the fitted values increases. These gaps are the residuals.
- The degree to which each residual increases depends on the relationship between X2 and the dependent variable. Consequently, the residuals correlate with X2.
- X1 correlates with X2, and X2 correlates with the residuals. Ergo, variable X1 correlates with the residuals.
- Hence, this condition violates the ordinary least squares assumption that independent variables in the model do not correlate with the residuals. Violations of this assumption produce biased estimates.

This explanation serves a purpose later in this post!

The important takeaway here is that leaving out an important variable not only reduces the goodness-of-fit (larger residuals), but it can also bias the coefficient estimates.

**Related posts**: 7 Classical OLS Assumptions and Check Your Residual Plots

## Predicting the Direction of Omitted Variable Bias

We can use correlation structures, like the one in the example, to predict the direction of bias that occurs when the model omits a confounding variable. The direction depends on both the correlation between the included and omitted independent variables and the correlation between the included independent variable and the dependent variable. The table below summarizes these relationships and the direction of bias.

Included and Omitted: Negative Correlation | Included and Omitted: Positive Correlation | |

Included and Dependent: Negative Correlation | Positive bias: coefficient is overestimated. | Negative bias: coefficient is underestimated. |

Included and Dependent: Positive Correlation | Negative bias: coefficient is underestimated. | Positive bias: coefficient is overestimated. |

Let’s apply this table to the bone density example. The included (Activity) and omitted confounding variable (Weight) have a negative correlation, so we need to use the middle column. The included variable (Weight) and the dependent variable (Bone Density) have a positive relationship, which corresponds to the bottom row. At the intersection of the middle column and bottom row, the table indicates that we can expect a negative bias, which matches our results.

Suppose we hadn’t collected weight and were unable to include it in the model. In that case, we can use this table, along with the hypothesized relationships, to predict the direction of the omitted variable bias.

## How to Detect Omitted Variable Bias and Identify Confounding Variables

You saw one method of detecting omitted variable bias in this post. If you include different combinations of independent variables in the model, and you see the coefficients changing, you’re watching omitted variable bias in action!

In this post, I started with a regression model that has activity as the lone independent variable and bone density as the dependent variable. After adding weight to the model, the correlation changed from zero to positive.

However, if we don’t have the data, it can be harder to detect omitted variable bias. If my study hadn’t collected the weight data, the answer would not be as clear.

I presented a clue earlier in this post. We know that for omitted variable bias to exist, an independent variable must correlate with the residuals. Consequently, we can plot the residuals by the variables in our model. If we see a relationship in the plot, rather than random scatter, it both tells us that there is a problem and points us towards the solution. We know which independent variable correlates with the confounding variable.

Another step is to carefully consider theory and other studies. Ask yourself several questions:

- Do the coefficient estimates match the theoretical signs and magnitudes? If not, you need to investigate. That was my first tip-off!
- Can you think of confounding variables that you didn’t measure that are likely to correlate with both the dependent variable and at least one independent variable? Reviewing the literature, consulting experts, and brainstorming sessions can shed light on this possibility.

## Obstacles to Correcting Omitted Variable Bias

Again, you saw the best correction possible in this post—including the variable in the model! Including confounding variables in a regression model allows the analysis to control for them and prevent the spurious effects that the omitted variables would have caused otherwise. Theoretically, you should include all independent variables that have a relationship with the dependent variable. That’s easier said than done because this approach produces real-world problems.

For starters, you might need to collect data on many more characteristics than is feasible. Additionally, some of these characteristics might be very difficult or even impossible to measure. Suppose you fit a model for salary that includes experience and education. Ability might also be a significant variable, but one that is much harder to measure in some fields.

Furthermore, as you include more variables in the model, the number of observations must increase to avoid overfitting the model, which can also produce unreliable results. Measuring more characteristics *and* gathering a larger sample size can be an expensive proposition!

Because the bias occurs when the confounding variables correlate with independent variables, including these confounders invariably introduces multicollinearity into your model. Multicollinearity causes its own problems including unstable coefficient estimates, lower statistical power, and less precise estimates.

It’s important to note a tradeoff that might occur between precision and bias. As you include the formerly omitted variables, you lessen the bias, but the multicollinearity can potentially reduce the precision of the estimates.

It’s a balancing act! Let’s get into some practical recommendations.

**Related posts**: Overfitting Regression Models and Multicollinearity in Regression

## Recommendations for Addressing Confounding Variables and Omitted Variable Bias

Before you begin your study, arm yourself with all the possible background information you can gather. Research the study area, review the literature, and consult with experts. This process enables you to identify and measure the crucial variables that you should include in your model. It helps you avoid the problem in the first place. Just imagine if you collect all your data and *then* realize that you didn’t measure a critical variable. That’s an expensive mistake!

After the analysis, this background information can help you identify potential bias, and, if necessary, track down the solution.

Check those residual plots! Sometimes you might not be sure whether bias exists, but the plots can clearly display the hallmarks of confounding variables.

Recognize that omitted variable bias lessens as the degree of correlations decrease. It might not always be a significant problem. Understanding the relationships between the variables helps you make this determination.

Remember that a tradeoff between bias and the precision of the estimates *might* occur. As you add confounding variables to reduce the bias, keep an eye on the precision of the estimates. To track the precision, check the confidence intervals of the coefficient estimates. If the intervals become wider, the estimates are less precise. In the end, you might accept a little bias if it significantly improves precision.

## What to Do When Including Confounding Variables is Impossible

If you absolutely cannot include an important variable and it causes omitted variable bias, consider using a proxy variable. Typically, proxy variables are easy to measure, and analysts use them instead of variables that are either impossible or difficult to measure. The proxy variable can be a characteristic that is not of any great importance itself, but has a good correlation with the confounding variable. These variables allow you to include some of the information in your model that would not otherwise be possible, and, thereby, reduce omitted variable bias. For example, if it is crucial to include historical climate data in your model, but those data do not exist, you might include tree ring widths instead.

Finally, if you can’t correct omitted variable bias using any method, you can at least predict the direction of bias for your estimates. After identifying confounding variable candidates, you can estimate their theoretical correlations with the relevant variables and predict the direction of the bias—as we did with the bone density example.

If you aren’t careful, the hidden hazards of confounding variables and omitted variable bias can completely flip the results of your regression analysis!

Patrik Silva says

Thank you very much Jim,

Very helpful, I think my problem is really on the number of observation (25 obs). Yes, I have read that post also, and I always keep the theory in mind when analyzing the IVs.

My main objective is to show the existing relationship between X2 and Y, which is also supported by literature, however, if I do not control for X1 I will never be sure that the effect I have found is due to X2 or X1, because X1 and X2 are correlated.

I think only correlation would be ok, since my number of observation are limited and by using regression it limits me about the number of IVs to be included in the model also, which may make me leave out of the model some others IVs, which is also bad.

Thank you again

Best regards!

PS

Patrik Silva says

Hi Jim,

Thank you for this very good post.

However, I have a question. What to do if the (IV) X1 and X2 are correlated (says 0.75) and both are correlated to Y (DV) at 0.60. However, when include X1 and X2 in the same model X2 is not statistically significant, but when put separably they become statistically significant. On the other hand, the model with only X1 has higher explanatory power than the model with only X2.

Note: In individual model both meet the OLS assumptions but, together, X2 become not statistically significant (using stepwise regression X2 is removed from the model), what this means.

In addition, I know from the literarture that X2 affects Y, but I am testing X1, and X1 is showing better fits that X2.

Thank you in advance, I hope you understand my question!

Jim Frost says

Hi Patrik,

Yes, I understand completely! This situation isn’t too unusual. The underlying problem is that because the two IVs are correlated, they’re supplying a similar type of predictive information. There isn’t enough unique predictive information for both of them to be statistically significant. If you had a larger sample size, it’s

possiblethat both would significant. Also, keep in mind that correlation is a pairwise measure and doesn’t account for other variables. When you include both IVs in the model, the relationship between each IV and the DV is determined after accounting for the other variables in the model. That’s why you can see a pairwise correlation but not a relationship in a regression model.I know you’ve read a number of my posts, but I’m not sure if you’ve read the one about model specification. In that post, a key point I make is not to use statistical measures alone to determine which IVs to leave in the model. If theory suggests that X2 should be included, you have a very strong case for including it even if it’s not significant when X1 is in the model–just be sure to include that discussion in your write-up.

Conversely, just because X2 seems to provide a better fit statistically and is significant with or without X1 doesn’t mean you must include it in the model. Those

arestrong signs that you should consider including a variable in the model. However, as always, use theory as a guide and document the rational for the decisions you make.For your case, you might consider include both IVs in the model. If they’re both supplying similar information and X2 is justified by theory, chances are that X1 is as well. Again, document your rationale. If you include both, check the VIFs to be sure that you don’t have problematic levels of multicollinearity when you include both IVs. If those are the only two IVs in your model, that won’t be problematic given the correlations you describe. But, it could be problematic if you more IVs in the model that are also correlated to X1 and X2.

Another thing to look at is whether the coefficients for X1 and X2 vary greatly depending on whether you have one or both of the IVs in the model. If they don’t change much, that’s nice and simple. However, if they do change quite a bit, then you need to determine which coefficient values are likely to be closer to the correct value because that corresponds to the choice about which IVs to include! I’m sounding like a broken record, but if this is a factor, document your rational and decisions.

I hope that helps! Best of luck with your analysis!

Patrick says

Hi Jim,

Another great post! Thank you for truly making statistics intuitive.

I learned a lot of this material back in school, but am only now understanding them more conceptually thanks to you. Super useful for my work in analytics. Please keep it up!

Jim Frost says

Thanks, Patrick! It’s great to hear that it was helpful!

Jayant Jain says

I think there may be a typo here – “These are important variables that the statistical model does include and, therefore, cannot control.” Shouldn’t it be “does not include”, if I understand correctly?

Jim Frost says

Thanks, Jayant! Good eagle eyes! That is indeed a typo. I will fix it. Thanks for pointing it out!

Lucy Quinlan says

Mr. Jim thank you for making me understand econometrics. I thought that omitted variable is excluded from the model and that why they under/overestimate the coefficients. Somewhere in this article you mentioned that they are still included in the model but not controlled for. I find that very confusing, would you be able to clarify ?

Thanks a lot.

Jim Frost says

Hi Lucy,

You’re definitely correct. Omitted variable bias occurs when you exclude a variable from the model. If I gave the impression that it’s included, please let me know where in the text because I want to clarify that! Thanks!

By excluding the variable, the model does not control for it, which biases the results. When you include a previously excluded variable, the model can now control for it and the bias goes away. Maybe I wrote that in a confusing way?

Thanks! I always strive to make my posts as clear as possible, so I’ll think about how to explain this better.

Stan Alekman says

In addition to mean square error, adj R-squared, I use Cp, IC, HQC, and SBIC to decide the number of dependent variables in multiple regression.

Jim Frost says

I think there are a variety of good measures. I’d also add predicted R-squared–as long as you use them in conjunction with subject-area expertise. As I mention in this post, the entire set of estimate relationships must make theoretical sense. If they don’t, the statistical measures are not important.

Stan Alekman says

i have to read the article you named. Having said that, caution should be given when regression models model systems or processes not in statistical control. Also, some processes have physical bounds that a regression model does not capture and calculated predicted values have no physical meaning. Further, models from narrow ranges of independent variables may not be applicable outside the ranges of the independent variables.

Jim Frost says

Hi Stan, those are all great points, and true. They all illustrate how you need to use your subject-area knowledge in conjunction with statistical analyses.

I talk about the issue of not going outside the range of the data, amongst other issues, in my post about Using Regression to Make Predictions.

I also agree about statistical control, which I think is under appreciated outside of the quality improvement arena. I’ve written about this in a post about using control charts with hypothesis tests.

Stan Alekman says

Valid confidence/prediction intervals are important if the regression model represents a process that is being characterized. When the prediction intervals are wide or too wide, the model’s validity and utility are in question.

Jim Frost says

Hi Stan,

You’re definitely correct! If the model doesn’t fit the data, your predictions are worthless. One minor caveat that I’d add to your comment.

The prediction intervals can be too wide to be useful yet the model might still be valid. It’s really two separate assessments. Valid model and degree of precision. I write about this in several posts including the following: Understanding Precision in Prediction

Stan Alekman says

Jim, does centering any independent explanatory variable require centering them all? Center the dependent and explanatory variables?

I always make a normal probability plot of the deleted residuals as one test of the prediction capability of the fitted model. It is remarkable how good models give good normal probability plots. I also use the Shapiro-Wilks test to assess the deleted variables for normality.

Stan Alekman

Jim Frost says

Hi Stan,

Yes, you should center all of the continuous independent variables if your goal is to reduce multicollinearity and/or to be able to interpret the intercept. I’ve never seen a reason to center the dependent variable.

It’s funny that you mention that about normally distributed residuals! I, too, have been impressed with how frequently that occurs even with fairly simple models. I’ve recently written a post about OLS assumptions and I mention how normal residuals are sort of optional. They only need to be normally distributed if you want to perform hypothesis tests and have valid confidence/prediction intervals. Most analysts want at least the hypothesis tests!

Mugdha Bhatnagar says

Hey Jim,your blogs are really helpful for me to learn data science.Here is my question in my assignment:

You have built a classification model with 90% accuracy but your client is not happy

because False Positive rate was very high then what will you do?

Can we do something to it by precision or recall??

this is the question..nothing is given in the background

though they should have given!

Brahim KHOUILED says

Thank you Jim

Really interesting

Jim Frost says

Hi Brahim, you’re very welcome! I’m glad it was interesting!

MG says

Hey Jim, you are awesome.

Jim Frost says

Aw, MG, thanks so much!! ðŸ™‚

SFdude says

Thanks for another great article, Jim!.

Q: Could you expand with a specific plot example

to explain more clearly, this statement:

“We know that for omitted variable bias to exist, an independent variable must correlate with the residuals. Consequently, we can plot the residuals by the variables in our model. If we see a relationship in the plot, rather than random scatter, it both tells us that there is a problem and points us towards the solution. We know which independent variable correlates with the confounding variable.”

Thanks!

SFdude

Jim Frost says

Hi, thanks!

I’ll try to find a good example plot to include soon. Basically, you’re looking for any non-random pattern. For example, the residuals might tend to either increase or decrease as the value of independent variable increases. That relationship can follow a straight line or display curvature, depending on the nature of relationship.

I hope this helps!

Saketh prasad says

It’s been a long time I heard from you Jim . Missed your stats

Jim Frost says

Hi Saketh, thanks, you’re too kind! I try to post here every two weeks at least. Occasionally, weekly!