What is Omitted Variable Bias?
Omitted variable bias (OVB) occurs when a regression model excludes a relevant variable. The absence of these critical variables can skew the estimated relationships between variables in the model, potentially leading to erroneous interpretations. This bias can exaggerate, mask, or entirely flip the direction of the estimated relationship between an independent and dependent variable.
Omitted variable bias is insidious! In the regression context, variables that aren’t in the model can bias the results for those in it! How does that happen?
For an omitted variable to bias the results, it must correlate with the dependent variable and at least one independent variable, making it a confounding variable. When this correlation structure exists, it forces the statistical procedure to attribute the effects of the omitted variable to variables in the model, distorting the genuine relationship.
The diagram below illustrates the conditions that cause omitted variable bias. There must be non-zero correlations (r) on all three sides of the triangle. X1 is an independent variable in the model, while X2 is omitted. Y is the dependent variable. Learn more about Correlations.
The degree of bias hinges on the collective strength of these correlations. Stronger correlations produce more bias. With weaker relationships, the bias might not be severe. The omitted variable won’t bias the results if any correlations are zero.
Understanding and addressing omitted variable bias is crucial for researchers and analysts to draw trustworthy conclusions from their regression models.
In this post, you’ll learn about omitted variable bias, see an example, learn how it occurs, and how to detect and avoid it.
Related post: Identifying Independent and Dependent Variables
Omitted Variable Bias Example
In a biomechanics lab, we studied the impact of physical activity on bone density, measuring variables like activity levels, weights, and bone densities. Theory suggests a positive link between activity and bone density. Using regression analysis, I found no relationship between these variables, contradicting established theories.
This unexpected result was due to omitted variable bias. Initially, I only included activity level in the model, overlooking a crucial variable: the subject’s weight. When I added weight to the analysis, both activity and weight showed significant, positive correlations with bone density.
The diagram shows that the correlation structure can produce omitted variable bias because all three sides have significant associations. Let’s see how leaving weight out of the model hid the positive relationship between activity and bone density.
The correlation pattern creates two opposite effects of activity on bone density. Higher activity increases density directly. However, active subjects tend to weigh less, which in turn reduces the density.
When I fit the regression model with only activity, the model had to assign both opposing effects to activity alone. Hence, there was no apparent relationship initially. However, the model with both activity and weight could correctly allocate both effects.
Next, I’ll go into a more statistical explanation for omitted variable bias from an assumption violation perspective.
Related post: Specifying the Correct Regression Model
Violating an OLS Residuals Assumption
However, omitted variable bias occurs because the residuals violate one of the assumptions. You need to follow a chain of events to see how this works.
Suppose you have a regression model with two significant independent variables, X1 and X2. These independent variables correlate with each other and the dependent variable—which are the requirements for omitted variable bias.
Now, imagine that we take variable X2 out of the model. It is the confounding variable. Here’s what happens to cause omitted variable bias:
- The model fits the data less well because we’ve removed a significant explanatory variable. Consequently, the gap between the observed values and the fitted values increases. These gaps are the residuals.
- The degree to which each residual increases depends on the relationship between X2 and the dependent variable. Consequently, the residuals correlate with X2.
- X1 correlates with X2, and X2 correlates with the residuals. Ergo, variable X1 correlates with the residuals.
- Hence, this condition violates the ordinary least squares assumption that independent variables in the model do not correlate with the residuals. Violations of this assumption produce biased estimates.
This explanation serves a purpose in the next section!
How to Detect OVB
You saw one method for detecting omitted variable bias in this post. If you include different combinations of independent variables in the model and see the coefficients changing, you’re watching omitted variable bias in action!
In this post, I started with a regression model that had activity as the lone independent variable and bone density as the dependent variable. After adding weight to the model, the correlation changed from zero to positive.
However, if we don’t have the data, it can be harder to detect omitted variable bias. If my study hadn’t collected the weight data, the answer would not be as clear.
I presented a clue in the previous section. For omitted variable bias to exist, a confounding variable must correlate with the residuals. Consequently, we can plot the residuals by the variables in our model. If we see a relationship in the plot rather than a random scatter, it tells us there is a problem and points us toward the solution. We know which independent variable correlates with the confounding variable.
Another step is to consider theory and other studies carefully. Ask yourself several questions:
- Do the coefficient estimates match the theoretical signs and magnitudes? If not, you need to investigate. That was my first tip-off!
- Can you think of confounding variables that likely correlate with the dependent variable and at least one independent variable? Reviewing the literature, consulting experts, and brainstorming sessions can illuminate this possibility.
How to Avoid Omitted Variable Bias
Before you begin your study, arm yourself with all the possible background information you can gather. Research the study area, review the literature, and consult with experts. This process enables you to identify and measure potential confounding variables that you should include in your model. Sometimes, you can set control variables in an experiment to combat omitted variable bias. These approaches help you avoid the problem in the first place.
Imagine collecting all your data and then realizing that you didn’t measure a critical variable. That’s an expensive mistake!
If you absolutely cannot include a confounding variable and it causes OVB, consider using a proxy variable. Typically, proxy variables are easy to measure, and analysts use them instead of variables that are either impossible or difficult to measure. The proxy variable can be a characteristic that is not important itself but has a good correlation with the confounding variable.
These variables allow you to include some of the information in your model that would otherwise be impossible, thereby reducing omitted variable bias. For example, if it is crucial to incorporate historical climate data in your model, but those data do not exist, you might include tree ring widths instead.
Learn more about Proxy Variables.
If you aren’t careful, the hidden hazards of omitted variable bias can completely flip the results of your regression analysis!
Lopes, H. F. (2016, September 21). Omitted Variable Bias: The Simple Case. Hedibert.