Confounding Variable Definition
In studies examining possible causal links, a confounding variable is an unaccounted factor that impacts both the potential cause and effect and can distort the results. Recognizing and addressing these variables in your experimental design is crucial for producing valid findings. Statisticians also refer to confounding variables that cause bias as confounders, omitted variables, and lurking variables.
A confounding variable systematically influences both an independent and dependent variable in a manner that changes the apparent relationship between them. Failing to account for a confounding variable can bias your results, leading to erroneous interpretations. This bias can produce the following problems:
- Overestimate the strength of an effect.
- Underestimate the strength of an effect.
- Change the direction of an effect.
- Mask an effect that actually exists.
- Create Spurious Correlations.
Additionally confounding variables reduce an experiment’s internal validity, thereby reducing its ability to make causal inferences about treatment effects. You don’t want any of these problems!
In this post, you’ll learn about confounding variables, the problems they cause, and how to minimize their effects. I’ll provide plenty of examples along the way!
What is a Confounding Variable?
Confounding variables bias the results when researchers don’t account for them. How can variables you don’t measure affect the results for variables that you record? At first glance, this problem might not make sense.
Confounding variables influence both the independent and dependent variable, distorting the observed relationship between them. To be a confounding variable, the following two conditions must exist:
- It must correlate with the dependent variable.
- It must correlate with at least one independent variable in the experiment.
The diagram below illustrates these two conditions. There must be non-zero correlations (r) on all three sides of the triangle. X1 is the independent variable of interest while Y is the dependent variable. X2 is the confounding variable.
The correlation structure can cause confounding variables to bias the results that appear in your statistical output. In short, The amount of bias depends on the strength of these correlations. Strong correlations produce greater bias. If the relationships are weak, the bias might not be severe. If any of the correlations are zero, the extraneous variable won’t produce bias even if the researchers don’t control for it.
Leaving a confounding variable out of a regression model can produce omitted variable bias.
Confounding Variable Examples
Exercise and Weight Loss
In a study examining the relationship between regular exercise and weight loss, diet is a confounding variable. People who exercise are likely to have other healthy habits that affect weight loss, such as diet. Without controlling for dietary habits, it’s unclear whether weight loss is due to exercise, changes in diet, or both.
Education and Income Level
When researching the correlation between the level of education and income, geographic location can be a confounding variable. Different regions may have varying economic opportunities, influencing income levels irrespective of education. Without controlling for location, you can’t be sure if education or location is driving income.
Exercise and Bone Density
I used to work in a biomechanics lab. For a bone density study, we measured various characteristics including the subjects’ activity levels, their weights, and bone densities among many others. Bone growth theories suggest that a positive correlation between activity level and bone density likely exists. Higher activity should produce greater bone density.
Early in the study, I wanted to validate our initial data quickly by using simple regression analysis to assess the relationship between activity and bone density. There should be a positive relationship. To my great surprise, there was no relationship at all!
Long story short, a confounding variable was hiding a significant positive correlation between activity and bone density. The offending variable was the subjects’ weights because it correlates with both the independent (activity) and dependent variable (bone density), thus allowing it to bias the results.
After including weight in the regression model, the results indicated that both activity and weight are statistically significant and positively correlate with bone density. Accounting for the confounding variable revealed the true relationship!
The diagram below shows the signs of the correlations between the variables. In the next section, I’ll explain how the confounder (Weight) hid the true relationship.
Related post: Identifying Independent and Dependent Variables
How the Confounder Hid the Relationship
The diagram for the Activity and Bone Density study indicates the conditions exist for the confounding variable (Weight) to bias the results because all three sides of the triangle have non-zero correlations. Let’s find out how leaving the confounding variable of weight out of the model masked the relationship between activity and bone density.
The correlation structure produces two opposing effects of activity. More active subjects get a bone density boost directly. However, they also tend to weigh less, which reduces bone density.
When I fit a regression model with only activity, the model had to attribute both opposing effects to activity alone. Hence, the zero correlation. However, when I fit the model with both activity and weight, it could assign the opposing effects to each variable separately.
Now imagine if we didn’t have the weight data. We wouldn’t have discovered the positive correlation between activity and bone density. Hence, the example shows the importance of controlling confounding variables. Which leads to the next section!
Reducing the Effect of Confounding Variables
As you saw above, accounting for the influence of confounding variables is essential to ensure your findings’ validity. Here are four methods to reduce their effects.
Restriction
Restriction involves limiting the study population to a specific group or criteria to eliminate confounding variables.
For example, in a study on the effects of caffeine on heart rate, researchers might restrict participants to non-smokers. This restriction eliminates smoking as a confounder that can influence heart rate.
Matching
This process involves pairing subjects by matching characteristics pertinent to the study. Then, researchers randomly assign one individual from each pair to the control group and the other to the experimental group. This randomness helps eliminate bias, ensuring a balanced and fair comparison between groups. This process controls confounding variables by equalizing them between groups. The goal is to create groups as similar as possible except for the experimental treatment.
For example, in a study examining the impact of a new education method on student performance, researchers match students on age, socioeconomic status, and baseline academic performance to control these potential confounders.
Learn more about Matched Pairs Design: Use & Examples.
Random Assignment
Randomly assigning subjects to the control and treatment groups helps ensure that the groups are statistically similar, minimizing the influence of confounding variables.
For example, in clinical trials for a new medication, participants are randomly assigned to either the treatment or control group. This random assignment helps evenly distribute variables such as age, gender, and health status across both groups.
Learn more about Random Assignment in Experiments.
Statistical Control
Statistical control involves using analytical techniques to adjust for the effect of confounding variables in the analysis phase. Researchers can use methods like regression analysis to control potential confounders.
For example, I showed you how I controlled for weight as a confounding variable in the bone density study. Including weight in the regression model revealed the genuine relationship between activity and bone density.
Learn more about controlling confounders by using regression analysis.
By incorporating these strategies into research design and analysis, researchers can significantly reduce the impact of confounding variables, leading to more accurate results.
If you aren’t careful, the hidden hazards of a confounding variable can completely flip the results of your experiment!
Reference
Kamangar F. Confounding variables in epidemiologic studies: basics and beyond. Arch Iran Med. 2012 Aug;15(8):508-16. PMID: 22827790.
Stan Alekman says
Hi Jim,
To address this potential problem, I collect all the possible variables and create a correlation matrix to identify all the correlations, there direction, and their statistical significance, before regression.
Jim Frost says
Hi Stan,
That’s a great practice for understanding the underlying correlation structure of your data. Definitely a good thing to do along with graphing the scatterplots for all those pairs because they’re good at displaying curved relationships that might not register with Pearson’s correlation.
It’s been awhile since I worked on the bone density study, but I’m sure I created that correlation & scatterplot matrix to get the lay of the land.
A couple of caveats:
Those correlations are pairwise relationships, equivalent to one predictor for a response (but without the directionality). So, those correlations can be affected by a confounding variable just like a simple regression model. Going back to the example in my post, if I did a pairwise correlation between all variables, including activity and bone density, that would’ve still been essentially zero–affected by the weight confounder in the same way as the regression model. At least with a correlation matrix, you’d be able to piece together that weight was a confounder likely affecting the other correlation.
And a confounder can exist outside your dataset. You might not have even measured a confounder, so it won’t be in your correlation matrix, but it can still impact your results. Hence, it’s always good to consider variables that you didn’t record as well.
I’m guessing you know all that, I’m more spelling it out for other readers.
And if I’m remember correctly, your background is more with randomized experiments. The random assignment process should break any correlation between a confounder and the outcome, making it essentially zero. Consequently, randomizes experiments tend to prevent confounding variables from affecting the results.
Raymond F Palmer says
Hi Jim,
In multivariate regression, I have always removed variables that aren’t significant. However, recently a reviewer said that this approach is unjustified. Is there a consensus about this? a reference article?
thanks, Ray
Jim Frost says
Hi Raymond,
I don’t have an article handy to refer you to. But based on what happens to models when you retain and exclude variables, I recommend the following approach.
Deciding whether to eliminate an insignificant independent variable from a regression model requires a thorough understanding of the theoretical implications related to that variable. If there’s strong theoretical justification for its inclusion, it might be advisable to keep it within the model, despite its insignificance.
Maintaining an insignificant variable in the model does not typically degrade its overall performance. On the contrary, removing a theoretically justified but insignificant variable can lead to biased outcomes for the remaining independent variables, a situation known as omitted variable bias. Therefore, it can be beneficial to retain an insignificant variable within the model.
It’s vital to consider two major aspects when making this decision. Firstly, whether there’s strong theoretical support for retaining the insignificant variable, and secondly, whether excluding it has a significant impact on the coefficient estimates of the remaining variables. In short, if you remove an insignificant variable and the other coefficients change, you need to assess the situation.
If there are no theoretical reasons to retain an insignificant variable and removing it doesn’t appear to bias the result, then you probably should remove it because it might increase the precision of your model somewhat.
Consequently, I advise “considering” the removal of insignificant independent variables from the model, instead of asserting that you “should” remove them, as this decision depends on the aforementioned factors and is not a hard-and-fast rule. Of course, when you do the write-up, explain your reasoning for including insignificant variables along with everything else.
Veronica Milani says
Thank you very much! That helped a lot.
Veronica Milani says
Hi Jim,
thank you for the interesting post. I would like to ask a question because I think that I am very much stuck into a discipline mismatch. I come from economics but I am now working in the social sciences field.
You describe that conditions for confounding bias: 1) there is a correlation between x1 and x2 (the OVB) 2) x1 associates with y 3) x2 associates with y. I interpret 1) as that sometime x1 may determine x2 or the contrary.
However, I read quite recently a social stat paper in which they define confounding bias differently.
2)3) still hold but 1) says that x2 –> x1, not the contrary. So, the direction of the relationship cannot go the other way around. Otherwise that would be mediation..
I am a bit confused and think that this could be due to the different disciplines but I would be interested in knowing what you think.
Thank you.
Best,
Vero
Jim Frost says
Hi Veronica,
Some of your notation looks garbled in the comment, but I think I get the gist of your question. Unfortunately, the comments section doesn’t handle formatting well!
So, X1 and X2 are explanatory variables while Y is the outcome. The two x variables correlate with each other and the Y variable. In this scenario, yes, if you exclude X2, it will cause some degree of omitted variable bias. It is a confounding variable. The degree of bias depends on the collective strength of all three correlations.
Now, as for the question of the direction of the relationship between X1 and X2, that doesn’t matter statistically. As long as the correlation is there, the potential for confounding bias exists. This is true whether the relationship between X1 and X2 is causal in either direction or totally non-causal. It just depends on the set of correlations existing.
I think you’re correct in that this is a difference between disciplines.
The social sciences define a mediator variable as explaining the process by which two variables are related, which gets to your point about the direction of a causal relationship. When X1 –> X2, I’d say that the social sciences would call that a mediator variable AND that X2 is still a confounder that will cause bias if it is omitted from the model. Both things are true.
I hope that helps!
Rita Fontes says
Hi Jim,
Thanks in advance for your awesome content.
Regarding this question brought by Lucy, I want to ask the following: If introducing variables reduces the bias (because the model controls for it), why don’t we just insert all variables at once to see the real impact of each variable?
Let’s say I have a dataset of 150 observations and I want to study the impact of 20 variables (dummies and continuous), it is advantageous to introduce everything at once and see which variables are significant?
I got the idea that introducing variables is always positive because it forces the model to show the real effects (of course I am talking about fundamented variables), but are there any caveats of doing so? Is it possible that some variables in fact may “hide” the significance of others because they will overshadow the others regressors? Usually it is said that, if the significance changes when introducing a variable, it was due to confounding. My question now is: is possible that confounding was not case and, in fact, the significance is just being hiden due to a present of a much more strong predictor?
Jim Frost says
Hi Rita,
In some ways, you’re correct. Generally speaking, it is better to include too many variables than too few. However, there is a cost for including more variables than necessary, particularly when they’re not significant. Adding more variables than needed increases the model’s variance, which reduces statistical power and precision of the estimates. Ideally, you want a balance of all the necessary variables, no more, and no less. I write about this tradeoff in my post about selecting the best model. That should answer a lot of your questions.
I think the approach of starting with model with all possible variables has merit. You can always start removing the ones that are not significant. Just do that by removing one at a time and start by removing the least significant. Watch for any abrupt changes in coefficient signs and p-values as you remove each one.
As for caveats, there are rules of thumb as to how many independent variables you can include in a model based on how many observations you have. If you include too many, you can run into overfitting, which can produce whacky results. Read my post about overfitting models for information about that. So, in some cases, you just won’t be able to add all the potential variables at once, but that depends on the number of variables versus the number of observations. The overfitting post describes that.
And, to answer your last question, overfitting is another case where adding variables can change the significance that’s not due to confounding.
I hope that helps!
Sterre says
Hi Jim,
Thanks for the clear explanation, it was reallly helpful!
I do have a question regarding this sentence:
“The important takeaway here is that leaving out a confounding variable not only reduces the goodness-of-fit (larger residuals), but it can also bias the coefficient estimates.”
Is it always the case that leaving out a confounding variable leads to a lesser fit? I was thinking about the case of positive bias: say variables x and y are both negatively correlated with the dependent variable, but x and y are positively correlated with each other. If a high value for x is caused by a high value of y both variables ‘convey the information’ of variable y. So adding variable x to a model wouldn’t add any additional information, and thus wouldn’t improve the fit of the model.
Am I making a mistake in my reasoning somewhere? Or does leaving out a confounding variable not lead to a worse fit in this case?
Thanks again for the article!
Sterre
Jim Frost says
Hi Sterre,
Think about it this way. In general, adding an IV always causes R-squared to increase to some degree–even when it’s only a chance correlation. That still applies when you add a confounding variable. However, with a confounding variable, you know it’s an appropriate variable to add.
Yes, the correlation with the IV in the model might capture some of the confounder’s explanatory power, but you can also be sure that adding it will cause the model to fit better. And, again, it’s an entirely appropriate variable to include because of its relationship with the DV (i.e., you’re not adding it just to artificially inflate R-squared/goodness-of-fit). Additionally, unless there’s a perfect correlation between the included IV and the confounder, the included IV can’t contain all the confounder’s information. But, if there was a perfect correlation, you wouldn’t be able to add both anyway.
There are cases where you might not want to include the confounder. If you’re mainly interested in making predictions and don’t need to understand the role of each IV, you might not need to include the confounder if your model makes sufficiently precise predictions. That’s particularly true if the confounder is difficult/expensive to measure.
Alternatively, if there is a very high, but not perfect correlation, between the included IV and the confounder, adding the confounder might introduce too much multicollinearity, which causes its own problems. So, you might be willing to take the tradeoff between exchanging multicollinearity issues for omitted variable bias. However, that’s a very specific weighing of pros and cons given the relative degree of severity for both problems for your specific model. So, there’s no general advice for which way to go. It’s also important to note that there are other types of regression analysis (Ridge and LASSO) that can effectively handle multicollinearity, although at the cost of introducing a slight bias. Another possibility to balance!
But, to your main question, yes, if you add the confounder, you can expect the model fit to improve to some degree. It may or may not be an improvement that’s important in a practical sense. Even if the fit isn’t notably better, it’s often worthwhile adding the confounder to address the bias.
gqe66 says
Jim, this was a great article, but I do not understand the table. I am sure it is easy, and I am missing something basic. what does it mean to be included and omitted: negative correlation…. etc. in the 2 way by 2 way table? I cannot wrap my head around the titles, and correspdonding scenarios. thanks John
Jim Frost says
Hi
When I refer to “included” and “omitted,” I’m talking about whether the variable in question an independent variable IN the model (included), or a potential independent variable that is NOT in the model (omitted). After all, we’re talking about omitted variable bias, which is the bias caused by leaving an important variable out of the model.
The table allows you to determine the direction the coefficient estimate is being biased if you can determine the direction of the correlation between several variables.
In the example, I’m looking at a model where Activity (the included IV) predicts the bone density of the individual (the DV). The omitted confounder is weight. So, now we just need to assess the relationships between those variables to determine the direction of the bias. I explain the process of using the table with this example in the paragraph below the table, so I won’t retype it here. But, if you don’t understand something I write there, PLEASE let me know and I’ll help clarify it!
In the example, Activity = Included, Weight = Omitted, and Dependent = Bone Density. I use the signs from the triangle diagram that include a ways before the table which lists these three variables to determine the column and row to use.
Again, I’m not sure which part is tripping you up!
mohan says
Thank you Jim !
The two groups are both people with illness, only different because they are illnesses that occur in different ages. The first illness group is of younger age like around 30, the other of older age around 45. Overlap of ages between these groups is very minimal.
By control group, I meant a third group of healthy people without illness, and has ages uniformly distributed in the range represented in the two patient groups, and thus the group factor having three levels now..
I was thinking if this can reduce the previous problem of directly comparing the young and old patient groups where adding age as covariate can cause collinearity problem..
Jim Frost says
Ah, ok. I didn’t realize that both groups had an illness. Usually a control group won’t have a condition.
I really wouldn’t worry about the type of multicollinearity you’re referring to. You’d want to include those two groups and age plus the interaction term, which you could remove if it’s not significant. If the two groups were completely distinct in age and had a decent gap between them, there are other model estimate problems to worry about, but that doesn’t seem to be the case. If age is a factor in this study area, you definitely don’t want to exclude it. Including it allows you to control for it. Otherwise, if you leave it out, the age effect will get rolled into the groups and, thereby, bias your results. Including age is particularly important in your case because you know the groups are unbalanced in age. You don’t want the model to attribute the difference in outcomes to the illness condition when it’s actually age that is unbalanced between those two conditions. I’d go so far to say that your model urgently needs you to include age!
That said, I would collect a true control group that has healthy people and ideally a broad range of ages that covers both groups. That will give you several benefits. Right now, you won’t know how your illness groups compare to a healthy group. You’ll only know how they compare to each other. Having that third group will allow you to compare each illness group to the healthy group. I’m assuming that’s useful information. Plus, having a full range of ages will allow the model to produce a better estimate of the age effect.
I hope that helps!
Mohan says
Hi JIm, Thanks a lot for your intuitive explanations!!
I want to study the effect of two Groups of patients (X1) on y (a test performance score), in a GLM framework. Age (X2) and Education (X3) are potential confounders on y.
However its not possible to match these two groups for age, as they are illnesses that occur in different age groups-one group is younger than the other. Hence the mean ages are significantly different between these groups.
I’m afraid adding age as a covariate could potentially cause multicollinearity problem as age is significantly different between groups, and make the estimation of group effect (β1) erroneous, although it might improve the model.
Is recruiting a control group with age distribution comparable to the pooled patient groups, hence of a mean age mid-way between the two patient groups a good idea to improve the statistical power of the study? In this case my group factor X1 will have three levels. Can this reduce the multicollinearity problem to an extent as the ages of patients in the two patient groups are approximately represented in the control group also..? Should I add an interaction term of Age*Group in the GLM to account for the age difference between groups..? Thank you in advance..
-Mohan
Jim Frost says
Hi Mohan,
I’d at least try including age to see what happens. If there’s any overlap in age between the two groups, I think you’ll be ok. Even if there is no overlap, age is obviously a crucial variable. My guess would be that it’s doing more harm by excluding it from the model when it’s clearly important.
I’m a bit confused by what you’re suggesting for the control group. Isn’t one of your groups those individuals with the condition and the other without it?
It does sound possible that there would be an interaction effect in this case. I’d definitely try fitting and see what the results are! That interaction term would show whether the relationship between age and test score is different between the groups.
Joshua says
In the paragraph below the table, both weight and activity are referred to as included variables.
Jim Frost says
Hi Joshua, yes, you’re correct! A big thanks! I’ve corrected the text. In that example, activity is the included variable, weight is the omitted variable, and bone density it the dependent variable.
Joshua Tian says
Hi, Jim. Great article. However, is that a typo in the direction of omitted variable bias table? For the rows, it makes more sense to me if they were “correlation between dependent and omitted variables” instead of between depedent and included variables”.
Jim Frost says
Hi Joshua,
No, that’s not a typo!
Gyan says
Hi Jim,
Please let me know if this summary makes sense. Again, Thanks for the great posts !
Scenario 1: There are 10 IVs. They are modeled using OLS. We get the regression coefficients.
Scenario 2: One of the IVs is removed. It is not a confounder. The only impact is on the residuals (they increase). The coefficients obtained in Scenario 1 remain intact. Is that correct ?
Scenario 3: The IV that was removed in Scenario 2, is placed back into the mix. This time, another IV is removed. Now this one’s a confounder. OLS modeling is re-run. There are 3 resutls.
1) The residuals increase — because it is correlated with the dependent variable.
2) The coefficient of the other IV, to which this removed confounder is correlated, changes.
3) The coefficients of the other IVs remain intact.
Are these 3 scenarios an accurate summary, Jim? A reply would be much appreciated !
Again, do keep up the good work.
Cheers
Jim Frost says
Hi Gyan,
Yes, that all sounds right on! 🙂
Gyan says
Great post, Jim !
Probably a basic question, but would appreciate your answer on this, since we have encountered this in practical scenarios. Thanks in advance.
What if we know of a variable that should get included on the IV side, we don’t have data for that, we know (from domain expertise) that it is correlated with the dependent variable, but it is not correlated with any of the IVs…In other words, it is not a confounding variable in the strictest sense of the term (since it is not correlated to any of the IVs).
How do we account for such variables?
Here again the solution would be to use proxy variables? In other words, can we consider proxy variables to be a workaround for not just confounders, but also non-confounders of the above type ?
Thanks again !
Jim Frost says
Hi Gyan,
I discuss several methods in this article. The one I’d recommend if at all possible is identifying a proxy variable that stands in for the important variable that you don’t have. It sounds like in your case it’s not a confounder. So, it’s probably not biasing your other coefficients. However, your model is missing important information. You might be able to improve the precision using a proxy variable.
David says
Hi Jim, that article is helping me a lot during my research project, thank you so for that!
However, there is one question for which I couldn’t find a satisfactory answer on the internet, so I hope that maybe you can shed some light on this:
In my panel regression, I have my main independent variable on “Policy Uncertainty”, that catpures uncertainty related to the possible impact of future government policies. It is based on an index that has a mean of 100. My dependent variable is whether a firm has received funding in quarter t (Yes = 1, No = 0), thus I want to estimate the impact of policy uncertainty on the likelihood of receiving external funding. In my baseline regression, the coefficient on policy uncertainty is insignificant, suggesting that policy uncertainty has no impact.
When I now add a proxy for uncertainty related finincial markets (e.g. implied stock market volatitily), then policy uncertainty becomes significant at the 1% level and the market uncertainty proxy is statistically significant at the 1% level too! The correlation between both is rather low, 0.2. Furthermore, both have opposite signs (poilcy uncertainty is positively associated with the likelihood of receiving funding), additionally, the magnitude of the coefficients is comparable.
Now am I wondering what this tells me…did the variable on policy uncertainty previously capture the effect of market uncertainty before including the latter in regression? Would be great if you could help 🙂
Jim Frost says
Hi David,
Thanks for writing with the interesting questions!
First, I’ll assume you’re using binary logistic regression because you have a binary dependent variable. For logistic regression, you don’t interpret the coefficients that same ways as you do for say least squares regression. Typically, you’ll assess the odds ratio to understand the IVs relationship to the binary DV.
On to your example. It’s entirely possible that leaving out market uncertainty was causing omitted variable bias in the policy uncertainty. That might be what is happening. But, the positive sign of one and the negative sign of the other could be cancelling each other out when you only include the one. That is what happens in the example I use in this post. However, for that type of bias/confounding, you’d expect there to be a correlation between the two DVs and you say it is low.
Another possibility is the fact that for each variable in a model, the significance refers to the Adj SS for the variable, which factors in all the other variables before entering variable in question. So, the policy uncertainty in the model with market volatility is significant after accounting for the variance that the other variables explain, including market volatility. For the model without market volatility, the policy uncertainty is not significant in that different pool of remaining variability. Given the low correlation (0.2) between those two IVs, I’d lean towards this explanation. If there was a stronger correlation between the policy/market uncertainty, I’d lean towards omitted variable bias.
Also be sure that your model doesn’t have any other type of problems, such as overfitting or patterns in the residual plots. Those can cause weird things to happen with the coefficients.
It can be unnerving when the significance of one variable depends entirely on the presence of another variable. It makes choosing the correct model difficult! I’d let theory be your guide. I write about that towards the end of my post about selecting the correct regression model. That’s written in the contest of least squares regression, but the same ideas about theory and other research apply here.
You should definitely investigate this mystery further!
Kushal Jain says
Hello Jim,
Thank you for this blog. I have a question: If two independent variables are corelated, can we not convert one into the other and replace that in the model? For example, If Y=X1 +X2, and X2= – 0.5X1, then Y=0.5X1. However, I don’t see that as a suggestion in the blog. The blog mentions that activity is related to weight, but then somehow both are finally included in the model, rather than replacing one with the other in the model. Will this not help with multicollinearity, too? I am sure I am missing something here that you can see, but I am unable to find that out. Can you please help?
Regards,
Kushal Jain
Jim Frost says
Hi Kushal,
Why would you want to convert one to another? Typically, you want to understand the relationship between each independent variable and the dependent variable. In the model I talk about, I’d want to know the relationship between both activity and weight with bone density. Converting activity to weight does not help with that.
And, I’m not understanding what you mean by “then somehow both are finally included in the model.” You just include both variables in the model the normal way.
There’s no benefit to converting the variables as you describe and there are reasons not to do that!
Erin says
Hi Jim,
I have been trying to figure out covariates for a study we are doing for some time. My colleague believes that if two covariates have a high correlation (>20%) then one should be removed from the model. I’m assuming this is true unless both are correlated to the dependent variable, per your discussion above? Also, what do you think about selecting covariates by using the 10% change method?
Any thoughts would be helpful. We’ve had a heck of a time selecting covariates for this study.
Thanks,
Erin
Jim Frost says
Hi Erin,
It’s usually ok to have covariates that have a correlation greater than 20%. The exact value depends on the number of covariates and the strength of their correlations. But 20% is low and almost never a problem. When covariates are corelated, it’s known as multicollinearity. And, there’s a special measure known as VIFs that determine whether you have an excessive amount of correlation amongst your covariates. I have a post that discusses multicollinearity and how to detect and correct it.
I have not used the 10% change method myself. However, I would suggest using that method only as one point of information. I’d really place more emphasis on theory and understanding the subject area. However, observing how much a covariate changes can provide useful information about whether bias is a problem or not. In general, if you’re uncertain, I’d err on the side of unnecessarily including a covariate than leaving it out. There are usually fewer problems associated with having an additional variable than omitting one. However, keep an eye out on the VIFs as you do that. And, having a number of unnecessary variables could lead to problems if taken to an extreme or if you have a really small sample size.
I wrote a post about model selection. I give some practical tips in it. Overall, I suggest using a mix of theory, subject area knowledge, and statistical approaches. I’d suggest reading that. It’s not specifically about controlling for confounders but the same principles apply. Also, I’d highly recommend reading about what researchers performing similar studies have done if that’s at all possible. They might have already addressed that issue!
Charlotte Stuart says
Hi Jim
Im not sure whether my problem fits under this category or not so apologies if not. I am looking at whether an inflammatory biomarker (independant variable) correlates with a measure of cognitive function (dependant variable). It does if its just a simple linear regression however the biomarker (independant variable) is affected by age, sex and whether you’re a smoker or not. Correcting for these 3 covariables in the model shows that actually there is no correlation between the biomarker and cognitive function. I assume this was the correct thing to do but wanted to make sure seeing as a) none of the 3 covariables correlate with/predict my dependant variable, and b) as age correlates highly with the biomarker, does this not introduce colinearity?
Thanks!
Charlotte
Jim Frost says
Hi Charlotte,
Yes, it sounds like you did the right thing. Including the other variables in the model allows the model to control for them.
The collinearity (aka multicollinearity or correlation between independent variables) between age and the biomarker is a potential concern. However, a little correlation, or a moderate amount of correlation is fine. What you really need to do is to assess the VIFs for your independent variables. I discuss VIFs and multicollinearity in my post about multicollinearity. So, your next step should be to determine whether you have problematic levels of multicollinearity.
One symptom of multicollinearity is a lack of statistical significance, which your model is experience. So, it would be good to check.
Actually, I’m noticing that at least several of your independent variables are binary. Smoker. Gender. Is the biomarker also binary? Present or not present? If so, that’s doesn’t change the rational for including the other variables in the model but it does mean VIFs won’t detect the multicollinearity.
Humberto Calvani says
Thanks for the clarification, Jim. Best regards.
Humberto Calvani says
Hi Jim,
I think the section on “Predicting the Direction of Omitted Variable Bias” has a typo on the first column, first two rows. It should state:
*Omitted* and Dependent: Negative Correlation
*Omitted* and Dependent: Positive Correlation
This makes it consistent with the required two conditions for Omitted Variable Bias to occurs:
The *omitted* variable must correlate with the dependent variable.
The omitted variable must correlate with at least one independent variable that is in the regression model.
Jim Frost says
Hi Humberto,
Thanks for the close reading of my article! The table is correct as it is, but you are also correct. Let’s see why!
There are the following two requirements for omitted variable bias to exist:
*The omitted variable must correlate with an IV in the model.
*That IV must correlate with the DV.
The table accurately depicts both those conditions. The columns indicate the relationship between the IV (included) and omitted variable. The rows indicate the nature of the relationship between the IV and DV.
If both those conditions are true, you can then infer that there is a correlation between the omitted variable and the dependent variable and the nature of the correlation, as you indicate. I could include that in the table, but it is redundant information.
We’re thinking along the same lines and portraying the same overall picture. Alas, I’d need to use a three dimensional matrix to portray those three conditions! Fortunately, using the two conditions that I show in the table, we can still determine the direction of bias. And you could use those two relationships to determine the relationship between the omitted variable and dependent variable if you so wanted. However, that information doesn’t change our understanding of the direction of bias because it’s redundant with information already in the table.
Thanks for the great comment and it’s always beneficial thinking through these things using a different perspective!
Vineeth says
Thank you for the intuitive explanation, Jim!
I would like to ask a query. Suppose i have two groups-one with a recently diagnosed lung disease and another with chronic lung disease where i would like to do an independent t-test for the amount of lung damage. It happens that the two groups also significantly differ in their mean age. The group with recently diagnosed disease has a lesser mean age than the group with chronic disease. Also theory says Age can cause some damage in lung as a normal course too. So if i include age as a covariate in the model, wont it regress out the effect of DV and give underestimated effect as the IV (age) significantly correlates with DV (lung damage)? How do we address this confounding effect of correlation between only IV and DV? Should it be by having a control group without lung disease? If so can one control group help? Or should there be 2 control groups with age-matching to the two study groups? Thank you in advance.
Jim Frost says
Hi Vineeth,
First, yes, if you know age is a factor, you should include it as a covariate in the model. It won’t “regress out” the true effect between the two groups. I would think of it a little differently.
You have two groups and you suspect that something caused those two groups to have differing amounts of lung damage. You also know that age plays a role. And those groups have different ages. So, if you look only at the groups without factoring in age, the effect of age is still present but the model is incorrectly attributing it to the groups. In your case, it will make the effect look larger.
When you include age, yes, it will reduce the effect size between the groups, but it’s reveal the correct effect by accounting for age. So, yes, in your cases, it’ll make the group difference look smaller, but don’t think of it as “regressing out” the effect but instead it is removing the bias in the other results. In other words, you’re improving the quality of your results.
When you look at your model results for say the grouping variable, it’s already controlling for the age variable. So, you’re left with what you need, just the effect between the IV and DV that is accounted for by another variable in the model, such as age. That’s what you need!
A control group for any experiment is always a good idea if you can manage one. However, it’s not always possible. I write about these experimental design issues, randomized experiments, observational studies, how to design a good experiment, etc. among other topics in my Introduction to Statistics ebook, which you might consider. It’s also just now available in print on Amazon!
Ivan says
Hi Jim,
I was wondering whether it’s correct to check the correlation between the independent variables and the error term in order to check for endogeneity.
If we assume that there is endogeneity then the estimated errors aren’t correct and so the correlation between the independent variables and those errors doesn’t say much. Am I missing something here?
best regards,
Ivan.
Lauren Madia says
Hi Jim,
I wanted to look at the effects of confounders on my study but I’m not sure what analysis(es) to use for dichotomous covariates. I have one categorical iv with two levels, two continuous dvs, and then the two dichotomous confounding variables. It was hard to finds information for categorical covariates online. Thanks in advance Jim!
Dirk says
Hi Jim,
Thank you for your nice blog. I have still a question. Let’s say I want to determine the effect of one independent variable on a dependent variable with a linear regression analysis. I have selected a number of potential variables for this relationship based on literature, such as age, gender, health status and education level. How can I check (with statistical analyses) if these are indeed confounders? I would like to know for which of them I should control for in my linear regression analysis. Can I create a correlationmatrix beforehand to see if the potential confounder is both correlated with my independent and dependent variable? And what threshold for the correlation coefficient should be taken here? Is this every correlation coefficient except zero (for instance 0.004? Are there scientific articles/books that endorce this threshold? Or is it maybe better to use a “change-in-estimate” criterion to see if my regression coefficient changes with a particular size after adding my potential confounder in the linear regression model? What would be the threshold here?
I hope my question is clear. Thanks in advance!
Martin Fierz says
Dear Jim,
thanks for a wonderful website!
I love your example with the bone density which does not appear to be correlated to physical activity if looked at alone, and needs to have the weight added as explanatory variable to make both of them appear as significantly correlated with bone density.
I would love to use this example in my class, as I think it is very important to understand that there are situations where a single-parameter model can lead you badly astray (here into thinking activity is not correlated with bone density).
Of course, I could make up some numbers for my students, but it would be even nicer if I could give them your real data. Could you by any chance make a file of real measurements of bone densities, physical activity and weight available? I would be very grateful, and I suppose a lot of other teachers/students too!
best regards
Martin
Jim Frost says
Hi Martin,
When I wrote this post, I wanted to share the data. Unfortunately, it seems like I no longer have it. If I uncover it, I’ll add it to the post.
evangelia panagiotidou says
Hello Jim,
The work you have done is amazing, and I’ve learned so much through this website. .
I am at beginner level in SPSS and I would be grateful if you could answer my question.
I have found that a medical treatment results in worse quality of life.
But I know from crosstabs that people that are taking this treatment present more severe disease (continuous variable) that also correlates to quality of life.
How can I test if it is treatment or severity that worsens quality of life?
Jim Frost says
Hi Evangelia,
Thanks so much for your kind words, I really appreciate them! And, I’m glad my website has been helpful!
That’s a great question and a valid concern to have. Fortunately, in a regression model, the solution is very simple. Just include both the treatment and severity of the disease in the model as independent variables. Doing that allows the model to hold disease severity constant (i.e., controls for it) while it estimates the effect of the treatment.
Conversely, if you did not include severity of the disease in the model, and it correlates with both the treatment and quality of life, it is uncontrolled and will be a confounding variable. In other words, if you don’t include severity of disease, the estimate for the relationship between treatment and quality of life will be biased.
We can use the table in this post for estimating the direction of bias. Based on what you wrote, I’ll assume that the treatment condition and severity have a positive correlation. Those taking the treatment present a more severe disease. And, that the treatment condition has a negative correlation with quality of life. Those on the treatment have a lower quality of life for the reasons you indicated. That puts us in the top-right quadrant of the table, which indicates that if you do not include severity of disease as an IV, the treatment effect will be underestimated.
Again, simply by including disease severity in your model will reduce the bias!
I hope that helps!
Johnny says
Hello,
Just a question about what you said about power. Will adding more independent variables to a regression model cause a loss of power? (at a fixed sample size). Or does it depend on the type of independent variable added: confounder vs. non confounder.
Luis says
you mention “Suppose you have a regression model with two significant independent variables, X1 and X2. These independent variables correlate with each other and the dependent variable”
How is possible for two random variables (in this case the two factors) to correlate with each other if they are independent? If two random variables are independent then covariance is zero and therefore correlaton is zero.
Corr(X1,X2)=Cov(X1, X2)/(sqrt(var(X1))*sqrt(var(X2)))
Cov(X1,X2)=E[X1*X2]-E[X1]*E[X2]
if X1 and X2 are independent then E[X1*X2]=E[X1]*E[X2] and therefore covariance is zero.
Jim Frost says
Hi Luis,
Ah, there’s a bit of confusion here. The explanatory variables in a regression model are often referred to as independent variables, as well as predictors, x-variables, inputs, etc. I was using “independent variable” as the name. You’re correct, if they were independent in the sense that you describe them, there would be no correlation. Ideally, there would be no correlation between them in a regression model. However, they can, in fact, be correlated. If that correlation is too strong, it will cause problems with the model.
“Independent variable” in the regression context refers to the predictors and describes their ideal state. In practice, they’ll often have some degree of correlation.
I hope this helps!
Scott Stevens says
Ah! Enlightenment!
I had taken your statement about the correlation of the independent variable with the residuals to be a statement about computed value of the correlation between them, that is, that cor(X1, resid) was nonzero. I believe that (in a model with a constant term), this is impossible.
But I think I get now that that you were using the term more loosely, referring to a (nonlinear) pattern appearing between the values of X1 and the corresponding residuals, in the same way as you would see a parabolic pattern in a scatterplot of residuals versus X if you tried to make a linear fit of quadratic data. The linear correlation between X and the residuals would still compute out, numerically, to zero, so X1 and the residuals would would technically be uncorrelated, but they would not be statistically independent. If the residuals are showing a nonlinear pattern when plotted against X, look for a lurker.
The Albany example was very helpful. Thanks so much for digging it up!
Scott
Scott Stevens says
Hi, Jim! Thanks very much for you speedy reply!
I appreciate the clarity that you aim for in your writing, and I’m sorry if I wasn’t clear in my post. Let me try again, being a bit more precise, hopefully without getting too technical.
My problem is that I think that the very process used in finding the OLS coefficients (like minimizing the sum squared error of the residuals) results in a regression equation that satisfies two properties. First, that the sum (or mean) of the resulting residuals is zero. Second, that for any regressor Xi, Xi is orthogonal to the vector of residuals, which in turn leads to the covariance of the residuals with any regressor having to be zero. Certainly, the true error terms need not sum to zero, nor need they be uncorrelated with a regressor…but if I understand correctly, these properties of the _residuals_ is an automatic consequence of fitting OLS to a data set, regardless of whether the actual error terms are correlated to the regressor or not.
I’ve found a number of sources that seem to say this–one online example is on page two here: https://www.stat.berkeley.edu/~aditya/resources/LectureSIX.pdf. I’ll be happy to provide others on request.
I’ve also generated a number of my own data sets with correlated regressors X1 and X2 and Y values generated by a X1 + b X2 + (error), where a and b are constants and (error) is a normally distributed error term of fixed variance, independently chosen for each point in the data set. In each case, leaving X2 out of the model still left me with zero correlation between X1 and the residuals, although there was a correlation between X1 and the true error terms, of course.
If I have it wrong, I’d love to see a data set that demonstrates what you’re talking about. If you don’t have time to find one (which I certainly understand), I’d be quite happy with any reference you might point me to that talks about this kind of correlation between residuals and one of the regressors in OLS, in any context.
Thanks again for your help, and for making regression more comprehensible to so many people.
Scott Stevens
Jim Frost says
Hi Scott,
Unfortunately, the analysis doesn’t fix all possible problems with the residuals. It is possible to specify models where the residuals exhibit various problems. You mention that residuals will sum to zero. However, if you specify a model without a constant, the residuals won’t necessarily sum to zero-read about that here. If you have a time series model, it’s possible to have autocorrelation in the residuals if you leave out important variables. If you specify a model that doesn’t adequately model curvature in the data, you’ll see patterns in the residuals.
In a similar vein, if you leave out an important variable that is correlated both with the DV and another IV in the model, you can have residuals that correlate with an IV. The standard practice is to graph the residuals by the independent variable to look for that relationship because it might have a curved shape which indicates a relationship but not necessarily a linear one that correlation would detect.
As for references, any regression textbook should cover this assumption. Again, it’ll refer to error, but the key is to remember that residuals are the proxy for error.
Here’s a reference from the University of Albany about Omitted Variable Bias that goes into it in more detail from the standpoint of residuals and includes an example of graphing the residuals by the omitted variable.
Scott Stevens says
Hi, Jim. I very much enjoy how you make regression more accessible, and I like to use your approaches with my own students. I’m confused, though by the matter brought up by SFDude.
I certainly see how the _error_ term in a regression model will be correlated with an independent variable when a confounding variable is omitted, but it seems to me that the normal equations that define the regression coefficients assure that an independent variable in the model will always be uncorrelated with the _residuals_ of that model, regardless of whether an omitted confounding variable exists or not. Certainly, “X1 correlates with X2, and X2 correlates with the residuals. Ergo, variable X1 correlates with the residuals” would not hold for any three variables X1 and X2 and R. For example, if A and B are independent, then “A correlates with A + B, A + B correlates with B. Ergo, A correlates with B” is a false statement.
If I’m missing something here, I’d very much appreciate a data set that demonstrates the kind of correlation between an independent variable and the residuals of the model that it seems you’re talking about.
Thanks!
Scott Stevens
Jim Frost says
Hi Scott,
Thanks for writing. And, I’m glad to hear that you find my website helpful!
The key thing to remember is that while the OLS assumptions refer to the error, we can’t directly observe the true error. So, we use the residuals as estimates of the error. If the error is correlated with an omitted variable, we’d expect the residuals to be correlated as well in approximately the same manner. Omitted variable bias is a real condition, and that description is simply getting deep into the nuts and bolts of how it works. But, it’s the accepted explanation. You can read it in textbooks. While the assumptions refer to error, we can only assess the residuals instead. They’re the best we’ve got!
When you say A and B are “independent”, if you mean they are not correlated, I’d agree that removing a truly uncorrelated variable from the model does not cause this type of bias. I mention that in this post. This bias only occurs when independent variables are correlated with each other to some degree, and with the dependent variable, and you exclude one of the IVs.
I guess I’m not exactly sure which part is causing the difficulty? The regression equations can’t ensure that the residuals are not uncorrelated if the model is specified in such a way that it causes them to be correlated. It’s just like in time series regression models, you have to be on the look out for autocorrelation (correlated residuals) because the model doesn’t account for time-order effects. Incorrectly specified models can and do cause problems with the residuals, including residuals that are correlated with other variables and themselves.
I’ll have to see if I can find a dataset with this condition.
I hope this helps!
Reeba says
Hi Jim,
I am involved in a study which involves looking into s number of clinical paramaters like platelet count and Haemogobin for patients who underwent emergency change of a mechanical circulatory support device due to thrombosis or clotting of the actual device. The purpose is to look if there is a trend in these parameters in the time frame of before 3 days and after 3 days of the change and establish if these parameters could be used as predictor of the event. My concern is that there is no control group for this study. But I dont see the need for looking into trend in a group which never had an event itself. Will not having a control group be considered as a weakness for this study?
Also, what would be best statistical test for this. I was thinking of the generalized linear model.
I would really appreciate your guidance here.
Thank you
Susan Mitchell says
Dear Jim,
I’m looking at a published paper that develops clinical prediction rules by using logistic regression in order to help primary care doctors to decide who to refer to breast clinics for further investigation. The dependent variable is simply whether breast cancer is found to be present or not. The independent variables include 11 symptoms and age in (mostly) ten year increments (six separate age bands). The age bands were decided before the logistical regression was carried out. The paper goes on to use the data to create a scoring system based on symptoms and age. If this scoring system were to be used then above a certain score a woman would be referred, and below a certain score a woman would not be referred.
The total sample size is 6590 women referred to a breast clinic of which 320 were found to have breast cancer. The sample itself is very skewed. In younger women, breast cancer is rare and so some categories the numbers are very low. So for instance, in the 18-29 age band there are 62 women referred of whom 8 women have breast cancer, and in the 30-39 age band there are 755 women referred of which only one woman has breast cancer. So my first question is: if there are fewer individuals in particular categories than symptoms can the paper still use logistic regression to predict who to refer to a breast clinic based on a scoring system that includes both age and symptoms? My second question is: if there is meant to be at least 10 individuals per variable in logistic regression, are the numbers of women with breast cancer in these age groups too small for logistic regression to apply?
When I look at the total number of women in the sample (6590) and then the total number of symptoms (8616) there is a discrepancy. This means that some women have had more than one symptom recorded. (Or from the symptoms’ point of view, some women have been recorded more than once). So my third question is: does this mean that some of the independent variables are not actually independent of each other? (There is around a 30%-32% discrepancy in all categories. How significant is this?)
There are lots of other problems with the paper (the fact the authors only look at referred women rather than all the symptomatic women that a primary care doctor sees is a case in point) but I’d like to know whether the statistics are flawed too. If there are any other questions I need to ask about the data please do let me know.
With very best wishes,
Ms Susan Mitchell
Jim Frost says
Hi Susan,
Offhand, I don’t see anything that screams to me that there is a definite problem. I’d have to read the study to be really sure. Here’s some thoughts.
I’m not in the medical field, but I’ve heard talks by people in the that field and it sounds like this is a fairly common use for binary logistic regression. The analyst creates a model where you indicate which characteristics, risk factors, etc apply to an individual. Then, the model predicts the probability of an outcome for them. I’ve seen similar models for surgical success, death, etc. The idea is that it’s fairly easy to use because some can just enter the characteristics of the patient and the model spits out a probability. For any model of this type, you’d really have to check the residuals and see all the output to determine how well the model fits the data. But, there’s nothing inherently wrong with this approach.
I don’t see a problem with the sample size (6590) and the number of IVs (12). That’s actually a very good ratio of observations per IV.
It’s ok that there are fewer individuals in some categories. It’s better if you have a fairly equal number but it’s not a show stopper. Categories with fewer observations will have less precise estimates. It can potentially reduce the precision of model. You’d have to see how well the model fit the data to really know how well it works out. But, yes, if you have an extremely low number of individuals that have a particular symptom, you won’t get as precise of an estimate for that symptoms effect. You might see a wider CI for its odds ratio. But, it’s hard to say without seeing all of that output and how the numbers by symptoms. And, it’s possible that they selected the characteristics that apply to a sufficient number of women. Again, I wouldn’t be able to say. It’s an issue to consider for sure.
As for the number of symptoms versus the number of women, it’s ok that a woman can have more than one symptom. Each symptom is in it’s own column and will be coded with a 1 or 0. A row corresponds to one woman and she’ll have a 1 for each characteristic that she has and 0s for the ones that she does not have. It’s possible these symptoms are correlated. These are categorical variables, so you couldn’t use Pearson’s correlation. You’d need to use something like the chi-square test of independence. And, some correlation is okay. Only very high correlation would be problematic. Again, I can’t say whether that’s a problem in this study or not because it depends on the degree of correlation. It might be, but it’s not necessarily a problem. You’d hope that the study strategically included a good set of IVs that aren’t overly correlated.
Regarding the referred women vs symptomatic women, that comes down to the population that is being sampled and how generalizeable the results are. Not being familiar with the field, I don’t have a good sense for how that affects generalizability, but yes that would be a concern to consider.
So, I don’t see anything that shouts to me that it’s a definite problem. But, as with any regression model, it would come down to the usual assessments of how well the model fits the data. You mention issues that could be concerns, but again, it depends on the specifics.
Sorry I couldn’t provide more detailed thoughts but evaluating these things requires real specific information. But, the general approach for this study seems sound to me.
Terri Leach says
Hi Jim,
I have a question, how well can we evaluate a regression equation “fits” the data by examing the R Square statistic, and test for statistical significance of the whole regression equation using the F-Test?
Thank you~
Jim Frost says
Hi Terri,
I have two blog posts that will be perfect for you!
Interpreting R-squared
Interpreting the F-test of Overall Significance
If you have questions about either one, please post it in the comments section of the corresponding post. But, I think those posts will go a long way in answering your questions!
Dahlia says
Mr. Frost I know I need to run a regression model however I’m still unsure of which one. I’m examining the effects of alcohol use on teenagers with 4 confounders.
Jim Frost says
Hi Dahlia, to make the decision, I’d need to know what types of variables they all are (continuous, categorical, binary, etc). However, if the effect of alcohol is a continuous variable, then OLS linear regression is a great place to start!
Best of luck with your analysis!
Patrik Silva says
Thank you very much Jim,
Very helpful, I think my problem is really on the number of observation (25 obs). Yes, I have read that post also, and I always keep the theory in mind when analyzing the IVs.
My main objective is to show the existing relationship between X2 and Y, which is also supported by literature, however, if I do not control for X1 I will never be sure that the effect I have found is due to X2 or X1, because X1 and X2 are correlated.
I think only correlation would be ok, since my number of observation are limited and by using regression it limits me about the number of IVs to be included in the model also, which may make me leave out of the model some others IVs, which is also bad.
Thank you again
Best regards!
PS
Patrik Silva says
Hi Jim,
Thank you for this very good post.
However, I have a question. What to do if the (IV) X1 and X2 are correlated (says 0.75) and both are correlated to Y (DV) at 0.60. However, when include X1 and X2 in the same model X2 is not statistically significant, but when put separably they become statistically significant. On the other hand, the model with only X1 has higher explanatory power than the model with only X2.
Note: In individual model both meet the OLS assumptions but, together, X2 become not statistically significant (using stepwise regression X2 is removed from the model), what this means.
In addition, I know from the literarture that X2 affects Y, but I am testing X1, and X1 is showing better fits that X2.
Thank you in advance, I hope you understand my question!
Jim Frost says
Hi Patrik,
Yes, I understand completely! This situation isn’t too unusual. The underlying problem is that because the two IVs are correlated, they’re supplying a similar type of predictive information. There isn’t enough unique predictive information for both of them to be statistically significant. If you had a larger sample size, it’s possible that both would significant. Also, keep in mind that correlation is a pairwise measure and doesn’t account for other variables. When you include both IVs in the model, the relationship between each IV and the DV is determined after accounting for the other variables in the model. That’s why you can see a pairwise correlation but not a relationship in a regression model.
I know you’ve read a number of my posts, but I’m not sure if you’ve read the one about model specification. In that post, a key point I make is not to use statistical measures alone to determine which IVs to leave in the model. If theory suggests that X2 should be included, you have a very strong case for including it even if it’s not significant when X1 is in the model–just be sure to include that discussion in your write-up.
Conversely, just because X2 seems to provide a better fit statistically and is significant with or without X1 doesn’t mean you must include it in the model. Those are strong signs that you should consider including a variable in the model. However, as always, use theory as a guide and document the rational for the decisions you make.
For your case, you might consider include both IVs in the model. If they’re both supplying similar information and X2 is justified by theory, chances are that X1 is as well. Again, document your rationale. If you include both, check the VIFs to be sure that you don’t have problematic levels of multicollinearity when you include both IVs. If those are the only two IVs in your model, that won’t be problematic given the correlations you describe. But, it could be problematic if you more IVs in the model that are also correlated to X1 and X2.
Another thing to look at is whether the coefficients for X1 and X2 vary greatly depending on whether you have one or both of the IVs in the model. If they don’t change much, that’s nice and simple. However, if they do change quite a bit, then you need to determine which coefficient values are likely to be closer to the correct value because that corresponds to the choice about which IVs to include! I’m sounding like a broken record, but if this is a factor, document your rational and decisions.
I hope that helps! Best of luck with your analysis!
Patrick says
Hi Jim,
Another great post! Thank you for truly making statistics intuitive.
I learned a lot of this material back in school, but am only now understanding them more conceptually thanks to you. Super useful for my work in analytics. Please keep it up!
Jim Frost says
Thanks, Patrick! It’s great to hear that it was helpful!
Jayant Jain says
I think there may be a typo here – “These are important variables that the statistical model does include and, therefore, cannot control.” Shouldn’t it be “does not include”, if I understand correctly?
Jim Frost says
Thanks, Jayant! Good eagle eyes! That is indeed a typo. I will fix it. Thanks for pointing it out!
Lucy Quinlan says
Mr. Jim thank you for making me understand econometrics. I thought that omitted variable is excluded from the model and that why they under/overestimate the coefficients. Somewhere in this article you mentioned that they are still included in the model but not controlled for. I find that very confusing, would you be able to clarify ?
Thanks a lot.
Jim Frost says
Hi Lucy,
You’re definitely correct. Omitted variable bias occurs when you exclude a variable from the model. If I gave the impression that it’s included, please let me know where in the text because I want to clarify that! Thanks!
By excluding the variable, the model does not control for it, which biases the results. When you include a previously excluded variable, the model can now control for it and the bias goes away. Maybe I wrote that in a confusing way?
Thanks! I always strive to make my posts as clear as possible, so I’ll think about how to explain this better.
Stan Alekman says
In addition to mean square error, adj R-squared, I use Cp, IC, HQC, and SBIC to decide the number of dependent variables in multiple regression.
Jim Frost says
I think there are a variety of good measures. I’d also add predicted R-squared–as long as you use them in conjunction with subject-area expertise. As I mention in this post, the entire set of estimate relationships must make theoretical sense. If they don’t, the statistical measures are not important.
Stan Alekman says
i have to read the article you named. Having said that, caution should be given when regression models model systems or processes not in statistical control. Also, some processes have physical bounds that a regression model does not capture and calculated predicted values have no physical meaning. Further, models from narrow ranges of independent variables may not be applicable outside the ranges of the independent variables.
Jim Frost says
Hi Stan, those are all great points, and true. They all illustrate how you need to use your subject-area knowledge in conjunction with statistical analyses.
I talk about the issue of not going outside the range of the data, amongst other issues, in my post about Using Regression to Make Predictions.
I also agree about statistical control, which I think is under appreciated outside of the quality improvement arena. I’ve written about this in a post about using control charts with hypothesis tests.
Stan Alekman says
Valid confidence/prediction intervals are important if the regression model represents a process that is being characterized. When the prediction intervals are wide or too wide, the model’s validity and utility are in question.
Jim Frost says
Hi Stan,
You’re definitely correct! If the model doesn’t fit the data, your predictions are worthless. One minor caveat that I’d add to your comment.
The prediction intervals can be too wide to be useful yet the model might still be valid. It’s really two separate assessments. Valid model and degree of precision. I write about this in several posts including the following: Understanding Precision in Prediction
Stan Alekman says
Jim, does centering any independent explanatory variable require centering them all? Center the dependent and explanatory variables?
I always make a normal probability plot of the deleted residuals as one test of the prediction capability of the fitted model. It is remarkable how good models give good normal probability plots. I also use the Shapiro-Wilks test to assess the deleted variables for normality.
Stan Alekman
Jim Frost says
Hi Stan,
Yes, you should center all of the continuous independent variables if your goal is to reduce multicollinearity and/or to be able to interpret the intercept. I’ve never seen a reason to center the dependent variable.
It’s funny that you mention that about normally distributed residuals! I, too, have been impressed with how frequently that occurs even with fairly simple models. I’ve recently written a post about OLS assumptions and I mention how normal residuals are sort of optional. They only need to be normally distributed if you want to perform hypothesis tests and have valid confidence/prediction intervals. Most analysts want at least the hypothesis tests!
Mugdha Bhatnagar says
Hey Jim,your blogs are really helpful for me to learn data science.Here is my question in my assignment:
You have built a classification model with 90% accuracy but your client is not happy
because False Positive rate was very high then what will you do?
Can we do something to it by precision or recall??
this is the question..nothing is given in the background
though they should have given!
Brahim KHOUILED says
Thank you Jim
Really interesting
Jim Frost says
Hi Brahim, you’re very welcome! I’m glad it was interesting!
MG says
Hey Jim, you are awesome.
Jim Frost says
Aw, MG, thanks so much!! 🙂
SFdude says
Thanks for another great article, Jim!.
Q: Could you expand with a specific plot example
to explain more clearly, this statement:
“We know that for omitted variable bias to exist, an independent variable must correlate with the residuals. Consequently, we can plot the residuals by the variables in our model. If we see a relationship in the plot, rather than random scatter, it both tells us that there is a problem and points us towards the solution. We know which independent variable correlates with the confounding variable.”
Thanks!
SFdude
Jim Frost says
Hi, thanks!
I’ll try to find a good example plot to include soon. Basically, you’re looking for any non-random pattern. For example, the residuals might tend to either increase or decrease as the value of independent variable increases. That relationship can follow a straight line or display curvature, depending on the nature of relationship.
I hope this helps!
Saketh prasad says
It’s been a long time I heard from you Jim . Missed your stats
Jim Frost says
Hi Saketh, thanks, you’re too kind! I try to post here every two weeks at least. Occasionally, weekly!