Variance Inflation Factors (VIFs) measure the correlation among independent variables in least squares regression models. Statisticians refer to this type of correlation as multicollinearity. Excessive multicollinearity can cause problems for regression models.
In this post, I focus on VIFs and how they detect multicollinearity, why they’re better than pairwise correlations, how to calculate VIFs yourself, and interpreting VIFs. If you need a refresher about the types of problems that multicollinearity causes and how to fix them, read my post: Multicollinearity: Problems, Detection, and Solutions.
Why Use VIFs Rather Than Pairwise Correlations?
Multicollinearity is correlation amongst the independent variables. Consequently, it seems logical to assess the pairwise correlation between all independent variables (IVs) in the model. That is one possible method. However, imagine a scenario where you have four IVs, and the pairwise correlations between each pair are not high, say around 0.6. No problem, right?
Unfortunately, you might still have problematic levels of collinearity. While the correlations between IV pairs are not exceptionally high, it’s possible that three IVs together could explain a very high proportion of the variance in the fourth IV.
Hmmm, using multiple variables to explain variability in another variable sounds like multiple regression analysis. And that’s the method that VIFs use!
Related post: Interpreting Correlation Coefficients
Calculating Variance Inflation Factors
VIFs use multiple regression to calculate the degree of multicollinearity. Imagine you have four independent variables: X1, X2, X3, and X4. Of course, the model has a dependent variable (Y), but we don’t need to worry about it for our purposes. When your statistical software calculates VIFs, it uses multiple regression to regress all IVs except one on that final IV. It repeats this process for all IVs, as shown below:
- X1 ⇐ X2, X3, X4
- X2 ⇐ X1, X3, X4
- X3 ⇐ X1, X2, X4
- X4 ⇐ X1, X2, X3
To calculate the VIFs, all independent variables become a dependent variable. Each model produces an R-squared value indicating the percentage of the variance in the individual IV that the set of IVs explains. Consequently, higher R-squared values indicate higher degrees of multicollinearity. VIF calculations use these R-squared values. The VIF for an independent variable equals the following:
Where the subscript i indicates the independent variable. There is a VIF for each IV.
When R-squared equals zero, there is no multicollinearity because the set of IVs does not explain any of the variability in the remaining IV. Take a look at the equation and notice that when R-squared equals 0, both the numerator and denominator equal 1, producing a VIF of 1. This is the lowest possible VIF and it indicates absolutely no multicollinearity.
As R-squared increases, the denominator decreases, causing the VIFs to increase. In other words, as the set of IVs explains more of the variance in the individual IV, it indicates higher multicollinearity and the VIFs increase from 1.
Related post: How to Interpret R-squared
How to Interpret VIFs
From the above, we know that a VIF of 1 represents no multicollinearity, and higher values indicate more multicollinearity is present. What do these values actually mean?
The name “variance inflation factor” gives it away. VIFs represent the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 4 indicates that multicollinearity inflates the variance by a factor of 4 compared to a model with no multicollinearity.
The next logical questions are: What variance are we talking about? And why should I care?
The variances in question are the standard errors of the coefficient estimates, which relate to the precision of these estimates. Confidence interval calculations use these standard errors to determine the interval widths for the coefficients. Larger standard errors produce wider confident intervals, which indicates that the coefficient estimates are less precise. Lower precision makes it more difficult to obtain statistically significant results. Additionally, wide CIs are more likely to cross zero, explaining why multicollinearity can cause coefficient signs to flip. In short, larger standard errors produce imprecise, unstable coefficient estimates and reduce statistical power.
These are the classic effects of multicollinearity.
How to Calculate VIFs
In my blog post about multicollinearity, I use regression analysis to model the relationship between the independent variables (physical activity, body fat percentage, weight, and the interaction between weight and body fat) and the dependent variable (bone mineral density of the femoral neck). Below is the portion of the statistical output that contains the VIFs. We’ll focus on %Fat, which has a VIF of 14.93.
Now, let’s calculate the VIF for the %Fat independent variable ourselves. To do that, I’ll fit a regression model where %Fat is now the dependent variable and include the remaining independent variables and the interaction term as IVs in this new model (physical activity, weight, and the Weight*Fat interaction). To try it yourself, download this CSV dataset: MulticollinearityVIF.
Below is the statistical output for this model. You can see that the other variables are at or near statistical significance.
Collectively, the model explains 93.3% of the variance for %Fat, which is quite high. For calculating the VIF, we’ll need to use this R-square value, as shown below.
That rounds to the value in the original statistical output!
How High is Too High?
A moderate amount of multicollinearity is okay. However, as multicollinearity increases, its effects also increase. What VIF values represent too much correlation?
I’ve seen different thresholds. I consider VIFs of five and higher to represent problematic amounts of multicollinearity. However, I’ve seen recommendations range from 2.5 to 10! I wish I could give you a concrete answer. However, I’ve long used five as a threshold, and that seems to work well.
Another consideration is that each coefficient estimate has its own standard error with its corresponding VIF. Consequently, you should evaluate each independent variable separately for multicollinearity. Even when some IVs in your model have high VIFs, other variables might have low VIFs and be absolutely fine. When extreme multicollinearity is present in some variables, it does not affect IVs with low VIFs.
Thank you! Very clear and straightforward, and helped me understand both the intuition and the calculations.
Would be amazing if you could attach relevant R codes to future posts.
All the best,
Uri,
This is a wonderful explanation! Very easy to understand.
Great post. I actually did this very thing a few months ago. Doubtless it was another post of yours that inspired me to do VIF analysis, b/c I would not have thought of it on my own.
I have a model with six continuous IVs. Two of the coefficients were negative, which in the context of my data makes NO SENSE whatsoever. So I regressed each of the IVs against the others. Hoo-BOY! Lotsa of co-linearity. Now I use fewer IVs and have a stronger model.
Excellent post and very timely for a project I am working on. Thanks Jim and keep up the great work!!
-Charles Wheelus
Thankyou Jim its always a pleasure to read your posts.
Thank you so much, Muzi!
Great explanation using easy to understand terminology
Thanks, Nathan!
Jim Frost –
This posting on VIFs is great! I’ve only thought of them superficially until now, and have never before seen such a thorough and clear explanation. Explaining it in terms of R^2 with a link to another posting on that is nice. It helps show why VIFs are a bit nebulous, since R^2s are subject to interpretation themselves. I like graphical residual analyses and think one may somewhat ignore R^2 values when it is so easy to see the graphics, and then one can consider cross-validation. I see here that VIFs could possibly be very misleading since we don’t generally do the corresponding graphical residual analyses for them. But I have given them some consideration, and it is good to understand them better.
Very nice posting. Thank you.
I have a question which betrays my ignorance here. Assuming that the R^2s in the VIFs were very meaningful, not finding problems in the scatterplots, I’m interested in the inflation which could cause a sign to flip. So in the example you gave, I see that the standard error of the coefficient that you picked was estimated to be 0.00409. The VIF was 14.925. Does that mean that the effective standard error is really
SQRT(((0.00409)^2)(14.925)) = 0.0158, if the R^2 is meaningful, or am I looking at this incorrectly?
Hope that isn’t too silly of a question, but it is the question I have since I get a little lost.
🙂
Anyway, thank you for the posting.
Be safe and have a Happy Holiday Season.
Cheers – Jim Knaub