Variance Inflation Factors (VIFs) measure the correlation among independent variables in least squares regression models. Statisticians refer to this type of correlation as multicollinearity. Excessive multicollinearity can cause problems for regression models.
In this post, I focus on VIFs and how they detect multicollinearity, why they’re better than pairwise correlations, how to calculate VIFs yourself, and interpreting VIFs. If you need a refresher about the types of problems that multicollinearity causes and how to fix them, read my post: Multicollinearity: Problems, Detection, and Solutions.
Why Use VIFs Rather Than Pairwise Correlations?
Multicollinearity is correlation amongst the independent variables. Consequently, it seems logical to assess the pairwise correlation between all independent variables (IVs) in the model. That is one possible method. However, imagine a scenario where you have four IVs, and the pairwise correlations between each pair are not high, say around 0.6. No problem, right?
Unfortunately, you might still have problematic levels of collinearity. While the correlations between IV pairs are not exceptionally high, it’s possible that three IVs together could explain a very high proportion of the variance in the fourth IV.
Hmmm, using multiple variables to explain variability in another variable sounds like multiple regression analysis. And that’s the method that VIFs use!
Related post: Interpreting Correlation Coefficients
Calculating Variance Inflation Factors
VIFs use multiple regression to calculate the degree of multicollinearity. Imagine you have four independent variables: X1, X2, X3, and X4. Of course, the model has a dependent variable (Y), but we don’t need to worry about it for our purposes. When your statistical software calculates VIFs, it uses multiple regression to regress all IVs except one on that final IV. It repeats this process for all IVs, as shown below:
- X1 ⇐ X2, X3, X4
- X2 ⇐ X1, X3, X4
- X3 ⇐ X1, X2, X4
- X4 ⇐ X1, X2, X3
To calculate the VIFs, all independent variables become a dependent variable. Each model produces an R-squared value indicating the percentage of the variance in the individual IV that the set of IVs explains. Consequently, higher R-squared values indicate higher degrees of multicollinearity. VIF calculations use these R-squared values. The VIF for an independent variable equals the following:
Where the subscript i indicates the independent variable. There is a VIF for each IV.
When R-squared equals zero, there is no multicollinearity because the set of IVs does not explain any of the variability in the remaining IV. Take a look at the equation and notice that when R-squared equals 0, both the numerator and denominator equal 1, producing a VIF of 1. This is the lowest possible VIF and it indicates absolutely no multicollinearity.
As R-squared increases, the denominator decreases, causing the VIFs to increase. In other words, as the set of IVs explains more of the variance in the individual IV, it indicates higher multicollinearity and the VIFs increase from 1.
Related post: How to Interpret R-squared
How to Interpret VIFs
From the above, we know that a VIF of 1 represents no multicollinearity, and higher values indicate more multicollinearity is present. What do these values actually mean?
The name “variance inflation factor” gives it away. VIFs represent the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 4 indicates that multicollinearity inflates the variance by a factor of 4 compared to a model with no multicollinearity.
The next logical questions are: What variance are we talking about? And why should I care?
The variances in question are the standard errors of the coefficient estimates, which relate to the precision of these estimates. Confidence interval calculations use these standard errors to determine the interval widths for the coefficients. Larger standard errors produce wider confident intervals, which indicates that the coefficient estimates are less precise. Lower precision makes it more difficult to obtain statistically significant results. Additionally, wide CIs are more likely to cross zero, explaining why multicollinearity can cause coefficient signs to flip. In short, larger standard errors produce imprecise, unstable coefficient estimates and reduce statistical power.
These are the classic effects of multicollinearity.
How to Calculate VIFs
In my blog post about multicollinearity, I use regression analysis to model the relationship between the independent variables (physical activity, body fat percentage, weight, and the interaction between weight and body fat) and the dependent variable (bone mineral density of the femoral neck). Below is the portion of the statistical output that contains the VIFs. We’ll focus on %Fat, which has a VIF of 14.93.
Now, let’s calculate the VIF for the %Fat independent variable ourselves. To do that, I’ll fit a regression model where %Fat is now the dependent variable and include the remaining independent variables and the interaction term as IVs in this new model (physical activity, weight, and the Weight*Fat interaction). To try it yourself, download this CSV dataset: MulticollinearityVIF.
Below is the statistical output for this model. You can see that the other variables are at or near statistical significance.
Collectively, the model explains 93.3% of the variance for %Fat, which is quite high. For calculating the VIF, we’ll need to use this R-square value, as shown below.
That rounds to the value in the original statistical output!
How High is Too High?
A moderate amount of multicollinearity is okay. However, as multicollinearity increases, its effects also increase. What VIF values represent too much correlation?
I’ve seen different thresholds. I consider VIFs of five and higher to represent problematic amounts of multicollinearity. However, I’ve seen recommendations range from 2.5 to 10! I wish I could give you a concrete answer. However, I’ve long used five as a threshold, and that seems to work well.
Another consideration is that each coefficient estimate has its own standard error with its corresponding VIF. Consequently, you should evaluate each independent variable separately for multicollinearity. Even when some IVs in your model have high VIFs, other variables might have low VIFs and be absolutely fine. When extreme multicollinearity is present in some variables, it does not affect IVs with low VIFs.