Variance Inflation Factors (VIFs) measure the correlation among independent variables in least squares regression models. Statisticians refer to this type of correlation as multicollinearity. Excessive multicollinearity can cause problems for regression models.
In this post, I focus on VIFs and how they detect multicollinearity, why they’re better than pairwise correlations, how to calculate VIFs yourself, and interpreting VIFs. If you need a refresher about the types of problems that multicollinearity causes and how to fix them, read my post: Multicollinearity: Problems, Detection, and Solutions.
Why Use VIFs Rather Than Pairwise Correlations?
Multicollinearity is correlation amongst the independent variables. Consequently, it seems logical to assess the pairwise correlation between all independent variables (IVs) in the model. That is one possible method. However, imagine a scenario where you have four IVs, and the pairwise correlations between each pair are not high, say around 0.6. No problem, right?
Unfortunately, you might still have problematic levels of collinearity. While the correlations between IV pairs are not exceptionally high, it’s possible that three IVs together could explain a very high proportion of the variance in the fourth IV.
Hmmm, using multiple variables to explain variability in another variable sounds like multiple regression analysis. And that’s the method that VIFs use!
Related post: Interpreting Correlation Coefficients
Calculating Variance Inflation Factors
VIFs use multiple regression to calculate the degree of multicollinearity. Imagine you have four independent variables: X1, X2, X3, and X4. Of course, the model has a dependent variable (Y), but we don’t need to worry about it for our purposes. When your statistical software calculates VIFs, it uses multiple regression to regress all IVs except one on that final IV. It repeats this process for all IVs, as shown below:
- X1 ⇐ X2, X3, X4
- X2 ⇐ X1, X3, X4
- X3 ⇐ X1, X2, X4
- X4 ⇐ X1, X2, X3
To calculate the VIFs, all independent variables become a dependent variable. Each model produces an R-squared value indicating the percentage of the variance in the individual IV that the set of IVs explains. Consequently, higher R-squared values indicate higher degrees of multicollinearity. VIF calculations use these R-squared values. The VIF for an independent variable equals the following:
Where the subscript i indicates the independent variable. There is a VIF for each IV.
When R-squared equals zero, there is no multicollinearity because the set of IVs does not explain any of the variability in the remaining IV. Take a look at the equation and notice that when R-squared equals 0, both the numerator and denominator equal 1, producing a VIF of 1. This is the lowest possible VIF and it indicates absolutely no multicollinearity.
As R-squared increases, the denominator decreases, causing the VIFs to increase. In other words, as the set of IVs explains more of the variance in the individual IV, it indicates higher multicollinearity and the VIFs increase from 1.
Related post: How to Interpret R-squared
How to Interpret VIFs
From the above, we know that a VIF of 1 represents no multicollinearity, and higher values indicate more multicollinearity is present. What do these values actually mean?
The name “variance inflation factor” gives it away. VIFs represent the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 4 indicates that multicollinearity inflates the variance by a factor of 4 compared to a model with no multicollinearity.
The next logical questions are: What variance are we talking about? And why should I care?
The variances in question are the standard errors of the coefficient estimates, which relate to the precision of these estimates. Confidence interval calculations use these standard errors to determine the interval widths for the coefficients. Larger standard errors produce wider confident intervals, which indicates that the coefficient estimates are less precise. Lower precision makes it more difficult to obtain statistically significant results. Additionally, wide CIs are more likely to cross zero, explaining why multicollinearity can cause coefficient signs to flip. In short, larger standard errors produce imprecise, unstable coefficient estimates and reduce statistical power.
These are the classic effects of multicollinearity.
How to Calculate VIFs
In my blog post about multicollinearity, I use regression analysis to model the relationship between the independent variables (physical activity, body fat percentage, weight, and the interaction between weight and body fat) and the dependent variable (bone mineral density of the femoral neck). Below is the portion of the statistical output that contains the VIFs. We’ll focus on %Fat, which has a VIF of 14.93.
Now, let’s calculate the VIF for the %Fat independent variable ourselves. To do that, I’ll fit a regression model where %Fat is now the dependent variable and include the remaining independent variables and the interaction term as IVs in this new model (physical activity, weight, and the Weight*Fat interaction). To try it yourself, download this CSV dataset: MulticollinearityVIF.
Below is the statistical output for this model. You can see that the other variables are at or near statistical significance.
Collectively, the model explains 93.3% of the variance for %Fat, which is quite high. For calculating the VIF, we’ll need to use this R-square value, as shown below.
That rounds to the value in the original statistical output!
How High is Too High?
A moderate amount of multicollinearity is okay. However, as multicollinearity increases, its effects also increase. What VIF values represent too much correlation?
I’ve seen different thresholds. I consider VIFs of five and higher to represent problematic amounts of multicollinearity. However, I’ve seen recommendations range from 2.5 to 10! I wish I could give you a concrete answer. However, I’ve long used five as a threshold, and that seems to work well.
Another consideration is that each coefficient estimate has its own standard error with its corresponding VIF. Consequently, you should evaluate each independent variable separately for multicollinearity. Even when some IVs in your model have high VIFs, other variables might have low VIFs and be absolutely fine. When extreme multicollinearity is present in some variables, it does not affect IVs with low VIFs.
This is a very concise but insightful post. Thank you!
From this post, the VIF must not be less than 1.
I’ve experienced a case which the VIF of the constant was zero.
Could there have been something wrong?
Looking forward to your reply.
Thanks again for posting.
Jim Frost says
In order for a correlation to exist, you must have variability in two variables. By definition, the constant is not a variable. Hence, correlation is not possible. So, having a zero VIF for the constant doesn’t indicate anything is wrong. The only surprising aspect is that statistical software usually won’t even calculate a VIF for the constant because it’s not needed!
Thank you Jim, must I test for multicollinearity of IVs when using multinomial logistic regression?
Does it make sense to calculate VIFs when using logistic regression? Is it possible to calculate VIFs for categorical i.e. non-numeric variables? Based on your article above, I believe the answer to both of these questions is no. But I just wanted to be sure.
I’m seeing a scenario, due to the difficulty to obtain the R2 values form BQML for a multiple linear regression model, I’m trying to create the auxiliary regressions and get the R2 for each scenarios (excluding one IV at at time) and then calculating VIF = 1/(1-R2).
But, I obtained a negative R2 actually for 1 scenario.
In this case, if I use the R2 value directly, denominator becomes (1-(-0.00414) = 1.00414
1/(1.00414) = 0.995877069 which becomes less than 1 which shouldn’t be the case, right?
Could you please clarify if modulus function(|x|) has to be used for R2?
i.e., VIF = 1 / (1 – |R2| )
Thanks in advance! (Stuck on a deadline – a quick reply might save the day for me)
Jim Frost says
I think I see what you’re trying to do. I don’t know why you’re obtaining strange R2 values. Yes, the formula for VIF is correct. However, one possibility for the strange values is that underlying model is nonlinear. Please note the technical definition of nonlinear because it is not synonymous with curved.
R2 squared is not valid for nonlinear models and you can obtain values outside the acceptable range. For more information read my post about why R2 is invalid for nonlinear models.
Your blogs are concise and easy to understand.
Thank you – Jim.
Philip Thorpe says
Great post on VIFs, thank you. You said that you use 5 as a VIF threshold, do you have a link to one of your papers where you have demonstrated this? I wish to cite this as part of a study on orthoptera distribution in the UK.
Jim Frost says
Typically, statisticians will say that it become problematic somewhere in the 5-10 range. Some statisticians will adjust the value based on the number of IVs in the model. Personally, I use 5 based on practical experience.
The reference below for Fox (2016) suggests a cutoff value of four (IIRC). At this value, precision is cut in half. However, there’s no magic dividing line where on one side there is no reduction of precision and on the other there is. It’s a question of how much loss of precision is acceptable, which leads to the differences in opinion for a good cutoff value.
Fox, John. 2016. Applied Regression Analysis and Generalized Linear Models. 3rd ed. Los Angeles: Sage Publications.
RAKESH MONDAL says
Each of your articles related Statistics is highly understandable ,so is VIF.
Jim Frost says
Thanks so much, Rakesh!
I am trying to figure out how a regression model can be used to predict if a customer will default on payment terms. I know that regression requires an independent and dependent variables. I am now sure how to set up the data. Could you please provide some insight? Also, would I be able to use the regression in the data analysis toolkit in Excel?
Jim Frost says
That sounds like a binary logistic regression model to me. For those models, the dependent variable is binary, such as default yes or no. You include independent variables that allow to predict the probability of the DV. In your case, you could use it to predict the probability that a customer will default based on their IV values.
To set up the data, you’ll need to have each variable in a column. Each row represents the values for a single customer. Depending on your software, the DV will need to be either 1s and 0s, or possibly you can use text values, which the software will interpret as a binary variable. For Excel, you’d need to code it using 1s for defaults and 0s for no default.
You cannot perform binary logistic regression using the Regression option in the Data Analysis toolkit. However, I gather that Excel can perform binary logistic regression using the Solver plugin. Check the Data tab near the Analysis option and look for Solver. However, you can’t directly perform binary logistic regression using Excel. You need to do some data preprocessing. I’m thinking about writing a blog post about that some day. But, right now, I don’t have all the details. However, again, I believe it can be done using Excel’s Solver plugin.
To learn more about this type of regression analysis, read my post where I use binary logistic regression to analyze data. It’ll give you an idea of its capabilities.
I hope that helps!
Uri Gottlieb says
Thank you! Very clear and straightforward, and helped me understand both the intuition and the calculations.
Would be amazing if you could attach relevant R codes to future posts.
All the best,
This is a wonderful explanation! Very easy to understand.
Joseph Lombardi says
Great post. I actually did this very thing a few months ago. Doubtless it was another post of yours that inspired me to do VIF analysis, b/c I would not have thought of it on my own.
I have a model with six continuous IVs. Two of the coefficients were negative, which in the context of my data makes NO SENSE whatsoever. So I regressed each of the IVs against the others. Hoo-BOY! Lotsa of co-linearity. Now I use fewer IVs and have a stronger model.
Charles Wheelus says
Excellent post and very timely for a project I am working on. Thanks Jim and keep up the great work!!
Muzi Dlamini says
Thankyou Jim its always a pleasure to read your posts.
Jim Frost says
Thank you so much, Muzi!
Nathan Jones says
Great explanation using easy to understand terminology
Jim Frost says
Jim Knaub says
Jim Frost –
This posting on VIFs is great! I’ve only thought of them superficially until now, and have never before seen such a thorough and clear explanation. Explaining it in terms of R^2 with a link to another posting on that is nice. It helps show why VIFs are a bit nebulous, since R^2s are subject to interpretation themselves. I like graphical residual analyses and think one may somewhat ignore R^2 values when it is so easy to see the graphics, and then one can consider cross-validation. I see here that VIFs could possibly be very misleading since we don’t generally do the corresponding graphical residual analyses for them. But I have given them some consideration, and it is good to understand them better.
Very nice posting. Thank you.
I have a question which betrays my ignorance here. Assuming that the R^2s in the VIFs were very meaningful, not finding problems in the scatterplots, I’m interested in the inflation which could cause a sign to flip. So in the example you gave, I see that the standard error of the coefficient that you picked was estimated to be 0.00409. The VIF was 14.925. Does that mean that the effective standard error is really
SQRT(((0.00409)^2)(14.925)) = 0.0158, if the R^2 is meaningful, or am I looking at this incorrectly?
Hope that isn’t too silly of a question, but it is the question I have since I get a little lost.
Anyway, thank you for the posting.
Be safe and have a Happy Holiday Season.
Cheers – Jim Knaub