• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

Variance Inflation Factors (VIFs)

By Jim Frost 22 Comments

Variance Inflation Factors (VIFs) measure the correlation among independent variables in least squares regression models. Statisticians refer to this type of correlation as multicollinearity. Excessive multicollinearity can cause problems for regression models.

In this post, I focus on VIFs and how they detect multicollinearity, why they’re better than pairwise correlations, how to calculate VIFs yourself, and interpreting VIFs. If you need a refresher about the types of problems that multicollinearity causes and how to fix them, read my post: Multicollinearity: Problems, Detection, and Solutions.

Why Use VIFs Rather Than Pairwise Correlations?

Multicollinearity is correlation amongst the independent variables. Consequently, it seems logical to assess the pairwise correlation between all independent variables (IVs) in the model. That is one possible method. However, imagine a scenario where you have four IVs, and the pairwise correlations between each pair are not high, say around 0.6. No problem, right?

Unfortunately, you might still have problematic levels of collinearity. While the correlations between IV pairs are not exceptionally high, it’s possible that three IVs together could explain a very high proportion of the variance in the fourth IV.

Hmmm, using multiple variables to explain variability in another variable sounds like multiple regression analysis. And that’s the method that VIFs use!

Related post: Interpreting Correlation Coefficients

Calculating Variance Inflation Factors

VIFs use multiple regression to calculate the degree of multicollinearity. Imagine you have four independent variables: X1, X2, X3, and X4. Of course, the model has a dependent variable (Y), but we don’t need to worry about it for our purposes. When your statistical software calculates VIFs, it uses multiple regression to regress all IVs except one on that final IV. It repeats this process for all IVs, as shown below:

  • X1 ⇐ X2, X3, X4
  • X2 ⇐ X1, X3, X4
  • X3 ⇐ X1, X2, X4
  • X4 ⇐ X1, X2, X3

To calculate the VIFs, all independent variables become a dependent variable. Each model produces an R-squared value indicating the percentage of the variance in the individual IV that the set of IVs explains. Consequently, higher R-squared values indicate higher degrees of multicollinearity. VIF calculations use these R-squared values. The VIF for an independent variable equals the following:

VIF formula.

Where the subscript i indicates the independent variable. There is a VIF for each IV.

When R-squared equals zero, there is no multicollinearity because the set of IVs does not explain any of the variability in the remaining IV. Take a look at the equation and notice that when R-squared equals 0, both the numerator and denominator equal 1, producing a VIF of 1. This is the lowest possible VIF and it indicates absolutely no multicollinearity.

As R-squared increases, the denominator decreases, causing the VIFs to increase. In other words, as the set of IVs explains more of the variance in the individual IV, it indicates higher multicollinearity and the VIFs increase from 1.

Related post: How to Interpret R-squared

How to Interpret VIFs

Photo of tire pressure gauge.
VIFs check your inflation!

From the above, we know that a VIF of 1 represents no multicollinearity, and higher values indicate more multicollinearity is present. What do these values actually mean?

The name “variance inflation factor” gives it away. VIFs represent the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 4 indicates that multicollinearity inflates the variance by a factor of 4 compared to a model with no multicollinearity.

The next logical questions are: What variance are we talking about? And why should I care?

The variances in question are the standard errors of the coefficient estimates, which relate to the precision of these estimates. Confidence interval calculations use these standard errors to determine the interval widths for the coefficients. Larger standard errors produce wider confident intervals, which indicates that the coefficient estimates are less precise. Lower precision makes it more difficult to obtain statistically significant results. Additionally, wide CIs are more likely to cross zero, explaining why multicollinearity can cause coefficient signs to flip. In short, larger standard errors produce imprecise, unstable coefficient estimates and reduce statistical power.

These are the classic effects of multicollinearity.

How to Calculate VIFs

In my blog post about multicollinearity, I use regression analysis to model the relationship between the independent variables (physical activity, body fat percentage, weight, and the interaction between weight and body fat) and the dependent variable (bone mineral density of the femoral neck). Below is the portion of the statistical output that contains the VIFs. We’ll focus on %Fat, which has a VIF of 14.93.

Regression analysis output with VIFs.

Now, let’s calculate the VIF for the %Fat independent variable ourselves. To do that, I’ll fit a regression model where %Fat is now the dependent variable and include the remaining independent variables and the interaction term as IVs in this new model (physical activity, weight, and the Weight*Fat interaction). To try it yourself, download this CSV dataset: MulticollinearityVIF.

Below is the statistical output for this model. You can see that the other variables are at or near statistical significance.

Regression model for calculating the VIF.

Collectively, the model explains 93.3% of the variance for %Fat, which is quite high. For calculating the VIF, we’ll need to use this R-square value, as shown below.

VIF formula with example values.

That rounds to the value in the original statistical output!

How High is Too High?

A moderate amount of multicollinearity is okay. However, as multicollinearity increases, its effects also increase. What VIF values represent too much correlation?

I’ve seen different thresholds. I consider VIFs of five and higher to represent problematic amounts of multicollinearity. However, I’ve seen recommendations range from 2.5 to 10! I wish I could give you a concrete answer. However, I’ve long used five as a threshold, and that seems to work well.

Another consideration is that each coefficient estimate has its own standard error with its corresponding VIF. Consequently, you should evaluate each independent variable separately for multicollinearity. Even when some IVs in your model have high VIFs, other variables might have low VIFs and be absolutely fine. When extreme multicollinearity is present in some variables, it does not affect IVs with low VIFs.

Share this:

  • Tweet

Related

Filed Under: Regression Tagged With: assumptions, conceptual, interpreting results

Reader Interactions

Comments

  1. Benedict says

    October 23, 2022 at 5:26 am

    Hey Jim,
    This is a very concise but insightful post. Thank you!
    From this post, the VIF must not be less than 1.
    I’ve experienced a case which the VIF of the constant was zero.
    Could there have been something wrong?
    Looking forward to your reply.

    Thanks again for posting.

    Ben.

    Reply
    • Jim Frost says

      October 23, 2022 at 4:19 pm

      Hi Benedict,

      In order for a correlation to exist, you must have variability in two variables. By definition, the constant is not a variable. Hence, correlation is not possible. So, having a zero VIF for the constant doesn’t indicate anything is wrong. The only surprising aspect is that statistical software usually won’t even calculate a VIF for the constant because it’s not needed!

      Reply
  2. Isaac says

    June 13, 2022 at 6:19 am

    Thank you Jim, must I test for multicollinearity of IVs when using multinomial logistic regression?

    Reply
  3. Jagdish says

    April 20, 2022 at 2:15 pm

    Does it make sense to calculate VIFs when using logistic regression? Is it possible to calculate VIFs for categorical i.e. non-numeric variables? Based on your article above, I believe the answer to both of these questions is no. But I just wanted to be sure.

    Reply
  4. Anu says

    February 4, 2022 at 5:25 am

    Hey Jim!
    I’m seeing a scenario, due to the difficulty to obtain the R2 values form BQML for a multiple linear regression model, I’m trying to create the auxiliary regressions and get the R2 for each scenarios (excluding one IV at at time) and then calculating VIF = 1/(1-R2).

    But, I obtained a negative R2 actually for 1 scenario.

    In this case, if I use the R2 value directly, denominator becomes (1-(-0.00414) = 1.00414
    And
    1/(1.00414) = 0.995877069 which becomes less than 1 which shouldn’t be the case, right?

    Could you please clarify if modulus function(|x|) has to be used for R2?
    i.e., VIF = 1 / (1 – |R2| )

    Thanks in advance! (Stuck on a deadline – a quick reply might save the day for me)

    Reply
    • Jim Frost says

      February 4, 2022 at 5:14 pm

      Hi Anu,

      I think I see what you’re trying to do. I don’t know why you’re obtaining strange R2 values. Yes, the formula for VIF is correct. However, one possibility for the strange values is that underlying model is nonlinear. Please note the technical definition of nonlinear because it is not synonymous with curved.

      R2 squared is not valid for nonlinear models and you can obtain values outside the acceptable range. For more information read my post about why R2 is invalid for nonlinear models.

      Reply
  5. Dakshin says

    April 22, 2021 at 4:34 am

    Your blogs are concise and easy to understand.
    Thank you – Jim.

    Reply
  6. Philip Thorpe says

    March 20, 2021 at 10:23 am

    Hi Jim

    Great post on VIFs, thank you. You said that you use 5 as a VIF threshold, do you have a link to one of your papers where you have demonstrated this? I wish to cite this as part of a study on orthoptera distribution in the UK.

    Thank you

    Phil

    Reply
    • Jim Frost says

      March 22, 2021 at 12:22 am

      Hi Philip,

      Typically, statisticians will say that it become problematic somewhere in the 5-10 range. Some statisticians will adjust the value based on the number of IVs in the model. Personally, I use 5 based on practical experience.

      The reference below for Fox (2016) suggests a cutoff value of four (IIRC). At this value, precision is cut in half. However, there’s no magic dividing line where on one side there is no reduction of precision and on the other there is. It’s a question of how much loss of precision is acceptable, which leads to the differences in opinion for a good cutoff value.

      Fox, John. 2016. Applied Regression Analysis and Generalized Linear Models. 3rd ed. Los Angeles: Sage Publications.

      Reply
  7. RAKESH MONDAL says

    March 19, 2021 at 2:19 am

    Each of your articles related Statistics is highly understandable ,so is VIF.
    Amazing!

    Reply
    • Jim Frost says

      March 19, 2021 at 3:00 pm

      Thanks so much, Rakesh!

      Reply
  8. Tony says

    March 7, 2021 at 3:49 pm

    Hi Jim,

    I am trying to figure out how a regression model can be used to predict if a customer will default on payment terms. I know that regression requires an independent and dependent variables. I am now sure how to set up the data. Could you please provide some insight? Also, would I be able to use the regression in the data analysis toolkit in Excel?

    Thanks,

    Tony

    Reply
    • Jim Frost says

      March 9, 2021 at 7:36 pm

      Hi Tony,

      That sounds like a binary logistic regression model to me. For those models, the dependent variable is binary, such as default yes or no. You include independent variables that allow to predict the probability of the DV. In your case, you could use it to predict the probability that a customer will default based on their IV values.

      To set up the data, you’ll need to have each variable in a column. Each row represents the values for a single customer. Depending on your software, the DV will need to be either 1s and 0s, or possibly you can use text values, which the software will interpret as a binary variable. For Excel, you’d need to code it using 1s for defaults and 0s for no default.

      You cannot perform binary logistic regression using the Regression option in the Data Analysis toolkit. However, I gather that Excel can perform binary logistic regression using the Solver plugin. Check the Data tab near the Analysis option and look for Solver. However, you can’t directly perform binary logistic regression using Excel. You need to do some data preprocessing. I’m thinking about writing a blog post about that some day. But, right now, I don’t have all the details. However, again, I believe it can be done using Excel’s Solver plugin.

      To learn more about this type of regression analysis, read my post where I use binary logistic regression to analyze data. It’ll give you an idea of its capabilities.

      I hope that helps!

      Reply
  9. Uri Gottlieb says

    February 5, 2021 at 5:38 am

    Thank you! Very clear and straightforward, and helped me understand both the intuition and the calculations.
    Would be amazing if you could attach relevant R codes to future posts.

    All the best,
    Uri,

    Reply
  10. WQ says

    December 11, 2020 at 2:39 pm

    This is a wonderful explanation! Very easy to understand.

    Reply
  11. Joseph Lombardi says

    December 8, 2020 at 9:33 am

    Great post. I actually did this very thing a few months ago. Doubtless it was another post of yours that inspired me to do VIF analysis, b/c I would not have thought of it on my own.

    I have a model with six continuous IVs. Two of the coefficients were negative, which in the context of my data makes NO SENSE whatsoever. So I regressed each of the IVs against the others. Hoo-BOY! Lotsa of co-linearity. Now I use fewer IVs and have a stronger model.

    Reply
  12. Charles Wheelus says

    December 8, 2020 at 9:18 am

    Excellent post and very timely for a project I am working on. Thanks Jim and keep up the great work!!

    -Charles Wheelus

    Reply
  13. Muzi Dlamini says

    December 7, 2020 at 3:45 am

    Thankyou Jim its always a pleasure to read your posts.

    Reply
    • Jim Frost says

      December 7, 2020 at 11:20 pm

      Thank you so much, Muzi!

      Reply
  14. Nathan Jones says

    December 6, 2020 at 7:44 pm

    Great explanation using easy to understand terminology

    Reply
    • Jim Frost says

      December 7, 2020 at 11:20 pm

      Thanks, Nathan!

      Reply
  15. Jim Knaub says

    December 6, 2020 at 7:22 pm

    Jim Frost –

    This posting on VIFs is great! I’ve only thought of them superficially until now, and have never before seen such a thorough and clear explanation. Explaining it in terms of R^2 with a link to another posting on that is nice. It helps show why VIFs are a bit nebulous, since R^2s are subject to interpretation themselves. I like graphical residual analyses and think one may somewhat ignore R^2 values when it is so easy to see the graphics, and then one can consider cross-validation. I see here that VIFs could possibly be very misleading since we don’t generally do the corresponding graphical residual analyses for them. But I have given them some consideration, and it is good to understand them better.

    Very nice posting. Thank you.

    I have a question which betrays my ignorance here. Assuming that the R^2s in the VIFs were very meaningful, not finding problems in the scatterplots, I’m interested in the inflation which could cause a sign to flip. So in the example you gave, I see that the standard error of the coefficient that you picked was estimated to be 0.00409. The VIF was 14.925. Does that mean that the effective standard error is really
    SQRT(((0.00409)^2)(14.925)) = 0.0158, if the R^2 is meaningful, or am I looking at this incorrectly?

    Hope that isn’t too silly of a question, but it is the question I have since I get a little lost.
    🙂

    Anyway, thank you for the posting.

    Be safe and have a Happy Holiday Season.

    Cheers – Jim Knaub

    Reply

Comments and Questions Cancel reply

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Follow Me

    • FacebookFacebook
    • RSS FeedRSS Feed
    • TwitterTwitter

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • How To Interpret R-squared in Regression Analysis
    • Mean, Median, and Mode: Measures of Central Tendency
    • Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
    • How to Interpret the F-test of Overall Significance in Regression Analysis
    • Choosing the Correct Type of Regression Analysis
    • How to Find the P value: Process and Calculations
    • Interpreting Correlation Coefficients
    • How to do t-Tests in Excel
    • Z-table

    Recent Posts

    • Fishers Exact Test: Using & Interpreting
    • Percent Change: Formula and Calculation Steps
    • X and Y Axis in Graphs
    • Simpsons Paradox Explained
    • Covariates: Definition & Uses
    • Weighted Average: Formula & Calculation Examples

    Recent Comments

    • Dave on Control Variables: Definition, Uses & Examples
    • Jim Frost on How High Does R-squared Need to Be?
    • Mark Solomons on How High Does R-squared Need to Be?
    • John Grenci on Normal Distribution in Statistics
    • Jim Frost on Normal Distribution in Statistics

    Copyright © 2023 · Jim Frost · Privacy Policy