Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be *independent*. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

In this blog post, I’ll highlight the problems that multicollinearity can cause, show you how to test your model for it, and highlight some ways to resolve it. In some cases, multicollinearity isn’t necessarily a problem, and I’ll show you how to make this determination. I’ll work through an example dataset which contains multicollinearity to bring it all to life!

## Why is Multicollinearity a Potential Problem?

A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you *hold all of the other independent variables constant*. That last portion is crucial for our discussion about multicollinearity.

The idea is that you can change the value of one independent variable and not the others. However, when independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable* independently* because the independent variables tend to change in unison.

There are two basic kinds of multicollinearity:

**Structural multicollinearity**: This type occurs when we create a model term using other terms. In other words, it’s a byproduct of the model that we specify rather than being present in the data itself. For example, if you square term X to model curvature, clearly there is a correlation between X and X^{2}.**Data multicollinearity**: This type of multicollinearity is present in the data itself rather than being an artifact of our model. Observational experiments are more likely to exhibit this kind of multicollinearity.

## What Problems Do Multicollinearity Cause?

Multicollinearity causes the following two basic types of problems:

- The coefficient estimates can swing wildly based on which other independent variables are in the model. The coefficients become very sensitive to small changes in the model.
- Multicollinearity reduces the precision of the estimate coefficients, which weakens the statistical power of your regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant.

Imagine you fit a regression model and the coefficient values, and even the signs, change dramatically depending on the specific variables that you include in the model. It’s a disconcerting feeling when slightly different models lead to very different conclusions. You don’t feel like you know the actual effect of each variable!

Now, throw in the fact that you can’t necessarily trust the p-values to select the independent variables to include in the model. This problem makes it difficult both to specify the correct model and to justify the model if many of your p-values are not statistically significant.

As the severity of the multicollinearity increases so do these problematic effects. However, these issues affect only those independent variables that are correlated. You can have a model with severe multicollinearity and yet some variables in the model can be completely unaffected.

The regression example with multicollinearity that I work through later on illustrates these problems in action.

## Do I Have to Fix Multicollinearity?

Multicollinearity makes it hard to interpret your coefficients, and it reduces the power of your model to identify independent variables that are statistically significant. These are definitely serious problems. However, the good news is that you don’t always have to find a way to fix multicollinearity.

The need to reduce multicollinearity depends on its severity and your primary goal for your regression model. Keep the following three points in mind:

- The severity of the problems increases with the degree of the multicollinearity. Therefore, if you have only moderate multicollinearity, you may not need to resolve it.
- Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not present for the independent variables that you are particularly interested in, you may not need to resolve it. Suppose your model contains the experimental variables of interest and some control variables. If high multicollinearity exists for the control variables but not the experimental variables, then you can interpret the experimental variables without problems.
- Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.

Over the years, I’ve found that many people are incredulous over the third point, so here’s a reference!

The fact that some or all predictor variables are correlated among themselves does not, in general, inhibit our ability to obtain a good fit nor does it tend to affect inferences about mean responses or predictions of new observations. —Applied Linear Statistical Models, p289, 4

^{th}Edition.

## Testing for Multicollinearity with Variance Inflation Factors (VIF)

If you can identify which variables are affected by multicollinearity and the strength of the correlation, you’re well on your way to determining whether you need to fix it. Fortunately, there is a very simple test to assess multicollinearity in your regression model. The variance inflation factor (VIF) identifies correlation between independent variables and the strength of that correlation.

Statistical software calculates a VIF for each independent variable. VIFs start at 1 and have no upper limit. A value of 1 indicates that there is no correlation between this independent variable and any others. VIFs between 1 and 5 suggest that there is a moderate correlation, but it is not severe enough to warrant corrective measures. VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated, and the p-values are questionable.

Use VIFs to identify correlations between variables and determine the strength of the relationships. Most statistical software can display VIFs for you. Assessing VIFs is particularly important for observational studies because these studies are more prone to having multicollinearity.

## Multicollinearity Example: Predicting Bone Density in the Femur

This regression example uses a subset of variables that I collected for an experiment. In this example, I’ll show you how to detect multicollinearity as well as illustrate its effects. I’ll also show you how to remove structural multicollinearity. You can download the CSV data file: MulticollinearityExample.

I’ll use regression analysis to model the relationship between the independent variables (physical activity, body fat percentage, weight, and the interaction between weight and body fat) and the dependent variable (bone mineral density of the femoral neck).

Here are the regression results:

These results show that Weight, Activity, and the interaction between them are statistically significant. The percent body fat is not statistically significant. However, the VIFs indicate that our model has severe multicollinearity for some of the independent variables.

Notice that Activity has a VIF near 1, which shows that multicollinearity does not affect it and we can trust this coefficient and p-value with no further action. However, the coefficients and p-values for the other terms are suspect!

Additionally, at least some of the multicollinearity in our model is the structural type. We’ve included the interaction term of body fat * weight. Clearly, there is a correlation between the interaction term and both of the main effect terms. The VIFs reflect these relationships.

I have a neat trick to show you. There’s a method to remove this type of structural multicollinearity quickly and easily!

## Center the Independent Variables to Reduce Structural Multicollinearity

In our model, the interaction term is at least partially responsible for the high VIFs. Both higher-order terms and interaction terms produce multicollinearity because these terms include the main effects. Centering the variables is a simple way to reduce structural multicollinearity.

Centering the variables is also known as standardizing the variables by subtracting the mean. This process involves calculating the mean for each continuous independent variable and then subtracting the mean from all observed values of that variable. Then, use these centered variables in your model. Most statistical software provides the feature of fitting your model using standardized variables.

There are other standardization methods, but the advantage of just subtracting the mean is that the interpretation of the coefficients remains the same. The coefficients continue to represent the mean change in the dependent variable given a 1 unit change in the independent variable.

In the worksheet, I’ve included the centered independent variables in the columns with an S added to the variable names.

For more about this, read my post about standardizing your continuous independent variables.

## Regression with Centered Variables

Let’s fit the same model but using the centered independent variables.

The most apparent difference is that the VIFs are all down to satisfactory values; they’re all less than 5. By removing the structural multicollinearity, we can see that there is some multicollinearity in our data, but it is not severe enough to warrant further corrective measures.

Removing the structural multicollinearity produced other notable differences in the output that we’ll investigate.

## Comparing Regression Models to Reveal Multicollinearity Effects

We can compare two versions of the same model, one with high multicollinearity and one without it. This comparison highlights its effects.

The first independent variable we’ll look at is Activity. This variable was the only one to have almost no multicollinearity in the first model. Compare the Activity coefficients and p-values between the two models and you’ll see that they are the same (coefficient = 0.000022, p-value = 0.003). This illustrates how only the variables that are highly correlated are affected by its problems.

Let’s look at the variables that had high VIFs in the first model. The standard error of the coefficient measures the precision of the estimates. Lower values indicate more precise estimates. The standard errors in the second model are lower for both %Fat and Weight. Additionally, %Fat is significant in the second model even though it wasn’t in the first model. Not only that, but the sign for %Fat has changed from positive to negative!

The lower precision, switched signs, and a lack of statistical significance are typical problems associated with multicollinearity.

Now, take a look at the Summary of Model tables for both models. You’ll notice that the standard error of the regression (S), R-squared, adjusted R-squared, and predicted R-squared are all identical. As I mentioned earlier, multicollinearity doesn’t affect the predictions or goodness-of-fit. If you just want to make predictions, the model with severe multicollinearity is just as good!

## How to Deal with Multicollinearity

I showed how there are a variety of situations where you don’t need to deal with it. The multicollinearity might not be severe, it might not affect the variables you’re most interested in, or maybe you just need to make predictions. Or, perhaps it’s just structural multicollinearity that you can get rid of by centering the variables.

But, what if you have severe multicollinearity in your data and you find that you must deal with it? What do you do then? Unfortunately, this situation can be difficult to resolve. There are a variety of methods that you can try, but each one has some drawbacks. You’ll need to use your subject-area knowledge and factor in the goals of your study to pick the solution that provides the best mix of advantages and disadvantages.

The potential solutions include the following:

- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.

As you consider a solution, remember that all of these have downsides. If you can accept less precise coefficients, or a regression model with a high R-squared but hardly any statistically significant variables, then not doing anything about the multicollinearity might be the best solution.

Do you have experience dealing with multicollinearity?

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Anastasia says

Hi Jim!

Thank you for this post, it is definitely very helpful. I see a few people asked about detecting collinearity for categorical variables, and the suggested solution is the chi-square test. My biggest problem is that when I calculated VIFs for my linear mixed-effects model with up to 4-way interactions, some interactions of categorical variables correspond to ginormous VIFs. The interaction term seems to be collinear with one of the main effect terms included into this interaction. Those are essential interactions for my hypothesis, so I’m not sure what to do about it (sort of gave up on this for now and switched to random forests…). Am I right evaluating the collinearity in my model using the VIF? Or can/should I use the chi-square test for that?

Thanks a lot,

Anastasia

Jim Frost says

Hi Anastasia,

Because you’re referring to VIFs, I’ll assume you have continuous interactions. (VIFs are calculated only for continuous variables. Chi-squared is only for categorical variables.) Whenever you include interaction terms, you’ll produce high VIFs because the main effects that comprise the interaction terms are, of course, highly correlated with the interaction terms. Fortunately, there is a very simple solution for that, which I illustrate in this blog post. Simply center all of your continuous independent variables and fit the model using the centered variables. That should cut down those VIFs by quite a bit. Often centering will get the VIFs down to manageable levels. So, reread that section of this post!

I hardly ever see models with three-way interactions, much less four-way interactions. Be sure that they’re actually improving the model. If the improvement is small, consider removing them. Higher-ordered interaction terms like those are very difficult to interpret. Of course, your model might be much better with those terms in it. But, at least check because it is fairly unusual.

I hope this helps!

JP says

Hi Jim, Thanks for the great article, I am working on multiple regression model where my VIF values are less than 2 for all the variables. From the model one variable changes sign of the coefficient (from the theory) even when the VIF value is less than 2 , I checked that one variable with DV it is showing the right coefficient sign and it is not statistically significant on its own . Should i look into interaction with that variable.

Jim Frost says

Hi JP,

With VIFs that small, it’s unlikely that multicollinearity is flipping the signs of the coefficients.

Several possibilities come to mind that can flip signs. The two most common are issues that I’ve written about: Omitted variable bias and incorrectly modeling curvature that is present. For a quick check on the curvature issue, check the residual plots! And, yes, it’s possible that not including an interaction term that should be in the model can cause this problem. All of these issues are forms of model specification error (i.e., your model is incorrect and giving you biased results). I’d check those things first.

I’m assuming that the independent variable that has the incorrect sign is statistically significant?

Zara says

Hello! Im doing a regression analysis where I have 2 measurements of the same concept. Since the measurements are very correlated I was thinking of creating a composite variable by averaging their z-scores. Is this a good step and would the interpretation of the regression change if I’m using standardized scores?

Jim Frost says

Hi Zara,

The approach you describe is definitely an acceptable one. First, I’d fit the model with both measures and check the VIFs as I describe in this post. Be sure that’s actually a problem. The two variables have to be very highly correlated to cause problems.

If it is a problem, I’d also compare the goodness-of-fit for models with both measurements versus models with just one of the measurements. If they are very highly correlated, you might not lose much explanatory power by simply removing one because they’re supply highly redundant information to begin with. If the change in goodness-of-fit is small, you can consider including only one measure. This approach gives you the ability to estimate the effect of one measurement.

However, if you do decide to combine the measurements as you describe, it does change the interpretation. You’re standardizing the measurements. For standardized data, the coefficients represent the mean change in the dependent variable given a change of one standard deviation in the independent variable. That’s still a fairly intuitive interpretation. However, you’d be averaging the z-scores, so it’s not quite that straightforward. Instead, the interpretation will be the mean change in the DV given the average change of one standard deviations across those two measurements. You’ll lose the ability to link the change in a specific IV to the DV. However, it’s possible you’ll gain more explanatory power. Before settling on this approach, I’d check to be sure that it actually improves the fit of the model.

It’s possibly a good approaching depending on whether the VIFs are problematic, the loss of explanatory power by just using one measurement, and the gains in explanatory power by averaging the two.

Best of luck with your analysis!

Rui Fang says

Hi Jim,

Can I use VIF to test for multicollinearity between caterogical independent variables?

Jim Frost says

Hi Rui,

VIFs only work for continuous IVs. I think you’d need to use chi-squared test of independence with categorical variables. It is harder to determine with categorical variables.

Nhat T.Tran says

Hi Jim,

Thank you for your reply. If it would not take up too much of your time, I would like to ask you more one question.

As you mentioned in the example, using standardized variables does not influence the predictions, so my primary goal is to investigate relationships of independent variables with a dependent variable.

In your example, after using the centered independent variables, the sign for %Fat has changed from positive to negative. Using subject-area knowledge, if %Fat has a negative relationship with the bone mineral density of the femoral neck, we may pick the standardized solution. On the other hand, if %Fat has a positive relationship with the bone mineral density of the femoral neck, we may choose the uncoded model.

I wonder whether the above solutions are appropriate or not. Therefore, I would appreciate any assistance or comments you could give me. Thank you again for your time and consideration.

Best regards,

Nhat Tran.

Jim Frost says

Hi Nhat,

Unfortunately, because you’re interested in understanding the relationships between the variables, this is not a case where you can choose between these two models based on theory. A model with excessive multicollinearity is one that has a problem that specifically obscures the relationships between the variables. Consequently, you don’t want to choose the model with multicollinearity because it happens to agree with your theoretical notions. That’s probably just a chance agreement!

What you really need to do is find a method that both resolves the multicollinearity and estimates relationships that match theory (or at least you come up with an explanation for why it does not). If I wasn’t able to use centering to reduce the multicollinearity, I’d probably need to use something like Ridge or LASSO regression to accomplish this task.

If the model with acceptable multicollinearity produces estimates that don’t match theory, consider the possibility that you’re specifying the incorrect model. That would be the next issue I’d look into. But, don’t go with the model that has problematic multicollinearity just because it happens to agree with your expectations.

Nhat T.Tran says

Hi Jim,

Thank you so much for creating a great blog for statistics.

In the ANOVA result of your example, you used the adjusted sums of squares ( Adj SS). I also run the same test for your data with Minitab using sequential sums of squares (Seq SS). However, two results are different as follows.

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value

Regression 4 0.555785 0.138946 27.95 0.000

%Fat 1 0.009240 0.009240 1.86 0.176

Weight kg 1 0.127942 0.127942 25.73 0.000

Activity 1 0.047027 0.047027 9.46 0.003

%Fat*Weight kg 1 0.041745 0.041745 8.40 0.005

Error 87 0.432557 0.004972

Total 91 0.988342

Analysis of Variance

Source DF Seq SS Seq MS F-Value P-Value

Regression 4 0.55578 0.138946 27.95 0.000

%Fat 1 0.20514 0.205137 41.26 0.000

Weight kg 1 0.24506 0.245059 49.29 0.000

Activity 1 0.06384 0.063843 12.84 0.001

%Fat*Weight kg 1 0.04175 0.041745 8.40 0.005

Error 87 0.43256 0.004972

Total 91 0.98834

The ANOVA results using Seq SS shows a statistically significant effect of %Fat on the Femoral Neck (P value = 0.000), while The ANOVA results using Adj SS indicates there is not a statistically significant effect of %Fat on the Femoral Neck ( P value = 0.176).

Therefore, I wonder that when will we use “sequential sums of squares” and when will we use “adjusted sums of squares”.

If you could answer all two of my questions, I would be most grateful.

Best regards,

Nhat Tran.

Jim Frost says

Hi Nhat,

Quick definitions first.

Adjusted sums of squares: Calculates the reduction in the error sum of squares for each variable based on a model that already includes all of the other variables in the model. The procedure adds each IV to a model that already contains all of the other IVs, and determines how much variance it accounts for.

Sequential sums of squares: Calculates this reduction by entering the variables in the specific order that they are listed. The procedure enters the first variable first, then the second variable, third, and so on.

The analysis uses these sums of squares to calculate F-values and t-values, which in turn determines the p-values. So, it’s not surprising that changing the sums of squares affects the p-values.

The standard choice you almost always want to use is the Adjusted Sums of Squares (Type III). Using this method, the model can determine the unique portion of variance that each variable explains because it calculates the variance reduction for each variable when it is entered last. This type of SS is used for at least 99% of the regression models! Basically, use this type unless you know of very strong reasons to the sequential sums of squares.

I don’t have a strong case for using sequential sums of squares. You’d need really strong theoretical reasons for why the variables need to be entered into the model in a specific order. This option is almost never used that I’m aware of.

I hope this helps!

akshay thakar says

Thanks for such a quick response !! The chi square test for Independence can involve only 2 categorical variables at a time, so should I take multiple pairs one by one to check for multi collinearity…?? Or is there any way to do the chi square test for multiple variables..??

Jim Frost says

Hi again Akshay,

Yes! You can definitely use additional variables. In chi-squared, they’re referred to as layer variables. Although, it can get a bit unwieldy when you have many, I’d try that approach. I can help show if you potentially have a problem and where to look for it!

akshay thakar says

Hi Jim,

Your blog is amazing !!

I am trying to run a regression analysis however I am facing the issue of multi-collinearity between categorical variables , are there any tests to identify multi-collinearity between categorical variables ??

Jim Frost says

Hi Akshay, thanks for the kind words!

The chi-squared test of independence would be a good way to detect correlations between categorical variables. I cover this method in this blog post: Chi-squared Test of Independence and an Example.

Lola says

Thank you! Thank you!

Lola says

Thanks a bunch Jim!

Is that (pairwise correlation) the same as producing a correlation matrix of all the independent variables from the regression model?

I’d also like to know if and how you take classes . I’d really want to hone my statistical ‘skills’.

Jim Frost says

Hi Lola, you’re very welcome!

You could certainly do a matrix of all the variables as you describe. However, that might provide more correlations than you need.

Suppose X1 is the IV with the high VIF and that the others have low VIFs. You’d really just need the correlations of X1 with X2, X3, X4, and so on.

If you did a matrix, you get all the correlations with say X2 and X3, X2 and X4, etc. You might not need those. It doesn’t hurt to obtain extra correlations, it’s just more numbers to sort through!

Just keep in mind that only the terms in the model with high VIFs are actually affected by multicollinearity. You can have some terms with high VIFs and others with low VIFs. Multicollinearity does not affect the variables with low VIFs,, and you don’t need to worry about those.

Lola says

Hi Jim, this page is absolutely brilliant. Thanks for this initiative.

However, I have a question on something a bit unclear regarding the VIF interpretation. In the first regression model above, it can be seen that ‘Fat’, ‘Weight’ and ‘Fat*Weight’ have “worrirsome” VIF values., which depict multicollinearity. And as explained above, multicollinearity involves 2 or more independent variables.

the question is – For each, Independent variable with worrisome VIF values, how does one determine which of the other IVs, it is highly correlated with?

Jim Frost says

Hi Lola,

That is a great question! The VIF for a specific term in model shows the total multicollinearity across all of the other terms in the model. So, you’re right, seeing a high VIF indicates there is a problem but it doesn’t tell which variable(s) are the primary culprits. At that point, you should probably calculate the pairwise correlations between the independent variable in question and the other IVs. The variables with the highest correlations would be the primary offenders.

Ioakim Boutakidis says

Greetings Jim…just wanted to drop a quick note and say that you have laid out some excellent content here. I landed on your cite after a student of mine came across it looking for info on multicollinearity, and so I felt I had to check it out to make sure she was getting legitimate information. I have to say I am very impressed. It looks like you are helping a lot of people do better research, and that’s something you should be very proud of.

Jim Frost says

Hi Ioakim, thank you so much for you kind words. They mean a lot to me! I’m glad my site has been helpful for your students!

Michal says

Very much! Thank you!

Michal says

Could it be that two highly related independent variables (r=0.834!!) yield a VIF of 3.82? Of course, it makes my life easier that I don’t have to deal with the multicollinearity problem, but I don’t understand how this can happen….

Jim Frost says

Hi Michal,

Yes, that might be surprising but it is accurate. In fact, for the example in this blog post, the %fat and body weight variables have a correlation of 0.83, yet the VIF for a model with only those two predictor variables is just 3.2. That’s very similar to your situation. When you have only a pair of correlated predictors in your model, the correlation between them has to be very high (~0.9) before it starts to cause problems.

However, when you have more than two predictors, the collective predictive power between the predictors adds up more quickly. As you increase the number of predictors, each pair can have a lower correlation, but the overall strength of those relationships accumulates. VIFs work by regressing a set of predictors on another predictor. Consequently, it’s easier to get higher VIFs when you have more predictors in the model. No one predictor has to “work very hard” to produce problems.

But, when you have only two predictors, the relationship between them must be very strong to produce problems!

I hope this helps!

Ben says

What if the sole purpose of the regression is to identify the “rank” of the contributions of the independent variables? If all the varibales (including the dependent variable) are all correlated with each other, does this “drivers analysis” still hold?

Javed Iqbal says

Q L K

11 12.2 10.1

34.6 30.2 28.2

21.9 23 24

28.2 22.3 21.3

14.7 15.7 14.3

20.2 20.8 18.4

9.7 11.5 10

22.2 25.9 24

17.3 21.5 20.3

19.5 22.4 20.5

13.6 14.4 12.2

34 29.5 29.2

35.1 26.8 25.5

10.6 12.7 10.8

18.6 19.6 19.9

22.9 25 24

27.4 25.7 23.2

16.4 18 16.2

22 18.3 19.4

27 19.7 17.2

27.1 23.7 25

15.6 21.2 20.5

13.2 23 22.1

27.3 26.3 24.3

15.4 22.6 20.8

30.6 30.5 28.9

24.4 28.6 28.1

36.1 26.7 27.9

24.8 21.7 20.7

21 18.5 17.1

10.2 13.5 11.1

20.4 13.4 11.7

14.3 15.7 15.4

This is the data ‘cobb’ from Hill-Grifith-Lim, Principles of Econometrics. The estimate of the Production function results in the following (with R-sq of 0.69 and overlall F of 33.12 with p-valu of 0.000). This is a classical case of multi collinearity. as non of the individual coefficient are significant. The sample correlation b/w log of L and log K is 0.985 and the VIFs are 35.15 for both variables. I will appreciate if you could comment on resolving the multicollinearity issue.

Variable Coefficient Std. Error t-Statistic Prob.

C -0.128673 0.546132 -0.235608 0.8153

LOG(L) 0.558992 0.816438 0.684672 0.4988

LOG(K) 0.487731 0.703872 0.692925 0.4937

derekness says

Hi Jim,

thanks again for the useful input.

I have been playing with the model and trying to see how it responds when I do put in co-linear data in. If I put in the two strong peaks for A and B twice. It goes mad and falls over. This is good. I think it tells me that the additional data contained in the ratio of the two strong peaks is different than the data contained in the two individual peaks. This is good as it really appears to help the model work and give me great predictive powers. It gives me confidence that all is working well.

I am also working to improve the performance of the model on how it handles independent data.

We have to make calibration mixes of A and B in the lab. We can do this very accurately and have a great machine for making a super homogenous air free mixes. The model uses these mixes with varying amounts of A and B to then be able to test mixes made in a production environment from a large metering and mixing machine. Unfortunately these do not “look” like the perfect lab made mixes and the analysis gives me a wide variations in the compositions of A and B. I think that this is not real. I therefore am now tuning the model so that it handles the production mixes better. This makes the Rsq. values for the calibration model worse, but I now get much better Rsq. values from the production mixes. This approach has worked well, and the model is now insensitive to whatever is different in the real world production mixes.

This has been a great learning process for me ( and also a lot of fun), but I always am cautious as it is a bit of a “black box” and I have no idea what it is up to!

Oh and yes we also use PLS treatments for some analysis’s, and that can be really good on tricky materials. For this one I have to use the ILS method, but it seem pretty good. With the PLS work you have more insight to what is going on withe the factors and PRESS data, the ILS software doesn’t let on how it does it!

regards,

Derek.

Jim Frost says

Hi Derek,

Thanks for the follow-up! I always enjoy reading how different fields use these analyses. While the methods are often the same, the contexts can be so different!

It does sounds like you have a promising model. And, the super accurate machine you have explains how you can obtain a very high R-squared when you assess the lab mixes. As you found, you’d expect the real life mixtures to have a lower R-squared.

That actually reminds me of research done in the education field. When some researchers tested a new teaching method, they did so in a very controlled lab-like setting. It worked in that setting. However, when they tried it in the field, their results were not nearly as good. They learned that because the new method had to work in the field that it had to be robust to variations you’d expect in the field. You always think of reducing the variability for experiments, but there’s also the need to reflect the actual conditions. And, that sounds kind of like what you’re dealing with.

It’s great that it’s been a fun experience for you! That’s the way I see it too and what I always try to convey through my blog. That statistics can be fun by helping you discover how something works. It’s the process of discovery.

Thanks for sharing!

Derek Ness says

Jim, thank for that. We have a good grasp of the chemistry but the maths that we use we don’t understand and have no feel for at all. The FTIR spectra is just a collection of peaks. We do one for component A and another for component B. We look for areas where we get a strong peak in one component and nothing in the other. We then use these peaks to put into the Inverse Least Squares ( I think it is actually a MLR) treatment, on a sreies of mixes with varying amounts of A and B in. I do this and I get an OK model. If I then add in the ratio of the two peaks as well as the individual peaks, the model gets amazingly good, Rsq. 0.9999 from Rsq.0.9. It also gets very good at analysing my set of independent data. I am mentally struggling with whether this is real or is the treatment cheating, and it will all fall apart when I get a new set of data to analyse.

I will probably work up both approaches and use them on the next application and see if the super one keeps looking great!

Jim Frost says

Hi Derek,

I don’t know enough about the subject to have an opinion on whether it is “cheating” or valid. You should think through the logic of the model and determine whether it makes sense. I wouldn’t say there is anything inherently “cheating” about including the ratio. But, does that make sense from a subject matter point of view. What was your rationale for including it? And, do the coefficient estimates fit your theory?

Also, consider your sample size. Just be sure that you’re not overfitting the model. It doesn’t sound like you have too many predictors, but just something to consider.

Ultimately, yes, I think cross validation with a new dataset is the best way to evaluate a model. I think that’s always true. But, even more so when you have questions like the ones you have! Sounds like you’ve got a great plan to address that! I’d be curious to hear how that goes if you don’t mind sharing at a later point?

Derek Ness says

Jim, I am working on FTIR spectra of mixes and we are using a ILS treatment to build an analysis model. We can select peaks to include in the model we can also use ratios of peaks. As I see it if I use two strong peaks from component A and two strong peaks from component B I have a colinearity issue as these peaks are related. Does this cause a problem?

Right now I am using a single peak from A and a single peak from B. This works OK, but if I then use a ratio of these 2 peaks as well the model looks amazing with an Rsq. of 1. This looks suspicious to me. Is colineraity effects causing this and is it thus dangerous to use the ratio and the 2 individual peaks. The cross validation for this also looks great and it appears to analyse an independent set of data really well.

Jim Frost says

Hi Derek,

First, let me say that I know so little about the subject area, which will limit what I can say. Statistics should always be a mix of statistical knowledge and subject-area expertise.

If your goal is to make a good prediction, then you don’t need to worry about multicollinearity. It’s often surprising, but multicollinearity isn’t bad when it comes to the goodness-of-fit measures, such as R-squared and S. It does affect the precision of the coefficient estimates and their p-values. But, if the coefficients/p-values aren’t your main concern, multicollinearity isn’t necessarily a problem.

Again, I don’t know the subject area, but for physical phenomenon, it’s not impossible to obtain very high R-squared values if there’s very little noise/random error in the process. That’s something you’ll have to determine using your subject-area knowledge. Maybe research what others have done and the results they obtained. But, again, the very high R-squared might not be a problem. I did write a post about reasons why your R-squared value might be too high, which you should read because it covers other potential reasons why it could be too high. But, the fact that your cross validation looks good is a great sign!

Finally, I’m also aware that analysts often use partial least squares (PLS) regression for analyzing spectra because of both a large number of predictors and multicollinearity. This form of regression is a mix between principle components analysis and least squares regression. I’m not sure if it would be helpful for your analysis, but it’s an analysis to consider. Unfortunately, I don’t have much firsthand experience using PLS so I don’t have much insight to offer. But, if you need to consider other forms of analysis (which you might not), it’s one I’d look in to.

I hope this helps. Best of luck with your analysis!

Jerry Avura says

Hello sir, how can one calculate for VIF using R? Thanks in anticipation.

Jim Frost says

Hi Jerry,

Unfortunately, I don’t use R and don’t know the answer. Sorry.

Jerry Avura says

thank you sir. Looking forward to it

Jerry Avura says

Thank you sir for the nice job, it was so clear and perfect. I’m Jerry, an MSc Student of Statistics. I’m currently working on Multicollinearity. Please can you throw more light on “Ridge Regression “?

Thanks in anticipation.

Jim Frost says

Hi Jerry, at some point I’ll try to write a blog post about Ridge Regression, but I have a bit of a backlog right now! I do talk about it a bit in my post about choosing the correct type of regression analysis.

John Komlos says

than’s interesting. thank you very much. I did not know that. do you have a citation for me by any chance? thanks in advance, John

Jim Frost says

You bet! That’s a generally recognized property of multicollinearity so any linear model textbook should discuss this issue. In this post, I include a reference to my preferred textbook for another issue. That’s the one I’d recommend, but any good textbook will talk about this issue. I don’t know of any articles offhand.

John Komlos says

Thank you Jim, appreciate the explanation. One more question: would it be possible for the two variables to be significant in spite of multicollinearity?

Thanks.

Best regards,

John

Jim Frost says

Hi, it’s definitely possible. While multicollinearity weakens the statistical power of the analysis, it’s still possible to obtain significant results–it’s just more difficult. Additionally, the coefficient estimates are erratic and can swing widely depending on which variables are in the model. While you can obtain significant results, this instability makes it more difficult for you to be confident in which specific estimates are correct.

John Komlos says

I wonder if two multicollinear variables can be both statistically significant. one is large and negative while the other is large and positive and both significant. i have a feeling that like magnets they repel each other. is that possible?

Jim Frost says

Hi John,

It’s certainly possible for multicollinear variables to have opposite signs like you describe. However, there is no propensity for that situation to occur. That is to say, having different signs or the same signs are equally likely and just depends on the nature of the correlations in your data. The real issue is that you can use one independent variable to predict another. It’s really the absolute magnitude of the correlation coefficient that is the issue rather than the signs themselves.

It actually gets a bit more involved than that. VIFs aren’t just assessing pairs of independent variables. Instead, VIF calculations regress a set of independent variables on each independent variable. It’s possible that two or more independent variables collectively explain a large proportion of the variance in another independent variable. In a VIF regression model, it’s possible to have a mix of positive and negative signs!

That’s probably more than you want to know! But, to your question, yes, it’s possible but it’s really the absolute magnitude that is the issue.

Patrik Silva says

Hi Jim, Thank you, I got your point! You are helping me a lot!

Patrik Silva says

Hi, Jim

I would like to know if, when you mentioned: Fat S, Weight S , Activity S and FatS * WeightS. The Fat S multiplied by Weight S (Fat S * Weight S) is calculated using the Fat S (Standardized) * Weight S (Standardized) or is the standardized of the two variable together by taking the (Fat * Weight) S.

I do not know if you got my point!

Thank you!

Jim Frost says

Hey Patrick, it’s the first scenario that you list.

Patrik Silva says

Dear Jim, you are making people love statistic! Every time I come here to read something, I am getting more love to “Stats”. You explain statistics so easy, but so easy that I feel like I am reading/hearing a beautiful story.

Thank you Jim!

Sinan says

Hi Jim,

It can be used methods such as backward elemination for property selection in multiple linear regresssion (MLR) . In these methods, features are removed from the model, just like in Multicolliniarity. The question I want to ask is: In an MLR application, which one to do first? Multicolliniarity or model selection?

Jim Frost says

Hi Sinan,

This is a tricky situation. The problem is that multicollinearity makes it difficult for stepwise regression (which includes the backward elimination method) to fit the best model. I write about this problem in my post that compares stepwise and best subsets regression. You can find it in the second half of the post where I talk about factors that influence how well it works.

However, removing multicollinearity can be difficult. But, I would try to remove the multicollinearity first.

There is another approach that you can try–LASSO regression. This method both addresses the multicollinearity and it can help choose the model. I describe in my post about choosing the right type of regression analysis to use. I don’t have hands-on experience with it myself, but it might be something you can look into if it sounds like it can do what you need it to do.

I hope this helps!

Veikko says

Hi, do you have an author for your reference “Applied Linear Statistical Models”?

Jim Frost says

Michael H. Kutner et al.

John Velez says

Hi Jim,

Again, thank you for such a great explanation!

My question is as follows:

Say you have two “severely” correlated IVs (X1 and X2) but you’re interested in examining each one individually. How would controlling (i.e., enter as a covariate) X2 while examining X1 influence your coefficients and p values? I would assume a loss of power but what else may occur? I’m also interested in any potential pitfalls of using this approach.

Thanks for your time!

John

Jim Frost says

Hi John,

There are several problems with including severely correlated IVs in your model. One, it saps the statistical power. However, it also makes the coefficients unstable. You can change the model by including or excluding variables and the coefficients can swing around wildly. This condition makes it very difficult have confidence in the true value of the coefficient. So, the lower statistical power and unstable coefficients are the major drawbacks. Basically, it’s hard to tell which IVs are correlated with the DV and the nature of those relationships.

One thing I don’t mention in this post, but I should add, is that you can try Ridge regression and Lasso regression, which are more advanced forms of regression analysis that are better at handling multicollinearity. I don’t have much first hand experience using them for that reason but they could be worth looking into. I mention them in my post about choosing the correct type of regression analysis.

I hope this helps!

seyi says

Hi Jim,

Just to thank you for the clear explanation on your articles.

Jim Frost says

Hi Seyi, you’re very welcome! Thanks for taking the time to write such a nice comment! It means a lot to me!

Filmon says

You are just amazing Mr. Jim Thank you for the wonderful and exhaustive note

Jim Frost says

You’re very welcome! Thanks for taking the time to write such a kind comment!

Vijay says

Hello Jim Frost,

It’s Awesome explanation regarding Multicollinearity.

But i have one doubt, is there any other method to detect multicollinearity except VIF?

Thanks in advance.

Regards,

Vijay

Jim Frost says

Hi Vijay, thank you for your kind words! Thanks for writing with the great question!

VIFs really are the best way because they calculation the correlation between each independent variable with ALL of the other independent variables. You get a complete picture of the combined strength of the correlation.

You can also assess the individual simple correlations between pairs of IVs. This approach can tell you if two IVs are highly correlated, but can miss out on correlations with multiple IVs. For instance, suppose that individually, IV1 has a moderate (but not problematic) simple correlation with IV2 and IV3. However, collectively IV1 has problematic levels of correlation with both IV2 and IV3 combined. VIFs will detect the problem while simple correlation will miss it.

I hope this helps!

Blake Wareham says

Hi, Jim! Third year MA/PhD student here, very much appreciating the time and work you’ve put into communicating these concepts so effectively. I’ve been reading through a few of your posts for information about a three-way interaction in a regression used for my thesis, and will certainly cite your pages. I’m also wondering if it would also be possible to provide other sources in these posts that reiterate or elaborate on the concepts? I imagine at least some of it includes Dawson, Preacher, and perhaps the Aiken and West paper.

Thanks again for what you’re doing here.

Best,

Blake

Jim Frost says

Hi Blake,

Thanks for the kinds words! They mean a lot!

I cover two-way interactions in a previous post. However, I find the interpretation of three-way interactions to be much more complicated. Suppose the three-way interaction A*B*C is statistically significant. This interaction indicates that the effect of A on the dependent variable (Y) depends on the values of both B and C. You can also come up with similar interpretations for the effects of B and C on Y.

In my post about two-way interactions, I show how graphs are particularly useful for understanding interaction effects. For three-way interactions, you’ll likely need even more graphs. If you’re using categorical variables in ANOVA, you can also perform post-hoc tests to determine whether the differences between the groups formed by the interaction term are statistically significant.

Down the road, I might well write a more advanced post about interpreting three-way interactions. However, in the meantime, I hope this explanation about interpreting three-way interaction effects is helpful.

Nivedan says

Hi Jim!

What if a categorical independent variable (converted to dummies) has 2, 3 or more category levels and are severely correlated?

Regards

Nivedan.

Jim Frost says

With correlated categorical, independent variables, you’ll face the same problems as with any correlated independent variables. You can also use the same potential solutions–partial least squares, Lasso regression, Ridge regression, removing one of them, combining them linearly, etc.

Avinash says

Thanks, Jim!

Avinash says

Hi Jim,

If one independent variable is a subset of another independent variable, then does that also cause problems? Say for ex, I’m looking at no. of people who have completed high school and also no. of people with a bachelor’s degree. The former is clearly a subset of the latter. Should they be separately considered in 2 different models to study the effects?

And a second question – what about a cluster analysis? Does it cause any problem to have 2 correlated variables?

Jim Frost says

Hi Avinash,

In your education example, yes, those variables would clearly be correlated. Individuals with a bachelor’s degree must also have HS degree. And those without a HS degree cannot have a bachelor’s degree. Usually, analyses will include a variable for the highest degree obtained (or worked on), which eliminates that problem.

Correlated variables can affect cluster analysis. Highly correlated variables are not sufficiently unique to identify distinct clusters. And, the aspects associated with the correlated variable will be overrepresented in the solution.

I hope this helps!

Jim

prabuddh says

i am still trying to figure out the solution and will share here once i get it. thank you for your response though. and great articles to understand concepts.!

Jim Frost says

I would love to hear what you do. Please do share! Best of luck!

Prabuddh says

In case of marketing mix where we want to understand the impact of say radio, TV, internet and other media campaigns individually, we cannot remove or merge these variables. How to go about them? And obviously these variables highly correlated

Jim Frost says

Hi, first you should make sure that they’re correlated to a problematic degree. Some correlation is OK. Check those VIFs! If they’re less than 5, then you don’t need to worry about multicolliearity. I’m not sure that I’d necessarily expect the correlation between the different types of media campaigns to be so high as to cause problems, but I’m not an expert in that area.

If it is too high, you need to figure out which solution to go with. The correct solution depends on your specific study area and the requirements of what you need to learn. It can be a difficult choice and not one that I can make based on a general information. In other words, you might need to consult with a statistician. You’d have to choose among options such a removing variables, linearly combining the measures, using principle components analysis, and partial least squares regression.

You’ve already ruled a couple of those out but there are still some options left. So, I’d look into those.

I hope this helps!

Jim

Sibashis Chakraborty says

say I am regressing Y on two variables x1 and x2 and x1 and x2 has a high correlation between them. If I run a regression of Y on x1 and then Y on X2 and finally Y on x1 and x2 I will get different values for coefficients that will affect my prediction. don’t you think?

Jim Frost says

Hi, it depends on all of the specifics. However, if X1 and X2 are highly correlated, they provide very similar information in model. Consequently, using X1, X2 or, X1 and X2 might not change the predictions all that much. As the correlation between the two increases, this becomes ever more the case.

I hope this helps!

Jim