Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be *independent*. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

In this blog post, I’ll highlight the problems that multicollinearity can cause, show you how to test your model for it, and highlight some ways to resolve it. In some cases, multicollinearity isn’t necessarily a problem, and I’ll show you how to make this determination. I’ll work through an example dataset which contains multicollinearity to bring it all to life!

## Why is Multicollinearity a Potential Problem?

A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you *hold all of the other independent variables constant*. That last portion is crucial for our discussion about multicollinearity.

The idea is that you can change the value of one independent variable and not the others. However, when independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable* independently* because the independent variables tend to change in unison.

There are two basic kinds of multicollinearity:

**Structural multicollinearity**: This type occurs when we create a model term using other terms. In other words, it’s a byproduct of the model that we specify rather than being present in the data itself. For example, if you square term X to model curvature, clearly there is a correlation between X and X^{2}.**Data multicollinearity**: This type of multicollinearity is present in the data itself rather than being an artifact of our model. Observational experiments are more likely to exhibit this kind of multicollinearity.

## What Problems Do Multicollinearity Cause?

Multicollinearity causes the following two basic types of problems:

- The coefficient estimates can swing wildly based on which other independent variables are in the model. The coefficients become very sensitive to small changes in the model.
- Multicollinearity reduces the precision of the estimate coefficients, which weakens the statistical power of your regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant.

Imagine you fit a regression model and the coefficient values, and even the signs, change dramatically depending on the specific variables that you include in the model. It’s a disconcerting feeling when slightly different models lead to very different conclusions. You don’t feel like you know the actual effect of each variable!

Now, throw in the fact that you can’t necessarily trust the p-values to select the independent variables to include in the model. This problem makes it difficult both to specify the correct model and to justify the model if many of your p-values are not statistically significant.

As the severity of the multicollinearity increases so do these problematic effects. However, these issues affect only those independent variables that are correlated. You can have a model with severe multicollinearity and yet some variables in the model can be completely unaffected.

The regression example with multicollinearity that I work through later on illustrates these problems in action.

## Do I Have to Fix Multicollinearity?

Multicollinearity makes it hard to interpret your coefficients, and it reduces the power of your model to identify independent variables that are statistically significant. These are definitely serious problems. However, the good news is that you don’t always have to find a way to fix multicollinearity.

The need to reduce multicollinearity depends on its severity and your primary goal for your regression model. Keep the following three points in mind:

- The severity of the problems increases with the degree of the multicollinearity. Therefore, if you have only moderate multicollinearity, you may not need to resolve it.
- Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not present for the independent variables that you are particularly interested in, you may not need to resolve it. Suppose your model contains the experimental variables of interest and some control variables. If high multicollinearity exists for the control variables but not the experimental variables, then you can interpret the experimental variables without problems.
- Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.

Over the years, I’ve found that many people are incredulous over the third point, so here’s a reference!

The fact that some or all predictor variables are correlated among themselves does not, in general, inhibit our ability to obtain a good fit nor does it tend to affect inferences about mean responses or predictions of new observations. —Applied Linear Statistical Models, p289, 4

^{th}Edition.

## Testing for Multicollinearity with Variance Inflation Factors (VIF)

If you can identify which variables are affected by multicollinearity and the strength of the correlation, you’re well on your way to determining whether you need to fix it. Fortunately, there is a very simple test to assess multicollinearity in your regression model. The variance inflation factor (VIF) identifies correlation between independent variables and the strength of that correlation.

Statistical software calculates a VIF for each independent variable. VIFs start at 1 and have no upper limit. A value of 1 indicates that there is no correlation between this independent variable and any others. VIFs between 1 and 5 suggest that there is a moderate correlation, but it is not severe enough to warrant corrective measures. VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated, and the p-values are questionable.

Use VIFs to identify correlations between variables and determine the strength of the relationships. Most statistical software can display VIFs for you. Assessing VIFs is particularly important for observational studies because these studies are more prone to having multicollinearity.

## Multicollinearity Example: Predicting Bone Density in the Femur

This regression example uses a subset of variables that I collected for an experiment. In this example, I’ll show you how to detect multicollinearity as well as illustrate its effects. I’ll also show you how to remove structural multicollinearity. You can download the CSV data file: MulticollinearityExample.

I’ll use regression analysis to model the relationship between the independent variables (physical activity, body fat percentage, weight, and the interaction between weight and body fat) and the dependent variable (bone mineral density of the femoral neck).

Here are the regression results:

These results show that Weight, Activity, and the interaction between them are statistically significant. The percent body fat is not statistically significant. However, the VIFs indicate that our model has severe multicollinearity for some of the independent variables.

Notice that Activity has a VIF near 1, which shows that multicollinearity does not affect it and we can trust this coefficient and p-value with no further action. However, the coefficients and p-values for the other terms are suspect!

Additionally, at least some of the multicollinearity in our model is the structural type. We’ve included the interaction term of body fat * weight. Clearly, there is a correlation between the interaction term and both of the main effect terms. The VIFs reflect these relationships.

I have a neat trick to show you. There’s a method to remove this type of structural multicollinearity quickly and easily!

## Center the Independent Variables to Reduce Structural Multicollinearity

In our model, the interaction term is at least partially responsible for the high VIFs. Both higher-order terms and interaction terms produce multicollinearity because these terms include the main effects. Centering the variables is a simple way to reduce structural multicollinearity.

Centering the variables is also known as standardizing the variables by subtracting the mean. This process involves calculating the mean for each continuous independent variable and then subtracting the mean from all observed values of that variable. Then, use these centered variables in your model. Most statistical software provides the feature of fitting your model using standardized variables.

There are other standardization methods, but the advantage of just subtracting the mean is that the interpretation of the coefficients remains the same. The coefficients continue to represent the mean change in the dependent variable given a 1 unit change in the independent variable.

In the worksheet, I’ve included the centered independent variables in the columns with an S added to the variable names.

For more about this, read my post about standardizing your continuous independent variables.

## Regression with Centered Variables

Let’s fit the same model but using the centered independent variables.

The most apparent difference is that the VIFs are all down to satisfactory values; they’re all less than 5. By removing the structural multicollinearity, we can see that there is some multicollinearity in our data, but it is not severe enough to warrant further corrective measures.

Removing the structural multicollinearity produced other notable differences in the output that we’ll investigate.

## Comparing Regression Models to Reveal Multicollinearity Effects

We can compare two versions of the same model, one with high multicollinearity and one without it. This comparison highlights its effects.

The first independent variable we’ll look at is Activity. This variable was the only one to have almost no multicollinearity in the first model. Compare the Activity coefficients and p-values between the two models and you’ll see that they are the same (coefficient = 0.000022, p-value = 0.003). This illustrates how only the variables that are highly correlated are affected by its problems.

Let’s look at the variables that had high VIFs in the first model. The standard error of the coefficient measures the precision of the estimates. Lower values indicate more precise estimates. The standard errors in the second model are lower for both %Fat and Weight. Additionally, %Fat is significant in the second model even though it wasn’t in the first model. Not only that, but the sign for %Fat has changed from positive to negative!

The lower precision, switched signs, and a lack of statistical significance are typical problems associated with multicollinearity.

Now, take a look at the Summary of Model tables for both models. You’ll notice that the standard error of the regression (S), R-squared, adjusted R-squared, and predicted R-squared are all identical. As I mentioned earlier, multicollinearity doesn’t affect the predictions or goodness-of-fit. If you just want to make predictions, the model with severe multicollinearity is just as good!

## How to Deal with Multicollinearity

I showed how there are a variety of situations where you don’t need to deal with it. The multicollinearity might not be severe, it might not affect the variables you’re most interested in, or maybe you just need to make predictions. Or, perhaps it’s just structural multicollinearity that you can get rid of by centering the variables.

But, what if you have severe multicollinearity in your data and you find that you must deal with it? What do you do then? Unfortunately, this situation can be difficult to resolve. There are a variety of methods that you can try, but each one has some drawbacks. You’ll need to use your subject-area knowledge and factor in the goals of your study to pick the solution that provides the best mix of advantages and disadvantages.

The potential solutions include the following:

- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.

As you consider a solution, remember that all of these have downsides. If you can accept less precise coefficients, or a regression model with a high R-squared but hardly any statistically significant variables, then not doing anything about the multicollinearity might be the best solution.

Do you have experience dealing with multicollinearity?

If you’re learning regression, check out my Regression Tutorial!

Sibashis Chakraborty says

say I am regressing Y on two variables x1 and x2 and x1 and x2 has a high correlation between them. If I run a regression of Y on x1 and then Y on X2 and finally Y on x1 and x2 I will get different values for coefficients that will affect my prediction. don’t you think?

Jim Frost says

Hi, it depends on all of the specifics. However, if X1 and X2 are highly correlated, they provide very similar information in model. Consequently, using X1, X2 or, X1 and X2 might not change the predictions all that much. As the correlation between the two increases, this becomes ever more the case.

I hope this helps!

Jim

Prabuddh says

In case of marketing mix where we want to understand the impact of say radio, TV, internet and other media campaigns individually, we cannot remove or merge these variables. How to go about them? And obviously these variables highly correlated

Jim Frost says

Hi, first you should make sure that they’re correlated to a problematic degree. Some correlation is OK. Check those VIFs! If they’re less than 5, then you don’t need to worry about multicolliearity. I’m not sure that I’d necessarily expect the correlation between the different types of media campaigns to be so high as to cause problems, but I’m not an expert in that area.

If it is too high, you need to figure out which solution to go with. The correct solution depends on your specific study area and the requirements of what you need to learn. It can be a difficult choice and not one that I can make based on a general information. In other words, you might need to consult with a statistician. You’d have to choose among options such a removing variables, linearly combining the measures, using principle components analysis, and partial least squares regression.

You’ve already ruled a couple of those out but there are still some options left. So, I’d look into those.

I hope this helps!

Jim

prabuddh says

i am still trying to figure out the solution and will share here once i get it. thank you for your response though. and great articles to understand concepts.!

Jim Frost says

I would love to hear what you do. Please do share! Best of luck!

Avinash says

Hi Jim,

If one independent variable is a subset of another independent variable, then does that also cause problems? Say for ex, I’m looking at no. of people who have completed high school and also no. of people with a bachelor’s degree. The former is clearly a subset of the latter. Should they be separately considered in 2 different models to study the effects?

And a second question – what about a cluster analysis? Does it cause any problem to have 2 correlated variables?

Jim Frost says

Hi Avinash,

In your education example, yes, those variables would clearly be correlated. Individuals with a bachelor’s degree must also have HS degree. And those without a HS degree cannot have a bachelor’s degree. Usually, analyses will include a variable for the highest degree obtained (or worked on), which eliminates that problem.

Correlated variables can affect cluster analysis. Highly correlated variables are not sufficiently unique to identify distinct clusters. And, the aspects associated with the correlated variable will be overrepresented in the solution.

I hope this helps!

Jim

Avinash says

Thanks, Jim!

Nivedan says

Hi Jim!

What if a categorical independent variable (converted to dummies) has 2, 3 or more category levels and are severely correlated?

Regards

Nivedan.

Jim Frost says

With correlated categorical, independent variables, you’ll face the same problems as with any correlated independent variables. You can also use the same potential solutions–partial least squares, Lasso regression, Ridge regression, removing one of them, combining them linearly, etc.