Standardization is the process of putting different variables on the same scale. In regression analysis, there are some scenarios where it is crucial to standardize your independent variables or risk obtaining misleading results.

In this blog post, I show when and why you need to standardize your variables in regression analysis. Don’t worry, this process is simple and helps ensure that you can trust your results. In fact, standardizing your variables can reveal essential findings that you would otherwise miss!

## Why Standardize the Variables

In regression analysis, you need to standardize the independent variables when your model contains polynomial terms to model curvature or interaction terms. These terms provide crucial information about the relationships between the independent variables and the dependent variable, but they also generate high amounts of multicollinearity.

Multicollinearity refers to independent variables that are correlated. This problem can obscure the statistical significance of model terms, produce imprecise coefficients, and make it more difficult to choose the correct model.

When you include polynomial and interaction terms, your model almost certainly has excessive amounts of multicollinearity. These higher-order terms multiply independent variables that are in the model. Consequently, it’s easy to see how these terms are correlated with other independent variables in the model.

When your model includes these types of terms, you are at risk of producing misleading results and missing statistically significant terms.

Fortunately, we’re in luck because standardizing the independent variables is a simple method to reduce multicollinearity that is produced by higher-order terms. Although, it’s important to note that it won’t work for other causes of multicollinearity.

Standardizing your independent variables can also help you determine which variable is the most important. Read how in my post: Identifying the Most Important Independent Variables in Regression Models.

## How to Standardize the Variables

Standardizing variables is a simple process. Most statistical software can do this for you automatically. Usually, standardization refers to the process of subtracting the mean and dividing by the standard deviation. However, to remove multicollinearity caused by higher-order terms, I recommend only subtracting the mean and **not** dividing by the standard deviation. Subtracting the means is also known as centering the variables.

Centering the variables and standardizing them will both reduce the multicollinearity. However, standardizing changes the interpretation of the coefficients. So, for this post, I’ll center the variables.

## Interpreting the Results for Standardized Variables

When you center the independent variables, it’s very convenient because you can interpret the regression coefficients in the usual way. Consequently, this approach is easy to use and produces results that are easy to interpret.

Let’s go through an example that illustrates the problems of higher-order terms and how centering the variables resolves them. You can try this example yourself using the CSV data file: TestSlopes.

## Regression Model with Unstandardized Independent Variables

First, we’ll fit the model without centering the variables. Output is the dependent variable. And, we’ll include Input, Condition, and the interaction term Input*Condition in the regression model. The results are below.

Using a significance level of 0.05, Input and Input*Condition are statistically significant while Condition is not. However, notice the VIF values. VIFs greater than 5 indicate that you have problematic levels of multicollinearity. Condition and the interaction term both have VIFs near 5.

**Related post**: Understanding Interaction Effects

## Regression Model with Standardized Variables

Now, let’s fit the model again, but we’ll standardize the independent variables using the centering method.

Standardizing the variables has reduced the multicollinearity. All VIFs are less than 5. Furthermore, Condition is statistically significant in the model. Previously, multicollinearity was hiding the significance of that variable.

The coded coefficients table shows the coded (standardized) coefficients. My software converts the coded values back to the natural units in the Regression Equation in Uncoded Units. Interpret these values in the usual manner.

Standardizing the independent variables produces vital benefits when your regression model includes interaction terms and polynomial terms. Always standardize your variables when the model has these terms. Keep in mind that it is enough to center the variables for a more straightforward interpretation. It’s an easy thing to do, and you can have more confidence in the results.

For more information about multicollinearity, plus another example of how standardizing the independent variables can help, read my post: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. The example in that post shows how multicollinearity can change the sign of a coefficient!

If you’re learning regression, check out my Regression Tutorial!

Gabriel Samuel says

Your blog is awesome. I’m grateful I got hooked at this point in my thesis write up. Thanks and keep up the good work.

Jim Frost says

Thank you so much, Gabriel! I’m so happy to hear that my blog has been helpful. Best of luck with your thesis!

Shaarang says

As an aspiring data scientist, I can not overstate how helpful your setup has been. Thanks a ton

Jim Frost says

You’re very welcome! It makes my day hearing how it has been helpful for you!

Luke says

Great article! Thanks for sharing. I do have a question regarding what you said here “However, to remove multicollinearity caused by higher-order terms, I recommend only subtracting the mean and not dividing by the standard deviation. Subtracting the means is also known as centering the variables”, would you elaborate how will it cause problems by dividing the standard deviation after centering?

Jim Frost says

Hi Luke,

All I meant by that was that if you just center the variables, the interpretation of the coefficients doesn’t change from their normal interpretation that a coefficient indicates the mean change in the dependent variable given a one-unit change in the independent variable. However, if you also divide by the standard deviation, the interpretation of the coefficients changes. For that case, the coefficient represents the mean change in the DV for a 1 standard deviation change in the IV.

I write about how standardizing your continuous IVs can be helpful in a post about How to Identify the Most Important Independent Variables in Your Model. You can read more about that approach in that post.

I hope this helps!

Douglas AMULI says

I found very helpful your post.

Concerning it i have two questions:

– Is it a problem if one runs a regression model where some independent variables are standardized and others are not ?

– Imagine a particular case of a mimic model with standardized causes but not standardized indicators. Are results negatively affected ?

Thanks in advance for your reply.

Visar says

This was very useful. Thanks a lot and keep up the good work!

Jim Frost says

Thank you, Visar!

Karien says

Hi Jim,

Thank you for your posts. I have to do a statistical analysis for a project and I have never had to delve so deep into statistics before. Your plain English explanations really help a lot.

When dealing with interactions, do you first get the interactions between the variables and center them as well? Or do you center the independent variables and then get the interactions? Or, are the interactions from the original independent variables and then only the independent variables are centered? I am very confused about the order of things here. Also if you have more than one interaction that is significant, does it become another term in the regression equation?

Thanks,

Karien

Jim Frost says

Hi Karien,

I’m so glad to hear that my blog posts are helpful!

To answer your question, some of it depends on your statistical software. If it can do these things for you automatically, then you don’t have to worry about it.

However, if you need to do them manually, here’s to correct order.

1. Create a new column for each continuous independent variable you need to center.

2. Center the continuous variables in the new columns.

3. Create a new column for each interaction term.

4. Create the interaction term by multiplying the appropriate columns. Be sure to use the centered variables.

Again, many software packages can do some or all of these steps for you automatically. So, you might not need to worry, but do check the documentation for the software.

I hope this helps!

Jim