Standardization is the process of putting different variables on the same scale. In regression analysis, there are some scenarios where it is crucial to standardize your independent variables or risk obtaining misleading results.
In this blog post, I show when and why you need to standardize your variables in regression analysis. Don’t worry, this process is simple and helps ensure that you can trust your results. In fact, standardizing your variables can reveal essential findings that you would otherwise miss!
Why Standardize the Variables
In regression analysis, you need to standardize the independent variables when your model contains polynomial terms to model curvature or interaction terms. These terms provide crucial information about the relationships between the independent variables and the dependent variable, but they also generate high amounts of multicollinearity.
Multicollinearity refers to independent variables that are correlated. This problem can obscure the statistical significance of model terms, produce imprecise coefficients, and make it more difficult to choose the correct model.
When you include polynomial and interaction terms, your model almost certainly has excessive amounts of multicollinearity. These higher-order terms multiply independent variables that are in the model. Consequently, it’s easy to see how these terms are correlated with other independent variables in the model.
When your model includes these types of terms, you are at risk of producing misleading results and missing statistically significant terms.
Fortunately, we’re in luck because standardizing the independent variables is a simple method to reduce multicollinearity that is produced by higher-order terms. Although, it’s important to note that it won’t work for other causes of multicollinearity.
Standardizing your independent variables can also help you determine which variable is the most important. Read how in my post: Identifying the Most Important Independent Variables in Regression Models.
How to Standardize the Variables
Standardizing variables is a simple process. Most statistical software can do this for you automatically. Usually, standardization refers to the process of subtracting the mean and dividing by the standard deviation. However, to remove multicollinearity caused by higher-order terms, I recommend only subtracting the mean and not dividing by the standard deviation. Subtracting the means is also known as centering the variables.
Centering the variables and standardizing them will both reduce the multicollinearity. However, standardizing changes the interpretation of the coefficients. So, for this post, I’ll center the variables.
Interpreting the Results for Standardized Variables
When you center the independent variables, it’s very convenient because you can interpret the regression coefficients in the usual way. Consequently, this approach is easy to use and produces results that are easy to interpret.
Let’s go through an example that illustrates the problems of higher-order terms and how centering the variables resolves them. You can try this example yourself using the CSV data file: TestSlopes.
Regression Model with Unstandardized Independent Variables
First, we’ll fit the model without centering the variables. Output is the dependent variable. And, we’ll include Input, Condition, and the interaction term Input*Condition in the regression model. The results are below.
Using a significance level of 0.05, Input and Input*Condition are statistically significant while Condition is not. However, notice the VIF values. VIFs greater than 5 indicate that you have problematic levels of multicollinearity. Condition and the interaction term both have VIFs near 5.
Related post: Understanding Interaction Effects
Regression Model with Standardized Variables
Now, let’s fit the model again, but we’ll standardize the independent variables using the centering method.
Standardizing the variables has reduced the multicollinearity. All VIFs are less than 5. Furthermore, Condition is statistically significant in the model. Previously, multicollinearity was hiding the significance of that variable.
The coded coefficients table shows the coded (standardized) coefficients. My software converts the coded values back to the natural units in the Regression Equation in Uncoded Units. Interpret these values in the usual manner.
Standardizing the independent variables produces vital benefits when your regression model includes interaction terms and polynomial terms. Always standardize your variables when the model has these terms. Keep in mind that it is enough to center the variables for a more straightforward interpretation. It’s an easy thing to do, and you can have more confidence in the results.
For more information about multicollinearity, plus another example of how standardizing the independent variables can help, read my post: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. The example in that post shows how multicollinearity can change the sign of a coefficient!
If you’re learning regression, check out my Regression Tutorial!