Standardization is the process of putting different variables on the same scale. In regression analysis, there are some scenarios where it is crucial to standardize your independent variables or risk obtaining misleading results.

In this blog post, I show when and why you need to standardize your variables in regression analysis. Don’t worry, this process is simple and helps ensure that you can trust your results. In fact, standardizing your variables can reveal essential findings that you would otherwise miss!

## Why Standardize the Variables

In regression analysis, you need to standardize the independent variables when your model contains polynomial terms to model curvature or interaction terms. These terms provide crucial information about the relationships between the independent variables and the dependent variable, but they also generate high amounts of multicollinearity.

Multicollinearity refers to independent variables that are correlated. This problem can obscure the statistical significance of model terms, produce imprecise coefficients, and make it more difficult to choose the correct model.

When you include polynomial and interaction terms, your model almost certainly has excessive amounts of multicollinearity. These higher-order terms multiply independent variables that are in the model. Consequently, it’s easy to see how these terms are correlated with other independent variables in the model.

When your model includes these types of terms, you are at risk of producing misleading results and missing statistically significant terms.

Fortunately, we’re in luck because standardizing the independent variables is a simple method to reduce multicollinearity that is produced by higher-order terms. Although, it’s important to note that it won’t work for other causes of multicollinearity.

Standardizing your independent variables can also help you determine which variable is the most important. Read how in my post: Identifying the Most Important Independent Variables in Regression Models.

## How to Standardize the Variables

Standardizing variables is a simple process. Most statistical software can do this for you automatically. Usually, standardization refers to the process of subtracting the mean and dividing by the standard deviation. However, to remove multicollinearity caused by higher-order terms, I recommend only subtracting the mean and **not** dividing by the standard deviation. Subtracting the means is also known as centering the variables.

Centering the variables and standardizing them will both reduce the multicollinearity. However, standardizing changes the interpretation of the coefficients. So, for this post, I’ll center the variables.

## Interpreting the Results for Standardized Variables

When you center the independent variables, it’s very convenient because you can interpret the regression coefficients in the usual way. Consequently, this approach is easy to use and produces results that are easy to interpret.

Let’s go through an example that illustrates the problems of higher-order terms and how centering the variables resolves them. You can try this example yourself using the CSV data file: TestSlopes.

## Regression Model with Unstandardized Independent Variables

First, we’ll fit the model without centering the variables. Output is the dependent variable. And, we’ll include Input, Condition, and the interaction term Input*Condition in the regression model. The results are below.

Using a significance level of 0.05, Input and Input*Condition are statistically significant while Condition is not. However, notice the VIF values. VIFs greater than 5 indicate that you have problematic levels of multicollinearity. Condition and the interaction term both have VIFs near 5.

**Related post**: Understanding Interaction Effects

## Regression Model with Standardized Variables

Now, let’s fit the model again, but we’ll standardize the independent variables using the centering method.

Standardizing the variables has reduced the multicollinearity. All VIFs are less than 5. Furthermore, Condition is statistically significant in the model. Previously, multicollinearity was hiding the significance of that variable.

The coded coefficients table shows the coded (standardized) coefficients. My software converts the coded values back to the natural units in the Regression Equation in Uncoded Units. Interpret these values in the usual manner.

Standardizing the independent variables produces vital benefits when your regression model includes interaction terms and polynomial terms. Always standardize your variables when the model has these terms. Keep in mind that it is enough to center the variables for a more straightforward interpretation. It’s an easy thing to do, and you can have more confidence in the results.

For more information about multicollinearity, plus another example of how standardizing the independent variables can help, read my post: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. The example in that post shows how multicollinearity can change the sign of a coefficient!

If you’re learning regression, check out my Regression Tutorial!

Richard says

Hi Jim,

Thank you for your blog. Would you please explain why the center method will reduce VIF?

f.g. there are three explanatory Variables: a, b, c

VIF of a is 1-(1-R2) , where R2 come from lm(a ~ b + c) .

If I center a, b, c, they become a_center, b_center, c_center

VIF of a_center is 1-(1-R2_center) , where R2 come from lm(a_center ~ b_center + c_center) .

My current understand is that center will not change the relationship between variables. So lm(a ~ b + c) . and

lm(a_center ~ b_center + c_center) should output the same result. Thus R2 is the same as R2_center.

I used my data to test whether they are the same, they are sometimes the same, sometimes not.

Would you help me understand?

Thank you

Ronan Murphy says

Hi Jim,

Thanks for creating this statistical resources. Its invaluable.

I have run into a road block with something that I think should be easy; how do I undo standardisation?

I created a Python Jupiter notebook to fit a line to a set of (x,y) samples by minimising the sum of squared errors (SSE) in order to deepen my understanding of linear regression.

For (x,y) sample data I used set 1 of Anscombe’s Quartet. The x values range from 4 to 13 and the y values range from 4.26 to 10.84. Before running the loop to minimise the SSE I standardised the x values using the formula Xnew = (Xorig – Xmin)/(Xmax – Xmin). Similarly for the y values.

Running the loop gives values for m and c. Once I got to an acceptable level of error I plotted the “fitting” line using y=mx+c. I got a straight line that fits the standardised points.

I’d like to reverse the standardisation and plot the line against the original (x, y) samples. The slope m would not be affected by this reversal but c would be.

I tried the obvious reversal, namely Yorig =(Ynew * (Ymax-Ymin))+Ymin. This gave me a line which was above the set of points.

Do you know how I reverse the standardisation so that my c is correct?

Your help would be much appreciated.

Thank you.

John Grenci says

Jim, I am still not getting my question answered. maybe I have lost it. 🙁 perhaps if I give the specific example, you can see what I am asking. I have 4 0-1 IV’s. I left out variable 4. (called it rank4). when the regression is run, the results are below.

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -1.5224265 0.3186144 -4.7782726 1.768076e-06

## rank3 0.3220316 0.3846844 0.8371319 4.025184e-01

## rank2 0.9366996 0.3610304 2.5945174 9.472383e-03

## rank1 1.6867296 0.4093073 4.1209370 3.773346e-05

if we take e to the -1.52 (the intercept) power and divide by that number plus 1 (I will call that expression #), it should translate back to the percent of times the dependent variable is 1 when rank4 is 1. that is EXACTLY what we get, in this case, 17.9%. but if I run it with NOMINAL IV’s I get a totally DIFFERENT SET of coefficients, and I don’t see how they can be ‘equally’ correct. because certainly, the above expression # does not hold if I get an intercept that is something other than the intercept above. this tells me then if I got the correct interpretations when the IV’s were CONTINUOUS (which they had to be in this case), then must they always be continuous else I will not be able to at the very least correctly interpret the intercept (we will leave interpreting other variables for another discussion). I don’t see where it has to do with how the software handles data unless it is a case by case basis depending upon the type of IV ……..i.e. when continuous, interpret by the expression #, and if nominal, we use an entirely different expression. I dont think that is the case. . does my question make sense? this is the fourth forum I am at, and if you cannot answer it, then I am may have to move on. if my question does not make sense, please say so, and there is no need for you to further answer. thanks. sorry, extremely frustrated.

Jim Frost says

Hi,

I haven’t had time to look through the model in depth. But, it’s not completely surprising to obtain different results for nominal versus numeric. When it’s a nominal variable, you’re depending on an algorithm deciding how to code the different levels. You’d expect different coding to produce different coefficients. It’s possible that your software is coding 1s as 0s and 0s as 1s. That would be weird but it’s a possibility. It might getting confused by the fact that you’re asking it to treat binary data as nominal.

Additionally, there’s another possibility at play. If you have a categorical variable with three levels, you need two indicators for two of the levels, and leave one baseline level out. If you have four levels, you’ll have three indicator variables, and so on. I’m not sure if the variables in your model are related to a categorical variable in that manner, but each software will have an algorithm that determines which level it leaves out my default. But the choice of the baseline level that you leave out also affects the coefficients. So, I can see multiple reasons why the coefficients could change. Although, I see left out rank4 intentionally, so maybe it’s not that.

As a general rule, when you change the coding scheme and/or baseline level, the different sets of coefficients tell the same story but from a different perspective. To interpret them correctly, you need to know how the indicator variables are coded and, if applicable, which categorical levels are left out to be the baseline.

As for exactly what’s going on, I can’t tell. I don’t know what your software is doing. There is a bit of mystery there. You’d think that if the data were already 1s and 0s and you told the software that they are nominal data, that it would recode them as 1s and 0s consistently. So, I know I haven’t answered your question exactly other than I’ve shown several ways it could’ve happened in general. I suspect you’d need to dig into the documentation to determine what is happening. it’s nothing something obvious that’s coming to my mind.

I’ve written a post about interaction effects and for awhile I was confused. Readers said that they got different coefficients for interactions with categorical variables. Eventually after enough communication we realized that their software and my software coded the categorical variables differently and/or used a different baseline value. The result was the coefficients were different but the p-values were the same. That’s what led me down that path with your results. Except that in your case it’s not different software. But, my thought was that somehow the coding worked weirdly on data that was already binary. If it’s not that or the baseline issue, I’m not really sure!

One other thing to check is to make sure it’s using the same link function too. I don’t if that would change, but something to consider. Again, that would be weird but I grabbing at straws!

John Grenc says

Hello again, Jim. I did order the ebook just now. looking forward to it. however, I am still befuddled here. does this not suggest that it is of ultra importance which way you assign you independent 0-1 variables? but here is my point. what if I run a logistic regression, and get certain coefficients? How can i know that the interpretation is correct if I have just witnessed two completely different interpretations (depending upon the data type) of the same data and model? a further question would be how it is that a 0-1 variable defined as CONTINUOUS would bear the correct coefficients and NOT what everyone would presume to be the correct ones (having defined as nominal). I say correct because when defining as continuous we get an intercept that completely reflects the actual data for the missing 0-1 variable (there are four of them and I ran 3 in the model). do you see what I am asking? or maybe I don’t understand something about logistic regression (or regression in general) that others do? thanks John

Jim Frost says

Hi again,

Yes, you’re right. You must know how the variables are encoded to properly interpret them. The results are statistically correct either way. But, knowing what a 1 indicates allows you to interpret the results correctly. For example, if your IV is gender, a 1 could indicate either male or female and 0s represent the other gender. Either way, you obtain meaningful results. However, to interpret the results correctly, you must know what a 1 represents. Consequently, changing the coding scheme does change the coefficients. However, the overall significance for that variable does not change. Logically, depending on whether 1 represents males or females changes the coefficient. However, the the gender variable itself has the same significance either way. You’ll read about this in my book, although not in the logistic regression context. But, the same ideas apply.

For continuous, I wasn’t clear, but I was referring to the DV. It would be better to classify indicator variables as numeric.

I’m not sure how your software codes nominal variables so I’m unsure of what is happening. But, you can be sure that it is recoding nominal variables into indicator variables using an algorithm. If you already have a binary variable that uses the numeric values of 1 and 0, you don’t need to recode it. It’s already in the correct format that it can use to fit the model.

Telling your software that a variable is nominal tells it that the values in your data sheet, whether they’re numbers or text, represent group membership. The software can’t use group membership to fit models without recoding them as indicator variables. It depends on the nature of the data. Again, refer back to my example where you use the values 1, 2, and 3 to represent three groups. You could use those numeric values to fit the model. However, the results would be incorrect. The software would assume that a value of three indicates three times some measurement compared to a value of 1. However, that’s incorrect because these are separate groups and there is no distance between them.

I don’t entirely understand what you’re asking or how your software handles the data. But, it comes down to representing the data accurately for your software (groups versus numeric), and then understanding how it handles nominal data. In your case, it’s already entered as indicator variables, so you don’t need to worry about how it recodes it. Using it as numeric data is fine.

John Grenci says

hello, Jim.. I have been on this site before, and really appreciate your helpfulness. I had a question, and I was not sure where it fit in one of the articles, and this looked as good as any, so apologies if it is not in the right place. it has to do with logistic regression.

I found an article online that ran a logistic regression and I was trying to duplicate values by running the same set of data, and my coefficients were different. there is one 0-1 dependent variable, and there are four 0-1 independent variables. I found out the reason I was getting different coefficients was that I had run it with the independent variables as NOMINAL, and the person whose data I was trying to duplicate had run them as continuous. his interpretation was actually correct as verified by the interpretation of the intercept along with the actual data (i.e. leaving out variable rank4 as one of the variables gave an interpretation of 79% “ones” when rank4 was set to one, and that is exactly what the data reflected). so, I had two questions one is why it mattered? what is the math behind it? but further, that seems to suggest that in ALL cases one should run independent binary variables as continuous. I can provide the data or more so, the output of coefficients if you like. thanks John

Jim Frost says

Hi John,

I don’t have many posts on logistic regression. I should fix that! So, no worries! I do have one here but it doesn’t address your question.

Actually, your question isn’t specific to logistic regression. It applies least squares regression for continuous data as well.

In regression analysis, you can include categorical (nominal) variables. However, the software needs to convert them to indicator variables. Indicators variables are simply binary variables that indicate the presence or absence of a condition, characteristic, or event. Now, if your data is already coded as binary indicator variables, there’s no need to recode them and you can just use them as they are. The way it works mathematically is that the procedure estimates the coefficient for each indicator variable. When the characteristic is present, you multiply it by one, which means you just add the coefficient into the fitted value. However, if it’s not zero, the effect that term is zero. I go into more detail about this process and its implications in my ebook about regression.

As for why it makes a difference, I’m not sure. I’m picturing that you have a column of 1s and 0s. By telling your software it’s a nominal variable, it’ll recode a column of 1s and 0s into a column of 1s and 0s. Perhaps it’s recoding 1s into 0s and vice versa? I’m not sure. You’ll need to check your software documentation to determine how it recodes nominal data. Usually, there’s a scheme, such as the higher value being recoded as 1s and low values as zeros–or the other way around. Either way can produce valid results, but you need to know what it considers to be an “event” to be able to interpret it correctly. You should be able to change the coding scheme too.

The real value of running a variable as nominal data is when you have text values, such as Pass and Fail. You’ll define one of those as the “event,” which then gets coded as the 1s. The OLS procedure can’t understand text, so that’s needs to be converted to indicator variables. Or, if you have three groups of data and they’re entered as groups: 1, 2, and 3. Treating this as numeric would be a mistake. The software would think that a value of 3 represents three times the amount of something compared to a value of 1. In reality, they’re just different groups where the differences between numbers don’t represent anything other than group membership. In that case, these three values would be recoded into two columns of indicator variables that indicate group membership. Again, my ebook covers this in much more detail.

So, yes, if your data are already entered as binary 1s and 0s, run it as numeric! No recoding is necessary!

I hope this helps!

Raghad says

Thanks a lot

that was so helpful

Jim Frost says

You’re very welcome. Best of luck with your analysis!

Because your study uses regression analysis, you might consider my ebook about regression. It covers this material in more depth.

Raghad says

Many Many thanks for the valuable information.

I just have a question…

If I have a collinearity just between two independent variables, so I Have to standardize one of them or to standardized all independent variables in the model?

Jim Frost says

Hi Raghad,

Standardizing your IVs only reduces multicollinearity for interaction terms and polynomials. It won’t help reduce multicollinearity that exists between two IVs. To address that type of multicollinearity, read my post on that topic for some potential remedies.

In terms of standardizing variables for multicollinearity due to interaction terms and polynomials, strictly speaking, you only need to standardize the variables that are included in those terms. You don’t need to standardize variables that are not included in those terms. However, analysts will frequently standardize all of the continuous variables for consistency.

Indeed, when you standardize all the continuous IVs, the constant takes on a special interpretation. When you standardize all the continuous IVs, the constant represents the average DV value when all the continuous IVs are at their mean values. If you have a mix of standardized and non-standardize IVs, it complicates the interpretation of the constant.

Additionally, if you’re making predictions and entering values into the equation, you’ll need to remember which IVs to enter as scores relative to their mean versus which scores to enter in their raw form.

Consequently, it’s often easier and more useful to standardize all of your IVs if you decide that you need to standardize any of them. However, again, standardizing won’t help reduce multicollinearity between two IVs. It sounds like you’ll need to consider other remedies.

jagriti khanna says

Thank you so much Jim.

I’m so fortunate to have come across these websites of yours. You have compiled everything at a single platform in a very easy to understand way. I thoroughly went across all these sites and it helped me a lot for my project. I can’t thank you enough 🙂

Jim Frost says

You’re very welcome, Jagriti! I’m glad my website was helpful!

jagriti khanna says

Hi JIm

I think I wasn’t able to explain myself clearly.

Basically i just want to know that how will the change in value of x for a multivariate curvilinear regression affect the value of Y since now we don’t have coeff directly as slopes because of polynomial nature of each variable

Jim Frost says

Hi Jagriti,

Interpreting coefficients when you have polynomial terms is always more difficult than when you only have straight-line effects. Standardizing your variables doesn’t really make it more difficult. I always start out with graphs to understand visually the nature of the relationship between two variables.

In terms of graphing, when you have only one independent variable, I highly recommend using fitted line plots. These fantastic plots show the curved fit (when you have polynomials) along with your data points, which helps you determine how well the model fits the data. You can see examples in my posts about coefficients and p-values and modeling curvature.

When you have more than one independent variable, you need to use a main effects plot. These plots display the relationship between one IV and the DV, while holding the other IVs constant. I don’t have blog post to point you towards, but I cover these plots and this particular use for them in my ebook about regression analysis.

In terms of predicting values, normally you’d just enter the the values of your IVs into the equation and calculate the fitted result. However, because you standardized your variables, you’ll need to adjust the values that you enter into the equation. All you need to do is subtract of the mean of each IV from the value you want to enter into the equation for each IV. For example, if the mean of the IV is 15 and you want to predict the mean DV for X, you’d enter X-15 into the equation. Although, some software packages will do this for you automatically. Check the documentation.

Best of luck with your analysis!

jagriti khanna says

Hi Jim

Can you please tell me after i standardized my x matrix, how will i standardize the future values of x which i want to use for prediction, if i want to study cases like if i increase one of my x by 5%, what will be the change in my y (which is slope). And also tell me how to see the actual polynomial combined equation of all the variables after i’ve done p-vaue and r sq analysis and checked for residuals and done the final multivariate linear regression since now my coeff won’t directly determine the slope as the are part of polynomial equations.

Thanks in advance

Gabriel Samuel says

Your blog is awesome. I’m grateful I got hooked at this point in my thesis write up. Thanks and keep up the good work.

Jim Frost says

Thank you so much, Gabriel! I’m so happy to hear that my blog has been helpful. Best of luck with your thesis!

Shaarang says

As an aspiring data scientist, I can not overstate how helpful your setup has been. Thanks a ton

Jim Frost says

You’re very welcome! It makes my day hearing how it has been helpful for you!

Luke says

Great article! Thanks for sharing. I do have a question regarding what you said here “However, to remove multicollinearity caused by higher-order terms, I recommend only subtracting the mean and not dividing by the standard deviation. Subtracting the means is also known as centering the variables”, would you elaborate how will it cause problems by dividing the standard deviation after centering?

Jim Frost says

Hi Luke,

All I meant by that was that if you just center the variables, the interpretation of the coefficients doesn’t change from their normal interpretation that a coefficient indicates the mean change in the dependent variable given a one-unit change in the independent variable. However, if you also divide by the standard deviation, the interpretation of the coefficients changes. For that case, the coefficient represents the mean change in the DV for a 1 standard deviation change in the IV.

I write about how standardizing your continuous IVs can be helpful in a post about How to Identify the Most Important Independent Variables in Your Model. You can read more about that approach in that post.

I hope this helps!

Douglas AMULI says

I found very helpful your post.

Concerning it i have two questions:

– Is it a problem if one runs a regression model where some independent variables are standardized and others are not ?

– Imagine a particular case of a mimic model with standardized causes but not standardized indicators. Are results negatively affected ?

Thanks in advance for your reply.

Visar says

This was very useful. Thanks a lot and keep up the good work!

Jim Frost says

Thank you, Visar!

Karien says

Hi Jim,

Thank you for your posts. I have to do a statistical analysis for a project and I have never had to delve so deep into statistics before. Your plain English explanations really help a lot.

When dealing with interactions, do you first get the interactions between the variables and center them as well? Or do you center the independent variables and then get the interactions? Or, are the interactions from the original independent variables and then only the independent variables are centered? I am very confused about the order of things here. Also if you have more than one interaction that is significant, does it become another term in the regression equation?

Thanks,

Karien

Jim Frost says

Hi Karien,

I’m so glad to hear that my blog posts are helpful!

To answer your question, some of it depends on your statistical software. If it can do these things for you automatically, then you don’t have to worry about it.

However, if you need to do them manually, here’s to correct order.

1. Create a new column for each continuous independent variable you need to center.

2. Center the continuous variables in the new columns.

3. Create a new column for each interaction term.

4. Create the interaction term by multiplying the appropriate columns. Be sure to use the centered variables.

Again, many software packages can do some or all of these steps for you automatically. So, you might not need to worry, but do check the documentation for the software.

I hope this helps!

Jim