Standardization is the process of putting different variables on the same scale. In regression analysis, there are some scenarios where it is crucial to standardize your independent variables or risk obtaining misleading results.
In this blog post, I show when and why you need to standardize your variables in regression analysis. Don’t worry, this process is simple and helps ensure that you can trust your results. In fact, standardizing your variables can reveal essential findings that you would otherwise miss!
Why Standardize the Variables
In regression analysis, you need to standardize the independent variables when your model contains polynomial terms to model curvature or interaction terms. These terms provide crucial information about the relationships between the independent variables and the dependent variable, but they also generate high amounts of multicollinearity.
Multicollinearity refers to independent variables that are correlated. This problem can obscure the statistical significance of model terms, produce imprecise coefficients, and make it more difficult to choose the correct model.
When you include polynomial and interaction terms, your model almost certainly has excessive amounts of multicollinearity. These higher-order terms multiply independent variables that are in the model. Consequently, it’s easy to see how these terms are correlated with other independent variables in the model.
When your model includes these types of terms, you are at risk of producing misleading results and missing statistically significant terms.
Fortunately, we’re in luck because standardizing the independent variables is a simple method to reduce multicollinearity that is produced by higher-order terms. Although, it’s important to note that it won’t work for other causes of multicollinearity.
Standardizing your independent variables can also help you determine which variable is the most important. Read how in my post: Identifying the Most Important Independent Variables in Regression Models.
How to Standardize the Variables
Standardizing variables is a simple process. Most statistical software can do this for you automatically. Usually, standardization refers to the process of subtracting the mean and dividing by the standard deviation. However, to remove multicollinearity caused by higher-order terms, I recommend only subtracting the mean and not dividing by the standard deviation. Subtracting the means is also known as centering the variables.
Centering the variables and standardizing them will both reduce the multicollinearity. However, standardizing changes the interpretation of the coefficients. So, for this post, I’ll center the variables.
Interpreting the Results for Standardized Variables
When you center the independent variables, it’s very convenient because you can interpret the regression coefficients in the usual way. Consequently, this approach is easy to use and produces results that are easy to interpret.
Let’s go through an example that illustrates the problems of higher-order terms and how centering the variables resolves them. You can try this example yourself using the CSV data file: TestSlopes.
Regression Model with Unstandardized Independent Variables
First, we’ll fit the model without centering the variables. Output is the dependent variable. And, we’ll include Input, Condition, and the interaction term Input*Condition in the regression model. The results are below.
Using a significance level of 0.05, Input and Input*Condition are statistically significant while Condition is not. However, notice the VIF values. VIFs greater than 5 indicate that you have problematic levels of multicollinearity. Condition and the interaction term both have VIFs near 5.
Related post: Understanding Interaction Effects
Regression Model with Standardized Variables
Now, let’s fit the model again, but we’ll standardize the independent variables using the centering method.
Standardizing the variables has reduced the multicollinearity. All VIFs are less than 5. Furthermore, Condition is statistically significant in the model. Previously, multicollinearity was hiding the significance of that variable.
The coded coefficients table shows the coded (standardized) coefficients. My software converts the coded values back to the natural units in the Regression Equation in Uncoded Units. Interpret these values in the usual manner.
Standardizing the independent variables produces vital benefits when your regression model includes interaction terms and polynomial terms. Always standardize your variables when the model has these terms. Keep in mind that it is enough to center the variables for a more straightforward interpretation. It’s an easy thing to do, and you can have more confidence in the results.
For more information about multicollinearity, plus another example of how standardizing the independent variables can help, read my post: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. The example in that post shows how multicollinearity can change the sign of a coefficient!
If you’re learning regression, check out my Regression Tutorial!
Kathryn says
Thank you for your reply! I see now, I was not paying attention to the fact that Condition is a binary variable. So, the final regression coefficients calculated with centered independent variables can be interpreted “in the usual way” as you said, meaning that the coefficient is the difference in the dependent variable for each one-unit change in the independent variable, assuming all other independent variables are held constant. If the regression equation is then used for prediction, does that mean that new data can be directly plugged into the equation? Or does it also have to be centered (maybe using the mean that was used to center the original data)?
Jim Frost says
Hi Kathryn,
That’s right, just interpret them the usual way with the usual meaning as you wrote it.
If you fit a model using centered data and then want to use it for prediction, you’ll need to enter the data as centered values. Let’s say your X variable has a mean of 10 and you want to predict for a raw data value of 15. You need to subtract the mean from the raw value. So, to predict for an x-value of 15, you’d need to enter 5. The same for all the other continuous variables that you centered when fitting the model.
Kathryn says
Hi Jim, thank you for this informative post! I have a question about how the coefficients of the final regression equation are converted back to the natural units. After the independent variables have been centered, the polynomial or interaction terms have been created, and the model has been fit, the regression coefficients are based on the centered data. But then, you mention “My software converts the coded values back to the natural units in the Regression Equation in Uncoded Units.” How does it do this? I’m using python and I don’t see a built-in method for using centered independent variables and then converting the coefficients back to natural units. Thank you!
Jim Frost says
Hi Kathryn,
First, I need to make a distinction between centered and standardized data. Centered refers to only subtracting each variable’s mean from the variable’s values. Standardization refers to centering followed by dividing the values by the standard deviation.
The good news is that if you’re centering your variables, the continuous coefficients remain in the natural units and you can interpret them normally.
However, in this example, my software is doing a little extra by creating the equations for each of the two conditions. If it hadn’t done that, you could interpret all the coefficients in the normal manner except for the constant. In that model, the constant now represents the mean dependent value when all the continuous variables are at their mean and any categorical variables are at their baseline values.
That’s not true when you actually standardize because the continuous coefficients represent the mean change in the DV given a 1 standard deviation change in the IV.
If you only want to reduce structural multicollinearity, I recommend centering rather than standardizing. It keeps the interpretation simple.
As for the transformation from coded to uncoded units in this example, it’s really not so much a transformation but splitting the one equation into two for each condition. It’s more of a transformation process when you use standardization or transform the DV, but the software uses the same terminology across those various conditions. Here’s how it did the conversion for this example:
When you involve an interaction between the binary variable “Condition” and the continuous variable “Input” in your regression model, converting coefficients back to their natural units impacts both the intercept and slope of “Input”. Here’s how the coefficients adjust where b0 = the coded constant, b1 = the coded coefficient for Input, b2 = Condition B coded coefficient, and b3 is the coded interaction term coefficient.
Intercept for Condition = A: The adjusted intercept is calculated as b0′ = b0 – b1 * X_mean. This adjusts for the mean subtraction during centering, reflecting the base level of Y when Input is at its mean for Condition = A.
Intercept for Condition = B: When Condition = B, the intercept includes both the condition effect and interaction adjustments: b0′ + b2 = b0 + b2 – (b1 + b3) * X_mean. This accounts for changes in the intercept and slope due to Condition B and the interaction.
Slope for Input when Condition = A: The slope remains b1, indicating the effect of Input when Condition = A.
Slope for Input when Condition = B: For Condition B, the slope is adjusted to b1 + b3, reflecting the combined effect of Input and its interaction with Condition B.
Valério says
Thank you!
Jim Frost says
You’re very welcome! 🙂
Valério says
Hi Sarah, I have the same questions you presented. Were they answered?
Jim Frost says
Hi Valério,
When I saw your comment, I looked and realized that somehow Sarah’s original questions were from 2020 and somehow fell through the cracks! I’ve just replied to her now. She might not see the answer now that it is 4 years later but hopefully they’re of interest to you.
Torruam Japheth Terande says
I appreciate your efforts. Is it correct to standardized panel data within estimator to deal with heterogeneous time interval?
Won says
Hi, Jim
Thanks for your post. I have two questions :
1) Do you recommend to standardize proportional data that ranges between 0 and 1?
2) Should we standardize all variables or only the variables in the interaction term?
thanks for making statistics accessible.
Best,
Won
Charlie says
Hi Jim
A nice post, thank you! I have recently encountered a trouble, if the data is standardized during input, why is the coefficient result of the output variable still large, even more than 1. The VIF test has no collinearity.This is shown below.Looking forward to your reply!
Best wishes
Charlie
Jim Frost says
Hi Charlie,
I don’t have enough information to answer your question. It’s possible there is no problem and the coefficient is accurate. However, there are several other potential problems that can cause it. It sounds like you’ve rule out multicollinearity. So, that’s one down.
But, there are other possible causes. It could be that you’ve omitted an important variable from your model and it’s biasing this variable in your model. Maybe you’re incorrectly fitting curvature in the data and that’s biasing the coefficient. Or, perhaps you’re overfitting your model, which can cause strange results.
There are several things it could be and it’ll take an investigation to determine whether it’s actually and problem and, if so, what is causing it! Be sure to check the residual plots because they can help identify problems.
Stein Roest says
Hi Jim,
I use binary logistic regression. For my paper I have to compare the odds ratio’s of two credit rating models. One is based on a 6-grade scale and the other one is based on a 5-grade scale. Therefore the odds can not be compared directly.
Can I adjust the scale of the credit ratings by multiplying the coefficient by the standard deviation?
Thanks in advance!
boris says
Hi, Jim
Your Posts are really helpful for my current study, I’m really grateful.
I have a question about the standardization with a dataset contains both continuous and categorical variables.
If I only standardized the continuous variables, could the coefficients of those standardized continuous variables and categorical variables be compared and to find out the important independent variables?
Look forward to your answer!
Jim Frost says
Hi Boris,
You can only standardize continuous variables because the process requires you (well, your software) to calculate the mean and standard deviation for a variable. You can’t calculate either for categorical/nominal variables. For example, what’s the mean and standard deviation of a categorical variable like College Major?
You have to access categorical variables differently. How much does the dependent variable mean change between the different levels of the categorical variable in raw units? Is that a meaningful or trivial amount? It’s a different process.
I hope that helps!
sagar says
Hi Jim,
This is so beneficial . I want to ask you a question – suppose my model has interaction term and I resolved multicollinearity by centering the variable but it made my main effect variable insignificant. How to approach in that case?
Jim Frost says
Hi Sagar,
Centering the variables made the main effect insignificant? Or, adding the interaction term? You say centering the variable but I’m assuming you centered all continuous variables in your model?
Typically, when you include an interaction term, you also include the lower order terms that comprise the interaction term even when they’re not significant. That’s called a hierarchical model (that term is also used in a different context in regression analysis, so it can be confusing). So, if the main effect is not significant but the interaction effect is significant, I’d leave the main effect in the model anyway. Doing so allows the procedure to better estimate the interaction term.
Aly says
Hi Jim,
I’m trying to build a new equation model to predict public transportation efficiency. I was able to gather a sample of 10 countries to build my training model using four identified variables or metrics. I was advised to decide on my Y (efficiency) variable by making an arbitrary ranking using an existing online ranking where these countries were ranked before from 100-0. Plugging in the newly scaled x variables, I’m looking at my regression model and a bit confused as to how I can decide if this model is accurate.
My next steps is to basically get the regression equation using 7 of the countries in the sample (training the model), and then use that equation to predict the efficiency for the remaining three countries.
In this process, which equation will be my final efficiency equation, also is this thought process correct in building a prediction model. My concerns are mainly rooted in using the online ranking system for the initial regression. My final goal is to be able to use that equation to rank these countries again using the effiiency equation.
Fred says
Hey Jim,
I read most of the comments carefully and learned some new reasons why it is useful to standardize variables. Thanks a lot!
But I asked myself, if considering simple linear regression:
What is a good reason not to ‘standardize’ parameter estimates?
I have some ideas, but I am not really sure about them.
I can’t thank you enough for 😀
Fred
Jim Frost says
Hi Fred,
True standardization (subtracting the mean and dividing by the standard deviation) changes the interpretation of the regression coefficients. If you want the usual interpretation: the coefficient represents the mean change in the DV given a one-unit increase in the IV, then you don’t want to standardize.
However, you can get some of the benefits of standardizing by centering the variable (subtracting the mean) and still retain the regular interpretation. It does change the interpretation of the constant but usually you can’t meaningfully interpret the constant anyway.
In simple regression, there are interpretation reasons not to standardize. While you can center the variable, you really don’t gain any benefits.
I hope that helps! And thanks so much for the kind words! 🙂
RABIA NOUSHEEN says
Hi Jim
I want to ask that
1) Is it necessary to convert all predictors into factors while running generalized linear model or we can keep numeric terms as numeric?
2)I was interested to do model selection on the basis of AIC. I observed that model selection (using backward selection or drop1 method) was only possible with all factors model but not when there were numeric predictors. Can you share your thoughts on it?
3)How do we deterimine homogeniety of variance when predictors are quantitative? Levene test does not work with quantitative predictors.
4)While doing model selection. I first put all predictors in the model (i have 4 precitors) like this way:
Model1 <- glm(A*B*C*D, family = binomial, weights = exposed, data = data1)
when I do levene test for Model 1, I get to know that there is homogeneity of variance. However during model selection, some of the interaction terms particularly the 4 way interaction term is removed. So for the final model, how should I test the variance? or variance tested at the first is enough?
5) If there is multicollinearity in the model then is it ok to remove those coeffecients which show collinearity and whose removal decreases the model AIC as well?
I shall be grateful if you share your thoughts on above asked questions.
Regards
RABIA NOUSHEEN says
Hi Jim, I am desperately waiting for your comments on this question. I shall be thankful if you can explain all this in a general context.
Jim Frost says
Hi, I did answer your question. Please look more carefully.
Singapore360 says
Dear Jim,
Thank you for the post. On a slightly different topic – Would you advise that the DV outcome be centered (or standardised) too? If the dependent variable is a scale (ie, composite score of X items), will there be a difference in findings if we use the mean of the DV (ie, divide by X)?
Your advice much appreciated!
Leo
RABIA NOUSHEEN says
Hi Jim
Thank you for your posts. These are always very helpful. I am recently facing collinearity issue with my data and would like to get your help in that.
Briefly, I am using R to run logistic regression model on my data set which has 3 categorical independant variables and response variable is binary. Initially when I was running the model, I noticed that few coeffeciants (levels of variables) were missing (R was using them as baseline). As I wanted to have all treatments included in the model so I tried to code my variables so that only control is selected as a reference. I am now getting all levels included but now two got “NA” written there. R says coeffecients are missing due to singularities. What to do in this case to get an output for the missing ones too?
sonia says
Hi Jim, I have no word to praise your work much. your posts are life savers. very helpful
I have one question. I want to standardize observed temperature values with respect to their ideal values. But by my ideal values are in range. for example 10-12 degree centigrade. In this case, how can handle this.
Thanks.
Jim Frost says
Hi Sonia,
Thanks so much! I’m glad they’ve been helpful!
You can standardize variables by values other than their means. However, it can be only one value per variable. For your example, assuming you’re talking about one variable that has an ideal range of 10-12, you’ll need to pick one value to use. You might use the center value of 11, unless some other value makes more sense for your study. You’d simply subtract the value you choose from each observation, and then divide by the standard deviation. The end result is a value that represents the number of standard deviations above or below the ideal value for each observation. To interpret the coefficients, read my post about identifying the most important variables in your model. One of the methods is to standardize your variables. Even though I use the mean during standardization, the interpretation remains the same.
I hope that helps!
Ajay Tomar says
Hey Jim,
Does standardization helps in case of Independent Variables having 0? I got your article understood but the assignment i’m doing has a single Independent Variable but has a 0 inside it too.
So, i’m divided on deciding whether i have to do standardization or not. On one side i have no polynomial terms and interaction terms. On other i have 0 is my dataset.
Just want to know the reason why 0 being in a dataset becomes and exception for linear regression to do standardization.
Mavis says
Hi Jim,
Thanks for your great post.
I have read a statement that transforming/standardizing variables wouldn’t change its p-value as long as we keep the same model.
But, I found that there was a small difference in p-values for Condition B variable before/after standardization in your post.
Could you explain what makes the p-value change?
Can transforming/standardizing variables change p-values even though we use the same model?
If so, the statement I read is not trustworthy, or it only applies with condition, I guess.
Could you please share your thoughts on this? I will really appreciate it.
Thank you very much.
Jim Frost says
Hi Mavis,
Yes, standardizing the variables can change the p-values. There’s one prime reason I can think of off the top of my head–structural multicollinearity. Correlated predictors (multicollinearity) increases the error in your model which reduces p-values. Structural multicollinearity is when the model specification causes the multicollinearity. For example, when you include polynomials for curvature or interaction terms, the fact that you’re multiplying predictors in the higher order term creates multicollinearity. Standardizing (or just centering) can reduce structural multicollinearity, which increases the precision of your model and can make p-values become significant.
So, not only is it possible, but sometimes standardizing the variables is intentionally used to get the p-values to change (legitimately) because the p-values might be artificially high (not significant) in models with multicollinearity.
For more information on this process, read my post about multicollinearity and using standardization to reduce it.
Kemal says
Hi Jim,
Thank for providing valuable information.
The only thing that I wonder is that standardizing is applicable to get rid of zero and negative variable in the data set.
For example, I got a data like 0 0 0 -3 5 7 9 11 15 17……
Is it suitable for me to use standardization?
If yes could offer a reference to add my paper?
Best Regars.
Jim Frost says
Hi Kemal,
Sorry, I don’t have a reference. But it’s totally fine to use standardization with variables that contain zeros and negative values. However, it won’t rid your data set of zeros and negative values. In fact, standardized data will always have both positive and negative values because there will always be value above (+) and below (-) the mean. Well, unless all the values are constant then all values will be zero.
But, yes, it’s perfectly OK to standardize data that contain zeros and negative values.
Rafi says
Hi Jim,
I have a dataset which have both discrete and continues variables. I need to scale/standardize it. Can you please tell me whether I should standardize both continues and discrete variables? I assume we only need to scale the continues variables. Please confirm
Thanks,
Mashrafi
Jim Frost says
Hi Rafi,
You’d standardize the variables that you’re using in your model as continuous variables. That could potentially include discrete variables. Suppose you have a discrete variable that can take the values of 1 through 10, integers only. That’s technically a discrete variable but if your model uses it as a continuous variable, you could standardize it. However, you might use other discrete variable as categorical grouping variable. For example, you might have a variable that can be 1, 2, 3, 4, 5 but use those values to form groups. In that case, you wouldn’t standardize it.
Abi Revyansah Perwira says
Hello Jim
I have questions for you:
1. When I have some variables with high VIF value, should I standardized the problematic variables only or all independent variables?
2. Could I cite your blog as my paper reference?
Thank you
Jim Frost says
Hi Abi,
Yes, you can standardize just the problematic variables. Keep in mind that it only helps with multicollinearity caused by interaction terms and polynomials. If you standardize by centering, then you can get the benefits without changing the interpretation! I’d recommend centering all variables that are included in interaction and polynomial terms.
Yes, please do! Here’s a link to Purdue University’s Guide for Electronic Sources, which shows how to cite websites/blogs.
Thanks for writing!
james says
Hello Jim,
I have obtained a data regarding birds and I have calculated there ordinary abundance ( how many birds I saw in sample area), now as my transects are of different length I want to standardize this data ito sampled area and survey effort ( Birds I saw per 10 square km per hour).
can you suggest me a software or procedure to proceed with my analysis.
Jim Frost says
Hi James,
It sounds like you just need to come up with a good standard area and time period for the number of observed birds. I don’t know what researchers use in this area. I’d recommend doing a literature review to see how other researchers have done it. Your question is very subject-area specific and I just don’t have the experience in the area to give a good answer!
tempy temper says
Do you write for minitab?
Jim Frost says
Click the Read More link in the About me section to find out!
Muj says
Hi Jim,
I am confused what I have to standardize. I typed in Stata like this
insheet using TestSlopes.csv, comma clear
gen condition2=.
replace condition2=0 if condition==”A”
replace condition2=1 if condition==”B”
drop condition
rename condition2 condition
reg output c.input##i.condition
then I could replicate your first result without standardization.
Then, I did standardization.
egen input2=std(input)
egen condition2=std(condition)
But then I tried this.
reg output c.input2##c.condition2
But this failed to replicate your second result.
I also tried
reg output c.input2##i.condition
This replicated the coef for constant and condition. But coef on input2 and interaction term are different from yours.
What I am doing wrong?
Muhammad says
Hello Jim,
I was actually looking for a solution to solve a problem that is facing me, the issue is that I have two independent variables one is extremely small with values between 0.001 en 0.01 and the other variable van vary from -10 to +20 and then I want the dependent variable to have a cut value that is a mix between the two variables, but it seems quite difficult to combine the value that is very small to the other which is thousand times more.
Could you please help me approach this problem ?
Thanks in advance,
Muhammad.
Jim Frost says
Hi Muhammad,
I’m not sure exactly what you mean by a cut value? But, I take it that you want to combine these two variables into one? If so, I agree that combining a variable that has such small values with another variable that has much larger values is a problem. What you want to do is standardize both variables (subtract the mean and divide by the standard deviation), and then add them together. That will give both variables equal weighting. You’ll need to interpret the coefficient differently. It represents the mean change in the DV given a one standard deviation increase in this new variable.
I hope that helps!
Sarah says
Hi Jim, thank you so much for this fantastic post! I have one quick question:
Assume I want to perform this regression: y ~ a + b + c + a^2 + b^2 + c^2
Should I either:
a) standardize my original variables (i.e. a, b and c) first and THEN create the quadratic features, OR
b) Create the quadratic terms first and THEN standardize my set of features (a, b, c, a^2, b^2, c^2)?
I found the following advice on someone else’s blog post, but just wanted to check with you:
“Always standardize AFTER generating Polynomial Features because:
1) Loss of signal:
When you create feature interactions, you’re generating values that are multiples and squares of themselves.
When you standardize, you’re converting values to z-scores, which are usually between -3 and +3.
By creating interactions between z-score sized values, you’ll get values a magnitude smaller than the original.
To better illustrate this, imagine multiplying values between 0 and 1 by each other. You can only end up with more values between 0 and 1.The purpose of squaring values in PolynomialFeatures is to increase signal. To retain this signal, it’s better to generate the interactions first then standardize second.
2) Making random negatives:
When you standardize, you turn a set of only positive values into positive and negative values (with the mean at zero).
When you multiply negative by positive, you get negative. Doing this to your data will create negative values from previously all-positive values. In other words, your data will be jacked up.”
Hoping to hear from you soon!
Thanks 🙂
Jim Frost says
Hi Sarah,
You should center the variables before creating the quadratic feature. Centering is different from standardizing because it involves subtracting each variable’s from all of its values. Standardizing takes the additional step of dividing by the the variable’s standard deviation. For both approaches, you’d do this before the quadratic. The advantage to using centering is that you can still use the standard interpretation of the coefficients.
You want to center/standardize your variables before creating the quadratic feature so you can reduce structural multicollinearity. Doing so doesn’t cause any “loss of signal.” It can change some positive values to negative values but it doesn’t “jack up” your data. The spacing between the data remains the same the values so you’re retaining the same information. It’s all there just in a format that reduces structural multicollinearity.
Also, the purpose of squaring values is NOT to increase signal. It’s so you can model curved relationships in the data. And remember, if you inflate the signal but the outcomes remain the same, it reduces the apparent relationship between them! In other words, the larger signals with more variability aren’t producing larger changes in the outcomes. Relatively restricted variance in the outcome (compared to the inflated squared predictor) actually makes reduces the correlation between them.
The only reason you’d center the variables is so that you can reduce structural multicollinearity. Standardizing the variables also reduces structural multicollinearity but it results in the same basic model as centering. Although, standardizing does change your interpretation though because the coefficients now relate to changes in the variables in terms of standard deviations rather than a 1-unit change in their native scale (which is the interpretation for the raw data and centered data). In other words, standardized model use standard units rather than raw data units but the underlying model and results are the same.
Humairaa says
Hi I am doing a multiple regression analysis for my dissertation but for my variables they were questionnaires which have 93 responses and a total of 96 questions. So how will I need to do this regression with this data im a bit confused. As all the videos and helped I have looked at have one column of information but for my variables I have lots of information. Will I need to calculate a total score or standardize them also on spss it is not letting me work out the sum. I have also go recoded data. I am unsure how to do a regression on multiple information for each variables I have.
Abusheha says
Hi Jim,
Thank you for your posts. I have two questions:
1. Do we need to standardize the variables in a Quantile Regression Model?
2. Is it right that centering vs. standardizing the variables has only to be done to the independent variables?
Thank you in advance,
Jim Frost says
Hi Abusheha,
Yes, if you want to reduce multicollinearity or compare effect sizes, I’d center/standardize the continuous independent variables in quantile regression.
If you just want to reduce multicollinearity caused by polynomials and interaction terms, centering is sufficient. Centering doesn’t change how you interpret the coefficient.
However, if you want to compare the effect sizes across IVs, you should standardize your continuous IVs. Then, interpret the coefficients as I describe in my post about identifying the most important independent variable.
João Faria says
Thank you very much! That’s precisely it.
João Faria says
Hi Jim,
Thank you very much for this resources.
I tried skimming through the answers before asking this but can’t seem to find a proper answer:
When I use a dataset for the analyses and model its interactions, do I also need to restandardize the continuous variables if I run another analysis on just a subset of the data?
I think this should be true since we do not have the subset data correctly centered any more. However, I also think this may reduce possible comparisons and strenght of the general work since the variables “differ” between subset datasets.
Thank you once more!
Best wishes
Jim Frost says
Hi João,
You’re very welcome. I’m glad you’ve been finding them helpful!
I’m not 100% sure what you’re asking. I suspect some of the answer depends on what you’re doing exactly. If you mean you’re subsetting your data so a subset column contains only part of the original column and then run an analysis on that new shorter column, yes, I’d center that subset column based on its mean. You should subset the raw, uncentered data and then center the subsetted data.
Lucas says
do you need always center the variable when dealing with interactions?
Jim Frost says
Hi Lucas,
Yes, I’d always recommend centering variables when you have interaction terms in the model. Interaction terms always introduce multicollinearity and centering helps counter it. Same with polynomials terms to model curvature.
Lucas says
Hello Jim I have understood why standardizing the predictors eliminate the multicollinearity. If 2 predictors are correlated shouldn’t they be correlated also when standardized?
Jim Frost says
Hi Lucas,
That’s another great question. You have to remember that correlation calculations involve multiplying pairs of scores. So, the key is understanding how centering the variables reduces the products of paired observations.
For simplicity, imagine the original distributions for X and Y are the same. They both have a mean of 10 and range from 5 to 15. Now imagine the maximum X value of 15 corresponds to a Y value of 15. (We’re keeping this simple). 15 X 15 = 225.
Now, let’s center those two variables. We subtract the mean (10 for both variables) from the observed values for both variables (15). So, 15 – 10 = 5. Five is the centered value for both X and Y. The product of those two is: 5 X 5 = 25.
The original product for that pair is 225, which is much greater than 25 for the centered pair. The uncentered values will produce a higher correlation. This example was intentionally simplified, but that’s the principle. Centering reduces the product of pairs that have higher absolute values. Conveniently, centering does not change the interpretation of the coefficients. It’s like having your cake and eating it too! 🙂
Joao says
Hi there!
Do I have to standardize the variables (response and independent variables) before running a Multivariate Linear Regression Analysis (with same order terms)?
Which consequences to the results can emerge from that (comparing to non-standardized regression results)?
Thank you,
Joao
Jim Frost says
Hi Joao,
You often don’t need to standardize the variables. I can think of two common scenarios where you might need to standardize the continuous independent variables:
If those reasons don’t apply, you probably don’t need to standardize the independent variables.
This post and the other post will answer your questions about consequences and interpretations!
Richard says
Hi Jim,
Thank you for your blog. Would you please explain why the center method will reduce VIF?
f.g. there are three explanatory Variables: a, b, c
VIF of a is 1-(1-R2) , where R2 come from lm(a ~ b + c) .
If I center a, b, c, they become a_center, b_center, c_center
VIF of a_center is 1-(1-R2_center) , where R2 come from lm(a_center ~ b_center + c_center) .
My current understand is that center will not change the relationship between variables. So lm(a ~ b + c) . and
lm(a_center ~ b_center + c_center) should output the same result. Thus R2 is the same as R2_center.
I used my data to test whether they are the same, they are sometimes the same, sometimes not.
Would you help me understand?
Thank you
Ronan Murphy says
Hi Jim,
Thanks for creating this statistical resources. Its invaluable.
I have run into a road block with something that I think should be easy; how do I undo standardisation?
I created a Python Jupiter notebook to fit a line to a set of (x,y) samples by minimising the sum of squared errors (SSE) in order to deepen my understanding of linear regression.
For (x,y) sample data I used set 1 of Anscombe’s Quartet. The x values range from 4 to 13 and the y values range from 4.26 to 10.84. Before running the loop to minimise the SSE I standardised the x values using the formula Xnew = (Xorig – Xmin)/(Xmax – Xmin). Similarly for the y values.
Running the loop gives values for m and c. Once I got to an acceptable level of error I plotted the “fitting” line using y=mx+c. I got a straight line that fits the standardised points.
I’d like to reverse the standardisation and plot the line against the original (x, y) samples. The slope m would not be affected by this reversal but c would be.
I tried the obvious reversal, namely Yorig =(Ynew * (Ymax-Ymin))+Ymin. This gave me a line which was above the set of points.
Do you know how I reverse the standardisation so that my c is correct?
Your help would be much appreciated.
Thank you.
John Grenci says
Jim, I am still not getting my question answered. maybe I have lost it. 🙁 perhaps if I give the specific example, you can see what I am asking. I have 4 0-1 IV’s. I left out variable 4. (called it rank4). when the regression is run, the results are below.
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5224265 0.3186144 -4.7782726 1.768076e-06
## rank3 0.3220316 0.3846844 0.8371319 4.025184e-01
## rank2 0.9366996 0.3610304 2.5945174 9.472383e-03
## rank1 1.6867296 0.4093073 4.1209370 3.773346e-05
if we take e to the -1.52 (the intercept) power and divide by that number plus 1 (I will call that expression #), it should translate back to the percent of times the dependent variable is 1 when rank4 is 1. that is EXACTLY what we get, in this case, 17.9%. but if I run it with NOMINAL IV’s I get a totally DIFFERENT SET of coefficients, and I don’t see how they can be ‘equally’ correct. because certainly, the above expression # does not hold if I get an intercept that is something other than the intercept above. this tells me then if I got the correct interpretations when the IV’s were CONTINUOUS (which they had to be in this case), then must they always be continuous else I will not be able to at the very least correctly interpret the intercept (we will leave interpreting other variables for another discussion). I don’t see where it has to do with how the software handles data unless it is a case by case basis depending upon the type of IV ……..i.e. when continuous, interpret by the expression #, and if nominal, we use an entirely different expression. I dont think that is the case. . does my question make sense? this is the fourth forum I am at, and if you cannot answer it, then I am may have to move on. if my question does not make sense, please say so, and there is no need for you to further answer. thanks. sorry, extremely frustrated.
Jim Frost says
Hi,
I haven’t had time to look through the model in depth. But, it’s not completely surprising to obtain different results for nominal versus numeric. When it’s a nominal variable, you’re depending on an algorithm deciding how to code the different levels. You’d expect different coding to produce different coefficients. It’s possible that your software is coding 1s as 0s and 0s as 1s. That would be weird but it’s a possibility. It might getting confused by the fact that you’re asking it to treat binary data as nominal.
Additionally, there’s another possibility at play. If you have a categorical variable with three levels, you need two indicators for two of the levels, and leave one baseline level out. If you have four levels, you’ll have three indicator variables, and so on. I’m not sure if the variables in your model are related to a categorical variable in that manner, but each software will have an algorithm that determines which level it leaves out my default. But the choice of the baseline level that you leave out also affects the coefficients. So, I can see multiple reasons why the coefficients could change. Although, I see left out rank4 intentionally, so maybe it’s not that.
As a general rule, when you change the coding scheme and/or baseline level, the different sets of coefficients tell the same story but from a different perspective. To interpret them correctly, you need to know how the indicator variables are coded and, if applicable, which categorical levels are left out to be the baseline.
As for exactly what’s going on, I can’t tell. I don’t know what your software is doing. There is a bit of mystery there. You’d think that if the data were already 1s and 0s and you told the software that they are nominal data, that it would recode them as 1s and 0s consistently. So, I know I haven’t answered your question exactly other than I’ve shown several ways it could’ve happened in general. I suspect you’d need to dig into the documentation to determine what is happening. it’s nothing something obvious that’s coming to my mind.
I’ve written a post about interaction effects and for awhile I was confused. Readers said that they got different coefficients for interactions with categorical variables. Eventually after enough communication we realized that their software and my software coded the categorical variables differently and/or used a different baseline value. The result was the coefficients were different but the p-values were the same. That’s what led me down that path with your results. Except that in your case it’s not different software. But, my thought was that somehow the coding worked weirdly on data that was already binary. If it’s not that or the baseline issue, I’m not really sure!
One other thing to check is to make sure it’s using the same link function too. I don’t if that would change, but something to consider. Again, that would be weird but I grabbing at straws!
John Grenc says
Hello again, Jim. I did order the ebook just now. looking forward to it. however, I am still befuddled here. does this not suggest that it is of ultra importance which way you assign you independent 0-1 variables? but here is my point. what if I run a logistic regression, and get certain coefficients? How can i know that the interpretation is correct if I have just witnessed two completely different interpretations (depending upon the data type) of the same data and model? a further question would be how it is that a 0-1 variable defined as CONTINUOUS would bear the correct coefficients and NOT what everyone would presume to be the correct ones (having defined as nominal). I say correct because when defining as continuous we get an intercept that completely reflects the actual data for the missing 0-1 variable (there are four of them and I ran 3 in the model). do you see what I am asking? or maybe I don’t understand something about logistic regression (or regression in general) that others do? thanks John
Jim Frost says
Hi again,
Yes, you’re right. You must know how the variables are encoded to properly interpret them. The results are statistically correct either way. But, knowing what a 1 indicates allows you to interpret the results correctly. For example, if your IV is gender, a 1 could indicate either male or female and 0s represent the other gender. Either way, you obtain meaningful results. However, to interpret the results correctly, you must know what a 1 represents. Consequently, changing the coding scheme does change the coefficients. However, the overall significance for that variable does not change. Logically, depending on whether 1 represents males or females changes the coefficient. However, the the gender variable itself has the same significance either way. You’ll read about this in my book, although not in the logistic regression context. But, the same ideas apply.
For continuous, I wasn’t clear, but I was referring to the DV. It would be better to classify indicator variables as numeric.
I’m not sure how your software codes nominal variables so I’m unsure of what is happening. But, you can be sure that it is recoding nominal variables into indicator variables using an algorithm. If you already have a binary variable that uses the numeric values of 1 and 0, you don’t need to recode it. It’s already in the correct format that it can use to fit the model.
Telling your software that a variable is nominal tells it that the values in your data sheet, whether they’re numbers or text, represent group membership. The software can’t use group membership to fit models without recoding them as indicator variables. It depends on the nature of the data. Again, refer back to my example where you use the values 1, 2, and 3 to represent three groups. You could use those numeric values to fit the model. However, the results would be incorrect. The software would assume that a value of three indicates three times some measurement compared to a value of 1. However, that’s incorrect because these are separate groups and there is no distance between them.
I don’t entirely understand what you’re asking or how your software handles the data. But, it comes down to representing the data accurately for your software (groups versus numeric), and then understanding how it handles nominal data. In your case, it’s already entered as indicator variables, so you don’t need to worry about how it recodes it. Using it as numeric data is fine.
John Grenci says
hello, Jim.. I have been on this site before, and really appreciate your helpfulness. I had a question, and I was not sure where it fit in one of the articles, and this looked as good as any, so apologies if it is not in the right place. it has to do with logistic regression.
I found an article online that ran a logistic regression and I was trying to duplicate values by running the same set of data, and my coefficients were different. there is one 0-1 dependent variable, and there are four 0-1 independent variables. I found out the reason I was getting different coefficients was that I had run it with the independent variables as NOMINAL, and the person whose data I was trying to duplicate had run them as continuous. his interpretation was actually correct as verified by the interpretation of the intercept along with the actual data (i.e. leaving out variable rank4 as one of the variables gave an interpretation of 79% “ones” when rank4 was set to one, and that is exactly what the data reflected). so, I had two questions one is why it mattered? what is the math behind it? but further, that seems to suggest that in ALL cases one should run independent binary variables as continuous. I can provide the data or more so, the output of coefficients if you like. thanks John
Jim Frost says
Hi John,
I don’t have many posts on logistic regression. I should fix that! So, no worries! I do have one here but it doesn’t address your question.
Actually, your question isn’t specific to logistic regression. It applies least squares regression for continuous data as well.
In regression analysis, you can include categorical (nominal) variables. However, the software needs to convert them to indicator variables. Indicators variables are simply binary variables that indicate the presence or absence of a condition, characteristic, or event. Now, if your data is already coded as binary indicator variables, there’s no need to recode them and you can just use them as they are. The way it works mathematically is that the procedure estimates the coefficient for each indicator variable. When the characteristic is present, you multiply it by one, which means you just add the coefficient into the fitted value. However, if it’s not zero, the effect that term is zero. I go into more detail about this process and its implications in my ebook about regression.
As for why it makes a difference, I’m not sure. I’m picturing that you have a column of 1s and 0s. By telling your software it’s a nominal variable, it’ll recode a column of 1s and 0s into a column of 1s and 0s. Perhaps it’s recoding 1s into 0s and vice versa? I’m not sure. You’ll need to check your software documentation to determine how it recodes nominal data. Usually, there’s a scheme, such as the higher value being recoded as 1s and low values as zeros–or the other way around. Either way can produce valid results, but you need to know what it considers to be an “event” to be able to interpret it correctly. You should be able to change the coding scheme too.
The real value of running a variable as nominal data is when you have text values, such as Pass and Fail. You’ll define one of those as the “event,” which then gets coded as the 1s. The OLS procedure can’t understand text, so that’s needs to be converted to indicator variables. Or, if you have three groups of data and they’re entered as groups: 1, 2, and 3. Treating this as numeric would be a mistake. The software would think that a value of 3 represents three times the amount of something compared to a value of 1. In reality, they’re just different groups where the differences between numbers don’t represent anything other than group membership. In that case, these three values would be recoded into two columns of indicator variables that indicate group membership. Again, my ebook covers this in much more detail.
So, yes, if your data are already entered as binary 1s and 0s, run it as numeric! No recoding is necessary!
I hope this helps!
Raghad says
Thanks a lot
that was so helpful
Jim Frost says
You’re very welcome. Best of luck with your analysis!
Because your study uses regression analysis, you might consider my ebook about regression. It covers this material in more depth.
Raghad says
Many Many thanks for the valuable information.
I just have a question…
If I have a collinearity just between two independent variables, so I Have to standardize one of them or to standardized all independent variables in the model?
Jim Frost says
Hi Raghad,
Standardizing your IVs only reduces multicollinearity for interaction terms and polynomials. It won’t help reduce multicollinearity that exists between two IVs. To address that type of multicollinearity, read my post on that topic for some potential remedies.
In terms of standardizing variables for multicollinearity due to interaction terms and polynomials, strictly speaking, you only need to standardize the variables that are included in those terms. You don’t need to standardize variables that are not included in those terms. However, analysts will frequently standardize all of the continuous variables for consistency.
Indeed, when you standardize all the continuous IVs, the constant takes on a special interpretation. When you standardize all the continuous IVs, the constant represents the average DV value when all the continuous IVs are at their mean values. If you have a mix of standardized and non-standardize IVs, it complicates the interpretation of the constant.
Additionally, if you’re making predictions and entering values into the equation, you’ll need to remember which IVs to enter as scores relative to their mean versus which scores to enter in their raw form.
Consequently, it’s often easier and more useful to standardize all of your IVs if you decide that you need to standardize any of them. However, again, standardizing won’t help reduce multicollinearity between two IVs. It sounds like you’ll need to consider other remedies.
jagriti khanna says
Thank you so much Jim.
I’m so fortunate to have come across these websites of yours. You have compiled everything at a single platform in a very easy to understand way. I thoroughly went across all these sites and it helped me a lot for my project. I can’t thank you enough 🙂
Jim Frost says
You’re very welcome, Jagriti! I’m glad my website was helpful!
jagriti khanna says
Hi JIm
I think I wasn’t able to explain myself clearly.
Basically i just want to know that how will the change in value of x for a multivariate curvilinear regression affect the value of Y since now we don’t have coeff directly as slopes because of polynomial nature of each variable
Jim Frost says
Hi Jagriti,
Interpreting coefficients when you have polynomial terms is always more difficult than when you only have straight-line effects. Standardizing your variables doesn’t really make it more difficult. I always start out with graphs to understand visually the nature of the relationship between two variables.
In terms of graphing, when you have only one independent variable, I highly recommend using fitted line plots. These fantastic plots show the curved fit (when you have polynomials) along with your data points, which helps you determine how well the model fits the data. You can see examples in my posts about coefficients and p-values and modeling curvature.
When you have more than one independent variable, you need to use a main effects plot. These plots display the relationship between one IV and the DV, while holding the other IVs constant. I don’t have blog post to point you towards, but I cover these plots and this particular use for them in my ebook about regression analysis.
In terms of predicting values, normally you’d just enter the the values of your IVs into the equation and calculate the fitted result. However, because you standardized your variables, you’ll need to adjust the values that you enter into the equation. All you need to do is subtract of the mean of each IV from the value you want to enter into the equation for each IV. For example, if the mean of the IV is 15 and you want to predict the mean DV for X, you’d enter X-15 into the equation. Although, some software packages will do this for you automatically. Check the documentation.
Best of luck with your analysis!
jagriti khanna says
Hi Jim
Can you please tell me after i standardized my x matrix, how will i standardize the future values of x which i want to use for prediction, if i want to study cases like if i increase one of my x by 5%, what will be the change in my y (which is slope). And also tell me how to see the actual polynomial combined equation of all the variables after i’ve done p-vaue and r sq analysis and checked for residuals and done the final multivariate linear regression since now my coeff won’t directly determine the slope as the are part of polynomial equations.
Thanks in advance
Gabriel Samuel says
Your blog is awesome. I’m grateful I got hooked at this point in my thesis write up. Thanks and keep up the good work.
Jim Frost says
Thank you so much, Gabriel! I’m so happy to hear that my blog has been helpful. Best of luck with your thesis!
Shaarang says
As an aspiring data scientist, I can not overstate how helpful your setup has been. Thanks a ton
Jim Frost says
You’re very welcome! It makes my day hearing how it has been helpful for you!
Luke says
Great article! Thanks for sharing. I do have a question regarding what you said here “However, to remove multicollinearity caused by higher-order terms, I recommend only subtracting the mean and not dividing by the standard deviation. Subtracting the means is also known as centering the variables”, would you elaborate how will it cause problems by dividing the standard deviation after centering?
Jim Frost says
Hi Luke,
All I meant by that was that if you just center the variables, the interpretation of the coefficients doesn’t change from their normal interpretation that a coefficient indicates the mean change in the dependent variable given a one-unit change in the independent variable. However, if you also divide by the standard deviation, the interpretation of the coefficients changes. For that case, the coefficient represents the mean change in the DV for a 1 standard deviation change in the IV.
I write about how standardizing your continuous IVs can be helpful in a post about How to Identify the Most Important Independent Variables in Your Model. You can read more about that approach in that post.
I hope this helps!
Douglas AMULI says
I found very helpful your post.
Concerning it i have two questions:
– Is it a problem if one runs a regression model where some independent variables are standardized and others are not ?
– Imagine a particular case of a mimic model with standardized causes but not standardized indicators. Are results negatively affected ?
Thanks in advance for your reply.
Visar says
This was very useful. Thanks a lot and keep up the good work!
Jim Frost says
Thank you, Visar!
Karien says
Hi Jim,
Thank you for your posts. I have to do a statistical analysis for a project and I have never had to delve so deep into statistics before. Your plain English explanations really help a lot.
When dealing with interactions, do you first get the interactions between the variables and center them as well? Or do you center the independent variables and then get the interactions? Or, are the interactions from the original independent variables and then only the independent variables are centered? I am very confused about the order of things here. Also if you have more than one interaction that is significant, does it become another term in the regression equation?
Thanks,
Karien
Jim Frost says
Hi Karien,
I’m so glad to hear that my blog posts are helpful!
To answer your question, some of it depends on your statistical software. If it can do these things for you automatically, then you don’t have to worry about it.
However, if you need to do them manually, here’s to correct order.
1. Create a new column for each continuous independent variable you need to center.
2. Center the continuous variables in the new columns.
3. Create a new column for each interaction term.
4. Create the interaction term by multiplying the appropriate columns. Be sure to use the centered variables.
Again, many software packages can do some or all of these steps for you automatically. So, you might not need to worry, but do check the documentation for the software.
I hope this helps!
Jim