You’ve settled on a regression model that contains independent variables that are statistically significant. By interpreting the statistical results, you can understand how changes in the independent variables are related to shifts in the dependent variable. At this point, it’s natural to wonder, “Which independent variable is the most important?”

Surprisingly, determining which variable is the most important is more complicated than it first appears. For a start, you need to define what you #0excludeGlossary by “most important.” The definition should include details about your subject-area and your goals for the regression model. So, there is no one-size fits all definition for the most important independent variable. Furthermore, the methods you use to collect and measure your data can affect the seeming importance of the independent variables.

In this blog post, I’ll help you determine which independent variable is the most important while keeping these issues in mind. First, I’ll reveal surprising statistics that are not related to importance. You don’t want to get tripped up by them! Then, I’ll cover statistical and non-statistical approaches for identifying the most important independent variables in your linear regression model. I’ll also include an example regression model where we’ll try these methods out.

**Related post**: When Should I Use Regression Analysis?

__Do Not__ Associate Regular Regression Coefficients with the Importance of Independent Variables

The regular regression coefficients that you see in your statistical output describe the relationship between the independent variables and the dependent variable. The coefficient value represents the mean change of the dependent variable given a one-unit shift in an independent variable. Consequently, you might think you can use the absolute sizes of the coefficients to identify the most important variable. After all, a larger coefficient signifies a greater change in the mean of the independent variable.

However, the independent variables can have dramatically different types of units, which make comparing the coefficients meaningless. For example, the meaning of a one-unit change differs considerably when your variables measure time, pressure, and temperature.

Additionally, a single type of measurement can use different units. For example, you can measure weight in grams and kilograms. If you fit two regression models using the same dataset, but use grams in one model and kilograms in the other, the weight coefficient changes by a #1excludeGlossary of a thousand! Obviously, the importance of weight did not change at all even though the coefficient changed substantially. The model’s goodness-of-fit remains the same.

**Key point**: Larger coefficients don’t necessarily represent more important independent variables.

__Do Not__ Link P-values to Importance

You can’t use the coefficient to determine the importance of an independent variable, but how about the variable’s p-value? Comparing p-values seems to make sense because we use them to determine which variables to include in the model. Do lower p-values represent more important variables?

Calculations for p-values include various properties of the variable, but importance is not one of them. A very small p-value does not indicate that the variable is important in a practical sense. An independent variable can have a tiny p-value when it has a very precise estimate, low variability, or a large sample size. The result is that effect sizes that are trivial in the practical sense can still have very low p-values. Consequently, when assessing statistical results, it’s important to determine whether an effect size is practically significant in addition to being statistically significant.

**Key point**: Low p-values don’t necessarily represent independent variables that are practically important.

__Do__ Assess These Statistics to Identify Variables that might be Important

I showed how you can’t use several of the more notable statistics to determine which independent variables are most important in a regression model. The good news is that there are several statistics that you can use. Unfortunately, they sometimes disagree because each one defines “most important” differently.

### Standardized coefficients

As I explained previously, you can’t compare the regular regression coefficients because they use different scales. However, standardized coefficients all use the same scale, which means you can compare them.

Statistical software calculates standardized regression coefficients by first standardizing the observed values of each independent variable and then fitting the model using the standardized independent variables. Standardization involves subtracting the variable’s mean from each observed value and then dividing by the variable’s standard deviation.

Fit the regression model using the standardized independent variables and compare the standardized coefficients. Because they all use the same scale, you can compare them directly. Standardized coefficients signify the mean change of the dependent variable given a one standard deviation shift in an independent variable.

Statisticians consider standardized regression coefficients to be a standardized effect size because they indicate the strength of the relationship between variables without using the original data units. Instead, this measure indicates the effect size in terms of standard deviations. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.

**Key point**: Identify the independent variable that has the largest absolute value for its standardized coefficient.

**Related post**: Standardizing your variables can also help when your model contains polynomials and interaction terms.

### Change in R-squared for the last variable added to the model

Many statistical software packages include a very helpful analysis. They can calculate the increase in R-squared when each variable is added to a model that already contains all of the other variables. In other words, how much does the R-squared increase for each variable when you add it to the model last?

This analysis might not sound like much, but there’s more to it than is readily apparent. When an independent variable is the last one entered into the model, the associated change in R-squared represents the improvement in the goodness-of-fit that is due solely to that last variable after all of the other variables have been accounted for. In other words, it represents the *unique* portion of the goodness-of-fit that is attributable only to each independent variable.

**Key point**: Identify the independent variable that produces the largest R-squared increase when it is the last variable added to the model.

## Example of Identifying the Most Important Independent Variables in a Regression Model

The example output below shows a regression model that has three independent variables. You can download the CSV data file to try it yourself: ImportantVariables.

The statistical output displays the coded coefficients, which are the standardized coefficients. Temperature has the standardized coefficient with the largest absolute value. This measure suggests that Temperature is the most important independent variable in the regression model.

The graphical output below shows the incremental impact of each independent variable. This graph displays the increase in R-squared associated with each variable when it is added to the model last. Temperature uniquely accounts for the largest proportion of the variance. For our example, both statistics suggest that Temperature is the most important variable in the regression model.

## Cautions for Using Statistics to Pinpoint Important Variables

Standardized coefficients and the change in R-squared when a variable is added to the model last can both help identify the more important independent variables in a regression model—from a purely statistical standpoint. Unfortunately, these statistics can’t determine the practical importance of the variables. For that, you’ll need to use your knowledge of the subject area.

The manner in which you obtain and measure your sample can bias these statistics and throw off your assessment of importance.

When you collect a random sample, you can expect the sample variability of the independent variable values to reflect the variability in the population. Consequently, the change in R-squared values and standardized coefficients should reflect the correct population values.

However, if the sample contains a restricted range (less variability) for a variable, both statistics tend to underestimate the importance. Conversely, if the variability of the sample is greater than the population variability, the statistics tend to overestimate the importance of that variable.

Also, consider the quality of measurements for your independent variables. If the measurement precision for a particular variable is relatively low, that variable can appear to be less predictive than it truly is.

When the goal of your analysis is to change the mean of the independent variable, you must be sure that the relationships between the independent variables and the dependent variable are causal rather than just correlation. If these relationships are not causal, then intentional changes in the independent variables won’t cause the desired changes in the dependent variable despite any statistical measures of importance.

Typically, you need to perform a randomized experiment to determine whether the relationships are causal.

## Non-Statistical Issues to Help Find Important Variables

The definition of “most important” should depend on your goals and the subject-area. Practical issues can influence which variable you consider to be the most important.

For instance, when you want to affect the value of the dependent variable by changing the independent variables, use your knowledge to identify the variables that are easiest to change. Some variables can be difficult, expensive, or even impossible to change.

“Most important” is a subjective, context sensitive quality. Statistics can highlight candidate variables, but you still need to apply your subject-area expertise.

If you’re learning regression, check out my Regression Tutorial!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

JP Carvallo says

Hi Jim,

A quick note that the Oaxaca-Blinder decomposition can be a relatively rigorous way to test how much a given variable contributes to explain variance in the independent variable, when certain model conditions are met.

Thanks,

JP

Jim Frost says

Hi JP,

Thanks for the information. I haven’t used the Oaxaca-Blinder decomposition method myself. However, my understanding is that it explains the sources for the differences between group means. For example, if males and females have different mean values, what causes that difference? Is it categorical IV itself (e.g., gender) or other sources? To the extent of my knowledge, it’s not used to identify the relative importance of the various IVs in the model. That’s a different type of question it answers.

Chris says

[Please excuse my addition of two cents to question addressed at the blog author. If this is undue, feel free to remove this comment.]

Yes, there are indices that try to measure the proportional contribution of a variable to R^2. The problem with the so-called “ANOVA type I” R^2 decomposition is that it depends on the order of variables. A workaround is to average over all permutations, but this is computationally only feasible for a small number of variables. See the following OpenAccess papers for an overview:

Grömping: “Variable importance in regression models.” Wiley interdisciplinary reviews: Computational statistics 7.2 (2015): 137-152.

Grömping: “Relative importance for linear regression in R: the package relaimpo.” Journal of statistical software 17 (2007): 1-27.

The R^2 based metric suggested by Jim corresponds to the metric “last” in the second paper. It is equivalent to a so-called “type III ANOVA” which does not provide an R^2 decompositions, i.e. the individual SumOfSquares contributions do not add up to the total SumOfSquares. This does not make the metric useless, but it might not be what you were asking for.

Jim Frost says

Hi Chris,

Thanks for adding your thoughts. I appreciated the expanded discussion. It appears like we might differ somewhat on several points.

My take is that using the adjusted sum of squares (Type III) is actually quite valuable. It allows you to find the totally unique contributions for each variable that none of the other variables can explain.

Thank you for bringing up the all permutations approach.

While I understand its appeal, I have some reservations about it. Averaging over all permutations inherently means considering a multitude of models, many of which are likely to be misspecified. Such models are prone to various biases, including specification bias and omitted variable bias. By averaging results from these biased models, we risk diluting the validity of our findings.

Averaging over all permutations assumes that each permutation is equally likely or equally representative of the underlying process. This might not be the case for the reasons I discuss above.

Additionally, considering all permutations is a form of data dredging. It can pull up all sorts of chance correlations that look like real relationships but are in fact the product of sifting through all possible models. The all permutations process is entirely heuristic in nature and has no theoretical underpinnings.

Instead of relying on an average from a plethora of models, I believe it’s prudent to focus on a specific model. Analysts should carefully choose it using statistical, graphical, and theoretical assessments. By doing so, we can ensure a more robust and meaningful interpretation of the variable’s importance.

Will Ovenden says

Hi Jim,

thanks that paper is interesting and v useful. I hesitate to say I understand it all, but i get the jist of it.

A few things that confuse me slightly are in section 5.2 was relative importance vs absolute importance. Is absolute importance just referring to a parameter estimate for example, where you can interpret and increase in X, causing a change in Y..

Do you know if there is a method to establish the proportion of the variation in the response that can be attributed to individual variables (R-squared)?

thanks again for the help,

Will

Jim Frost says

Hi Will,

The absolute importance is just the effect of each predictor. Unsurprisingly, the article concludes that measures that are good for establishing absolute importance work better for establishing relative importance. That makes sense because if a measure can’t determine the absolute effect of a predictor, how can it rank multiple predictors? Akaike weights are a good example of what doesn’t work. These weights do not establish the absolute effect of a predictor. Hence, they are not good for the relative ranking of multiple predictors.

As for your question about using R-squared for individual variables, reread the section in this post title, “Change in R-square for the last variable added to the model.” I really like this measure because it’s does not only compares the amount of variance each predictor accounts for, it’s the unique variance it accounts for. The portion of variance that no other predictor can explain. Read that section more closely and it’ll answer your question.

Great discussion!

Will Ovenden says

Hi Jim, One more question if you don’t mind… I have seen some articles use Akaike weights to determine relative importance by summing the weights of the models for which the variable was included. Am i right in thinking this approach would be similar to using the F statistic, where a variable might be important with respect to reducing residual error in the model, but the Akaike weights approach doesn’t provide much information about the gradient of the relationship between the response and the variable so it is less useful than standardised coefficients when trying to establish which predictors are having the greatest impact on the response? I have just noticed that said paper (https://onlinelibrary.wiley.com/doi/full/10.1111/j.1600-0587.2012.00020.x) uses the weight sum of each variable as “variable contribution to the model” in the results but describe it as a method for the establishing the relative importance of the variable in the methods so perhaps its not quite fit for the purpose they intended? I think the fact that many variables have a weighted sum of 1, meaning it’s included in all the models, confirms this because there is no way of telling which of the variables included in all the models are more important in increasing/decreasing the response.

best,

Will

Jim Frost says

Hi Will,

Yes, I’d agree with your comments about all the variables with the weighted sum of 1.

Additionally, I’m a bit leery about using either F-values or Akaike weights for determining which predictors are most important. Neither really get at the effect size. Additionally, Akaike weights have a low repeatability across studies.

Instead, consider model-averaged standardized effect sizes which are more consistent across studies and directly evaluate relative effect sizes.

For reference, please see, A farewell to the sum of Akaike weights by Galipaud et al.

Will Ovenden says

ahh that makes sense. thankyou Jim!

Will

Will Ovenden says

Hi Jim,

Thanks for the article. I am wondering if you could explain the individual F values for parameters in a regression model. I understand F values is the ratio of Mean sum of squares for regression to mean sum of squares for Error, and that a higher F value indicates the variable is more significant. But my results are confusing. I have used standardised coefficients to rank the importance of variables. but when doing anova(model_object) the size of the F values are not even close to being in a similar order of importance to the standardised coefficients. I have checked for collinearity and it doesnt seem too high.

thanks in advance!

Will

Jim Frost says

Hi Will,

It’s unsurprising that the F-values and importance of the standardized coefficients don’t line up because they’re measuring different things.

F-values related to statistical significance. And statistical significance does not necessarily imply real-world, practical significance. It’s the same reason why I say to not use the p-value to determine the importance of a predictor. Reread the section about not using P-values and apply those ideas to F-values.

Standardized coefficients measure the average change in the DV given a 1 STD change in the IV/Predictor. In this way, it captures the practical significance in a manner that allows you to compare it between the disparate IVs.

As I recommend in this post, use the standardized coefficients and change in R-squared to assess the importance of the IVs. Don’t use measures such as p-values and F-values. As always, consider the caveats I discuss in the post.

I hope this helps!

Leila says

Hi!

Thank you for the helpful post.

I would simply have a quick question about the order in which the variables should be entered in the models, and whether variables should be taken out as we enter new ones. I am currently writing my thesis using regression analyses, and I have 7 independent variables which I want to test to see which has the most important effect in the dependent variable. Should I include all variables 1 by 1 into the model, and keep all of them in until the end, so that I start at 1 variable and end with 7 in the same model? Or should I delete the preceding variable every time I enter a new one into the model?

For example, let’s say my starting variable would be “feelings towards a brand”, and two of my other independent variables are “skepticism about advertisements” and “perceived authenticity of the brand”, with my dependent variable being “increased support toward the brand after watching an advertisement”. Should I first run the model with “feelings toward the brand”, then add “skepticism”, and then “perceived authenticity”, so that I end up having all 3 in the same model. Or should I do the first two together, then take out “skepticism” and add “perceived authenticity”?

I hope this is not too confusing of a question and explanation. Thank you in advance for your response.

Dr. MM Ali says

Hi Jim,

Very interesting article. I was looking for such an explanation in selecting the most important variables. I have to go through the entire discussion in the blog and go through some of the links you provide though, I need a quick help from you. Can you please suggest me a freely available statistical software package that can calculate the increase in R-squared when each variable is added to a model that already contains all of the other variables.

Besides posting in this blog please also send me an email so that I wont miss your information.

Regards.

Ali

Jim Frost says

Hi Ali,

Thanks for writing. I know that R can do what you ask. I don’t know of other free software that can do it. You can of course get these results by manually fitting the full model and then removing an IV using any statistical software (include Excel) that lets you perform multiple regression and produces R-squared.

Fit the full model and record the R-squared. Then, remove one variable and record that R-squared. The difference between the two R-squared values is the increase in R-squared for that one variable. Repeat that process for each IV in your model. You want to compare the full model R-squared to the model that has all the variables except one. Depending on the number of IVs in your full model, that might be tedious, but it will give you the information you need.

Finally, if I emailed everyone who commented, it would be hard to keep up with it all. Fortunately, you can just check the “Notify me of new comments by email” checkbox before submitting your comment. That way, the system will automatically email you.

A says

Hi Jim! With your help, I think I figured it out, and my coefficients are now looking a lot better. Thank you. I did have just one more quick question though – is using regression coefficients better than a correlation matrix because it specifies an actual relationship?

Jim Frost says

Hi, you’re welcome!

Regression coefficients are better for two reasons. One is that not only do you know the direction of the relationship, but you also know how much the DV changes on average given a change in the IV. You don’t learn that from correlation coefficients. Additionally, regression coefficients control for all the other IVs in your model. In other words, regression coefficients isolate the effect of each IV from all the other IVs. Correlation coefficients don’t do that.

A says

Hi Jim,

Love your articles. Thank you for writing them. I’m trying to tell which X variables matter the most as they relate to the Y variable – just like your article!

I do have a quick question though. I’m running a linear regression with quite a lot of X variables. When I run the regression with all the X variables, I get quite a few coefficients returned with a value of NA. After throwing out a few variables (I think this is due to multicollinearity), there are a few X variables with comparatively very high coefficients. The results are the opposite of what I’d expected. For example, I thought the variable crime would have a negative coefficient as it relates to a Y variable of happiness score. Instead it has a very positive coefficient.

This even happens when I do one X variable. Happiness score ~ Crime.

Do you have any ideas why this would be? Should I normalize the data? I’m having trouble explaining why this is the case. I’d appreciate any tips.

Jim Frost says

Hi,

I’m so glad my articles have been helpful!

For the NAs, it might be that you’re including more IVs than you have observations. I’m not sure. You mentioned that you had a lot of X variables, so that could be a reason. If that’s the case, even when you include fewer variables, you might be overfitting the model, which can produce results you can’t trust. Click the link to learn more!

How many observations do you have and how many X variables?

Multicollinearity can cause that as well. You can check for that to see if it is a problem. Click that link to learn more!

There are other possibilities as well. For example, you might not be correctly fitting curvature. Check the residual plots to see if there’s an obvious pattern in them. Or there could be omitted variable bias. That might be what is happening in your model with just one X variable.

So, you have some things to check that might give you clues.

As for normalizing, that’s a fuzzy term with different meanings. I’d need to know what you mean by that more specifically before commenting in more detail. If you mean standardizing your variables, which I mention in this post, there’s limited potential for that to help you resolve the high coefficients. It will only help if your model has multicollinearity specifically caused by including polynomial and/or interaction terms.

For general help in fitting your model, read my article about choosing the correct regression model. Lots of advice and tips in it! It’s great that you’re thinking about whether your results make theoretical sense. That’s a crucial but frequently overlooked step!

Chris says

Thanks for the quick reply. Yes, I agree that p-values are less useful, because the non-linear transformation of F or t quickly decays to values close to zero. It is interesting to note, though, that measuring the difference in R-squared when a variable is added last is (for continuous variables) exactly equivalent to comparing t-values (or t^2, to be precise) (*). In other words, if this is your variable importance metric, you can simply compare the t-values reported in the summary of a linear regression. It is problematic in case of (multi-)collinear predictors, but in that case more modern aproaches like permutation based variable importance can become problematic, too, despite their easier interpretability and the nice fact that they can be applied to ANY predictive model, not just linear regression.

(*) This is not difficult to prove after a little calculation based on their definitiions, but if you are looking for a formal reference, see John Bring: “How to standardize regression coefficients.” Am. Stat. 48,3 pp. 209-213. Bring actually advocates the use of t-values for measuring variable importance.

Jim Frost says

Hi Chris,

I’m not quite sure if you have a question in there somewhere or if you’re just presenting various musings (which is fine)? But there are reasons why I suggest using the change in R-squared as a measure of importance for linear models. We’ve covered some of them.

Yes, I’ve seen schemes where analysts have used t-values as you suggest. However, as you indicate, they are affected by multicollinearity whereas R-squared is not. I’ve seen the use of the absolute value of t-values largely in the space of factorial designed experiments where multicollinearity should not be a problem. Additionally, you might want to compare the entire effect of a categorical variable, which has an F-value rather than a t-value. Change in R-squares works in all those cases.

I’m not surprised that other methods besides linear models have their own ways for determining importance, but I’m not discussing those methods in this post. Perhaps down the road in a different post. Another larger point that I’m raising in this post is that there are various definitions of importance, which affects the identification of the one you call most important. Also, I show how statistics will only get you so far in determining importance and indeed has some shortcomings and can be affected by how you collect your data. Practical knowledge of the subject area is also crucially important here.

Chris says

Thanks for the detailed answer, Jim. In general, these observations concerning p-values hold and p-values are unrelated to effect size. Type III ANOVA, however, is a special case, because the denominator of the test statistic is identical for all variables and the numerator is the increase in R-squared when this variable is added last. In other words, the test statistic is how much R-squared decreases when a variable is removed from the model. The p-value is an (inversely) montonous transformation of the test statistic and thus is equivalent to your suggested criterion “change in R-squared”. As this transformation is non-linear, directly comparing the change in R-squared is easier to interpret, but the ranking should be the same.

Jim Frost says

Hi Chris,

I believe that you’re correct mathematically in terms of producing the same ranking. However, you lose a lot information going by p-values, which makes them inadequate for this purpose.

As you say, the change in R-squared is easier to interpret and it gives you a sense of how much of the unique DV variability that each variables accounts for. Additionally, you can have two similar p-values but the change in R-squared can be fairly different.

In fact, you see that in this post. In the example, both Temperature and Pressure have p-values of 0.000 (which actually means less than < 0.0005). The p-values are VERY close because they're both within the miniscule range of 0 - 0.0005 (non-inclusive). However, despite the negligible p-value difference, there is marked difference in the change in R-squared for each one (~80% vs ~20%). While the precise p-values for these two variables might produce the same ranking, they hide the larger difference between the variables. Consequently, p-values are insufficient for this task. It's also easy to imagine cases where you have two very significant p-values that are nearly equal. One might account for a significant portion of the variance while the other a trivial amount. Yet, if you use p-values, while you'd know that one was more important than the other, you'd be misled into thinking both were important. Or, there's the case where neither are important despite both having very significant p-values. Again, don't use P-values to assess practical significance!

chris says

Thanks for the blog post, Jim. I am not sure about the point concerning p-values, though. For a continuous IV, the p-value of the t-statistc measures how far its coefficient is away from zero in units of its standard deviation. This is identical to the square root of the F-statistic, which measures how much R-squared decreases when this variable is removed (the so-called type III ANOVA). IOW, looking at the p-value of a continuous IV (not recommended in your post) is equivalent to looking at the increase in R-squared when this variable is added last (recommended in your post). Can you please shed light on this contradiction?

Jim Frost says

Hi Chris,

Thanks for the great question! There are a couple of things that aren’t quite accurate with your statements. For t-values, it’s technically the standard error of the mean rather than the standard deviation. Additionally, F-values are the ratio of two variances. In some cases, F-statistics and t-values have the relationship you describe. However, neither are related to R-squared.

Both tests describe the likelihood of the observed results if the null hypothesis is true. You can obtain large F and t-values even when you explain a low proportion of the variance (R-squared). They are just different things. You can have a relatively trivial effect that is very statistically significant because you have a combination of a large sample size and low data variability. Or think of it this way. Imagine you have a p-value that is say 0.04. Significant but barely. Now imagine you perform the the same exact study but with a dramatically larger sample size and/or more precise measurements. And we’ll assume that the effect truly exists in the population but it’s trivial. Given that situation, the p-value will decrease, becoming more significant, even though the effect remains trivial.

The end result is that you can have miniscule p-values for trivial effects, which is why p-values aren’t a good measure of importance.

Nima Lord says

Dear Jim

How can I interpret the impact of one unit change of the independent variable on the dependant variable when I have interaction term or higher order of that independent variable in my model?

For example, if I have temperature and temperature*pressure in my model; and the temperature coefficient is 5 and the temperature*pressure coefficient is 3, what can be inferred from 3 and 5? Can I still say that one unit increase in temperature cause 5 unit increase in the dependant variable?

Jim Frost says

Hi Nima,

When you have interaction term you cannot just interpret the main effect, which is the mean change in the DV given a one unit increase in the IV. That interpretation will give misleading results. To learn how to interpret models with interaction effects, read my post about interaction effects.

Henra Muller says

Dear Jim

How do one go about deciding the best combination of a specific set of parameters with numerous variables?

Jim Frost says

Hi Henra,

That gets into another detailed area called response optimization. Response optimization helps you find the combination of variable settings that jointly optimize a response or a set of responses. You can also enter constraints as well as other considerations for variables that are easier or harder to change. This method uses an algorithm to find optimal settings while factoring in other constraints and considerations. Here’s a Wikipedia article about optimization.

There’s a simpler approach you can use if you know some variable settings that are good. Basically, your mentally doing some of the work that the algorithm would do. For that, read my article about making predictions while accounting for variability.

Isaac Asante says

I’ll help you intuitively understand statistics by focusing on concepts and using plain English…

And you do exactly that!. Thanks.

Do you have a YouTube channel?

Ghannaii says

Hi Jim,

Please, is there a limit to the number of independent variables that can be entered into a logistic regression model?

Thanks.

Jim Frost says

Hi Ghannail,

Yes, there’s a limit. It’s generally defined by your sample size. If you include too many IVs, it’s referred to as overfitting your model. I write about it from the perspective of least squares regression in Overfitting Regression Models: Problems, Detection, and Avoidance. The same principles apply to logistic regression but it’s a bit more complex. In that article, I include a reference that, if my memory serves, discusses logistic regression.

I hope that helps!

Didier says

Thank you for replying,

Yes, the change in R-square was larger for diabetes distress,

Sorry for the inconvenience. I thought there was a problem with my connection. It was later that I understood that you moderate the comments.

I will go and check for the omitted variable bias. Your site is very helpful.

Thank you so much

Jim Frost says

Hi Didier, no worries! I’m glad my site was helpful. It does sound like both measures suggest that diabetes distress is the more important variable.

Didier says

Hi Jim, Thank you for this helpful article,

I had a similar issue with my dissertation project. I wanted to know which of social support and diabetes distress was the best predictor of self-care

At first, I run a Pearson correlation and found that only diabetes distress related significantly with self-care, whereas there was no significant relationship between self-care and social support,

For the multiple regression, I was confused because both social support and diabetes distress were significant. Moreover, for the unstandardized coefficient, social support value was 0.206 and diabetes distress -.187. Therefore, although it did not really make sense, I assumed social support was the best predictor.

But effectively, the standardized coefficient tells another story, social support has a value of .234 and diabetes distress -.458, and the part correlation of diabetes distress is also higher than social support,

Therefore, if I understand what you said, according to the standardized value and change in R squared, Diabetes distress should be my best predictor?

Can you please advise,

Jim Frost says

Hi Didier,

When you perform a Pearson correlation its a pairwise analysis. It just includes a pair of variables and isn’t controlling for other variables. However, when you performed the multiple regression analysis, it controls for the other variables in the model. That explains the different results. You’re probably witnessing omitted variable bias in action. Click the link to learn more. I actually include a similar example in it.

As for the most important predictor, the standardized coefficients suggest that the diabetes distress is a more important predictor, but note the caveats I mention in this article.

I didn’t see the change in R-square so I can’t comment. Was it larger for diabetes distress?

By the way, I saw multiple attempts to post this comment. Bear in mind that manually moderate the comments, so there will often be a delay before you see it.

Alvin says

Noted, thanks for your reply Jim

Alvin says

Hi Jim

So after the process of standardization, the higher value of coefficient indicates a more important predictor? How about negative value? My model is nonlinear in the form of y = b0 * exp ( b1 * x1 + b2 * x2 + … )

Jim Frost says

Hi Alvin,

The key point to consider is that I write that you should look for the standardized coefficient with “the largest

absolutevalue.” For example, if you have coefficients of -3 and +2, the -3 coefficient has a larger absolute value.Ankit says

Thanks Jim for your reply.

So the data i have is manufacturing data where the dependent variable or outcome is the yield. So there are mulitple paramters which affect the yield.

So whenever the value of a specific parameters changes , it causes an yield drop or yield increase.

For example if i see this months yield, it could be affected due to a change in one parameter. Similarly next months yield could be affected by another parameter.

I want to build an automated system which pulls data every month and tells me that which parameters are currently causing the yield to increase/drop… The paramter value is controlled by different machines and could change sometime causing the drop in yield..

I saw your post to find the most important independent variables and thought of using measures like standardised pearson coefficient, MIC score etc…

How could i proceed with this project??

Jim Frost says

Hi Ankit,

What puzzles me is why some variables would be significant one week and other variables another week. Until you understand, you’ll have a hard time modeling the process!

For more information about choosing the correct model, please read my post about specifying the correct model. However, it does not address the issue above!

Ankit Singh says

Hello Jim

Great tutorial.

I am stuck on a real world problem. I have a dependent variable which is affected by around 20 independent variables but not always. Maybe this week one independent variable is causing change to the dependent one. Maybe next week some other would cause drop or rise of the dependent variable. All variables are numeric having different scales.

Would this method help me in bulding an automatic model which gives me the independent variable which currently is causing the dependent to drop or rise?

Jim Frost says

Hi Ankit,

To me, it sounds like the real problem with your analysis is the changing significance of the variables over time. That will make modeling the outcome very difficult. Do you have any understanding of why that’s happening? Typically, the variables in a model will consistently have a significant relationship with the outcome variable. Are you sure that it’s not random changes causing chance correlations? Are there theoretical reasons why the relationships would come and go?

I’m not sure which method you’re asking me about in your last sentence?

It sounds like using an automated model building method might be behind your problems by detecting chance correlations. Try building a model using the principles I describe in my post about specifying the correct model. Also, it sounds like you’re modeling something over time, which introduces special challenges. For example, you might need to included lagged versions of the variables to account for effects over time.

Fekremariam says

Thank you, it helps!

Charles Adusei says

It is amazing how you are imparting knowledge to students and lecturer’ s. Read your post on identifying the most independent variables in regression models. This is very timely for me and very insightful. It will help to improve my current research paper to a top journal. Thanks for your service .

Peeyush says

Dear Jim,

I hope you are well and fine.

I have one dependent variable and independent variable in my regression, but data showing heteroscedasticity may be due to negative values in data. Can I add constant positive value to both dependent and independent variables to make both series positive? This will help me to remove heteroscedasticity by log transformation. In this case, series mean will change but the variance will not change. After that I will apply regression, is it possible to do so?

Siyabonga says

Please write a book 🙏🏿 you a God send

Sreeja says

Thank you Sir

Jithin Narayanan says

How to find which variable is having more impact using interpretation

Jim Frost says

Hi Jithin, well, that’s the exact topic of this article! Read through it carefully and if you have an more specific questions, be sure to ask!

Jakub says

Hi

Thank you. It’s nice to know that I can use change in R-squared for the last variable added to the model to asses impotrance of variables in case of regression with categorical/binary variables. You makes things more clear for me.

Jakub says

Thank you for your quick response. I will be very glad if you answer to my last additional question in this topic. Could I use change in R-squared for the last variable added to the model to asses impotrance of variables in that case?

Jim Frost says

Hi Jakub, yes you can. That’s one of the methods that I recommend in this post.

R Saleh says

Hi, just a reply to your question. Some independent variables add noise to the model instead of signal. You can use the adjusted R^2 to see if adding a new variable increases or decreases this

statistic. If it increases, keep it. Otherwise it is adding noise.

Another reason is that overfitting can occur if there are too many variables. So we would like

to select the optimal set if we can find it to build our compact model.

R. Saleh says

HI, Just wanted to comment on your question. It turns out that the absolute value of the

t-statistic is a very good first-order method of determining variable importance. It is good

at telling you about three groups: most important, moderately important and least important.

You can use the adjusted R^2 to validate the ordering. The result can be shown to be correct

using a penalty function on LSE but I won’t go into those details here. That is, the groupings

are usually very good but ordering within the group may not be perfect.

Jakub says

Hey Jim

Very interesting article. I have some questions about it. Is it true that standarized coeficents shouldn’t be used in case of linear regression with dichotomous or qualitive variables? Is there any other methods which can be usefull in linear regression with that types of variables?

Jim Frost says

Hi Jakub,

It’s only possible to standardize continuous variables. Standardization is the process of subtracting the mean and dividing by the standard deviation. You can’t do either with categorical/binary variables. Off hand, I can’t think of a similar way to “standardize” those types of variables.

A/P NADARAJAH THIVIYA says

can i know why we should be cautious in adding more independent variables in the model?

Coco says

Hi Jim,

I wonder if you know about the CARET package in R and apparently they are using the absolute value of the t-statistic to determine variable importance for linear models including regression. What do you think of that?

Bhagirath Baria says

Hey Jim! Thanks for such lucid and clear explanations. It is quite clear that you have a lot of clarity on the foundations. Even though intermediate texts clarify some issues, there always remains a lack of “intuitive” clarity. Your explanation just gives that. Keep up the good work. I wonder if you have written a full-fledged textbook/reference on Applied Statistics. If so, please share the link; I will purchase it. If not, please get into it!

Aritra Rawat says

Thank you so much sir for your valuable feedback and suggestion. It was really helpful.

Aritra Rawat says

Hey Jim, thank you for such an intuitive and insightful article, though i am still confused,

I have 20 IV independent variable, please can tilly tell me how should i choose those predictors which influence my dependent variable (price of house) most.

Few more questions.

1. Do I have to consider VIF and tolerance factor.

2. Do i have to remove one of the factors, if the correlation is very high like (.963) between them.

3. Can I choose those factor which are obtain by using stepwise calculation in SPSS,

Before —-(independent variable-21)— adjusted R square – .546

Used stepwise method

After ——-(independent variable -6) — adjusted R square – .555

Jim Frost says

Hi Aritra,

In terms of choosing the best regression model, you should read my post about model specification.

VIFs don’t help you choose independent variables. However, if your IVs have high VIFs, then it’s difficult to determine the correct model. I write about that in my post about stepwise regression, which you should probably read anyway because you’re using it! If you have high VIFs, you’ll need to consider how to handle them.

If your IVs have such a high correlation as 0.963, then you undoubtedly have problematic levels of multicollinearity/high VIFs.

As for stepwise, read my article about that. You’ll see how stepwise can often get you close to the correct model but usually not all the way there. It’s an algorithm that knows nothing about the subject area. Read my posts about multicollinearity, model specification, and stepwise regression and I think you’ll gain a lot of valuable knowledge.

Also, consider buying my ebook about regression analysis which has even more information.

Best of luck with your analysis!

Michael J Buono says

How do you compare standardized beta weights. if one independent variable has a standardized beta weight of 0.4 and another has a value of 0.1 is the first one 4 times more important?

Jim Frost says

Hi Michael,

Yes, that’s how you’d interpret it. Just be aware of the caveats I point out about “importance” in the post.

Indira says

Hey Jim,

Thank you for posting this. This article helps me a lot.

The things is,

What is the least R squared we can tolerate?

Is it possible that the values of the current predictory variable switch (the bigger become smaller, the smaller become the bigger) when we add more predictory variables?

Thank you.

Jim Frost says

Hi Indira,

In terms of the lowest R-squared that is acceptable, that varies by subject area. For example, if you’re studying the relationship between physical characteristic and have high precision measurements, you might expect R-squared values to be in the 90s. In that context, 80% might be too low. On the other hand, when you’re trying to predict human behavior, there’s inherently more uncertainty. Consequently, an R-squared of 50% is probability considered high. You’ll need to look at similar research to determine what is considered normal R-squared values as well as which values are just plain to low. I talk about this issue in much more detail in my post about, how high does R-squared need to be?

And, yes, when you add predictors to the model, it’s entirely possible that some variables that were really strong become weaker and other can become stronger. This tends to happen when variables have a particular correlation pattern and it is called omitted variable bias. For more information, read my post about omitted variable bias.

I hope this helps!

Curt Miller says

Hi Jim,

My coworkers and I are running a MultiRegression Model (as required by federal regulations). We are including the 10 predictory variables required, and running against our department’s full population dataset.

We are running this analysis in SAS.

When adding any combination of only 3 of the 10 predictor variables, the results are complete and reasonable. However, when we add any 4th variable to the model, the results are as follows:

Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

We are researching this and, thus far, are unable to find any information that discusses how to resolve this issue.

Can you offer any advice?

Thanks,

curt miller

Jim Frost says

Hi Curt,

So, what that means in the most general sense is that you don’t have enough information to estimate the model that you’re specifying. There are two main more specific causes. One is that you have a very small sample size. I don’t know how large the department is, but it’s probably large enough to support a model with 4 predictors!

The other likely explanation is that you have enough data but too much is redundant, saying the same thing. This can happen when variables are perfectly linearly dependent. You can use one or more predictors to exactly predict the value of another predictor. It would be like including both male and female in the model. Perfect collinearity. I think that’s more likely. Some of your predictors are perfectly correlated. If so, you are fine excluding the redundancy and you won’t be losing information.

That’s my sense!

Navaneeth B S says

Hi Jim, many thanks for writing this blog and helping many like me out there gaining better understanding of statistics. I need your suggestion on independent variable transformation for OLS regression.

I have 10 years’ time series data, measured at quarterly interval (40 observations). My dependent variable (Y) is a transformed variable, calculated as year-on-year percentage change, as follows:

Y = { [Sales (t) – Sales (t – 4)] / [Sales (t – 4)] } * 100

Where, t is the current quarter; and (t – 4) is the same quarter from the previous year.

I have a set independent variables (X), which are also time series, measured at quarterly interval. My question is, is it required that all the independent variables should be transformed in line with the ‘Y’ or I can try transforming the variables at different levels as well, for example:

X = { [ X (t) – X (t – 1)] / [X (t – 1)] } * 100

X = { [ X (t) – X (t – 2)] / [X (t – 2) ] } * 100

X = { [ X (t) – X (t – 3)] / [X (t – 3)] } * 100

I appreciate your help. Many thanks.

Regards,

Navaneeth

Jim Frost says

Hi Navaneeth,

Yes, it’s entirely OK to pick and choose which IVs to transform. You don’t need to transform all or even any of them when you transform the DV. It depends on your data and subject area knowledge. You can also use different data transformations. If you use different transformations, you’ll have to be very careful about keeping it all straight when it comes to interpretations!

Best of luck with your analysis!

Seman Kedir Ousman says

Dear Jim Frost

I want to determine the most important variable in logistic regression using stata software. Most of the independent variables are categorical including the outcome variable and others continuous. The question is how I can standardize these covariates all together and decide about the variables strength. Tips using command for stata user might be very helpful. Thank you so much for your response in advance.

Bunga Aisyah says

this content is so helpful for me, thankyou very much. But, do you mind if i ask you for a text book that related with your explanation?

Jim Frost says

Hi Bunga, I regularly use Applied Linear Models by Neter et al. You can find this book listed on my recommendations page. While all of this content should be in that book, it’s not as nicely compiled in a nice and neat package as this blog post!

Claudia says

who was first….?

Jim Frost says

Yes, it was me in both cases. Thanks for writing! 🙂

jeff tennis says

Terrific article, this is exactly what I needed. This is a naive question (still new to predictive modeling), but when you say “standardize”, does that mean if I standardize all continuous variables I can compare them? If I create a linear model predicting home value based on square footage and age, then standardize both the square footage and age, could I then compare their model coefficients?

Taking it a step further, if square footage has a standardized coefficient of 2 and age has a standardized coefficient of 1, can I conclude square footage is 2x more important than age in predicting home value? I appreciate your help.

Jim Frost says

Hi Jeff, I’m glad the article was helpful!

Yes, it’s basically as you describe it. Standardize each continuous IV by subtracting its mean and dividing by the standard deviation for all observed values. Fit the model using the standardized variables and you obtain the standardized coefficients. Of course, many statistical packages will do this for you automatically and you won’t have to perform those steps.

When you’re working with these standardized coefficients, the coefficient represents the mean change in the DV given a one standard deviation change in the IV. The standard deviation of the IVs become the common scale. Of course, a 1 SD change in one IV equates to a different sized change in absolute terms compared to a 1 SD change in another IV. But, it puts them on a common scale to make them comparable.

It’s important to remember that there are a ton of caveats for all of this that I describe. Your interpretation in your example is one possible interpretation if you decide that standardized coefficients are meaningful for your study. But, flip the coefficients for the sake of argument. Suppose the age coefficient is 2 and square footage is 1. Further suppose you are looking at what a home owner can do to increase the value. In that scenario, even though age has a coefficient twice as large, the owner cannot change the age but can change the square footage. So, square footage is more important despite the smaller standardized coefficient!

Just be sure that you fully understand what most important means for your specific analysis.

Paul Yindeemark says

Thank you so much. This whole time I thought the significant of P-Values were the determinants of identifying most related independent variables.

Jim Frost says

Hi Paul, you’re very welcome! I think that’s a common misunderstanding. After all, we use p-values to determine which variables are statistically significant in the first place. Unfortunately, it doesn’t quite work that way!

Patrik Silva says

Hi Jim, fantastic post!

Which software do you normally use to produce results used in this blog?

I do not think that all software have this options!

Thank You!!!

Jim Frost says

Hi Patrik,

Thanks so much! I’m glad you’ve been enjoying them! I use Minitab statistical software in these posts. However, I think most functions are available in other software. Specifically for this post, I believe you can do all of this in both R and SPSS. However, they might have different terminology. For example, I believe SPSS refers to standardize coefficients as beta (which doesn’t make sense).

santosh says

Thanks!

sachin says

nice explanation…

Jim Frost says

Thank you!