You’ve settled on a regression model that contains independent variables that are statistically significant. By interpreting the statistical results, you can understand how changes in the independent variables are related to shifts in the dependent variable. At this point, it’s natural to wonder, “Which independent variable is the most important?”

Surprisingly, determining which variable is the most important is more complicated than it first appears. For a start, you need to define what you mean by “most important.” The definition should include details about your subject-area and your goals for the regression model. So, there is no one-size fits all definition for the most important independent variable. Furthermore, the methods you use to collect and measure your data can affect the seeming importance of the independent variables.

In this blog post, I’ll help you determine which independent variable is the most important while keeping these issues in mind. First, I’ll reveal surprising statistics that are not related to importance. You don’t want to get tripped up by them! Then, I’ll cover statistical and non-statistical approaches for identifying the most important independent variables in your regression model. I’ll also include an example regression model where we’ll try these methods out.

**Related post**: When Should I Use Regression Analysis?

__Do Not__ Associate Regular Regression Coefficients with the Importance of Independent Variables

The regular regression coefficients that you see in your statistical output describe the relationship between the independent variables and the dependent variable. The coefficient value represents the mean change of the dependent variable given a one-unit shift in an independent variable. Consequently, you might think you can use the absolute sizes of the coefficients to identify the most important variable. After all, a larger coefficient signifies a greater change in the mean of the independent variable.

However, the independent variables can have dramatically different types of units, which make comparing the coefficients meaningless. For example, the meaning of a one-unit change differs considerably when your variables measure time, pressure, and temperature.

Additionally, a single type of measurement can use different units. For example, you can measure weight in grams and kilograms. If you fit two regression models using the same dataset, but use grams in one model and kilograms in the other, the weight coefficient changes by a factor of a thousand! Obviously, the importance of weight did not change at all even though the coefficient changed substantially. The model’s goodness-of-fit remains the same.

**Key point**: Larger coefficients don’t necessarily represent more important independent variables.

__Do Not__ Link P-values to Importance

You can’t use the coefficient to determine the importance of an independent variable, but how about the variable’s p-value? Comparing p-values seems to make sense because we use them to determine which variables to include in the model. Do lower p-values represent more important variables?

Calculations for p-values include various properties of the variable, but importance is not one of them. A very small p-value does not indicate that the variable is important in a practical sense. An independent variable can have a tiny p-value when it has a very precise estimate, low variability, or a large sample size. The result is that effect sizes that are trivial in the practical sense can still have very low p-values. Consequently, when assessing statistical results, it’s important to determine whether an effect size is practically significant in addition to being statistically significant.

**Key point**: Low p-values don’t necessarily represent independent variables that are practically important.

__Do__ Assess These Statistics to Identify Variables that might be Important

I showed how you can’t use several of the more notable statistics to determine which independent variables are most important in a regression model. The good news is that there are several statistics that you can use. Unfortunately, they sometimes disagree because each one defines “most important” differently.

### Standardized coefficients

As I explained previously, you can’t compare the regular regression coefficients because they use different scales. However, standardized coefficients all use the same scale, which means you can compare them.

Statistical software calculates standardized regression coefficients by first standardizing the observed values of each independent variable and then fitting the model using the standardized independent variables. Standardization involves subtracting the variable’s mean from each observed value and then dividing by the variable’s standard deviation.

Fit the regression model using the standardized independent variables and compare the standardized coefficients. Because they all use the same scale, you can compare them directly. Standardized coefficients signify the mean change of the dependent variable given a one standard deviation shift in an independent variable.

Statisticians consider standardized regression coefficients to be a standardized effect size because they indicate the strength of the relationship between variables without using the original data units. Instead, this measure indicates the effect size in terms of standard deviations. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.

**Key point**: Identify the independent variable that has the largest absolute value for its standardized coefficient.

**Related post**: Standardizing your variables can also help when your model contains polynomials and interaction terms.

### Change in R-squared for the last variable added to the model

Many statistical software packages include a very helpful analysis. They can calculate the increase in R-squared when each variable is added to a model that already contains all of the other variables. In other words, how much does the R-squared increase for each variable when you add it to the model last?

This analysis might not sound like much, but there’s more to it than is readily apparent. When an independent variable is the last one entered into the model, the associated change in R-squared represents the improvement in the goodness-of-fit that is due solely to that last variable after all of the other variables have been accounted for. In other words, it represents the *unique* portion of the goodness-of-fit that is attributable only to each independent variable.

**Key point**: Identify the independent variable that produces the largest R-squared increase when it is the last variable added to the model.

## Example of Identifying the Most Important Independent Variables in a Regression Model

The example output below shows a regression model that has three independent variables. You can download the CSV data file to try it yourself: ImportantVariables.

The statistical output displays the coded coefficients, which are the standardized coefficients. Temperature has the standardized coefficient with the largest absolute value. This measure suggests that Temperature is the most important independent variable in the regression model.

The graphical output below shows the incremental impact of each independent variable. This graph displays the increase in R-squared associated with each variable when it is added to the model last. Temperature uniquely accounts for the largest proportion of the variance. For our example, both statistics suggest that Temperature is the most important variable in the regression model.

## Cautions for Using Statistics to Pinpoint Important Variables

Standardized coefficients and the change in R-squared when a variable is added to the model last can both help identify the more important independent variables in a regression model—from a purely statistical standpoint. Unfortunately, these statistics can’t determine the practical importance of the variables. For that, you’ll need to use your knowledge of the subject area.

The manner in which you obtain and measure your sample can bias these statistics and throw off your assessment of importance.

When you collect a random sample, you can expect the sample variability of the independent variable values to reflect the variability in the population. Consequently, the change in R-squared values and standardized coefficients should reflect the correct population values.

However, if the sample contains a restricted range (less variability) for a variable, both statistics tend to underestimate the importance. Conversely, if the variability of the sample is greater than the population variability, the statistics tend to overestimate the importance of that variable.

Also, consider the quality of measurements for your independent variables. If the measurement precision for a particular variable is relatively low, that variable can appear to be less predictive than it truly is.

When the goal of your analysis is to change the mean of the independent variable, you must be sure that the relationships between the independent variables and the dependent variable are causal rather than just correlation. If these relationships are not causal, then intentional changes in the independent variables won’t cause the desired changes in the dependent variable despite any statistical measures of importance.

Typically, you need to perform a randomized experiment to determine whether the relationships are causal.

## Non-Statistical Issues to Help Find Important Variables

The definition of “most important” should depend on your goals and the subject-area. Practical issues can influence which variable you consider to be the most important.

For instance, when you want to affect the value of the dependent variable by changing the independent variables, use your knowledge to identify the variables that are easiest to change. Some variables can be difficult, expensive, or even impossible to change.

“Most important” is a subjective, context sensitive quality. Statistics can highlight candidate variables, but you still need to apply your subject-area expertise.

If you’re learning regression, check out my Regression Tutorial!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Ghannaii says

Hi Jim,

Please, is there a limit to the number of independent variables that can be entered into a logistic regression model?

Thanks.

Jim Frost says

Hi Ghannail,

Yes, there’s a limit. It’s generally defined by your sample size. If you include too many IVs, it’s referred to as overfitting your model. I write about it from the perspective of least squares regression in Overfitting Regression Models: Problems, Detection, and Avoidance. The same principles apply to logistic regression but it’s a bit more complex. In that article, I include a reference that, if my memory serves, discusses logistic regression.

I hope that helps!

Didier says

Thank you for replying,

Yes, the change in R-square was larger for diabetes distress,

Sorry for the inconvenience. I thought there was a problem with my connection. It was later that I understood that you moderate the comments.

I will go and check for the omitted variable bias. Your site is very helpful.

Thank you so much

Jim Frost says

Hi Didier, no worries! I’m glad my site was helpful. It does sound like both measures suggest that diabetes distress is the more important variable.

Didier says

Hi Jim, Thank you for this helpful article,

I had a similar issue with my dissertation project. I wanted to know which of social support and diabetes distress was the best predictor of self-care

At first, I run a Pearson correlation and found that only diabetes distress related significantly with self-care, whereas there was no significant relationship between self-care and social support,

For the multiple regression, I was confused because both social support and diabetes distress were significant. Moreover, for the unstandardized coefficient, social support value was 0.206 and diabetes distress -.187. Therefore, although it did not really make sense, I assumed social support was the best predictor.

But effectively, the standardized coefficient tells another story, social support has a value of .234 and diabetes distress -.458, and the part correlation of diabetes distress is also higher than social support,

Therefore, if I understand what you said, according to the standardized value and change in R squared, Diabetes distress should be my best predictor?

Can you please advise,

Jim Frost says

Hi Didier,

When you perform a Pearson correlation its a pairwise analysis. It just includes a pair of variables and isn’t controlling for other variables. However, when you performed the multiple regression analysis, it controls for the other variables in the model. That explains the different results. You’re probably witnessing omitted variable bias in action. Click the link to learn more. I actually include a similar example in it.

As for the most important predictor, the standardized coefficients suggest that the diabetes distress is a more important predictor, but note the caveats I mention in this article.

I didn’t see the change in R-square so I can’t comment. Was it larger for diabetes distress?

By the way, I saw multiple attempts to post this comment. Bear in mind that manually moderate the comments, so there will often be a delay before you see it.

Alvin says

Noted, thanks for your reply Jim

Alvin says

Hi Jim

So after the process of standardization, the higher value of coefficient indicates a more important predictor? How about negative value? My model is nonlinear in the form of y = b0 * exp ( b1 * x1 + b2 * x2 + … )

Jim Frost says

Hi Alvin,

The key point to consider is that I write that you should look for the standardized coefficient with “the largest

absolutevalue.” For example, if you have coefficients of -3 and +2, the -3 coefficient has a larger absolute value.Ankit says

Thanks Jim for your reply.

So the data i have is manufacturing data where the dependent variable or outcome is the yield. So there are mulitple paramters which affect the yield.

So whenever the value of a specific parameters changes , it causes an yield drop or yield increase.

For example if i see this months yield, it could be affected due to a change in one parameter. Similarly next months yield could be affected by another parameter.

I want to build an automated system which pulls data every month and tells me that which parameters are currently causing the yield to increase/drop… The paramter value is controlled by different machines and could change sometime causing the drop in yield..

I saw your post to find the most important independent variables and thought of using measures like standardised pearson coefficient, MIC score etc…

How could i proceed with this project??

Jim Frost says

Hi Ankit,

What puzzles me is why some variables would be significant one week and other variables another week. Until you understand, you’ll have a hard time modeling the process!

For more information about choosing the correct model, please read my post about specifying the correct model. However, it does not address the issue above!

Ankit Singh says

Hello Jim

Great tutorial.

I am stuck on a real world problem. I have a dependent variable which is affected by around 20 independent variables but not always. Maybe this week one independent variable is causing change to the dependent one. Maybe next week some other would cause drop or rise of the dependent variable. All variables are numeric having different scales.

Would this method help me in bulding an automatic model which gives me the independent variable which currently is causing the dependent to drop or rise?

Jim Frost says

Hi Ankit,

To me, it sounds like the real problem with your analysis is the changing significance of the variables over time. That will make modeling the outcome very difficult. Do you have any understanding of why that’s happening? Typically, the variables in a model will consistently have a significant relationship with the outcome variable. Are you sure that it’s not random changes causing chance correlations? Are there theoretical reasons why the relationships would come and go?

I’m not sure which method you’re asking me about in your last sentence?

It sounds like using an automated model building method might be behind your problems by detecting chance correlations. Try building a model using the principles I describe in my post about specifying the correct model. Also, it sounds like you’re modeling something over time, which introduces special challenges. For example, you might need to included lagged versions of the variables to account for effects over time.

Fekremariam says

Thank you, it helps!

Charles Adusei says

It is amazing how you are imparting knowledge to students and lecturer’ s. Read your post on identifying the most independent variables in regression models. This is very timely for me and very insightful. It will help to improve my current research paper to a top journal. Thanks for your service .

Peeyush says

Dear Jim,

I hope you are well and fine.

I have one dependent variable and independent variable in my regression, but data showing heteroscedasticity may be due to negative values in data. Can I add constant positive value to both dependent and independent variables to make both series positive? This will help me to remove heteroscedasticity by log transformation. In this case, series mean will change but the variance will not change. After that I will apply regression, is it possible to do so?

Siyabonga says

Please write a book 🙏🏿 you a God send

Sreeja says

Thank you Sir

Jithin Narayanan says

How to find which variable is having more impact using interpretation

Jim Frost says

Hi Jithin, well, that’s the exact topic of this article! Read through it carefully and if you have an more specific questions, be sure to ask!

Jakub says

Hi

Thank you. It’s nice to know that I can use change in R-squared for the last variable added to the model to asses impotrance of variables in case of regression with categorical/binary variables. You makes things more clear for me.

Jakub says

Thank you for your quick response. I will be very glad if you answer to my last additional question in this topic. Could I use change in R-squared for the last variable added to the model to asses impotrance of variables in that case?

Jim Frost says

Hi Jakub, yes you can. That’s one of the methods that I recommend in this post.

R Saleh says

Hi, just a reply to your question. Some independent variables add noise to the model instead of signal. You can use the adjusted R^2 to see if adding a new variable increases or decreases this

statistic. If it increases, keep it. Otherwise it is adding noise.

Another reason is that overfitting can occur if there are too many variables. So we would like

to select the optimal set if we can find it to build our compact model.

R. Saleh says

HI, Just wanted to comment on your question. It turns out that the absolute value of the

t-statistic is a very good first-order method of determining variable importance. It is good

at telling you about three groups: most important, moderately important and least important.

You can use the adjusted R^2 to validate the ordering. The result can be shown to be correct

using a penalty function on LSE but I won’t go into those details here. That is, the groupings

are usually very good but ordering within the group may not be perfect.

Jakub says

Hey Jim

Very interesting article. I have some questions about it. Is it true that standarized coeficents shouldn’t be used in case of linear regression with dichotomous or qualitive variables? Is there any other methods which can be usefull in linear regression with that types of variables?

Jim Frost says

Hi Jakub,

It’s only possible to standardize continuous variables. Standardization is the process of subtracting the mean and dividing by the standard deviation. You can’t do either with categorical/binary variables. Off hand, I can’t think of a similar way to “standardize” those types of variables.

A/P NADARAJAH THIVIYA says

can i know why we should be cautious in adding more independent variables in the model?

Coco says

Hi Jim,

I wonder if you know about the CARET package in R and apparently they are using the absolute value of the t-statistic to determine variable importance for linear models including regression. What do you think of that?

Bhagirath Baria says

Hey Jim! Thanks for such lucid and clear explanations. It is quite clear that you have a lot of clarity on the foundations. Even though intermediate texts clarify some issues, there always remains a lack of “intuitive” clarity. Your explanation just gives that. Keep up the good work. I wonder if you have written a full-fledged textbook/reference on Applied Statistics. If so, please share the link; I will purchase it. If not, please get into it!

Aritra Rawat says

Thank you so much sir for your valuable feedback and suggestion. It was really helpful.

Aritra Rawat says

Hey Jim, thank you for such an intuitive and insightful article, though i am still confused,

I have 20 IV independent variable, please can tilly tell me how should i choose those predictors which influence my dependent variable (price of house) most.

Few more questions.

1. Do I have to consider VIF and tolerance factor.

2. Do i have to remove one of the factors, if the correlation is very high like (.963) between them.

3. Can I choose those factor which are obtain by using stepwise calculation in SPSS,

Before —-(independent variable-21)— adjusted R square – .546

Used stepwise method

After ——-(independent variable -6) — adjusted R square – .555

Jim Frost says

Hi Aritra,

In terms of choosing the best regression model, you should read my post about model specification.

VIFs don’t help you choose independent variables. However, if your IVs have high VIFs, then it’s difficult to determine the correct model. I write about that in my post about stepwise regression, which you should probably read anyway because you’re using it! If you have high VIFs, you’ll need to consider how to handle them.

If your IVs have such a high correlation as 0.963, then you undoubtedly have problematic levels of multicollinearity/high VIFs.

As for stepwise, read my article about that. You’ll see how stepwise can often get you close to the correct model but usually not all the way there. It’s an algorithm that knows nothing about the subject area. Read my posts about multicollinearity, model specification, and stepwise regression and I think you’ll gain a lot of valuable knowledge.

Also, consider buying my ebook about regression analysis which has even more information.

Best of luck with your analysis!

Michael J Buono says

How do you compare standardized beta weights. if one independent variable has a standardized beta weight of 0.4 and another has a value of 0.1 is the first one 4 times more important?

Jim Frost says

Hi Michael,

Yes, that’s how you’d interpret it. Just be aware of the caveats I point out about “importance” in the post.

Indira says

Hey Jim,

Thank you for posting this. This article helps me a lot.

The things is,

What is the least R squared we can tolerate?

Is it possible that the values of the current predictory variable switch (the bigger become smaller, the smaller become the bigger) when we add more predictory variables?

Thank you.

Jim Frost says

Hi Indira,

In terms of the lowest R-squared that is acceptable, that varies by subject area. For example, if you’re studying the relationship between physical characteristic and have high precision measurements, you might expect R-squared values to be in the 90s. In that context, 80% might be too low. On the other hand, when you’re trying to predict human behavior, there’s inherently more uncertainty. Consequently, an R-squared of 50% is probability considered high. You’ll need to look at similar research to determine what is considered normal R-squared values as well as which values are just plain to low. I talk about this issue in much more detail in my post about, how high does R-squared need to be?

And, yes, when you add predictors to the model, it’s entirely possible that some variables that were really strong become weaker and other can become stronger. This tends to happen when variables have a particular correlation pattern and it is called omitted variable bias. For more information, read my post about omitted variable bias.

I hope this helps!

Curt Miller says

Hi Jim,

My coworkers and I are running a MultiRegression Model (as required by federal regulations). We are including the 10 predictory variables required, and running against our department’s full population dataset.

We are running this analysis in SAS.

When adding any combination of only 3 of the 10 predictor variables, the results are complete and reasonable. However, when we add any 4th variable to the model, the results are as follows:

Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

We are researching this and, thus far, are unable to find any information that discusses how to resolve this issue.

Can you offer any advice?

Thanks,

curt miller

Jim Frost says

Hi Curt,

So, what that means in the most general sense is that you don’t have enough information to estimate the model that you’re specifying. There are two main more specific causes. One is that you have a very small sample size. I don’t know how large the department is, but it’s probably large enough to support a model with 4 predictors!

The other likely explanation is that you have enough data but too much is redundant, saying the same thing. This can happen when variables are perfectly linearly dependent. You can use one or more predictors to exactly predict the value of another predictor. It would be like including both male and female in the model. Perfect collinearity. I think that’s more likely. Some of your predictors are perfectly correlated. If so, you are fine excluding the redundancy and you won’t be losing information.

That’s my sense!

Navaneeth B S says

Hi Jim, many thanks for writing this blog and helping many like me out there gaining better understanding of statistics. I need your suggestion on independent variable transformation for OLS regression.

I have 10 years’ time series data, measured at quarterly interval (40 observations). My dependent variable (Y) is a transformed variable, calculated as year-on-year percentage change, as follows:

Y = { [Sales (t) – Sales (t – 4)] / [Sales (t – 4)] } * 100

Where, t is the current quarter; and (t – 4) is the same quarter from the previous year.

I have a set independent variables (X), which are also time series, measured at quarterly interval. My question is, is it required that all the independent variables should be transformed in line with the ‘Y’ or I can try transforming the variables at different levels as well, for example:

X = { [ X (t) – X (t – 1)] / [X (t – 1)] } * 100

X = { [ X (t) – X (t – 2)] / [X (t – 2) ] } * 100

X = { [ X (t) – X (t – 3)] / [X (t – 3)] } * 100

I appreciate your help. Many thanks.

Regards,

Navaneeth

Jim Frost says

Hi Navaneeth,

Yes, it’s entirely OK to pick and choose which IVs to transform. You don’t need to transform all or even any of them when you transform the DV. It depends on your data and subject area knowledge. You can also use different data transformations. If you use different transformations, you’ll have to be very careful about keeping it all straight when it comes to interpretations!

Best of luck with your analysis!

Seman Kedir Ousman says

Dear Jim Frost

I want to determine the most important variable in logistic regression using stata software. Most of the independent variables are categorical including the outcome variable and others continuous. The question is how I can standardize these covariates all together and decide about the variables strength. Tips using command for stata user might be very helpful. Thank you so much for your response in advance.

Bunga Aisyah says

this content is so helpful for me, thankyou very much. But, do you mind if i ask you for a text book that related with your explanation?

Jim Frost says

Hi Bunga, I regularly use Applied Linear Models by Neter et al. You can find this book listed on my recommendations page. While all of this content should be in that book, it’s not as nicely compiled in a nice and neat package as this blog post!

Claudia says

who was first….?

Jim Frost says

Yes, it was me in both cases. Thanks for writing! 🙂

jeff tennis says

Terrific article, this is exactly what I needed. This is a naive question (still new to predictive modeling), but when you say “standardize”, does that mean if I standardize all continuous variables I can compare them? If I create a linear model predicting home value based on square footage and age, then standardize both the square footage and age, could I then compare their model coefficients?

Taking it a step further, if square footage has a standardized coefficient of 2 and age has a standardized coefficient of 1, can I conclude square footage is 2x more important than age in predicting home value? I appreciate your help.

Jim Frost says

Hi Jeff, I’m glad the article was helpful!

Yes, it’s basically as you describe it. Standardize each continuous IV by subtracting its mean and dividing by the standard deviation for all observed values. Fit the model using the standardized variables and you obtain the standardized coefficients. Of course, many statistical packages will do this for you automatically and you won’t have to perform those steps.

When you’re working with these standardized coefficients, the coefficient represents the mean change in the DV given a one standard deviation change in the IV. The standard deviation of the IVs become the common scale. Of course, a 1 SD change in one IV equates to a different sized change in absolute terms compared to a 1 SD change in another IV. But, it puts them on a common scale to make them comparable.

It’s important to remember that there are a ton of caveats for all of this that I describe. Your interpretation in your example is one possible interpretation if you decide that standardized coefficients are meaningful for your study. But, flip the coefficients for the sake of argument. Suppose the age coefficient is 2 and square footage is 1. Further suppose you are looking at what a home owner can do to increase the value. In that scenario, even though age has a coefficient twice as large, the owner cannot change the age but can change the square footage. So, square footage is more important despite the smaller standardized coefficient!

Just be sure that you fully understand what most important means for your specific analysis.

Paul Yindeemark says

Thank you so much. This whole time I thought the significant of P-Values were the determinants of identifying most related independent variables.

Jim Frost says

Hi Paul, you’re very welcome! I think that’s a common misunderstanding. After all, we use p-values to determine which variables are statistically significant in the first place. Unfortunately, it doesn’t quite work that way!

Patrik Silva says

Hi Jim, fantastic post!

Which software do you normally use to produce results used in this blog?

I do not think that all software have this options!

Thank You!!!

Jim Frost says

Hi Patrik,

Thanks so much! I’m glad you’ve been enjoying them! I use Minitab statistical software in these posts. However, I think most functions are available in other software. Specifically for this post, I believe you can do all of this in both R and SPSS. However, they might have different terminology. For example, I believe SPSS refers to standardize coefficients as beta (which doesn’t make sense).

santosh says

Thanks!

sachin says

nice explanation…

Jim Frost says

Thank you!