You’ve settled on a regression model that contains independent variables that are statistically significant. By interpreting the statistical results, you can understand how changes in the independent variables are related to shifts in the dependent variable. At this point, it’s natural to wonder, “Which independent variable is the most important?”

Surprisingly, determining which variable is the most important is more complicated than it first appears. For a start, you need to define what you mean by “most important.” The definition should include details about your subject-area and your goals for the regression model. So, there is no one-size fits all definition for the most important independent variable. Furthermore, the methods you use to collect and measure your data can affect the seeming importance of the independent variables.

In this blog post, I’ll help you determine which independent variable is the most important while keeping these issues in mind. First, I’ll reveal surprising statistics that are not related to importance. You don’t want to get tripped up by them! Then, I’ll cover statistical and non-statistical approaches for identifying the most important independent variables in your regression model. I’ll also include an example regression model where we’ll try these methods out.

**Related post**: When Should I Use Regression Analysis?

__Do Not__ Associate Regular Regression Coefficients with the Importance of Independent Variables

The regular regression coefficients that you see in your statistical output describe the relationship between the independent variables and the dependent variable. The coefficient value represents the mean change of the dependent variable given a one-unit shift in an independent variable. Consequently, you might think you can use the absolute sizes of the coefficients to identify the most important variable. After all, a larger coefficient signifies a greater change in the mean of the independent variable.

However, the independent variables can have dramatically different types of units, which make comparing the coefficients meaningless. For example, the meaning of a one-unit change differs considerably when your variables measure time, pressure, and temperature.

Additionally, a single type of measurement can use different units. For example, you can measure weight in grams and kilograms. If you fit two regression models using the same dataset, but use grams in one model and kilograms in the other, the weight coefficient changes by a factor of a thousand! Obviously, the importance of weight did not change at all even though the coefficient changed substantially. The model’s goodness-of-fit remains the same.

**Key point**: Larger coefficients don’t necessarily represent more important independent variables.

__Do Not__ Link P-values to Importance

You can’t use the coefficient to determine the importance of an independent variable, but how about the variable’s p-value? Comparing p-values seems to make sense because we use them to determine which variables to include in the model. Do lower p-values represent more important variables?

Calculations for p-values include various properties of the variable, but importance is not one of them. A very small p-value does not indicate that the variable is important in a practical sense. An independent variable can have a tiny p-value when it has a very precise estimate, low variability, or a large sample size. The result is that effect sizes that are trivial in the practical sense can still have very low p-values. Consequently, when assessing statistical results, it’s important to determine whether an effect size is practically significant in addition to being statistically significant.

**Key point**: Low p-values don’t necessarily represent independent variables that are practically important.

__Do__ Assess These Statistics to Identify Variables that might be Important

I showed how you can’t use several of the more notable statistics to determine which independent variables are most important in a regression model. The good news is that there are several statistics that you can use. Unfortunately, they sometimes disagree because each one defines “most important” differently.

### Standardized coefficients

As I explained previously, you can’t compare the regular regression coefficients because they use different scales. However, standardized coefficients all use the same scale, which means you can compare them.

Statistical software calculates standardized regression coefficients by first standardizing the observed values of each independent variable and then fitting the model using the standardized independent variables. Standardization involves subtracting the variable’s mean from each observed value and then dividing by the variable’s standard deviation.

Fit the regression model using the standardized independent variables and compare the standardized coefficients. Because they all use the same scale, you can compare them directly. Standardized coefficients signify the mean change of the dependent variable given a one standard deviation shift in an independent variable.

**Key point**: Identify the independent variable that has the largest absolute value for its standardized coefficient.

**Related post**: Standardizing your variables can also help when your model contains polynomials and interaction terms.

### Change in R-squared for the last variable added to the model

Many statistical software packages include a very helpful analysis. They can calculate the increase in R-squared when each variable is added to a model that already contains all of the other variables. In other words, how much does the R-squared increase for each variable when you add it to the model last?

This analysis might not sound like much, but there’s more to it than is readily apparent. When an independent variable is the last one entered into the model, the associated change in R-squared represents the improvement in the goodness-of-fit that is due solely to that last variable after all of the other variables have been accounted for. In other words, it represents the *unique* portion of the goodness-of-fit that is attributable only to each independent variable.

**Key point**: Identify the independent variable that produces the largest R-squared increase when it is the last variable added to the model.

## Example of Identifying the Most Important Independent Variables in a Regression Model

The example output below shows a regression model that has three independent variables. You can download the CSV data file to try it yourself: ImportantVariables.

The statistical output displays the coded coefficients, which are the standardized coefficients. Temperature has the standardized coefficient with the largest absolute value. This measure suggests that Temperature is the most important independent variable in the regression model.

The graphical output below shows the incremental impact of each independent variable. This graph displays the increase in R-squared associated with each variable when it is added to the model last. Temperature uniquely accounts for the largest proportion of the variance. For our example, both statistics suggest that Temperature is the most important variable in the regression model.

## Cautions for Using Statistics to Pinpoint Important Variables

Standardized coefficients and the change in R-squared when a variable is added to the model last can both help identify the more important independent variables in a regression model—from a purely statistical standpoint. Unfortunately, these statistics can’t determine the practical importance of the variables. For that, you’ll need to use your knowledge of the subject area.

The manner in which you obtain and measure your sample can bias these statistics and throw off your assessment of importance.

When you collect a random sample, you can expect the sample variability of the independent variable values to reflect the variability in the population. Consequently, the change in R-squared values and standardized coefficients should reflect the correct population values.

However, if the sample contains a restricted range (less variability) for a variable, both statistics tend to underestimate the importance. Conversely, if the variability of the sample is greater than the population variability, the statistics tend to overestimate the importance of that variable.

Also, consider the quality of measurements for your independent variables. If the measurement precision for a particular variable is relatively low, that variable can appear to be less predictive than it truly is.

When the goal of your analysis is to change the mean of the independent variable, you must be sure that the relationships between the independent variables and the dependent variable are causal rather than just correlation. If these relationships are not causal, then intentional changes in the independent variables won’t cause the desired changes in the dependent variable despite any statistical measures of importance.

Typically, you need to perform a randomized experiment to determine whether the relationships are causal.

## Non-Statistical Issues to Help Find Important Variables

The definition of “most important” should depend on your goals and the subject-area. Practical issues can influence which variable you consider to be the most important.

For instance, when you want to affect the value of the dependent variable by changing the independent variables, use your knowledge to identify the variables that are easiest to change. Some variables can be difficult, expensive, or even impossible to change.

“Most important” is a subjective, context sensitive quality. Statistics can highlight candidate variables, but you still need to apply your subject-area expertise.

If you’re learning regression, check out my Regression Tutorial!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Curt Miller says

Hi Jim,

My coworkers and I are running a MultiRegression Model (as required by federal regulations). We are including the 10 predictory variables required, and running against our department’s full population dataset.

We are running this analysis in SAS.

When adding any combination of only 3 of the 10 predictor variables, the results are complete and reasonable. However, when we add any 4th variable to the model, the results are as follows:

Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

We are researching this and, thus far, are unable to find any information that discusses how to resolve this issue.

Can you offer any advice?

Thanks,

curt miller

Jim Frost says

Hi Curt,

So, what that means in the most general sense is that you don’t have enough information to estimate the model that you’re specifying. There are two main more specific causes. One is that you have a very small sample size. I don’t know how large the department is, but it’s probably large enough to support a model with 4 predictors!

The other likely explanation is that you have enough data but too much is redundant, saying the same thing. This can happen when variables are perfectly linearly dependent. You can use one or more predictors to exactly predict the value of another predictor. It would be like including both male and female in the model. Perfect collinearity. I think that’s more likely. Some of your predictors are perfectly correlated. If so, you are fine excluding the redundancy and you won’t be losing information.

That’s my sense!

Navaneeth B S says

Hi Jim, many thanks for writing this blog and helping many like me out there gaining better understanding of statistics. I need your suggestion on independent variable transformation for OLS regression.

I have 10 years’ time series data, measured at quarterly interval (40 observations). My dependent variable (Y) is a transformed variable, calculated as year-on-year percentage change, as follows:

Y = { [Sales (t) – Sales (t – 4)] / [Sales (t – 4)] } * 100

Where, t is the current quarter; and (t – 4) is the same quarter from the previous year.

I have a set independent variables (X), which are also time series, measured at quarterly interval. My question is, is it required that all the independent variables should be transformed in line with the ‘Y’ or I can try transforming the variables at different levels as well, for example:

X = { [ X (t) – X (t – 1)] / [X (t – 1)] } * 100

X = { [ X (t) – X (t – 2)] / [X (t – 2) ] } * 100

X = { [ X (t) – X (t – 3)] / [X (t – 3)] } * 100

I appreciate your help. Many thanks.

Regards,

Navaneeth

Jim Frost says

Hi Navaneeth,

Yes, it’s entirely OK to pick and choose which IVs to transform. You don’t need to transform all or even any of them when you transform the DV. It depends on your data and subject area knowledge. You can also use different data transformations. If you use different transformations, you’ll have to be very careful about keeping it all straight when it comes to interpretations!

Best of luck with your analysis!

Seman Kedir Ousman says

Dear Jim Frost

I want to determine the most important variable in logistic regression using stata software. Most of the independent variables are categorical including the outcome variable and others continuous. The question is how I can standardize these covariates all together and decide about the variables strength. Tips using command for stata user might be very helpful. Thank you so much for your response in advance.

Bunga Aisyah says

this content is so helpful for me, thankyou very much. But, do you mind if i ask you for a text book that related with your explanation?

Jim Frost says

Hi Bunga, I regularly use Applied Linear Models by Neter et al. You can find this book listed on my recommendations page. While all of this content should be in that book, it’s not as nicely compiled in a nice and neat package as this blog post!

Claudia says

who was first….?

Jim Frost says

Yes, it was me in both cases. Thanks for writing! 🙂

jeff tennis says

Terrific article, this is exactly what I needed. This is a naive question (still new to predictive modeling), but when you say “standardize”, does that mean if I standardize all continuous variables I can compare them? If I create a linear model predicting home value based on square footage and age, then standardize both the square footage and age, could I then compare their model coefficients?

Taking it a step further, if square footage has a standardized coefficient of 2 and age has a standardized coefficient of 1, can I conclude square footage is 2x more important than age in predicting home value? I appreciate your help.

Jim Frost says

Hi Jeff, I’m glad the article was helpful!

Yes, it’s basically as you describe it. Standardize each continuous IV by subtracting its mean and dividing by the standard deviation for all observed values. Fit the model using the standardized variables and you obtain the standardized coefficients. Of course, many statistical packages will do this for you automatically and you won’t have to perform those steps.

When you’re working with these standardized coefficients, the coefficient represents the mean change in the DV given a one standard deviation change in the IV. The standard deviation of the IVs become the common scale. Of course, a 1 SD change in one IV equates to a different sized change in absolute terms compared to a 1 SD change in another IV. But, it puts them on a common scale to make them comparable.

It’s important to remember that there are a ton of caveats for all of this that I describe. Your interpretation in your example is one possible interpretation if you decide that standardized coefficients are meaningful for your study. But, flip the coefficients for the sake of argument. Suppose the age coefficient is 2 and square footage is 1. Further suppose you are looking at what a home owner can do to increase the value. In that scenario, even though age has a coefficient twice as large, the owner cannot change the age but can change the square footage. So, square footage is more important despite the smaller standardized coefficient!

Just be sure that you fully understand what most important means for your specific analysis.

Paul Yindeemark says

Thank you so much. This whole time I thought the significant of P-Values were the determinants of identifying most related independent variables.

Jim Frost says

Hi Paul, you’re very welcome! I think that’s a common misunderstanding. After all, we use p-values to determine which variables are statistically significant in the first place. Unfortunately, it doesn’t quite work that way!

Patrik Silva says

Hi Jim, fantastic post!

Which software do you normally use to produce results used in this blog?

I do not think that all software have this options!

Thank You!!!

Jim Frost says

Hi Patrik,

Thanks so much! I’m glad you’ve been enjoying them! I use Minitab statistical software in these posts. However, I think most functions are available in other software. Specifically for this post, I believe you can do all of this in both R and SPSS. However, they might have different terminology. For example, I believe SPSS refers to standardize coefficients as beta (which doesn’t make sense).

santosh says

Thanks!

sachin says

nice explanation…

Jim Frost says

Thank you!