You’ve settled on a regression model that contains independent variables that are statistically significant. By interpreting the statistical results, you can understand how changes in the independent variables are related to shifts in the dependent variable. At this point, it’s natural to wonder, “Which independent variable is the most important?”
Surprisingly, determining which variable is the most important is more complicated than it first appears. For a start, you need to define what you mean by “most important.” The definition should include details about your subject-area and your goals for the regression model. So, there is no one-size fits all definition for the most important independent variable. Furthermore, the methods you use to collect and measure your data can affect the seeming importance of the independent variables.
In this blog post, I’ll help you determine which independent variable is the most important while keeping these issues in mind. First, I’ll reveal surprising statistics that are not related to importance. You don’t want to get tripped up by them! Then, I’ll cover statistical and non-statistical approaches for identifying the most important independent variables in your regression model. I’ll also include an example regression model where we’ll try these methods out.
Related post: When Should I Use Regression Analysis?
Do Not Associate Regular Regression Coefficients with the Importance of Independent Variables
The regular regression coefficients that you see in your statistical output describe the relationship between the independent variables and the dependent variable. The coefficient value represents the mean change of the dependent variable given a one-unit shift in an independent variable. Consequently, you might think you can use the absolute sizes of the coefficients to identify the most important variable. After all, a larger coefficient signifies a greater change in the mean of the independent variable.
However, the independent variables can have dramatically different types of units, which make comparing the coefficients meaningless. For example, the meaning of a one-unit change differs considerably when your variables measure time, pressure, and temperature.
Additionally, a single type of measurement can use different units. For example, you can measure weight in grams and kilograms. If you fit two regression models using the same dataset, but use grams in one model and kilograms in the other, the weight coefficient changes by a factor of a thousand! Obviously, the importance of weight did not change at all even though the coefficient changed substantially. The model’s goodness-of-fit remains the same.
Key point: Larger coefficients don’t necessarily represent more important independent variables.
Do Not Link P-values to Importance
You can’t use the coefficient to determine the importance of an independent variable, but how about the variable’s p-value? Comparing p-values seems to make sense because we use them to determine which variables to include in the model. Do lower p-values represent more important variables?
Calculations for p-values include various properties of the variable, but importance is not one of them. A very small p-value does not indicate that the variable is important in a practical sense. An independent variable can have a tiny p-value when it has a very precise estimate, low variability, or a large sample size. The result is that effect sizes that are trivial in the practical sense can still have very low p-values. Consequently, when assessing statistical results, it’s important to determine whether an effect size is practically significant in addition to being statistically significant.
Key point: Low p-values don’t necessarily represent independent variables that are practically important.
Do Assess These Statistics to Identify Variables that might be Important
I showed how you can’t use several of the more notable statistics to determine which independent variables are most important in a regression model. The good news is that there are several statistics that you can use. Unfortunately, they sometimes disagree because each one defines “most important” differently.
As I explained previously, you can’t compare the regular regression coefficients because they use different scales. However, standardized coefficients all use the same scale, which means you can compare them.
Statistical software calculates standardized regression coefficients by first standardizing the observed values of each independent variable and then fitting the model using the standardized independent variables. Standardization involves subtracting the variable’s mean from each observed value and then dividing by the variable’s standard deviation.
Fit the regression model using the standardized independent variables and compare the standardized coefficients. Because they all use the same scale, you can compare them directly. Standardized coefficients signify the mean change of the dependent variable given a one standard deviation shift in an independent variable.
Key point: Identify the independent variable that has the largest absolute value for its standardized coefficient.
Change in R-squared for the last variable added to the model
Many statistical software packages include a very helpful analysis. They can calculate the increase in R-squared when each variable is added to a model that already contains all of the other variables. In other words, how much does the R-squared increase for each variable when you add it to the model last?
This analysis might not sound like much, but there’s more to it than is readily apparent. When an independent variable is the last one entered into the model, the associated change in R-squared represents the improvement in the goodness-of-fit that is due solely to that last variable after all of the other variables have been accounted for. In other words, it represents the unique portion of the goodness-of-fit that is attributable only to each independent variable.
Key point: Identify the independent variable that produces the largest R-squared increase when it is the last variable added to the model.
Example of Identifying the Most Important Independent Variables in a Regression Model
The example output below shows a regression model that has three independent variables. You can download the CSV data file to try it yourself: ImportantVariables.
The statistical output displays the coded coefficients, which are the standardized coefficients. Temperature has the standardized coefficient with the largest absolute value. This measure suggests that Temperature is the most important independent variable in the regression model.
The graphical output below shows the incremental impact of each independent variable. This graph displays the increase in R-squared associated with each variable when it is added to the model last. Temperature uniquely accounts for the largest proportion of the variance. For our example, both statistics suggest that Temperature is the most important variable in the regression model.
Cautions for Using Statistics to Pinpoint Important Variables
Standardized coefficients and the change in R-squared when a variable is added to the model last can both help identify the more important independent variables in a regression model—from a purely statistical standpoint. Unfortunately, these statistics can’t determine the practical importance of the variables. For that, you’ll need to use your knowledge of the subject area.
The manner in which you obtain and measure your sample can bias these statistics and throw off your assessment of importance.
When you collect a random sample, you can expect the sample variability of the independent variable values to reflect the variability in the population. Consequently, the change in R-squared values and standardized coefficients should reflect the correct population values.
However, if the sample contains a restricted range (less variability) for a variable, both statistics tend to underestimate the importance. Conversely, if the variability of the sample is greater than the population variability, the statistics tend to overestimate the importance of that variable.
Also, consider the quality of measurements for your independent variables. If the measurement precision for a particular variable is relatively low, that variable can appear to be less predictive than it truly is.
When the goal of your analysis is to change the mean of the independent variable, you must be sure that the relationships between the independent variables and the dependent variable are causal rather than just correlation. If these relationships are not causal, then intentional changes in the independent variables won’t cause the desired changes in the dependent variable despite any statistical measures of importance.
Typically, you need to perform a randomized experiment to determine whether the relationships are causal.
Non-Statistical Issues to Help Find Important Variables
The definition of “most important” should depend on your goals and the subject-area. Practical issues can influence which variable you consider to be the most important.
For instance, when you want to affect the value of the dependent variable by changing the independent variables, use your knowledge to identify the variables that are easiest to change. Some variables can be difficult, expensive, or even impossible to change.
“Most important” is a subjective, context sensitive quality. Statistics can highlight candidate variables, but you still need to apply your subject-area expertise.
If you’re learning regression, check out my Regression Tutorial!
Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.