Automatic variable selection procedures are algorithms that pick the variables to include in your regression model. Stepwise regression and Best Subsets regression are two of the more common variable selection methods. In this post, I compare how these methods work and which one provides better results.
These automatic procedures can be helpful when you have many independent variables and you need some help in the investigative stages of the variable selection process. You could specify many models with different combinations of independent variables, or you can have your statistical software do this for you.
These procedures are especially useful when theory and experience provide only a vague sense of which variables you should include in the model. However, if theory and expertise are strong guides, it’s generally better to follow them than to use an automated procedure. Additionally, if you use one of these procedures, you should consider it as only the first step of the model selection process.
Here are my objectives for this blog post. I will:
- Show how stepwise regression and best subsets regression work differently.
- Use both procedures on one example dataset to compare their results.
- Explore whether one procedure is better.
- Examine the factors that affect a method’s ability to choose the correct model.
Related post: Model Specification: Choosing the Correct Regression Model
How Stepwise Regression Works
As the name stepwise regression suggests, this procedure selects variables in a step-by-step manner. The procedure adds or removes independent variables one at a time using the variable’s statistical significance. Stepwise either adds the most significant variable or removes the least significant variable. It does not consider all possible models, and it produces a single regression model when the algorithm ends.
Typically, you can control the specifics of the stepwise procedure. For example, you can specify whether it can only add variables, only remove variables, or both. You can also set the significance level for including and excluding the independent variables.
How Best Subsets Regression Works
Best subsets regression is also known as “all possible regressions” and “all possible models.” Again, the name of the procedure indicates how it works. Unlike stepwise, best subsets regression fits all possible models based on the independent variables that you specify.
The number of models that this procedure fits multiplies quickly. If you have 10 independent variables, it fits 1024 models. However, if you have 20 variables, it fits 1,048,576 models! Best subsets regression fits 2P models, where P is the number of predictors in the dataset.
After fitting all of the models, best subsets regression then displays the best fitting models with one independent variable, two variables, three variables, and so on. Usually, either adjusted R-squared or Mallows’ Cp is the criterion for picking the best fitting models for this process.
The procedure displays the best fitting models of different sizes up to the full model. You need to compare the models to determine which one is the best. In some cases, it is not clear which model is the best, and you’ll need to use your judgment.
Comparison of Stepwise to Best Subsets Regression
While both automatic variable selection procedures assess the set of independent variables that you specify, the end results can be different. Stepwise regression does not fit all models but instead assesses the statistical significance of the variables one at a time and arrives at a single model. Best subsets regression fits all possible models and displays some of the best candidates based on adjusted R-squared or Mallows’ Cp.
The single model that stepwise regression produces can be simpler for the analyst. However, best subsets regression presents more information that is potentially valuable.
Enough talk about how these procedures work. Let’s see them in action!
Example Using Stepwise and Best Subsets on the Same Dataset
Our example scenario models a manufacturing process. We’ll determine whether the production conditions are related to the strength of a product. If you want to try this yourself, you can download the CSV data file: ProductStrength.
For both variable selection procedures, we’ll use the same independent and dependent variables.
Dependent variable: Strength
Independent variables: Temperature, Pressure, Rate, Concentration, Time
Example of Stepwise Regression
Let’s use stepwise regression to pick the variables for our model. I’ll use the stepwise method that allows the procedure to both add and remove independent variables as needed. The output below shows the steps up to the fourth and final step.
For our example data, the stepwise procedure added a variable in each step. The process stopped when there were no variables it could add or remove from the model. The final column displays the model that the procedure produced.
The four independent variables in our model are Concentration, Rate, Pressure, and Temperature. This model has an R-squared of 89.09% and the highest adjusted R-squared. You also want Mallows’ Cp to be close to the number of independent variables plus one (for the constant). Mallows’ Cp for the final model is closer to the ideal value than the other models. It all looks good!
Example of Best Subsets Regression
Next, I’ll perform best subsets regression on the same dataset.
The best subsets procedure fits all possible models using our five independent variables. That means it fit 25 = 32 models. Each horizontal line represents a different model. By default, this statistical software package displays the top two models for each number of independent variables that are in the model. X’s indicate the independent variables that are in each model.
Below are the results.
We’re looking for a model that has a high adjusted R-squared, a small standard error of the regression, and a Mallows’ Cp close to the number of variables plus one.
The model I circled is the one that the stepwise method produced. Based on the goodness-of-fit measures, this model appears to be a good candidate. However, the best subsets regression results provide a larger context that might help us make a choice using our subject-area knowledge and goals.
This type of evaluation helps us find a parsimonious model—one that is relatively simple but effective. Learn more about Parsimonious Models: Benefits and Selecting.
Using Best Subsets Regression in conjunction with Our Requirements
We might have specific priorities that affect our choice for the best model.
For instance, if our top priorities are to simplify and reduce the costs of data collection, we might be interested in the models with fewer independent variables that fit the data nearly as well. The first model listed with three variables has an adjusted R-squared that is only 1.4 percentage points less than the circled model. In fact, the best two-variable model is not far behind.
On the other hand, if using the model to make accurate predictions is our top priority, we might be interested in the model with all five independent variables. Almost all of the goodness-of-fit measures are marginally better for the full model compared to the best model with four variables. However, the predicted R-squared for the full model declined slightly compared to the model with four variables.
Often, predicted R-squared starts to decline when the model becomes too complex and begins to fit the noise in the data. Sometimes simpler models can produce more precise predictions. For the most predictive model, we might use the best two-variable model because it has the highest predicted R-squared.
I value this extra information that best subsets regression provides. While this procedure requires more knowledge and effort to sort through the multiple models, it helps us choose the best model using our requirements. However, this method also fits many more models than stepwise regression, which is a form of data mining and increases the risk of finding chance correlations.
Assess Your Candidate Regression Models Thoroughly
If you use stepwise regression or best subsets regression to help pick your model, you need to investigate the candidates thoroughly. That entails fitting the candidate models the normal way and checking the residual plots to be sure the fit is unbiased. You also need to assess the signs and values of the regression coefficients to be sure that they make sense. These automatic model selection procedures can find chance correlations in the sample data and produce models that don’t make sense in the real world.
Automatic variable selection procedures can be helpful tools, particularly in the exploratory stage. However, you can’t expect an automated algorithm to understand the subject area better than you! Be aware of the following potential problems.
- These procedures can sift through many different models and find correlations that exist by chance in the sample. Assess the results critically and use your expertise to determine whether they make sense.
- These procedures cannot take real-world knowledge into account. The model may not be right in a practical sense.
- Stepwise regression does not always choose the model with the largest R-squared value.
We saw how stepwise and best subsets regression compare. At this point, there is a logical question. Does one of these procedures work better? Read on!
Which is Better, Stepwise Regression or Best Subsets Regression?
Which automatic variable selection procedure works better? Olejnik, Mills, and Keselman* performed a simulation study to compare how frequently stepwise regression and best subsets regression choose the correct model. The authors include 32 conditions in their study that differ by the number of candidate variables, number of correct variables, sample size, and amount of multicollinearity. For each state, a computer generated 1000 datasets. The authors analyzed each dataset using both stepwise and best subsets regression. For best subsets regression, they compared the effectiveness of using the lowest Mallows’ Cp to using the highest adjusted R-squared.
Drum roll, please!
The winner is … stepwise regression!
Although, it is a very close competition. Overall, stepwise regression is better than best subsets regression using the lowest Mallows’ Cp by less than 3%. Best subsets regression using the highest adjusted R-squared approach is the clear loser here.
However, there is a big warning to reveal.
Stepwise regression does not usually pick the correct model!
How Accurate is Stepwise Regression?
Let’s take a closer look at the results. I’m going to cover only the stepwise results. However, best subsets regression using the lowest Mallows’ Cp follows the same patterns and is virtually tied.
First, let’s define some terms in this study.
- Authentic variables are the independent variables that truly have a relationship with the dependent variable.
- Noise variables are independent variables that do not have an actual relationship with the dependent variable.
- The correct model includes all of the authentic variables and excludes all of the noise variables.
Let’s explore the accuracy of stepwise regression in picking the correct model, and the conditions that affect its accuracy.
When stepwise regression is most accurate
Let’s start by looking at the best case scenario for the stepwise procedure. In the study, this procedure is most capable when there are four candidate variables, three of the variables are authentic, there is no multicollinearity, and there is an extra-large sample size of 500 observations. This sample size is larger than the number of observations that most studies will collect if they are considering only four candidate variables.
In this scenario, stepwise regression chooses the correct model 84% of the time. The bad news is that this scenario is not realistic for most studies, and the accuracy drops from here.
The role of the number of candidate variables and authentic variables in stepwise regression accuracy
The study assesses conditions with either 4 or 8 independent variables (IVs) that are candidates. When there are more variables to evaluate, it is harder for stepwise regression to identify the correct model. This pattern also applies to the number of authentic independent variables.
The table below illustrates this pattern for scenarios with no multicollinearity and a good sample size (100-120). The percentage correct decreases as the number of candidate variables and authentic variables increase. Notice how most scenarios produce the correct model less than half the time!
Candidate IVs | Authentic IVs | % Correct model |
4 | 1 | 62.7 |
2 | 54.3 | |
3 | 34.4 | |
8 | 2 | 31.3 |
4 | 12.7 | |
6 | 1.1 |
The role of multicollinearity in stepwise regression accuracy
The study also assesses the role that multicollinearity plays in the capability of stepwise regression to choose the correct model. When independent variables are correlated, it’s harder to isolate the individual effect of each variable. This difficulty occurs regardless whether it is a human or computer algorithm trying to identify the correct model.
The table below illustrates how the percentage correct varies by the amount of correlation and the number of variables. The results are based on a good sample size (100-120). As the correlation increases, the percentage correct decreases.
Candidate IVs | Authentic IVs | Correlation | % Correct model |
4 | 2 | 0.0 | 54.3 |
0.2 | 43.1 | ||
0.6 | 15.7 | ||
8 | 4 | 0.0 | 12.7 |
0.2 | 1.0 | ||
0.6 | 0.4 |
Related post: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
The role of sample size in stepwise regression accuracy
The study assesses two sample sizes to determine how it affects the ability of stepwise regression to choose the correct model. The smaller sample size is based on the number of observations necessary to obtain 0.80 statistical power, which is between 100-120 observations. This approach is consistent with best practices, and I’ve referred to this size as a “good sample size” previously.
The study also uses a very large sample size, which is five times the size of the good sample size.
The table below shows that a very large sample size improves the capability of stepwise regression to choose the correct model. Collecting a very large sample size might be more expensive, but it dramatically improves the variable selection process.
Candidate IVs | Authentic IVs | Correlation | % Correct – good sample size | % Correct – very large sample |
4 | 2 | 0.0 | 54.3 | 72.1 |
0.2 | 43.1 | 72.9 | ||
0.6 | 15.7 | 69.2 | ||
8 | 4 | 0.0 | 12.7 | 53.9 |
0.2 | 1.0 | 39.5 | ||
0.6 | 0.4 | 1.8 |
Closing Thoughts on Choosing the Correct Model
Stepwise regression and best subsets regression don’t usually pick the correct model. This finding is true with the relatively low number of candidate independent variables that the simulation study assesses. In actual studies, it would be not surprising if the researchers need to assess many more variables, which would further reduce the percentage. In fact, unlike the simulation study, you can’t even be sure that you are assessing all of the authentic variables in a real world experiment!
Given these findings, you might be asking, “are stepwise regression and best subsets regression (using the lowest Mallows’ Cp) useful tools?”
I think they provide value during the very early, investigative stages of a study, particularly when theory doesn’t provide much guidance. However, you must rigorously assess the candidate models to see if they make sense. Further, it is important to understand that stepwise regression usually only gets you close to the correct model, but not all of the way there.
In that sense, I think stepwise regression provides some benefits. It can help you get to the right ballpark and provide a glimpse of the relationships in your data.
However, reality is complicated, and we are trying to model it with a sample. Choosing the correct model can be difficult even when researchers are armed with extensive subject-area knowledge. It is unreasonable to expect an automatic variable selection procedure to figure it out. Stepwise regression follows simple rules to pick the variables and does not know anything about the study area.
It’s up to you to go from the rough notion to the correct model. To do this, you need to use your expertise, theory, and common sense rather than depending on only simple variable selection rules. For more information about successful regression modeling, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.
If you’re learning regression, check out my Regression Tutorial!
Reference
*Stephen Olejnik, Jamie Mills, and Harfey Keselman, “Using Wherry’s Adjusted R2 and Mallows’ Cp for Model Selection from All Possible Regressions”, The Journal of Experimental Education, 2000, 68(4), 365-380.
Hi Jim , I’m a student picking up regression analytics and I would like to understand what exact does R , Significance F represent while I know that P-value must not be higher than 0.15? I’m trying to do a predictive analysis over how potential employee could perform in terms of rating scale with other variables such as Critical Thinking Score/Interview Rating, but cant seem to derive a probable P-value.
For eg.
Avg perf grade (past 3 yrs) Degree level Critical Thinking Score Interview Rating
3.4 1 79 70
3.6 1 79 86
4 1 82 66
4.9 2 83 76
Hi Denise,
I don’t know what you mean by “deriving a probable p-value?” So, I’m not sure how to answer your question. Here are some topics that might help you out.
Stepwise regression will produce p-values for all variables and an R-squared. Click those links to learn more about those concepts and how to interpret them. The exact p-value that stepwise regression uses depends on how you set your software. As an exploratory tool, it’s not unusual to use higher significance levels, such as 0.10 or 0.15. Stepwise helps you identify candidate variables but, as I write in this article, don’t expect it to get you to the exactly correct answer.
Also, I’ve written a post about using regression to make predictions. That will provide some useful information about prediction considerations and how to assess the predictions themselves.
Jim,
That is quite a bit for me to read and think about this weekend. I ran several models today. By considering the variables in blocks I was able to determine that omitting Whites and Others greatly reduced collinearity, which made almost all my variables significant (p. <.05). The hierarchical method reduced R2 at first, but my final model produced an R2 equal to (I think slightly higher than) the R2 for simultaneous entry. I do not know if I will get the same results with simultaneous entry (omitting the same variables), but I will certainly take a look tomorrow.
Jim,
I need to run a regression analysis with one categorical variable, two continuous variables, and two groups of interrelated continuous variables. My variable of interest will likely be much less significant than some of he others. I would like to use hierarchical regression entry to enter he variable of interest first to capture any variance that it shares with other variables. Then I would like to enter the variables and groups (block entry) in order of their significance. I plan to use an a priori backwards stepwise analysis to determine the order that I will then use for my hierarchical entry because I have little theory or previous research on which to base my model. Does this make sense?
Troy
Hi Troy,
I see what you’re trying to do but I don’t think that’s a good approach. But first some background knowledge.
What you’re talking about is really the difference between adjusted sums of squares (Adj SS) and sequential sums of squares (Seq SS). The default choice for all software that I’m familiar with is to use Adj SS. Using Adj SS, the order that you enter the variables into the model does not matter. More specifically, the results in your output for each IV are based on that IV being entered into the model last. In other words, all the other IVs are in the model and then the software adds the IV. That’s the standard approach because it tells you the unique portion of the regression SS that each IV accounts for.
What you’re talking about is switching to Seq SS and then intentionally entering your variables of interest first to, as you say, capture the SS that it shares with the other IVs. It is possible to fit the model that way as long as you switch to Seq. SS. However, it is a totally non-standard approach. Again, you typically want to know the SS accounted for by a variable uniquely. If you go this route, be sure to describe how you perform the analysis and how it’s not the unique SS in order to provide the proper context for interpretation. This approach will make your variables of interest appear more important than they actually are.
As a separate point, I’m not a big fan of hierarchical regression. I think it’s fairly common in the social sciences, so you’re on safe ground in terms of what your colleagues will accept. However, the idea of entering a block of variables (e.g., demographic variables) and then another block of the research variables has problems. I’m sure the researchers think of those groups differently. However, in reality, they’re all just variables that exist simultaneously. The modeling process doesn’t treat them any differently. There is also a potential problem of omitted variable bias. If your variables in the second group correlate with demographic variables (in the first group) and the DV, your model for the first group is biased because it omits important variables. Consequently, there’s no real purpose for fitting that first model. But, I digress!
At any rate, I personally would not use the approach you describe. I would stick with the default modeling process that uses Adj SS. Using this approach, the order that you enter the variables does not matter and the SS for each variables is the proportion that each variable explains.
I hope this helps!
could you give the idea about forward and backward LR method in logistic regression?
Sir is it the number of steps taken in stepwise always equal to the number of variables in the result? sample the stepwise took 6 steps then the number of variables will also be 6?
Hello Cess,
The stepwise procedure will not always have the same number of steps as the number of independent variables. If you have 6 variables and they’re all statistically significant, then the procedure can take 6 steps by adding one to the model on each step. However, if not all the variables are significant enough to be entered in the model, you will have fewer steps. It depends on how significant your variables are.
In fact, it’s possible to have more steps than variables in some cases. That depends on how you tell your stepwise procedure to work but sometimes it can add and remove variables. In that case, the total number of steps can exceed the number of variables. However, in my experience, that is very rare.
Typically, you will have fewer steps than independent variables unless all the variables are very significant and then number of steps equal the number of variables.
I hope this helps!
Very good articles and very good website. The author Jim has written lots of gems
Sir,
Very nice post. Enjoyed it.
Can you please elaborate on the topic of chance correlation.
Thank you, yes.
Hello,
Can you comment on the following from stackexchange. In current exploration using Best subsets I too am finding IVs with low correlations with the DV in the better models even if the t-statistics is not significant. I have only just begun consequently I still have to dig deeper but am curious about general rationale, as per the quote (below).
https://stats.stackexchange.com/questions/90711/can-independent-variables-with-low-correlation-with-dependent-variable-be-signif
“With a correlation matrix, you are examining unconditional (crude) associations between your variables. With a regression model, you are examining the joint associations of your IVs with your DVs, thus looking at conditional associations (for each IV, its association with the DV conditional on the other IVs). Depending on the structure of your data, these two can yield very different, even contrary results.”
Hi,
This is a great question. I agree with quote fully. You can get wonky results when you try to model relationships that involve multiple variables but doing so only one at a time, such as when you use correlation with pairs of variables. In other words, the subject matter is too complex for such a simple analyses as correlation. Keep in mind that when you include independent variables in a regression model and you assess the effect of a particular variable, the analysis holds the other variables constant (i.e., controls for it). However, when you assess the correlation, you’re not controlling for any other variables and this can affect the results.
When you omit important variables from the analysis, it can bias your result. And, you’re in luck, just a couple of weeks ago I wrote a post about confounding variables and omitted variable bias that takes a close look at how this bias occurs. In fact, I use an example where I looked at a model with only one variable that was not significant but it became significant when I added another variable. That’s very much like your situation, and I discuss why that occurs and under what conditions it occurs.
I hope this helps!