Automatic variable selection procedures are algorithms that pick the variables to include in your regression model. Stepwise regression and Best Subsets regression are two of the more common variable selection methods. In this post, I compare how these methods work and which one provides better results.

These automatic procedures can be helpful when you have many independent variables and you need some help in the investigative stages of the variable selection process. You could specify many models with different combinations of independent variables, or you can have your statistical software do this for you.

These procedures are especially useful when theory and experience provide only a vague sense of which variables you should include in the model. However, if theory and expertise are strong guides, it’s generally better to follow them than to use an automated procedure. Additionally, if you use one of these procedures, you should consider it as only the first step of the model selection process.

Here are my objectives for this blog post. I will:

- Show how stepwise regression and best subsets regression work differently.
- Use both procedures on one example dataset to compare their results.
- Explore whether one procedure is better.
- Examine the factors that affect a method’s ability to choose the correct model.

**Related post**: Model Specification: Choosing the Correct Regression Model

## How Stepwise Regression Works

As the name stepwise regression suggests, this procedure selects variables in a step-by-step manner. The procedure adds or removes independent variables one at a time using the variable’s statistical significance. Stepwise either adds the most significant variable or removes the least significant variable. It does not consider all possible models, and it produces a single regression model when the algorithm ends.

Typically, you can control the specifics of the stepwise procedure. For example, you can specify whether it can only add variables, only remove variables, or both. You can also set the significance level for including and excluding the independent variables.

## How Best Subsets Regression Works

Best subsets regression is also known as “all possible regressions” and “all possible models.” Again, the name of the procedure indicates how it works. Unlike stepwise, best subsets regression fits all possible models based on the independent variables that you specify.

The number of models that this procedure fits multiplies quickly. If you have 10 independent variables, it fits 1024 models. However, if you have 20 variables, it fits 1,048,576 models! Best subsets regression fits 2^{P} models, where P is the number of predictors in the dataset.

After fitting all of the models, best subsets regression then displays the best fitting models with one independent variable, two variables, three variables, and so on. Usually, either adjusted R-squared or Mallows Cp is the criterion for picking the best fitting models for this process.

The result is a display of the besting fit models of different sizes up to the full model. You need to compare the models to determine which one is the best. In some cases, it is not clear which model is the best, and you’ll need to use your judgment.

## Comparison of Stepwise to Best Subsets Regression

While both automatic variable selection procedures assess the set of independent variables that you specify, the end results can be different. Stepwise regression does not fit all models but instead assesses the statistical significance of the variables one at a time and arrives at a single model. Best subsets regression fits all possible models and displays some of the best candidates based on adjusted R-squared or Mallows’ Cp.

The single model that stepwise regression produces can be simpler for the analyst. However, best subsets regression presents more information that is potentially valuable.

Enough talk about how these procedures work. Let’s see them in action!

## Example Using Stepwise and Best Subsets on the Same Dataset

Our example scenario models a manufacturing process. We’ll determine whether the production conditions are related to the strength of a product. If you want to try this yourself, you can download the CSV data file: ProductStrength.

For both variable selection procedures, we’ll use the same independent and dependent variables.

**Dependent variable: ** Strength

**Independent variables:** Temperature, Pressure, Rate, Concentration, Time

## Example of Stepwise Regression

Let’s use stepwise regression to pick the variables for our model. I’ll use the stepwise method that allows the procedure to both add and remove independent variables as needed. The output below shows the steps up to the fourth and final step.

For our example data, the stepwise procedure added a variable in each step. The process stopped when there were no variables it could add or remove from the model. The final column displays the model that the procedure produced.

The four independent variables in our model are Concentration, Rate, Pressure, and Temperature. This model has an R-squared of 89.09% and the highest adjusted R-squared. You also want Mallows’ Cp to be close to the number of independent variables plus the constant. Mallows’ Cp for the final model is closer to the ideal value than the other models. It all looks good!

## Example of Best Subsets Regression

Next, I’ll perform best subsets regression on the same dataset.

The best subsets procedure fits all possible models using our five independent variables. That means it fit 2^{5 }= 32 models. Each horizontal line represents a different model. By default, this statistical software package displays the top two models for each number of independent variables that are in the model. X’s indicate the independent variables that are in each model.

Below are the results.

We’re looking for a model that has a high adjusted R-squared, a small standard error of the regression, and a Mallows’ Cp close to the number of variables plus constant.

The model I circled is the one that the stepwise method produced. Based on the goodness-of-fit measures, this model appears to be a good candidate. However, the best subsets regression results provide a larger context that might help us make a choice using our subject-area knowledge and goals.

## Using Best Subsets Regression in conjunction with Our Requirements

We might have specific priorities that affect our choice for the best model.

For instance, if our top priorities are to simplify and reduce the costs of data collection, we might be interested in the models with fewer independent variables that fit the data nearly as well. The first model listed with three variables has an adjusted R-squared that is only 1.4 percentage points less than the circled model. In fact, the best two-variable model is not far behind.

On the other hand, if using the model to make accurate predictions is our top priority, we might be interested in the model with all five independent variables. Almost all of the goodness-of-fit measures are marginally better for the full model compared to the best model with four variables. However, the predicted R-squared for the full model declined slightly compared to the model with four variables.

Often, predicted R-squared starts to decline when the model becomes too complex and begins to fit the noise in the data. Sometimes simpler models can produce more precise predictions. For the most predictive model, we might use the best two-variable model because it has the highest predicted R-squared.

I value this extra information that best subsets regression provides. While this procedure requires more knowledge and effort to sort through the multiple models, it helps us choose the best model based our specific requirements. However, this method also fits many more models than stepwise regression, which increases the risk of finding chance correlations.

## Assess Your Candidate Regression Models Thoroughly

If you use stepwise regression or best subsets regression to help pick your model, you need to investigate the candidates thoroughly. That entails fitting the candidate models the normal way and checking the residual plots to be sure the fit is unbiased. You also need to assess the signs and values of the regression coefficients to be sure that they make sense. These automatic model selection procedures can find chance correlations in the sample data and produce models that don’t make sense in the real world.

Automatic variable selection procedures can be helpful tools, particularly in the exploratory stage. However, you can’t expect an automated algorithm to understand the subject area better than you! Be aware of the following potential problems.

- These procedures can sift through many different models and find correlations that exist by chance in the sample. Assess the results critically and use your expertise to determine whether they make sense.
- These procedures cannot take real-world knowledge into account. The model may not be right in a practical sense.
- Stepwise regression does not always choose the model with the largest R-squared value.

We saw how stepwise and best subsets regression compare. At this point, there is a logical question. Does one of these procedures work better? Read on!

## Which is Better, Stepwise Regression or Best Subsets Regression?

Which automatic variable selection procedure works better? Olejnik, Mills, and Keselman* performed a simulation study to compare how frequently stepwise regression and best subsets regression choose the correct model. The authors include 32 conditions in their study that differ by the number of candidate variables, number of correct variables, sample size, and amount of multicollinearity. For each state, a computer generated 1000 datasets. The authors analyzed each dataset using both stepwise and best subsets regression. For best subsets regression, they compared the effectiveness of using the lowest Mallows’ Cp to using the highest adjusted R-squared.

Drum roll, please!

The winner is … stepwise regression!

Although, it is a very close competition. Overall, stepwise regression is better than best subsets regression using the lowest Mallows’ Cp by less than 3%. Best subsets regression using the highest adjusted R-squared approach is the clear loser here.

However, there is a big warning to reveal.

Stepwise regression does not usually pick the correct model!

## How Accurate is Stepwise Regression?

Let’s take a closer look at the results. I’m going to cover only the stepwise results. However, best subsets regression using the lowest Mallows’ Cp follows the same patterns and is virtually tied.

First, let’s define some terms in this study.

- Authentic variables are the independent variables that truly have a relationship with the dependent variable.
- Noise variables are independent variables that do not have an actual relationship with the dependent variable.
- The correct model includes all of the authentic variables and excludes all of the noise variables.

Let’s explore the accuracy of stepwise regression in picking the correct model, and the conditions that affect its accuracy.

### When stepwise regression is most accurate

Let’s start by looking at the best case scenario for the stepwise procedure. In the study, this procedure is most capable when there are four candidate variables, three of the variables are authentic, there is no multicollinearity, and there is an extra-large sample size of 500 observations. This sample size is larger than the number of observations that most studies will collect if they are considering only four candidate variables.

In this scenario, stepwise regression chooses the correct model 84% of the time. The bad news is that this scenario is not realistic for most studies, and the accuracy drops from here.

### The role of the number of candidate variables and authentic variables in stepwise regression accuracy

The study assesses conditions with either 4 or 8 independent variables (IVs) that are candidates. When there are more variables to evaluate, it is harder for stepwise regression to identify the correct model. This pattern also applies to the number authentic independent variables.

The table below illustrates this pattern for scenarios with no multicollinearity and a good sample size (100-120). The percentage correct decreases as the number of candidate variables and authentic variables increase. Notice how most scenarios produce the correct model less than half the time!

Candidate IVs | Authentic IVs | % Correct model |

4 | 1 | 62.7 |

2 | 54.3 | |

3 | 34.4 | |

8 | 2 | 31.3 |

4 | 12.7 | |

6 | 1.1 |

### The role of multicollinearity in stepwise regression accuracy

The study also assesses the role that multicollinearity plays in the capability of stepwise regression to choose the correct model. When independent variables are correlated, it’s harder to isolate the individual effect of each variable. This difficulty occurs regardless whether it is a human or computer algorithm trying to identify the correct model.

The table below illustrates how the percentage correct varies by the amount of correlation and the number of variables. The results are based on a good sample size (100-120). As the correlation increases, the percentage correct decreases.

Candidate IVs | Authentic IVs | Correlation | % Correct model |

4 | 2 | 0.0 | 54.3 |

0.2 | 43.1 | ||

0.6 | 15.7 | ||

8 | 4 | 0.0 | 12.7 |

0.2 | 1.0 | ||

0.6 | 0.4 |

**Related post**: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions

### The role of sample size in stepwise regression accuracy

The study assesses two sample sizes to determine how it affects the ability of stepwise regression to choose the correct model. The smaller sample size is based on the number of observations necessary to obtain 0.80 statistical power, which is between 100-120 observations. This approach is consistent with best practices, and I’ve referred to this size as a “good sample size” previously.

The study also uses a very large sample size, which is five times the size of the good sample size.

The table below shows that a very large sample size improves the capability of stepwise regression to choose the correct model. Collecting a very large sample size might be more expensive, but it dramatically improves the variable selection process.

Candidate IVs | Authentic IVs | Correlation | % Correct – good sample size | % Correct – very large sample |

4 | 2 | 0.0 | 54.3 | 72.1 |

0.2 | 43.1 | 72.9 | ||

0.6 | 15.7 | 69.2 | ||

8 | 4 | 0.0 | 12.7 | 53.9 |

0.2 | 1.0 | 39.5 | ||

0.6 | 0.4 | 1.8 |

## Closing Thoughts on Choosing the Correct Model

Stepwise regression and best subsets regression don’t usually pick the correct model. This finding is true with the relatively low number of candidate independent variables that the simulation study assesses. In actual studies, it would be not surprising if the researchers need to assess many more variables, which would further reduce the percentage. In fact, unlike the simulation study, you can’t even be sure that you are assessing all of the authentic variables in a real world experiment!

Given these findings, you might be asking, “are stepwise regression and best subsets regression (using the lowest Mallows’ Cp) useful tools?”

I think they provide value during the very early, investigative stages of a study, particularly when theory doesn’t provide much guidance. However, you must rigorously assess the candidate models to see if they make sense. Further, it is important to understand that stepwise regression usually only gets you close to the correct model, but not all of the way there.

In that sense, I think stepwise regression provides some benefits. It can help you get to the right ballpark and provide a glimpse of the relationships in your data.

However, reality is complicated, and we are trying to model it with a sample. Choosing the correct model can be difficult even when researchers are armed with extensive subject-area knowledge. It is unreasonable to expect an automatic variable selection procedure to figure it out. Stepwise regression follows simple rules to pick the variables and does not know anything about the study area.

It’s up to you to go from the rough notion to the correct model. To do this, you need to use your expertise, theory, and common sense rather than depending on only simple variable selection rules. For more information about successful regression modeling, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.

If you’re learning regression, check out my Regression Tutorial!

### Reference

*Stephen Olejnik, Jamie Mills, and Harfey Keselman, “Using Wherry’s Adjusted R2 and Mallows’ Cp for Model Selection from All Possible Regressions”, *The Journal of Experimental Education*, 2000, 68(4), 365-380.

Ba;a says

Very good articles and very good website. The author Jim has written lots of gems

Varun Bhat says

Sir,

Very nice post. Enjoyed it.

Can you please elaborate on the topic of chance correlation.

KNS says

Thank you, yes.

KNS says

Hello,

Can you comment on the following from stackexchange. In current exploration using Best subsets I too am finding IVs with low correlations with the DV in the better models even if the t-statistics is not significant. I have only just begun consequently I still have to dig deeper but am curious about general rationale, as per the quote (below).

https://stats.stackexchange.com/questions/90711/can-independent-variables-with-low-correlation-with-dependent-variable-be-signif

“With a correlation matrix, you are examining unconditional (crude) associations between your variables. With a regression model, you are examining the joint associations of your IVs with your DVs, thus looking at conditional associations (for each IV, its association with the DV conditional on the other IVs). Depending on the structure of your data, these two can yield very different, even contrary results.”

Jim Frost says

Hi,

This is a great question. I agree with quote fully. You can get wonky results when you try to model relationships that involve multiple variables but doing so only one at a time, such as when you use correlation with pairs of variables. In other words, the subject matter is too complex for such a simple analyses as correlation. Keep in mind that when you include independent variables in a regression model and you assess the effect of a particular variable, the analysis holds the other variables constant (i.e., controls for it). However, when you assess the correlation, you’re not controlling for any other variables and this can affect the results.

When you omit important variables from the analysis, it can bias your result. And, you’re in luck, just a couple of weeks ago I wrote a post about confounding variables and omitted variable bias that takes a close look at how this bias occurs. In fact, I use an example where I looked at a model with only one variable that was not significant but it became significant when I added another variable. That’s very much like your situation, and I discuss why that occurs and under what conditions it occurs.

I hope this helps!