Model specification is the process of determining which independent variables to include and exclude from a regression equation. How do you choose the best regression model? The world is complicated and trying to explain it with a small sample doesn’t help. In this post, I’ll show you how to decide on the model. I’ll cover statistical methods, difficulties that can arise, and provide practical suggestions for selecting your model. Often, the variable selection process is a mixture of statistics, theory, and practical knowledge.
Model selection in statistics is a crucial process. If you don’t select the correct model, you have made a specification error, which can invalidate your results.
Specification error is when the independent variables and their functional form (i.e., curvature and interactions) inaccurately portray the real relationship present in the data. Specification error can cause bias, which can exaggerate, understate, or entirely hide the presence of underlying relationships. In short, you can’t trust your results! Consequently, you need to understand model selection in statistics to choose the best regression model.
Model Selection in Statistics
The need to decide on a model often begins when a researcher wants to mathematically define the relationship between independent variables and the dependent variable. Typically, investigators measure many variables but include only some in the model. Analysts try to exclude independent variables that are not related and include only those that have an actual relationship with the dependent variable. During the specification process, the analysts typically try different combinations of variables and various forms of the model. For example, they can try different terms that explain interactions between variables and curvature in the data. During this process, analysts need to avoid a misspecification error.
The analysts need to reach a Goldilocks balance by including the correct number of independent variables in the regression equation.
- Too few: Underspecified models tend to be biased.
- Too many: Overspecified models tend to be less precise.
- Just right: Models with the correct terms are not biased and are the most precise.
To avoid biased results, your regression equation should contain any independent variables that you are specifically testing as part of the study plus other variables that affect the dependent variable.
Related post: When Should I Use Regression?
Model Selection Statistics
You can use various model selection statistics that can help you decide on the best regression model. Various metrics and algorithms can help you determine which independent variables to include in your regression equation. I review some standard approaches to model selection, but please click the links to read my more detailed posts about them.
Adjusted R-squared and Predicted R-squared: Typically, you want to select models that have larger adjusted and predicted R-squared values. These statistics can help you avoid the fundamental problem with regular R-squared—it always increases when you add an independent variable. This property tempts you into specifying a model that is too complex, which can produce misleading results.
- Adjusted R-squared increases only when a new variable improves the model by more than chance. Low-quality variables can cause it to decrease.
- Predicted R-squared is a cross-validation method that can also decrease. Cross-validation partitions your data to determine whether the model is generalizable outside of your dataset.
P-values for the independent variables: In regression, p-values less than the significance level indicate that the term is statistically significant. “Reducing the model” is the process of including all candidate variables in the model, and then repeatedly removing the single term with the highest non-significant p-value until your model contains only significant terms.
Stepwise regression and Best subsets regression: These two automated model selection procedures are algorithms that pick the variables to include in your regression equation. These automated methods can be helpful when you have many independent variables, and you need some help in the investigative stages of the variable selection process. These procedures can provide the Mallows’ Cp statistic, which helps you balance the tradeoff between precision and bias.
Real World Complications in the Model Specification Process
The good news is that there are model selection statistics that can help you choose the best regression model. Unfortunately, there are a variety of complications that can arise. Fear not! I’ll provide some practical advice!
- Your best regression model is only as good as the data you collect. Specification of the correct model depends on you measuring the proper variables. In fact, when you omit important variables from the model, the estimates for the variables that you include can be biased. This condition is known as omitted variable bias. If you can’t include a confounder, consider including a proxy variable to avoid this bias.
- The sample you collect can be unusual, either by luck or methodology. False discoveries and false negatives are inevitable when you work with samples.
- Multicollinearity occurs when independent variables in a regression equation are correlated. When multicollinearity is present, small changes in the equation can produce dramatic changes in coefficients and p-values. It can also reduce statistical significance in variables that are relevant. For these reasons, multicollinearity makes model selection challenging.
- If you fit many models during the model selection process, you will find variables that appear to be statistically significant, but they are correlated only by chance. This problem occurs because all hypothesis tests have a false discovery rate. This type of data mining can make even random data appear to have significant relationships!
- P-values, adjusted R-squared, predicted R-squared, and Mallows’ Cp can point to different regression equations. Sometimes there is not a clear answer.
- Stepwise regression and best subsets regression can help in the early stages of model specification. However, studies show that these tools can get close to the right answer but they usually don’t specify the correct model.
Practical Recommendations for Model Specification
Regression model specification is as much a science as it is an art. Statistical methods can help choose the best regression model, but ultimately you’ll need to place a high weight on theory and other considerations.
The best practice for model selection in statistics is to review the literature to develop a theoretical understanding of the relevant independent variables, their relationships with the dependent variable, and the expected coefficient signs and effect magnitudes before you begin collecting data. Building your knowledge helps you collect the correct data in the first place and it helps you specify the best regression equation without resorting to data mining. For more information about this process, read 5 Steps for Conducting Scientific Studies with Statistical Analyses.
Deciding on the model should not be based only on model selection statistics. In fact, the foundation of your model selection process should depend largely on theoretical concerns. Be sure to determine whether your statistical results match theory and, if necessary, make adjustments. For example, if theory suggests that an independent variable is important, you might include it in the regression equation even when its p-value is not significant. If a coefficient sign is the opposite of theory, investigate and either modify the model or explain the inconsistency.
Analysts often think that complex problems require complicated regression equations. However, studies reveal that simplification usually produces more precise models*. When you have several models with similar predictive power, choose the simplest because it is the most likely to be the best regression model.
Start simple and then add complexity only when it is actually needed. As you make a model more complex, it becomes more likely that you are tailoring it to fit the quirks in your particular dataset rather than actual relationships in the population. This overfitting reduces generalizability and can produce results that you can’t trust.
To avoid overly complex models, don’t chase a high R-squared mindlessly. Confirm that additional complexity aligns with theory and produces narrower prediction intervals. Check other measures, such as predicted R-squared, which can alert you to overfitting.
When you’re deciding on your model, check the residual plots. Residuals plots are an easy way to avoid biased models and can help you make adjustments. For instance, residual plots display patterns when an underspecified regression equation is biased, which can indicate the need to model curvature. The simplest model that creates random residuals is a great contender for being reasonably precise and unbiased.
Ultimately, model selection statistics alone can’t tell you which regression model is best. They just don’t understand the fundamentals of the subject-area. Your expertise is always a vital part of the model specification process! For more help with the regression model selection process, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.
Choosing the best regression model is one issue, while choosing the right type of regression analysis for your data is an entirely different matter.
If you’re learning regression, check out my Regression Tutorial!
Zellner, A. (2001), Keep it sophisticatedly simple. In Keuzenkamp, H. & McAleer, M. Eds. Simplicity, Inference, and Modelling: Keeping it Sophisticatedly Simple. Cambridge University Press, Cambridge.
Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.