Model specification is the process of determining which independent variables to include and exclude from a regression equation. How do you choose the best regression model? The world is complicated, and trying to explain it with a small sample doesn’t help. In this post, I’ll show you how to select the correct model. I’ll cover statistical methods, difficulties that can arise, and provide practical suggestions for selecting your model. Often, the variable selection process is a mixture of statistics, theory, and practical knowledge.
The need for model selection often begins when a researcher wants to mathematically define the relationship between independent variables and the dependent variable. Typically, investigators measure many variables but include only some in the model. Analysts try to exclude independent variables that are not related and include only those that have an actual relationship with the dependent variable. During the specification process, the analysts typically try different combinations of variables and various forms of the model. For example, they can try different terms that explain interactions between variables and curvature in the data.
The analysts need to reach a Goldilocks balance by including the correct number of independent variables in the regression equation.
- Too few: Underspecified models tend to be biased.
- Too many: Overspecified models tend to be less precise.
- Just right: Models with the correct terms are not biased and are the most precise.
To avoid biased results, your regression equation should contain any independent variables that you are specifically testing as part of the study plus other variables that affect the dependent variable.
Statistical Methods for Model Specification
You can use statistical assessments during the model specification process. Various metrics and algorithms can help you determine which independent variables to include in your regression equation. I review some standard approaches to model selection, but please click the links to read my more detailed posts about them.
Adjusted R-squared and Predicted R-squared: Typically, you want to select models that have larger adjusted and predicted R-squared values. These statistics can help you avoid the fundamental problem with regular R-squared—it always increases when you add an independent variable. This property tempts you into specifying a model that is too complex, which can produce misleading results.
- Adjusted R-squared increases only when a new variable improves the model by more than chance. Low-quality variables can cause it to decrease.
- Predicted R-squared is a cross-validation method that can also decrease. Cross-validation partitions your data to determine whether the model is generalizable outside of your dataset.
P-values for the independent variables: In regression, p-values less than the significance level indicate that the term is statistically significant. “Reducing the model” is the process of including all candidate variables in the model, and then repeatedly removing the single term with the highest non-significant p-value until your model contains only significant terms.
Stepwise regression and Best subsets regression: These two automated model selection procedures are algorithms that pick the variables to include in your regression equation. These automated methods can be helpful when you have many independent variables, and you need some help in the investigative stages of the variable selection process. These procedures can provide the Mallows’ Cp statistic, which helps you balance the tradeoff between precision and bias.
Real World Complications in the Model Specification Process
The good news is that there are statistical methods that can help you with model specification. Unfortunately, there are a variety of complications that can arise. Fear not! I’ll provide some practical advice!
- Your best model is only as good as the data you collect. Specification of the correct model depends on you measuring the proper variables. In fact, when you omit important variables from the model, the estimates for the variables that you include can be biased. This condition is known as omitted variable bias. If you can’t include a confounder, consider including a proxy variable to avoid this bias.
- The sample you collect can be unusual, either by luck or methodology. False discoveries and false negatives are inevitable when you work with samples.
- Multicollinearity occurs when independent variables in a regression equation are correlated. When multicollinearity is present, small changes in the equation can produce dramatic changes in coefficients and p-values. It can also reduce statistical significance in variables that are relevant. For these reasons, multicollinearity makes model selection challenging.
- If you fit many models during the model selection process, you will find variables that appear to be statistically significant, but they are correlated only by chance. This problem occurs because all hypothesis tests have a false discovery rate. This type of data mining can make even random data appear to have significant relationships!
- P-values, adjusted R-squared, predicted R-squared, and Mallows’ Cp can point to different regression equations. Sometimes there is not a clear answer.
- Stepwise regression and best subsets regression can help in the early stages of model specification. However, studies show that these tools can get close to the right answer but they usually don’t specify the correct model.
Practical Recommendations for Model Specification
Regression model specification is as much a science as it is an art. Statistical methods can help, but ultimately you’ll need to place a high weight on theory and other considerations.
The best practice is to review the literature to develop a theoretical understanding of the relevant independent variables, their relationships with the dependent variable, and the expected coefficient signs and effect magnitudes before you begin collecting data. Building your knowledge helps you collect the correct data in the first place and it helps you specify the best regression equation without resorting to data mining. For more information about this process, read 5 Steps for Conducting Scientific Studies with Statistical Analyses.
Specification should not be based only on statistical measures. In fact, the foundation of your model selection process should depend largely on theoretical concerns. Be sure to determine whether your statistical results match theory and, if necessary, make adjustments. For example, if theory suggests that an independent variable is important, you might include it in the regression equation even when its p-value is not significant. If a coefficient sign is the opposite of theory, investigate and either modify the model or explain the inconsistency.
Analysts often think that complex problems require complicated regression equations. However, studies reveal that simplification usually produces more precise models*. When you have several models with similar predictive power, choose the simplest because it is the most likely to be the best model.
Start simple and then add complexity only when it is actually needed. As you make a model more complex, it becomes more likely that you are tailoring it to fit the quirks in your particular dataset rather than actual relationships in the population. This overfitting reduces generalizability and can produce results that you can’t trust.
To avoid overly complex models, don’t chase a high R-squared mindlessly. Confirm that additional complexity aligns with theory and produces narrower prediction intervals. Check other measures, such as predicted R-squared, which can alert you to overfitting.
During the specification process, check the residual plots. Residuals plots are an easy way to avoid biased models and can help you make adjustments. For instance, residual plots display patterns when an underspecified regression equation is biased, which can indicate the need to model curvature. The simplest model that creates random residuals is a great contender for being reasonably precise and unbiased.
Ultimately, statistical measures can’t tell you which regression equation is best. They just don’t understand the fundamentals of the subject-area. Your expertise is always a vital part of the model specification process! For more help with the regression modeling process, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.
If you’re learning regression, check out my Regression Tutorial!
Zellner, A. (2001), Keep it sophisticatedly simple. In Keuzenkamp, H. & McAleer, M. Eds. Simplicity, Inference, and Modelling: Keeping it Sophisticatedly Simple. Cambridge University Press, Cambridge.
Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.