Model specification is the process of determining which independent variables to include and exclude from a regression equation. How do you choose the best regression model? The world is complicated, and trying to explain it with a small sample doesn’t help. In this post, I’ll show you how to select the correct model. I’ll cover statistical methods, difficulties that can arise, and provide practical suggestions for selecting your model. Often, the variable selection process is a mixture of statistics, theory, and practical knowledge.

The need for model selection often begins when a researcher wants to mathematically define the relationship between independent variables and the dependent variable. Typically, investigators measure many variables but include only some in the model. Analysts try to exclude independent variables that are not related and include only those that have an actual relationship with the dependent variable. During the specification process, the analysts typically try different combinations of variables and various forms of the model. For example, they can try different terms that explain interactions between variables and curvature in the data.

The analysts need to reach a Goldilocks balance by including the correct number of independent variables in the regression equation.

**Too few**: Underspecified models tend to be biased.**Too many**: Overspecified models tend to be less precise.**Just right**: Models with the correct terms are not biased and are the most precise.

To avoid biased results, your regression equation should contain any independent variables that you are specifically testing as part of the study plus other variables that affect the dependent variable.

**Related post**: When Should I Use Regression?

## Statistical Methods for Model Specification

You can use statistical assessments during the model specification process. Various metrics and algorithms can help you determine which independent variables to include in your regression equation. I review some standard approaches to model selection, but please click the links to read my more detailed posts about them.

**Adjusted R-squared and Predicted R-squared**: Typically, you want to select models that have larger adjusted and predicted R-squared values. These statistics can help you avoid the fundamental problem with regular R-squared—it *always* increases when you add an independent variable. This property tempts you into specifying a model that is too complex, which can produce misleading results.

- Adjusted R-squared increases only when a new variable improves the model by more than chance. Low-quality variables can cause it to decrease.
- Predicted R-squared is a cross-validation method that can also decrease. Cross-validation partitions your data to determine whether the model is generalizable outside of your dataset.

**P-values for the independent variables**: In regression, p-values less than the significance level indicate that the term is statistically significant. “Reducing the model” is the process of including all candidate variables in the model, and then repeatedly removing the single term with the highest non-significant p-value until your model contains only significant terms.

**Stepwise regression and Best subsets regression**: These two automated model selection procedures are algorithms that pick the variables to include in your regression equation. These automated methods can be helpful when you have many independent variables, and you need some help in the investigative stages of the variable selection process. These procedures can provide the Mallows’ Cp statistic, which helps you balance the tradeoff between precision and bias.

## Real World Complications in the Model Specification Process

The good news is that there are statistical methods that can help you with model specification. Unfortunately, there are a variety of complications that can arise. Fear not! I’ll provide some practical advice!

- Your best model is only as good as the data you collect. Specification of the correct model depends on you measuring the proper variables. In fact, when you omit important variables from the model, the estimates for the variables that you include can be biased. This condition is known as omitted variable bias.
- The sample you collect can be unusual, either by luck or methodology. False discoveries and false negatives are inevitable when you work with samples.
- Multicollinearity occurs when independent variables in a regression equation are correlated. When multicollinearity is present, small changes in the equation can produce dramatic changes in coefficients and p-values. It can also reduce statistical significance in variables that are relevant. For these reasons, multicollinearity makes model selection challenging.
- If you fit many models during the model selection process, you will find variables that appear to be statistically significant, but they are correlated only by chance. This problem occurs because all hypothesis tests have a false discovery rate. This type of data mining can make even random data appear to have significant relationships!
- P-values, adjusted R-squared, predicted R-squared, and Mallows’ Cp can point to different regression equations. Sometimes there is not a clear answer.
- Stepwise regression and best subsets regression can help in the early stages of model specification. However, studies show that these tools can get close to the right answer but they usually don’t specify the correct model.

## Practical Recommendations for Model Specification

Regression model specification is as much a science as it is an art. Statistical methods can help, but ultimately you’ll need to place a high weight on theory and other considerations.

### Theory

The best practice is to review the literature to develop a theoretical understanding of the relevant independent variables, their relationships with the dependent variable, and the expected coefficient signs and effect magnitudes before you begin collecting data. Building your knowledge helps you collect the correct data in the first place and it helps you specify the best regression equation without resorting to data mining.

Specification should not be based only on statistical measures. In fact, the foundation of your model selection process should depend largely on theoretical concerns. Be sure to determine whether your statistical results match theory and, if necessary, make adjustments. For example, if theory suggests that an independent variable is important, you might include it in the regression equation even when its p-value is not significant. If a coefficient sign is the opposite of theory, investigate and either modify the model or explain the inconsistency.

### Simplicity

Analysts often think that complex problems require complicated regression equations. However, studies reveal that simplification usually produces more precise models*. When you have several models with similar predictive power, choose the simplest because it is the most likely to be the best model.

Start simple and then add complexity only when it is actually needed. As you make a model more complex, it becomes more likely that you are tailoring it to fit the quirks in your particular dataset rather than actual relationships in the population. This overfitting reduces generalizability and can produce results that you can’t trust.

To avoid overly complex models, don’t chase a high R-squared mindlessly. Confirm that additional complexity aligns with theory and produces narrower prediction intervals. Check other measures, such as predicted R-squared, which can alert you to overfitting.

### Residual Plots

During the specification process, check the residual plots. Residuals plots are an easy way to avoid biased models and can help you make adjustments. For instance, residual plots display patterns when an underspecified regression equation is biased, which can indicate the need to model curvature. The simplest model that creates random residuals is a great contender for being reasonably precise and unbiased.

Ultimately, statistical measures can’t tell you which regression equation is best. They just don’t understand the fundamentals of the subject-area. Your expertise is always a vital part of the model specification process! For more help with the regression modeling process, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.

Choosing the correct regression model is one issue, while choosing the right type of regression analysis for your data is an entirely different matter.

If you’re learning regression, check out my Regression Tutorial!

*Reference*

Zellner, A. (2001), Keep it sophisticatedly simple. In Keuzenkamp, H. & McAleer, M. Eds. *Simplicity, Inference, and Modelling: Keeping it Sophisticatedly Simple*. Cambridge University Press, Cambridge.

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

karishma says

Thank you for your help Jim.

karishma says

Hi Jim,

I’m doing a multiple regression analysis on time series data. Can you recommend me some models that I can use for my analysis?

Thanks!

Jim Frost says

Hi Karishma,

You can use OLS regression to analyze time series data. Generally, you’ll need to include lagged variables and other time related predictors. Importantly, you can include predictors that are important to your study, which allows the analysis to estimate effects for them. You can use the model to make predictions. Be sure to pay particular attention to your residuals by order plot and the Durbin-Watson statistic to be sure that your model fits the data.

You can also use ARIMA, which is a regression-like approach to time series data. It includes multiple times series methods in one model (autoregressive, differencing, and moving average components). You can use relatively sophisticated correlational methods to uncover otherwise hidden patterns. You can use the model to make predictions. However, while models the dependent variable, it does not allow you to add other predictors into the model.

There are simpler time series models available, but they are less like regression, so I won’t detail them here.

Unfortunately, I don’t have much experience using regression analyses with time series data. There are undoubtedly other options available.

I hope this helps!

Hanna says

Hi Jim,

Thanks for this really helpful blog!

I am wondering whether I can use AIC and BIC to help me see which model fits my data best. Or is AIC and BIC only applicable when comparing the same model with different sets of variables (i.e. it tells me which variable selection is the best?). So could I use AIC and BIC to tell me whether a poisson or a negative binomial regression is best? And could I also compare OLS with count data models?

Any advice is much appreciated!

Peter Strauss says

So in 2015, a fairly similar article was posted on another website.

Care to at least give that one as a source?

Jim Frost says

Hi Peter,

Yes, I wrote both articles. I’ve been adding notes to that effect in several places and will need to add one to this post.

For some reason, the organization removed most author’s names from the articles. If you use the Internet Archive Wayback Machine and view an older version of that article, you’ll see that I am the author.

Thanks for writing!