Model specification is the process of determining which independent variables to include and exclude from a regression equation. How do you choose the best regression model? The world is complicated, and trying to explain it with a small sample doesn’t help. In this post, I’ll show you how to select the correct model. I’ll cover statistical methods, difficulties that can arise, and provide practical suggestions for selecting your model. Often, the variable selection process is a mixture of statistics, theory, and practical knowledge.

The need for model selection often begins when a researcher wants to mathematically define the relationship between independent variables and the dependent variable. Typically, investigators measure many variables but include only some in the model. Analysts try to exclude independent variables that are not related and include only those that have an actual relationship with the dependent variable. During the specification process, the analysts typically try different combinations of variables and various forms of the model. For example, they can try different terms that explain interactions between variables and curvature in the data.

The analysts need to reach a Goldilocks balance by including the correct number of independent variables in the regression equation.

**Too few**: Underspecified models tend to be biased.**Too many**: Overspecified models tend to be less precise.**Just right**: Models with the correct terms are not biased and are the most precise.

To avoid biased results, your regression equation should contain any independent variables that you are specifically testing as part of the study plus other variables that affect the dependent variable.

**Related post**: When Should I Use Regression?

## Statistical Methods for Model Specification

You can use statistical assessments during the model specification process. Various metrics and algorithms can help you determine which independent variables to include in your regression equation. I review some standard approaches to model selection, but please click the links to read my more detailed posts about them.

**Adjusted R-squared and Predicted R-squared**: Typically, you want to select models that have larger adjusted and predicted R-squared values. These statistics can help you avoid the fundamental problem with regular R-squared—it *always* increases when you add an independent variable. This property tempts you into specifying a model that is too complex, which can produce misleading results.

- Adjusted R-squared increases only when a new variable improves the model by more than chance. Low-quality variables can cause it to decrease.
- Predicted R-squared is a cross-validation method that can also decrease. Cross-validation partitions your data to determine whether the model is generalizable outside of your dataset.

**P-values for the independent variables**: In regression, p-values less than the significance level indicate that the term is statistically significant. “Reducing the model” is the process of including all candidate variables in the model, and then repeatedly removing the single term with the highest non-significant p-value until your model contains only significant terms.

**Stepwise regression and Best subsets regression**: These two automated model selection procedures are algorithms that pick the variables to include in your regression equation. These automated methods can be helpful when you have many independent variables, and you need some help in the investigative stages of the variable selection process. These procedures can provide the Mallows’ Cp statistic, which helps you balance the tradeoff between precision and bias.

## Real World Complications in the Model Specification Process

The good news is that there are statistical methods that can help you with model specification. Unfortunately, there are a variety of complications that can arise. Fear not! I’ll provide some practical advice!

- Your best model is only as good as the data you collect. Specification of the correct model depends on you measuring the proper variables. In fact, when you omit important variables from the model, the estimates for the variables that you include can be biased. This condition is known as omitted variable bias.
- The sample you collect can be unusual, either by luck or methodology. False discoveries and false negatives are inevitable when you work with samples.
- Multicollinearity occurs when independent variables in a regression equation are correlated. When multicollinearity is present, small changes in the equation can produce dramatic changes in coefficients and p-values. It can also reduce statistical significance in variables that are relevant. For these reasons, multicollinearity makes model selection challenging.
- If you fit many models during the model selection process, you will find variables that appear to be statistically significant, but they are correlated only by chance. This problem occurs because all hypothesis tests have a false discovery rate. This type of data mining can make even random data appear to have significant relationships!
- P-values, adjusted R-squared, predicted R-squared, and Mallows’ Cp can point to different regression equations. Sometimes there is not a clear answer.
- Stepwise regression and best subsets regression can help in the early stages of model specification. However, studies show that these tools can get close to the right answer but they usually don’t specify the correct model.

## Practical Recommendations for Model Specification

Regression model specification is as much a science as it is an art. Statistical methods can help, but ultimately you’ll need to place a high weight on theory and other considerations.

### Theory

The best practice is to review the literature to develop a theoretical understanding of the relevant independent variables, their relationships with the dependent variable, and the expected coefficient signs and effect magnitudes before you begin collecting data. Building your knowledge helps you collect the correct data in the first place and it helps you specify the best regression equation without resorting to data mining. For more information about this process, read 5 Steps for Conducting Scientific Studies with Statistical Analyses.

Specification should not be based only on statistical measures. In fact, the foundation of your model selection process should depend largely on theoretical concerns. Be sure to determine whether your statistical results match theory and, if necessary, make adjustments. For example, if theory suggests that an independent variable is important, you might include it in the regression equation even when its p-value is not significant. If a coefficient sign is the opposite of theory, investigate and either modify the model or explain the inconsistency.

### Simplicity

Analysts often think that complex problems require complicated regression equations. However, studies reveal that simplification usually produces more precise models*. When you have several models with similar predictive power, choose the simplest because it is the most likely to be the best model.

Start simple and then add complexity only when it is actually needed. As you make a model more complex, it becomes more likely that you are tailoring it to fit the quirks in your particular dataset rather than actual relationships in the population. This overfitting reduces generalizability and can produce results that you can’t trust.

To avoid overly complex models, don’t chase a high R-squared mindlessly. Confirm that additional complexity aligns with theory and produces narrower prediction intervals. Check other measures, such as predicted R-squared, which can alert you to overfitting.

### Residual Plots

During the specification process, check the residual plots. Residuals plots are an easy way to avoid biased models and can help you make adjustments. For instance, residual plots display patterns when an underspecified regression equation is biased, which can indicate the need to model curvature. The simplest model that creates random residuals is a great contender for being reasonably precise and unbiased.

Ultimately, statistical measures can’t tell you which regression equation is best. They just don’t understand the fundamentals of the subject-area. Your expertise is always a vital part of the model specification process! For more help with the regression modeling process, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.

Choosing the correct regression model is one issue, while choosing the right type of regression analysis for your data is an entirely different matter.

If you’re learning regression, check out my Regression Tutorial!

*Reference*

Zellner, A. (2001), Keep it sophisticatedly simple. In Keuzenkamp, H. & McAleer, M. Eds. *Simplicity, Inference, and Modelling: Keeping it Sophisticatedly Simple*. Cambridge University Press, Cambridge.

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

John says

Hi Jim,

How would I specify a regression model consisting of both continuous and categorical regressors? And how to interpret the output of that model?

Xiaojie Cheng says

Hi Jim,

Thank you for your excellent and intuitive explanations. I’m a graduated student and recently I’m trying to find Interactive relationships between two genes by add their interaction terms in the regression models. I have some questions about choosing the best regression model. The DVs can be affected by several IVs (B1,B2,…,Bn), and my aim is to find which Bn may be regulated by another IV (A). I have built three models to deal with that, but the results are so different.

Model 1: DV=A+Bn+A*Bn

I only input one pair IVs(A and Bn) in model each time, and then repeat this model n times. When Bn is B1(DV=A+B1+A*B1), all of the terms are significant.

—————————————————————-

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.732e+03 3.987e+02 -4.343 5.72e-05 ***

A 2.658e+01 8.261e+00 3.217 0.00212 **

B1 6.576e+00 2.140e+00 3.073 0.00323 **

A*B1 -8.390e-02 2.889e-02 -2.904 0.00521 **

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1065 on 58 degrees of freedom

Multiple R-squared: 0.2037, Adjusted R-squared: 0.1625

F-statistic: 4.945 on 3 and 58 DF, p-value: 0.003994

—————————————————————

Model 2: DV=A+B1+B2+…+Bn+A*Bn

To avoid biased results, as you suggested, I add all the IVs that may affect DV. But only one target interaction term is remained. Then repeat this model n times.

When interaction term is A*B1, the interaction effect is insignificant.

—————————————————————-

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.124e+03 2.815e+02 -7.546 7.49e-10 ***

A 1.516e+01 5.994e+00 2.530 0.01454 *

B1 2.056e+00 1.810e+00 1.136 0.26145

B2 3.657e+00 2.402e+00 1.523 0.13404

B3 6.188e-01 4.108e-01 1.506 0.13822

B4 4.790e-01 3.337e-01 1.435 0.15734

B5 -4.909e-01 1.355e+00 -0.362 0.71871

B6 1.485e+00 6.239e-01 2.381 0.02104 *

B7 1.600e+01 5.756e+00 2.780 0.00759 **

B8 2.062e-02 1.827e-02 1.129 0.26433

A*B1 -3.465e-02 2.225e-02 -1.558 0.12551

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 674.5 on 51 degrees of freedom

Multiple R-squared: 0.7194, Adjusted R-squared: 0.6643

F-statistic: 13.07 on 10 and 51 DF, p-value: 6.148e-11

—————————————————————–

Model 3: DV=A+B1+A*B1+B2+A*B2…+Bn+A*Bn

In this model, I add all the IVs(Bn) and their interaction terms with A simultaneously, thus model runs once. In this situation, no significant terms.

——————————————————————

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.314e+03 3.984e+02 -5.809 6.45e-07 ***

A 2.410e+01 1.277e+01 1.886 0.0658 .

B1 5.936e-01 2.170e+00 0.274 0.7857

B2 5.281e+00 6.525e+00 0.809 0.4226

B3 4.074e-01 1.238e+00 0.329 0.7436

B4 4.417e-01 1.202e+00 0.368 0.7150

B5 -4.153e-01 3.814e+00 -0.109 0.9138

B6 2.775e+00 1.777e+00 1.562 0.1255

B7 9.274e+00 1.136e+01 0.816 0.4187

B8 4.297e-02 4.573e-02 0.940 0.3524

A*B1 -1.749e-02 3.531e-02 -0.495 0.6228

A*B2 -8.492e-02 1.707e-01 -0.498 0.6212

A*B3 6.077e-03 2.901e-02 0.209 0.8350

A*B4 1.723e-03 2.737e-02 0.063 0.9501

A*B5 4.894e-02 1.136e-01 0.431 0.6688

A*B6 -5.186e-02 5.362e-02 -0.967 0.3388

A*B7 3.067e-01 5.010e-01 0.612 0.5436

A*B8 -4.106e-04 8.732e-04 -0.470 0.6405

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 686 on 44 degrees of freedom

Multiple R-squared: 0.7496, Adjusted R-squared: 0.6528

F-statistic: 7.747 on 17 and 44 DF, p-value: 2.326e-08

——————————————————————–

My question: Is the significant interaction effect between A and B1 in model 1 reliable? Which is the best model to find the Interactive relationship between A and Bn?

In addition, the IVs above are not centered, as I get same results for interaction terms and the less significant main effect sometimes after centering.

Dan says

Thank you very much for your help and support

SAMUEL K BREFO-ABABIO says

Hey Jim, thanks for your insightful post. Please, are there any steps or factors that best determine whether a data analyst should build one comprehensive model or simply put should build many models on partitions of the data.

Dan says

Thank you for your useful content.

Is that mean we should use same control variables from previous literature or we can use the most suitable variables after running some experiments.

Jim Frost says

Hi Dan,

Theory and the scientific literature should guide you when possible. If other studies find that particular variables are important, you should consider them for your study. Because of omitted variable bias, it can be risk in terms bias to not include variables that other studies have found to be important. That is particularly true if you’re performing an observation study rather than a randomized study. However, you can certainly add your own variables into the mix if you’re testing new theories and/or have access to new types of data.

So, be very careful when removing control variables that have been identified as being important. You should have, and be able to explain, good reasons for removing them. Feel freer when it comes to adding new variables.

Sravani Lekkala says

what should we do if the output variable is more skewed.skewness>4

Jim Frost says

Hi Sravani,

When the output/dependent variable is skewed, it can be more difficult to satisfy the OLS assumptions. Note that the OLS assumptions don’t state that the dependent variable must be normally distributed itself, but instead state that the residuals should be normally distributed. And, obtaining normally distributed residuals can be more difficult when the DV is skewed.

There are several things you can try.

Sometimes modeling the curvature, if it exists, will help. In my post about using regression to make predictions, I use BMI to predict body fat percentage. Body fat percentage is the DV and it is skewed. However, the relationship between BMI and BF% is curved and by modeling that curvature, the residuals are normally distributed.

As the skew worsens, it becomes harder to get good residuals. You might need to transform you DV. I don’t have a blog post about that but I include a lot of information about data transformations in my regression ebook.

Those are several things that I’d look into first.

Best of luck with your analysis!

Chuck says

Hi Jim,

What does it mean when a regression model has a negative prediction R2 while the R2 and adjusted R2 are positive and reasonable?

Jim Frost says

Hi Chuck,

Any time the predicted R-squared is notably less than the adjusted/regular r-squared values it means that the model doesn’t predict new observations as well as it explains observations that the were used in the model fitting process. Often this indicates you’re overfitting the model. Too many predictors given the size of dataset. Usually when it’s so bad as to be negative, it’s because the dataset is pretty small. Read my posts about adjusted and predicted R-squared and overfitting for more information.

While the regular R-squared ranges between 0 – 100%, both predicted and adjusted R-squared can have negative values. A negative value doesn’t have any special interpretation other than just being really bad. Some statistical software will round negative values to zero. I tend to see negative values for predicted R-squared more than adjusted R-squared. As you’ll in the post I recommend, it’s often the more sensitive measure to problems with the model.

Take the negative predicted R-squared seriously. You’re probably overfitting your model. I’d also bet that you have fairly small dataset.

Mariyam says

Hi Jim,

Currently Im doing a research in my Economics Degree. This has been very helpful. I do have some doubts though. My research topic is “Relationship between Inflation and Economic growth in Maldives and how it affects the Maldivian economy”.

For this topic, I’m using GDP as a dependent variable and inflation, unemployment and gdp per capita as independent variables. I want to know whether it’s right to use all of these variables in one equation for this topic? Once i figure that out, it would be easy to run the regression.

Hope you could help me out in this.

Thanks

Maryam

Muideen says

Very useful write up. Thanks Jim

Please where a number of empirical models related similar independent variables to a particular dependent variable, what are the usual justifications for opting for a particular empirical model that one intends to build his research on?

Jim Frost says

I’d focus on using theory and the literature to guide you. Statistical measures can also provide information. I describe the process that you should use in this blog post.

Twiza says

Hi jim,

Am truly grateful for this beautiful blog, it truly is assisting me in my dissertation!

So I needed help with what model to use having a binary DV ( poverty). I run different types of logistic regression on my dataset depending on what type of post estimations tests I was carrying out.

As I was testing for goodness of fit that’s estat gof and linktest, of course after running a logistic regression, my prob>chi was equivalent to 0.0000 rejecting the Ho hypothesis which states that the model fits if prob>chi is > 0.0000.

I tried adding more independent variable but to no avail. I have 3 categorical independent variables that are insignificant, 1 continuous independent variable that was insignificant. The other 6 continuous independent variables are significant.

What do I do about those two tests, I seriously need help.

Thanks in advance.

Regards Twiza.

Jim Frost says

Hi Twiza,

Because you have a binary DV, you need to use binary logistic regression. However, it’s impossible for me to determine why your model isn’t fitting. Some suggestions would be to try to fit interaction terms and use polynomials terms. Just like would for an least squares model. Another possibility is to try different link functions.

jagriti khanna says

Hi Jim

I read your post thoroughly. I still have some doubts. I’m doing multi regression which includes 9 predictor variables. I’ve used p-values to check which of my variables are important. Also i plotted the graph for each independent variable wrt dependent variable and noted the each variable has a polynomial relation at individual level. So how to do multi variate polynomial regression when? Can you please help me with this?

Thanks in advance

Jim Frost says

Hi Jagriti,

It’s great that you graphed the data like that. It’s such an important step, but so many people skip it!

It sounds like you just need to add the polynomial terms to your model. I write more about this my post about fitting curves, which explains that process. After you fit the curvature, be sure to check the residual plots to make sure that you didn’t miss anything!

Henry Lee says

Hi Jim thanks for your blog.

My problem is much simpler than a multiple regression: I have some data showing a curved trend, and I would like to select the best polynomial model (1st, 2nd, 3rd or 4th order polynomial) fitting these data. The ‘best’ model should have a good fit but should also be more simple as possible (the lowest order polynomial producing a good fitting…)

Someone suggetsed me the Akaike Information Criterion, that penalizes the complexity of the model. Which are the possible tests or approaches to this (apparently) simple problem?

Thank you in advance!

Henry Lee

Jim Frost says

Hi Henry,

I definitely agree with your approach about using the simplest model that fits the data adequately.

I write about using polynomials to fit curvature in my post about curve fitting with regression. In practice, I find that 3rd order and higher polynomials are very rare. I’d suggest starting by graphing your data and counting the bends that you see and use the corresponding polynomial, as I describe in the curve fitting post. You should also apply theory, particularly if you’re using 3rd order or higher. Does theory supporting modeling those extra bends in the data or are they likely the product of a fluky sample or a small data set.

As for statistical tests, p-values are good place to start. If a polynomial term is not significant, consider removing it. I also suggest using adjusted R-squared because you’re comparing models with different numbers of terms. Perhaps even more crucial is using predicted R-squared. That statistic helps prevent you from overfitting your model. As you increase the polynomial order, you might just be playing connect the dots and fitting the noise in your data rather than fitting the real relationships. I’ve written a post about adjusted R-squared and predicted R-squared that you should read. I even include an example where it appears like a 3rd order polynomial provides a good fit but predicted R-squared indicates that you’re overfitting the data.

Finally, be sure to assess the residual plots because they’ll show you if you’re not adequately modeling the curvature.

Best of luck with your analysis!

Arup Dey says

I’m doing multiple regression analysis and there are four independent variables for regression analysis. So, it is not possible to know the shape of a graph that indicates the relationship between DB and IV. In this, how can I know the best regression model for my data? for example, linear, quadratic or exponential.

Jim Frost says

Hi Arup,

I’ve written a blog post about fitting the curvature in your data. That post will answer your questions! Also, consider graphing your residuals by each IV to see if you need to fit curve for each variable. I talk about these methods in even more detail in my ebook about regression. You might check that out!

Best of luck with fitting your model!

Mani says

Hey Sir,

My question might not be related, but I”m much confused in some problems.Like, When we study human behavior We used some demographic variable like Age and sex of child.Why we used them and what are the rational behind this.And how to interpret them.

Thanks.

Jim Frost says

Hi Mani,

You’d use these demographic variables because you think that they correlate with your DV. For instance, understanding age and gender might help you understand changes in the DV. For example, your DV might increase for males compared to females or increase with age. These variables can provide important information in their own right. Additionally, if you don’t include these variables and they are important, you risk biasing the estimates for your other variables. See omitted variable bias for details!

If you include these demographic variables in your model and they are not statistically significant, you can consider removing them from the model.

You interpret this type of variable in the same manner as any other independent variable. See regression coefficients and p-values for details.

karishma says

Thank you for your help Jim.

karishma says

Hi Jim,

I’m doing a multiple regression analysis on time series data. Can you recommend me some models that I can use for my analysis?

Thanks!

Jim Frost says

Hi Karishma,

You can use OLS regression to analyze time series data. Generally, you’ll need to include lagged variables and other time related predictors. Importantly, you can include predictors that are important to your study, which allows the analysis to estimate effects for them. You can use the model to make predictions. Be sure to pay particular attention to your residuals by order plot and the Durbin-Watson statistic to be sure that your model fits the data.

You can also use ARIMA, which is a regression-like approach to time series data. It includes multiple times series methods in one model (autoregressive, differencing, and moving average components). You can use relatively sophisticated correlational methods to uncover otherwise hidden patterns. You can use the model to make predictions. However, while models the dependent variable, it does not allow you to add other predictors into the model.

There are simpler time series models available, but they are less like regression, so I won’t detail them here.

Unfortunately, I don’t have much experience using regression analyses with time series data. There are undoubtedly other options available.

I hope this helps!

Hanna says

Hi Jim,

Thanks for this really helpful blog!

I am wondering whether I can use AIC and BIC to help me see which model fits my data best. Or is AIC and BIC only applicable when comparing the same model with different sets of variables (i.e. it tells me which variable selection is the best?). So could I use AIC and BIC to tell me whether a poisson or a negative binomial regression is best? And could I also compare OLS with count data models?

Any advice is much appreciated!

Peter Strauss says

So in 2015, a fairly similar article was posted on another website.

Care to at least give that one as a source?

Jim Frost says

Hi Peter,

Yes, I wrote both articles. I’ve been adding notes to that effect in several places and will need to add one to this post.

For some reason, the organization removed most author’s names from the articles. If you use the Internet Archive Wayback Machine and view an older version of that article, you’ll see that I am the author.

Thanks for writing!