Model specification is the process of determining which independent variables to include and exclude from a regression equation. How do you choose the best regression model? The world is complicated and trying to explain it with a small sample doesn’t help. In this post, I’ll show you how to decide on the model. I’ll cover statistical methods, difficulties that can arise, and provide practical suggestions for selecting your model. Often, the variable selection process is a mixture of statistics, theory, and practical knowledge.
Model selection in statistics is a crucial process. If you don’t select the correct model, you have made a specification error, which can invalidate your results.
Specification error is when the independent variables and their functional form (i.e., curvature and interactions) inaccurately portray the real relationship present in the data. Specification error can cause bias, which can exaggerate, understate, or entirely hide the presence of underlying relationships. In short, you can’t trust your results! Consequently, you need to understand model selection in statistics to choose the best regression model.
Model Selection in Statistics
The need to decide on a model often begins when a researcher wants to mathematically define the relationship between independent variables and the dependent variable. Typically, investigators measure many variables but include only some in the model. Analysts try to exclude independent variables that are not related and include only those that have an actual relationship with the dependent variable. During the specification process, the analysts typically try different combinations of variables and various forms of the model. For example, they can try different terms that explain interactions between variables and curvature in the data. During this process, analysts need to avoid a misspecification error.
The analysts need to reach a Goldilocks balance by including the correct number of independent variables in the regression equation.
- Too few: Underspecified models tend to be biased.
- Too many: Overspecified models tend to be less precise.
- Just right: Models with the correct terms are not biased and are the most precise.
To avoid biased results, your regression equation should contain any independent variables that you are specifically testing as part of the study plus other variables that affect the dependent variable.
Related post: When Should I Use Regression?
Model Selection Statistics
You can use various model selection statistics that can help you decide on the best regression model. Various metrics and algorithms can help you determine which independent variables to include in your regression equation. I review some standard approaches to model selection, but please click the links to read my more detailed posts about them.
Adjusted R-squared and Predicted R-squared: Typically, you want to select models that have larger adjusted and predicted R-squared values. These statistics can help you avoid the fundamental problem with regular R-squared—it always increases when you add an independent variable. This property tempts you into specifying a model that is too complex, which can produce misleading results.
- Adjusted R-squared increases only when a new variable improves the model by more than chance. Low-quality variables can cause it to decrease.
- Predicted R-squared is a cross-validation method that can also decrease. Cross-validation partitions your data to determine whether the model is generalizable outside of your dataset.
P-values for the independent variables: In regression, p-values less than the significance level indicate that the term is statistically significant. “Reducing the model” is the process of including all candidate variables in the model, and then repeatedly removing the single term with the highest non-significant p-value until your model contains only significant terms.
Stepwise regression and Best subsets regression: These two automated model selection procedures are algorithms that pick the variables to include in your regression equation. These automated methods can be helpful when you have many independent variables, and you need some help in the investigative stages of the variable selection process. These procedures can provide the Mallows’ Cp statistic, which helps you balance the tradeoff between precision and bias.
Real World Complications in the Model Specification Process
The good news is that there are model selection statistics that can help you choose the best regression model. Unfortunately, there are a variety of complications that can arise. Fear not! I’ll provide some practical advice!
- Your best regression model is only as good as the data you collect. Specification of the correct model depends on you measuring the proper variables. In fact, when you omit important variables from the model, the estimates for the variables that you include can be biased. This condition is known as omitted variable bias. If you can’t include a confounder, consider including a proxy variable to avoid this bias.
- The sample you collect can be unusual, either by luck or methodology. False discoveries and false negatives are inevitable when you work with samples.
- Multicollinearity occurs when independent variables in a regression equation are correlated. When multicollinearity is present, small changes in the equation can produce dramatic changes in coefficients and p-values. It can also reduce statistical significance in variables that are relevant. For these reasons, multicollinearity makes model selection challenging.
- If you fit many models during the model selection process, you will find variables that appear to be statistically significant, but they are correlated only by chance. This problem occurs because all hypothesis tests have a false discovery rate. This type of data mining can make even random data appear to have significant relationships!
- P-values, adjusted R-squared, predicted R-squared, and Mallows’ Cp can point to different regression equations. Sometimes there is not a clear answer.
- Stepwise regression and best subsets regression can help in the early stages of model specification. However, studies show that these tools can get close to the right answer but they usually don’t specify the correct model.
Practical Recommendations for Model Specification
Regression model specification is as much a science as it is an art. Statistical methods can help choose the best regression model, but ultimately you’ll need to place a high weight on theory and other considerations.
Theory
The best practice for model selection in statistics is to review the literature to develop a theoretical understanding of the relevant independent variables, their relationships with the dependent variable, and the expected coefficient signs and effect magnitudes before you begin collecting data. Building your knowledge helps you collect the correct data in the first place and it helps you specify the best regression equation without resorting to data mining. For more information about this process, read 5 Steps for Conducting Scientific Studies with Statistical Analyses.
Deciding on the model should not be based only on model selection statistics. In fact, the foundation of your model selection process should depend largely on theoretical concerns. Be sure to determine whether your statistical results match theory and, if necessary, make adjustments. For example, if theory suggests that an independent variable is important, you might include it in the regression equation even when its p-value is not significant. If a coefficient sign is the opposite of theory, investigate and either modify the model or explain the inconsistency.
Simplicity
Analysts often think that complex problems require complicated regression equations. However, studies reveal that simplification usually produces more precise models*. When you have several models with similar predictive power, choose the simplest because it is the most likely to be the best regression model.
Start simple and then add complexity only when it is actually needed. As you make a model more complex, it becomes more likely that you are tailoring it to fit the quirks in your particular dataset rather than actual relationships in the population. This overfitting reduces generalizability and can produce results that you can’t trust.
To avoid overly complex models, don’t chase a high R-squared mindlessly. Confirm that additional complexity aligns with theory and produces narrower prediction intervals. Check other measures, such as predicted R-squared, which can alert you to overfitting.
In statistics, statisticians say that a simple but effective model is parsimonious. Learn more about Parsimonious Models: Benefits and Selection.
Residual Plots
When you’re deciding on your model, check the residual plots. Residuals plots are an easy way to avoid biased models and can help you make adjustments. For instance, residual plots display patterns when an underspecified regression equation is biased, which can indicate the need to model curvature. The simplest model that creates random residuals is a great contender for being reasonably precise and unbiased.
Ultimately, model selection statistics alone can’t tell you which regression model is best. They just don’t understand the fundamentals of the subject-area. Your expertise is always a vital part of the model specification process! For more help with the regression model selection process, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.
Choosing the best regression model is one issue, while choosing the right type of regression analysis for your data is an entirely different matter.
If you’re learning regression, check out my Regression Tutorial!
Reference
Zellner, A. (2001), Keep it sophisticatedly simple. In Keuzenkamp, H. & McAleer, M. Eds. Simplicity, Inference, and Modelling: Keeping it Sophisticatedly Simple. Cambridge University Press, Cambridge.
Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.
Robert Seymour says
Hi Jim
Just finished your book on Regression and in there you give a rule of thumb that you should have at least 10 -15 observations for every term in the model (IV, interaction and polynomial) to avoid overfitting the model. My question is when you add a catagorical variable that has a number of levels (say 4) is that still adding a single term or will the number of levels impact how many observations you might need to stay within the rule thumb?
thanks
Rob
Jim Frost says
Hi Rob,
Categorical variables count as multiple terms because of the way they’re entered into the model as a series of indicator (dummy) variables. Depending on your software, you might not see that but it’s happening behind the scenes. I talk about that in the book in the section about categorical variables. If your categorical variable has 4 levels, then the software adds 3 indicator variables. Hence, following the rule of thumb, you should include 30-45 observations for that one categorical variable.
However, there’s no exact consensus on the number of observations per term. The need to add additional observations can vary depending on the strength of the expected effect and the base number of observations you plan to have. But this approach gives you a good rough idea for the minimum. And if you read the section about the automated selection procedures, you know it’s better to have more than the minimums (even if your selecting the variables yourself).
It is important to note that categorical variables do use more degrees of freedom than continuous variables. And if you include interaction terms, the problem multiplies . . . literally! Plan accordingly!
Birara Endalew says
Dear Professor Jim Frost,
I would like to cite your post in my publication. But I could find the publication date? Could you share with me the data and year of your post?
Thank you.
Jim Frost says
Hi Birara!
Definitely feel free to cite this article! 🙂
When citing online resources, you typically use an “Accessed” date rather than a publication date because online content can change over time. For more information, read Purdue University’s Citing Electronic Resources.
Andualem says
If the dependent variable is categorical data, how to test and deal with heteroscedasticity?
Jake says
Hi Jim!
Thanks for sharing! What would you highlight as the most important elements in relation to model specification in customer analytics, and would there be anything of perticular importance when working with binomial logistic regression, multinomial logistic regression and conjoint analysis?
Best regards
Jim Frost says
Hi Jake,
In this post, I highlight those important elements. Those same elements apply to the other types of regression as well. When you’re creating a model to explain the relationships between the IVs and DV, you need to blend theory and statistical measures as I discuss in this post. These include theory, statistical measures, and residual plots. The all work together to help you find the best model. Subject area knowledge should be your guide in this process because the final model has to make sense in terms of the relationships, their directions, and magnitudes.
The exception is when you’re create a purely predictive model. That’s where you want to use the model to predict an outcome and aren’t interested in explaining the relationships. In this case, you don’t worry about theory but just want to find IVs that both are easy to measure and predict the DV.
RICHARD MAYANDA says
how do you solve this question: The fitted equation from a study on infant head circumference is as follows:
head circumference = 1.76 + 0.86×gestational age – 2.82×toxemia + 0.046×(gestational age×toxemia)
where gestational age is measured in weeks and toxemia is an indicator variable for the mother’s toxemia status during pregnancy (1=had toxemia).
For infants whose mothers did not have toxemia during pregnancy, what is the effect of an extra two weeks of gestation? What about for those whose mothers had toxemia?
What other information or calculations would you need to decide whether to include this effect in the final model?
What effect does the last term represent? How would you interpret this effect?
Specify the regression models and interpret regression results
Jim Frost says
Hi Richard,
I don’t want to do your homework or test questions. But the answers about how to decide which terms to include in the model are in this post. To calculate the effects of two weeks, read my post about how to interpret regression coefficients and p-values.
As for the last term, it is an interaction term. Click the link to learn about them!
Enrico Mendoza says
Hi JIm,
Thank you for this wonderful, easy to understand blog. It is very helpful to many people including me who is not a statistics graduate though i need statistics right now for my PhD dissertation. In my dissertation,, I am using multi level modeling, i have a binary dependent variable as well as binary level 1 independent variables. You mentioned above in your blog, that i can check the residuals to check my model specification. My question is: is checking for residuals applicable to logistic regression wherein all my independent variables and dependent variable are binary?
Hope you can find time to answer my question.
Thank you very much. God bless always.
Enrico Mendoza
Michael Green says
Hi Jim,
Thanks for your helpful responses. I have had PD for about 12 years now. When I see my neurologist, I wanted to show him my results and wanted make sure I did the statistical analysis correctly. Are there anything else I can do to lower the ViFs?
Michael Green says
Thanks a bunch for your response. My strategy was to first demonstrate that latitude was correlated with PD and then find other variables that could correlate with latitude (since latitude obviously doesn’t cause PD). When I entered the IV (no backwards regression), the VIFs were 18 and 15 for Latitude and max length of day…the rest were < 4 for all IVs. After reading about multicollinearity, I figured Latitude and maximum day length (r=0.96) are structural and one of them (Latitude) could be justifiably eliminated. So when I entered all my IVs except Latitude, all VIPs < 5… except “max daylength, VIP = 6).” Is this acceptable? Then I did a backwards regression and “max daylength” was significant, but r(partial)=0.32 did not seem so exciting. The p value for the magnetic field element was 0.07, r(partial)=0.24. When combined I get mult corr coefficient of 0.56. After a bit of research, I found out that vit D deficiency is associated with length of day which is associated with latitude which may affect PD. Could there be something to the magnetic field strength?
Jim Frost says
Hi Michael,
It seems like something associated with latitude is potentially playing a role. If all you wanted to do was predict PD, then you could just leave latitude in as a predictor and use that to generate predictions. However, because want an explanatory model, you really need to, at some point, remove latitude and include that real variables that causally explain the changes in PD.
The best statistical analyses blend subject area expertise with the statistics. And, I’m certainly not an expert in PD. I don’t know if there is an association between magnetic fields and PD. I know the Earth’s magnetic field is fairly weak but I don’t know if it is strong enough affect PD. So, I wouldn’t presume to have any idea whether it’s a promising lead or not. I would strongly suggest conducting background research to see what others have found.
I wonder if hours of daylight, or lack thereof in the winter, might be a factor. There seems to be a connection between depression and developing Parkinson’s. There is a connection between the long dark days further near a pole and seasonal depression. Perhaps a connection? The vitamin D angle is interesting too. I really don’t know though. Again, research and see what the experts have already found.
I think you have a reasonable approach in identifying candidates though. Things related to latitude.
VIFs greater than five are problematic. You’re right on the borderline.
Mike Green says
Hi Jim,
I would like to know if I used the best regression model to show that Parkinson’s Disease is associated with Latitude.
I found a database of countries with # of PD Deaths per 100,000 (age standardized).
I noticed that countries at higher latitudes seemed to have a higher prevalence of PD. So, I decided to see if there was any association of PD deaths with Latitude.
Grouped countries according to Latitude 0 (n=6), 15 (n=25), 30 (n=14), 45 (n=11), and 60 (n=5) degrees.
1. Latitude
I tried to think of independent variables that can affect health status for each country:
2. Human Development Index (The index incorporates three dimensions of human development, a long and healthy life, knowledge, and decent living standards.)
3. Diet (% Fruits and Vegetables)
Then I added other parameters I thought would be associated with Latitude:
4. Average yearly Temperature
5. Maximum daylight hours
6. Earth’s Magnetic Field (Z-component)
I also added Longitude for “good measure”.
7. Longitude
I entered the data into a statistical program using Backwards Least Squares Multiple Regression function and only “Latitude” was statistically significant. The multiple correlation coefficient was 0.53 p<0.0001, n = 58. F ratio = 21.7 p<0.0001, Accepted Normality. Did I use the right method to show that PD death rates are associated with latitude independent of HDI, Diet, Temperature, maximum daylight hours, Earth’s magnetic Field and Longitude?
I repeated the program using the same independent variables, but using another neurological disease as the dependent variable, i.e. Alzheimer’s/Dementia deaths per 100,000. This time “HDI” was statistically significant. The multiple correlation coefficient was 0.27 p<0.04, n = 58. F ratio = 4.4 p<0.04, Accepted Normality.
Would you conclude that PD is associated with latitude?
Thanks,
Mike
Jim Frost says
Hi Mike,
Those are interesting results! I would say that there’s evidence that there is a correlation between latitude and PD. However, you can’t tell if it’s a casual relationship or just correlation. I’d imagine it’s probably just correlation because I don’t see how the latitude itself could cause Parkinson’s. I would suggest there are some confounding factors involved. You did include some of the potential confounders, which is great. Was there any correlation between the other variables and Latitude? Did you check the VIFs for multicollinearity? It’s possible that with a medium sample size, a relatively large number of variables for the sample size, plus the chance of multicollinearity could reduce the power of the hypothesis tests and produced the insignificant results even though there might be an effect. In some cases, when variables are correlated, backwards elimination will remove one because there’s not enough significance to go around, so to speak. So, it has to pick one basically. It’s possible that latitude rolled up several aspects into one variable and, therefore, looked more significant to the model.
Alternatively, there could be other confounders.
So, a good place to start is to check the correlation between the other IVs and latitude. Check the VIFs too. See my post about multicollinearity for details!
My sense would be that you’re on the right track, thinking about the right things, but that there’s more at play here. Latitude itself can’t really be causing PD. Some physical properties associated latitude or something about the countries that happen to be at those latitudes is probably more of an underlying factor. I always suggest doing background research to see what other researchers have found. They might have identified the confounders/other casual variables. You don’t need to reinvent the wheel!
Sayan says
Hello sir,
I have a question regarding response variables. Suppose I have 3 response variables, and I would like to choose one to perform my regression analysis. Is there any way to specify, which one I should choose, without creating separate models for each of them.
Jim Frost says
Hi Sayan,
I can think of several possible ways of the top of my head.
You might choose one response variable if you were aware of research in the same subject-area that suggests using a particular response variable. Or, a particular response variable is more relevant for theoretical reasons. Another could be that a particular response variable is easier to measure, easier to interpret, or easier to apply to your particular use case.
In other words, look at what you really want to learn from your study, how you want to use the results, what other studies have done, and then make a decision that includes those factors.
tomy says
10q
vishnu kramesh says
Dear Sir
My question is that I have a dep variable say X and a variable of interest Y with some control variables(Z)
Now when I run following regressions
1) X at time t , Y & Z at t-1
2) X at time t , Y at t-1 & Z at t
3) X at time t , Y & Z at t
The sign of my variable of interest changes(significance too). If there are not any theory to guide me with respect lag specification of variable of interest and control variables, which one from the above model should I use? What is the general principle?
Erly says
Got it! Thank you!!! =)
minitabuser says
can I use regression to check if a change affects product specifications
Jim Frost says
Hi,
You can probably use regression to predict whether a change affects a product’s characteristics. However, product specifications are imposed by outside limitations. Products outside the spec limits are considered defects. Spec limits are usually devised because a product won’t be satisfactory outside those limits. Typically, you don’t use regression analysis to determine the spec limits. However, I suppose if you knew enough about the product’s use and could model the relevant factors, you might be able to show that changes in the product could affect the spec limits. I’m not familiar with that being done, but I suppose it’s possible.
If you really need to know the answer, I’d check with industry experts. My take is that it would theoretically be possible if you could model the usage well enough but it’s probably not typical.
Asitha Don says
Thank You, Sir.
Asitha says
Model without intercept gives high R^2, so, should I select that model as the best.
Jim Frost says
Hi Asitha,
That’s a deceptive property of fitting a model without the intercept. When you fit the model with an intercept, R-squared assesses the variability around the dependent variable’s mean that the model accounts for. However, when you don’t fit an intercept, R-squared assesses the variability around zero. Because they measure different things, you can’t compare them. The R-squared without the intercept is almost always much higher than the R-squared with the intercept because of this property.
By the way, to learn why you should almost always include the intercept in the model, read my post about the y intercept.
Ali says
Thnk u for this it’s really helpful
My research thesis is
Population growth and unemployment rate in……
So how i specify my model
Jim Frost says
Hi Ali,
Specifying your model is a process that requires a lot research. Follow the approaches I discuss in this article. I think the first, best place to start is by researching how others have specified their models in the same area. Do a literature review to get ideas for the variables to include.
Martize Smith says
Please let me know your thoughts. If you have two different models that you ran a regression on in excel what methods in sequence do you look at to determine which mode is better?
Please critique me. What i currently do is first use the backward or forward approach, then observe the p-values for significance, then use the t-stat and the range of higher then 2 or less then -2 as guideline for a good coefficient predictor. Lastly, what should be done to choose the best model when lets say model A has a adjusted R2 higher then model B but model A has at least on insignificant variable while model B does not?
Please help.
Jim Frost says
Hi Martize,
Here would be my suggestions. Keep in mind that all the statistical measures you mention, and even others, can help guide the process. However, you shouldn’t go by statistics alone. Chasing a high R-squared, or even adjusted R-squared, can lead you astray. Consider all the statistics, but then also think about theory and what that suggests. I’d read that section in this post again (near the end). For your case when have several candidate models where the statistics point in different directions, let theory help you choose. If possible, consider what other studies have done as well.
Stepwise regression can help you identify candidate variables, but studies have shown that it usually does not pick the correct model. Read my article about stepwise and best subsets regression for more details.
For adjusted R-squared, any variable that has a t-value greater than an absolute value of 1 will cause the adjusted R-squared to increase. However, variables with t-values near 1/-1 won’t be statistically significant. So, fitting a model by increasing the adjusted R-squared can cause you to include variables that are not significant but do increase adjusted R-squared–as you have found.
If you’re debating over whether to include a variable or not, it’s generally considered better to include an unnecessary variable than risk excluding an important variable. There are caveats. Including too many insignificant variables can reduce the precision of your model. You also need to be sure than you’re not straying into overfitting your model by including extra variables.
I know that doesn’t give you a concrete answer to go by! But, regression modeling is like that sometimes. But, do focus more on the theoretical/other studies side the of the coin to consider along with the statistical side. Go for simplicity when possible. The simplest model that produces good residual plots and is consistent with theory is often a good candidate.
Heidi says
Okay, I need to calculate a regression equasion for a multiple regression with 3 ind variables. my text gives the equasion of y=b1x1 + b2x2 +b3x3 +b0 +e, but what are the values for x1, x2, x3? to pug in. I thought I knew yesterday and now I have no clue and can’t find any examples that actually show the equasion with the data and plugged in numbers to find out. I have to include the equasion in my assignment report so I need to know what values to include.
One other thing – If one of the variables is not statistically significant, should I repeat the regression without using that data set at all? I know it will change/decrease my value for r-sq (which Is already very low at 11%).
Note I am using exel with the data analysis toolpack because it Is the program required by my instructor,
Jim Frost says
Hi Heidi,
The x-values represent the variables in your dataset that you include in the model. You can either plug in the observed values for an observation to see what the model predicts for that observation or enter new values to predict a new observation with the specified characteristics.
And, yes, as I point out in this post, typically, you at least consider removing variables that are not significant. Also as I point out, don’t chase the highest R-squared. The model with the highest R-squared is not necessarily the best.
Hadas says
THANK VERY MUCH YOU JIM !!!!
Hadas says
Dear Jim am Hadas , I was reading your comments and your constructive suggestions by lots of individuals about statistics questions . I was analyzing data using both descriptive statistic and logit model. The result form descriptive one founds the selected variables have influences but the result of logit for most variable are not statistically significance at 95 % ,for p=5 % only 4 form 15 variables found statistically significant. likert type qestion was used to measure level of participation ( 5 leveled ). Does statistically insignificance imply the variables didn’t influence the dependent variables ? what are the problems there?
THANK YOU JIM
Jim Frost says
Hi Hada,
The first thing to recognize is that there might not be a problem at all. Maybe there just is no relationship between the insignificant independent variables and the dependent variable? That is one possibility Check the literature and theory to assess that.
If you have reason to believe there should be significant relationships for the variables in questions, there are several possibilities. Perhaps your sample size is too small to be able to detect the effect? Perhaps you’ve left out a confounding variable or otherwise violating an assumption that is biasing the estimate to be not significant?
On the other hand, if you have descriptive statistics display an apparent effect, but the variable is not significant in your model, there are several possibilities for that case. Your descriptive statistics do not account for sampling error. You can have visible effects that might be caused by random error rather than by an effect that exists in the population. Hypothesis testing accounts for that possibility. Additionally, when you look at the descriptive statistics, they do not account for (i.e., control for) other variables. However, when you fit a regression model, the procedure controls for the other variables in the model. After controlling for the effect of other variables in the model, what looked like strong results in the descriptive statistic might not actually exist.
Technically, a variable that is not significant indicates that you have insufficient evidence to conclude there is an effect. It is NOT proof that an effect doesn’t exist. For more information about that, read my post about failing to reject the null hypothesis.
There’s a range of potential questions for you to look into!
Dimpy Nandwani says
Hello Jim!
Thank you for such a helpful article!
In our study, we have 3 independent variables and one dependent variable.
For all the variables we are using an already developed scale which has around 5-9 questions each and uses the Likert scale for answers.
We just wanted to know if we have followed the right steps and wanted your guidance on the same.
First, we took the sum of each participants response on every questionnaire. For example, the questionnaire of work autonomy (which is one of our variable) had 5 questions and a participant answered 2, 3, 2, 3, 4 respectively for all 5 questions. Then, we took the mean as 14 as the mean response of the participant on the questionnaire. This mean was calculated for all the respondents, for the all the questionnaires/variables.
Then, we used multiple regression analysis to study the effect of the 3 independent variables on the dependent variable.
Could you please let us know if we are on the right track and if we have used the correct analysis? Should we use ordinal regression instead?
Thank you so much!
Jim Frost says
Hi Dimpy,
Yes, that sounds like a good approach. When you take the average or sum of a Likert scale variable like you are, you can often treat it as a continuous variable.
One potential problem is that as you change values in Likert scales by going from 2 to 3 to 4, etc., you don’t know for sure whether those represent a fixed increases. It’s like when you compare the times of a first place, second place, and third place in a race, they’re not necessarily increasing at a fixed rate. That’s the nature of ordinal variables. You might need to fit curvature, etc. But, if you can fit a model where the residuals look good and the results make theoretical sense, then I think you’ve got a good model!
Best of luck with your analysis!
John says
Hi Jim,
How would I specify a regression model consisting of both continuous and categorical regressors? And how to interpret the output of that model?
Xiaojie Cheng says
Hi Jim,
Thank you for your excellent and intuitive explanations. I’m a graduated student and recently I’m trying to find Interactive relationships between two genes by add their interaction terms in the regression models. I have some questions about choosing the best regression model. The DVs can be affected by several IVs (B1,B2,…,Bn), and my aim is to find which Bn may be regulated by another IV (A). I have built three models to deal with that, but the results are so different.
Model 1: DV=A+Bn+A*Bn
I only input one pair IVs(A and Bn) in model each time, and then repeat this model n times. When Bn is B1(DV=A+B1+A*B1), all of the terms are significant.
—————————————————————-
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.732e+03 3.987e+02 -4.343 5.72e-05 ***
A 2.658e+01 8.261e+00 3.217 0.00212 **
B1 6.576e+00 2.140e+00 3.073 0.00323 **
A*B1 -8.390e-02 2.889e-02 -2.904 0.00521 **
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1065 on 58 degrees of freedom
Multiple R-squared: 0.2037, Adjusted R-squared: 0.1625
F-statistic: 4.945 on 3 and 58 DF, p-value: 0.003994
—————————————————————
Model 2: DV=A+B1+B2+…+Bn+A*Bn
To avoid biased results, as you suggested, I add all the IVs that may affect DV. But only one target interaction term is remained. Then repeat this model n times.
When interaction term is A*B1, the interaction effect is insignificant.
—————————————————————-
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.124e+03 2.815e+02 -7.546 7.49e-10 ***
A 1.516e+01 5.994e+00 2.530 0.01454 *
B1 2.056e+00 1.810e+00 1.136 0.26145
B2 3.657e+00 2.402e+00 1.523 0.13404
B3 6.188e-01 4.108e-01 1.506 0.13822
B4 4.790e-01 3.337e-01 1.435 0.15734
B5 -4.909e-01 1.355e+00 -0.362 0.71871
B6 1.485e+00 6.239e-01 2.381 0.02104 *
B7 1.600e+01 5.756e+00 2.780 0.00759 **
B8 2.062e-02 1.827e-02 1.129 0.26433
A*B1 -3.465e-02 2.225e-02 -1.558 0.12551
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 674.5 on 51 degrees of freedom
Multiple R-squared: 0.7194, Adjusted R-squared: 0.6643
F-statistic: 13.07 on 10 and 51 DF, p-value: 6.148e-11
—————————————————————–
Model 3: DV=A+B1+A*B1+B2+A*B2…+Bn+A*Bn
In this model, I add all the IVs(Bn) and their interaction terms with A simultaneously, thus model runs once. In this situation, no significant terms.
——————————————————————
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.314e+03 3.984e+02 -5.809 6.45e-07 ***
A 2.410e+01 1.277e+01 1.886 0.0658 .
B1 5.936e-01 2.170e+00 0.274 0.7857
B2 5.281e+00 6.525e+00 0.809 0.4226
B3 4.074e-01 1.238e+00 0.329 0.7436
B4 4.417e-01 1.202e+00 0.368 0.7150
B5 -4.153e-01 3.814e+00 -0.109 0.9138
B6 2.775e+00 1.777e+00 1.562 0.1255
B7 9.274e+00 1.136e+01 0.816 0.4187
B8 4.297e-02 4.573e-02 0.940 0.3524
A*B1 -1.749e-02 3.531e-02 -0.495 0.6228
A*B2 -8.492e-02 1.707e-01 -0.498 0.6212
A*B3 6.077e-03 2.901e-02 0.209 0.8350
A*B4 1.723e-03 2.737e-02 0.063 0.9501
A*B5 4.894e-02 1.136e-01 0.431 0.6688
A*B6 -5.186e-02 5.362e-02 -0.967 0.3388
A*B7 3.067e-01 5.010e-01 0.612 0.5436
A*B8 -4.106e-04 8.732e-04 -0.470 0.6405
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 686 on 44 degrees of freedom
Multiple R-squared: 0.7496, Adjusted R-squared: 0.6528
F-statistic: 7.747 on 17 and 44 DF, p-value: 2.326e-08
——————————————————————–
My question: Is the significant interaction effect between A and B1 in model 1 reliable? Which is the best model to find the Interactive relationship between A and Bn?
In addition, the IVs above are not centered, as I get same results for interaction terms and the less significant main effect sometimes after centering.
Dan says
Thank you very much for your help and support
SAMUEL K BREFO-ABABIO says
Hey Jim, thanks for your insightful post. Please, are there any steps or factors that best determine whether a data analyst should build one comprehensive model or simply put should build many models on partitions of the data.
Dan says
Thank you for your useful content.
Is that mean we should use same control variables from previous literature or we can use the most suitable variables after running some experiments.
Jim Frost says
Hi Dan,
Theory and the scientific literature should guide you when possible. If other studies find that particular variables are important, you should consider them for your study. Because of omitted variable bias, it can be risk in terms bias to not include variables that other studies have found to be important. That is particularly true if you’re performing an observation study rather than a randomized study. However, you can certainly add your own variables into the mix if you’re testing new theories and/or have access to new types of data.
So, be very careful when removing control variables that have been identified as being important. You should have, and be able to explain, good reasons for removing them. Feel freer when it comes to adding new variables.
Sravani Lekkala says
what should we do if the output variable is more skewed.skewness>4
Jim Frost says
Hi Sravani,
When the output/dependent variable is skewed, it can be more difficult to satisfy the OLS assumptions. Note that the OLS assumptions don’t state that the dependent variable must be normally distributed itself, but instead state that the residuals should be normally distributed. And, obtaining normally distributed residuals can be more difficult when the DV is skewed.
There are several things you can try.
Sometimes modeling the curvature, if it exists, will help. In my post about using regression to make predictions, I use BMI to predict body fat percentage. Body fat percentage is the DV and it is skewed. However, the relationship between BMI and BF% is curved and by modeling that curvature, the residuals are normally distributed.
As the skew worsens, it becomes harder to get good residuals. You might need to transform you DV. I don’t have a blog post about that but I include a lot of information about data transformations in my regression ebook.
Those are several things that I’d look into first.
Best of luck with your analysis!
Chuck says
Hi Jim,
What does it mean when a regression model has a negative prediction R2 while the R2 and adjusted R2 are positive and reasonable?
Jim Frost says
Hi Chuck,
Any time the predicted R-squared is notably less than the adjusted/regular r-squared values it means that the model doesn’t predict new observations as well as it explains observations that the were used in the model fitting process. Often this indicates you’re overfitting the model. Too many predictors given the size of dataset. Usually when it’s so bad as to be negative, it’s because the dataset is pretty small. Read my posts about adjusted and predicted R-squared and overfitting for more information.
While the regular R-squared ranges between 0 – 100%, both predicted and adjusted R-squared can have negative values. A negative value doesn’t have any special interpretation other than just being really bad. Some statistical software will round negative values to zero. I tend to see negative values for predicted R-squared more than adjusted R-squared. As you’ll in the post I recommend, it’s often the more sensitive measure to problems with the model.
Take the negative predicted R-squared seriously. You’re probably overfitting your model. I’d also bet that you have fairly small dataset.
Mariyam says
Hi Jim,
Currently Im doing a research in my Economics Degree. This has been very helpful. I do have some doubts though. My research topic is “Relationship between Inflation and Economic growth in Maldives and how it affects the Maldivian economy”.
For this topic, I’m using GDP as a dependent variable and inflation, unemployment and gdp per capita as independent variables. I want to know whether it’s right to use all of these variables in one equation for this topic? Once i figure that out, it would be easy to run the regression.
Hope you could help me out in this.
Thanks
Maryam
Muideen says
Very useful write up. Thanks Jim
Please where a number of empirical models related similar independent variables to a particular dependent variable, what are the usual justifications for opting for a particular empirical model that one intends to build his research on?
Jim Frost says
I’d focus on using theory and the literature to guide you. Statistical measures can also provide information. I describe the process that you should use in this blog post.
Twiza says
Hi jim,
Am truly grateful for this beautiful blog, it truly is assisting me in my dissertation!
So I needed help with what model to use having a binary DV ( poverty). I run different types of logistic regression on my dataset depending on what type of post estimations tests I was carrying out.
As I was testing for goodness of fit that’s estat gof and linktest, of course after running a logistic regression, my prob>chi was equivalent to 0.0000 rejecting the Ho hypothesis which states that the model fits if prob>chi is > 0.0000.
I tried adding more independent variable but to no avail. I have 3 categorical independent variables that are insignificant, 1 continuous independent variable that was insignificant. The other 6 continuous independent variables are significant.
What do I do about those two tests, I seriously need help.
Thanks in advance.
Regards Twiza.
Jim Frost says
Hi Twiza,
Because you have a binary DV, you need to use binary logistic regression. However, it’s impossible for me to determine why your model isn’t fitting. Some suggestions would be to try to fit interaction terms and use polynomials terms. Just like would for an least squares model. Another possibility is to try different link functions.
jagriti khanna says
Hi Jim
I read your post thoroughly. I still have some doubts. I’m doing multi regression which includes 9 predictor variables. I’ve used p-values to check which of my variables are important. Also i plotted the graph for each independent variable wrt dependent variable and noted the each variable has a polynomial relation at individual level. So how to do multi variate polynomial regression when? Can you please help me with this?
Thanks in advance
Jim Frost says
Hi Jagriti,
It’s great that you graphed the data like that. It’s such an important step, but so many people skip it!
It sounds like you just need to add the polynomial terms to your model. I write more about this my post about fitting curves, which explains that process. After you fit the curvature, be sure to check the residual plots to make sure that you didn’t miss anything!
Henry Lee says
Hi Jim thanks for your blog.
My problem is much simpler than a multiple regression: I have some data showing a curved trend, and I would like to select the best polynomial model (1st, 2nd, 3rd or 4th order polynomial) fitting these data. The ‘best’ model should have a good fit but should also be more simple as possible (the lowest order polynomial producing a good fitting…)
Someone suggetsed me the Akaike Information Criterion, that penalizes the complexity of the model. Which are the possible tests or approaches to this (apparently) simple problem?
Thank you in advance!
Henry Lee
Jim Frost says
Hi Henry,
I definitely agree with your approach about using the simplest model that fits the data adequately.
I write about using polynomials to fit curvature in my post about curve fitting with regression. In practice, I find that 3rd order and higher polynomials are very rare. I’d suggest starting by graphing your data and counting the bends that you see and use the corresponding polynomial, as I describe in the curve fitting post. You should also apply theory, particularly if you’re using 3rd order or higher. Does theory supporting modeling those extra bends in the data or are they likely the product of a fluky sample or a small data set.
As for statistical tests, p-values are good place to start. If a polynomial term is not significant, consider removing it. I also suggest using adjusted R-squared because you’re comparing models with different numbers of terms. Perhaps even more crucial is using predicted R-squared. That statistic helps prevent you from overfitting your model. As you increase the polynomial order, you might just be playing connect the dots and fitting the noise in your data rather than fitting the real relationships. I’ve written a post about adjusted R-squared and predicted R-squared that you should read. I even include an example where it appears like a 3rd order polynomial provides a good fit but predicted R-squared indicates that you’re overfitting the data.
Finally, be sure to assess the residual plots because they’ll show you if you’re not adequately modeling the curvature.
Best of luck with your analysis!
Arup Dey says
I’m doing multiple regression analysis and there are four independent variables for regression analysis. So, it is not possible to know the shape of a graph that indicates the relationship between DB and IV. In this, how can I know the best regression model for my data? for example, linear, quadratic or exponential.
Jim Frost says
Hi Arup,
I’ve written a blog post about fitting the curvature in your data. That post will answer your questions! Also, consider graphing your residuals by each IV to see if you need to fit curve for each variable. I talk about these methods in even more detail in my ebook about regression. You might check that out!
Best of luck with fitting your model!
Mani says
Hey Sir,
My question might not be related, but I”m much confused in some problems.Like, When we study human behavior We used some demographic variable like Age and sex of child.Why we used them and what are the rational behind this.And how to interpret them.
Thanks.
Jim Frost says
Hi Mani,
You’d use these demographic variables because you think that they correlate with your DV. For instance, understanding age and gender might help you understand changes in the DV. For example, your DV might increase for males compared to females or increase with age. These variables can provide important information in their own right. Additionally, if you don’t include these variables and they are important, you risk biasing the estimates for your other variables. See omitted variable bias for details!
If you include these demographic variables in your model and they are not statistically significant, you can consider removing them from the model.
You interpret this type of variable in the same manner as any other independent variable. See regression coefficients and p-values for details.
karishma says
Thank you for your help Jim.
karishma says
Hi Jim,
I’m doing a multiple regression analysis on time series data. Can you recommend me some models that I can use for my analysis?
Thanks!
Jim Frost says
Hi Karishma,
You can use OLS regression to analyze time series data. Generally, you’ll need to include lagged variables and other time related predictors. Importantly, you can include predictors that are important to your study, which allows the analysis to estimate effects for them. You can use the model to make predictions. Be sure to pay particular attention to your residuals by order plot and the Durbin-Watson statistic to be sure that your model fits the data.
You can also use ARIMA, which is a regression-like approach to time series data. It includes multiple times series methods in one model (autoregressive, differencing, and moving average components). You can use relatively sophisticated correlational methods to uncover otherwise hidden patterns. You can use the model to make predictions. However, while models the dependent variable, it does not allow you to add other predictors into the model.
There are simpler time series models available, but they are less like regression, so I won’t detail them here.
Unfortunately, I don’t have much experience using regression analyses with time series data. There are undoubtedly other options available.
I hope this helps!
Hanna says
Hi Jim,
Thanks for this really helpful blog!
I am wondering whether I can use AIC and BIC to help me see which model fits my data best. Or is AIC and BIC only applicable when comparing the same model with different sets of variables (i.e. it tells me which variable selection is the best?). So could I use AIC and BIC to tell me whether a poisson or a negative binomial regression is best? And could I also compare OLS with count data models?
Any advice is much appreciated!
Peter Strauss says
So in 2015, a fairly similar article was posted on another website.
Care to at least give that one as a source?
Jim Frost says
Hi Peter,
Yes, I wrote both articles. I’ve been adding notes to that effect in several places and will need to add one to this post.
For some reason, the organization removed most author’s names from the articles. If you use the Internet Archive Wayback Machine and view an older version of that article, you’ll see that I am the author.
Thanks for writing!