Regression is a very powerful statistical analysis. It allows you to isolate and understand the effects of individual variables, model curvature and interactions, and make predictions. Regression analysis offers high flexibility but presents a variety of potential pitfalls. Great power requires great responsibility!
In this post, I offer five tips that will not only help you avoid common problems but also make the modeling process easier. I’ll close by showing you the difference between the modeling process that a top analyst uses versus the procedure of a less rigorous analyst.
Tip 1: Conduct A Lot of Research Before Starting
Before you begin the regression analysis, you should review the literature to develop an understanding of the relevant variables, their relationships, and the expected coefficient signs and effect magnitudes. Developing your knowledge base helps you gather the correct data in the first place, and it allows you to specify the best regression equation without resorting to data mining.
Regrettably, large data bases stuffed with handy data combined with automated model building procedures have pushed analysts away from this knowledge based approach. Data mining procedures can build a misleading model that has significant variables and a good R-squared using randomly generated data!
In my blog post, Using Data Mining to Select Regression Model Can Create Serious Problems, I show this in action. The output below is a model that stepwise regression built from entirely random data. In the final step, the R-squared is decently high, and all of the variables have very low p-values!
Automated model building procedures can have a place in the exploratory phase. However, you can’t expect them to produce the correct model precisely. For more information, read my Guide to Stepwise Regression and Best Subsets Regression.
Tip 2: Use a Simple Model When Possible
It seems that complex problems should require complicated regression equations. However, studies show that simplification usually produces more precise models.* How simple should the models be? In many cases, three independent variables are sufficient for complex problems.
The tip is to start with a simple a model and then make it more complicated only when it is truly needed. If you make a model more complex, confirm that the prediction intervals are more precise (narrower). When you have several models with comparable predictive abilities, choose the simplest because it is likely to be the best model. Another benefit is that simpler models are easier to understand and explain to others!
As you make a model more elaborate, the R-squared increases, but it becomes more likely that you are customizing it to fit the vagaries of your specific dataset rather than actual relationships in the population. This overfitting reduces generalizability and produces results that you can’t trust.
Learn how both adjusted R-squared and predicted R-squared can help you include the correct number of variables and avoid overfitting.
Related post: Overfitting Regression Models: Problems, Detection, and Avoidance
Tip 3: Correlation Does Not Imply Causation . . . Even in Regression
Correlation does not imply causation. Statistics classes have burned this familiar mantra into the brains of all statistics students! It seems simple enough. However, analysts can forget this important rule while performing regression analysis. As you build a model that has significant variables and a high R-squared, it’s easy to forget that you might only be revealing correlation. Causation is an entirely different matter. Typically, to establish causation, you need to perform a designed experiment with randomization. If you’re using regression to analyze data that weren’t collected in such an experiment, you can’t be certain about causation.
Fortunately, correlation can be just fine in some cases. For instance, if you want to predict the outcome, you don’t always need variables that have causal relationships with the dependent variable. If you measure a variable that is related to changes in the outcome but doesn’t influence the outcome, you can still obtain good predictions. Sometimes it is easier to measure these proxy variables. However, if your goal is to affect the outcome by setting the values of the input variables, you must identify variables with truly causal relationships.
For example, if vitamin consumption is only correlated with improved health but does not cause good health, then altering vitamin use won’t improve your health. There must be a causal relationship between two variables for changes in one to cause changes in the other.
Related posts: Causation versus Correlation in Statistics
Tip 4: Include Graphs, Confidence, and Prediction Intervals in the Results
This tip focuses on the fact that how you present your results can influence how people interpret them. The information can be the same, but the presentation style can prompt different reactions. For instance, confidence intervals and statistical significance provide consistent information. When a p-value is less than the 0.05 significance level, the corresponding 95% confidence interval will always exclude zero. However, the impact on the reader is very different.
A study by Cumming* finds that statistical reports which refer only to statistical significance bring about correct interpretations only 40% of the time. When the results also include confidence intervals, the percentage rises to 95%! Other research by Soyer and Hogarth* show dramatic increases in correct interpretations when you include graphs in regression analysis reports. In general, you want to make the statistical results as intuitively understandable as possible.
Related post: Confidence Intervals vs Prediction Intervals vs Tolerance Intervals.
Tip 5: Check Your Residual Plots!
Residuals plots are a quick and easy way to check for problems in your regression model. These graphs can also help you make adjustments. For instance, residual plots display patterns when you fail to model curvature that is present in your data.
For more information, read my post: Check Your Residual Plots to Ensure Trustworthy Regression Results!
Differences Between a Top Analyst and a Less Rigorous Analyst
Top analysts tend to do the following:
- Conducts research to understand the study area before starting.
- Uses large quantities of reliable data and a few independent variables with well established relationships.
- Uses sound reasoning to determine which variables to include in the regression model.
- Combines different lines of research as needed.
- Presents the results using charts, prediction intervals, and confidence intervals in a lucid manner that ensures the appropriate interpretation by others.
On the other hand, a less rigorous analyst tends to do the following:
- Does not do the research to understand the research area and similar studies.
- Uses regression outside of designed experiments to hunt for causal relationships.
- Uses data-mining to rummage for relationships because databases provide a lot of convenient data.
- Includes variables in the model based mainly on statistical significance.
- Uses a complicated model to increase R-squared.
- Reports only the basic statistics of coefficients, p-values, and R-squared values.
I hope these regression analysis tips have helped you out! Do you have any tips of your own to share? For more information about how to choose the best model, read my post: Model Specification: Choosing the Correct Regression Model.
If you’re learning regression, check out my Regression Tutorial!
References
Armstrong J., Illusions in Regression Analysis, International Journal of Forecasting, 2012 (3), 689-694.
Cumming, G. (2012), Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Multivariate Applications Series). New York: Routledge.
Ord, K. (2012), The Illusion of Predictability: A call to action, International Journal of Forecasting, March 5, 2012.
Soyer, E. and Robin M. Hogarth, The illusion of predictability: How regression statistics mislead experts, International Journal of Forecasting, Volume 28, Issue 3, July–September 2012, Pages 695-711.
Zellner, A. (2001), Keep it sophisticatedly simple. In Keuzenkamp, H. & McAleer, M. Eds., Simplicity, Inference and Modelling: Keeping it Sophisticatedly Simple, Cambridge University Press, Cambridge.
Mayank says
Hello Jim,
How can we use regression if we have a equation of total income and sources of income.
I.e. total income = wages + rent + profit + gambling
This data is deterministic, I am unable to do regression.
Is there some way to use some othe variable and do regression.
Also how can we use regression to show before and after covid 19 effect in regression ?
Thankyou.
Bradley James Quiring says
Hi Jim,
Thanks for the great insights here and elsewhere. I’m trying to find the best model building method within the limitations of Excel to teach my intro business stats students. (These are not stats students, so simpler is always better.)
I’m assuming best subsets is not feasible with more than two or three independent variables. So my first question is which method would you recommend assuming Excel is our only software?
Forgive my second question, which will no doubt reveal my lack of theoretical experience with regression, but I’d like to understand, when building the best model, why adding another variable can increase the adjusted r squared even when the added variable itself has an insignificant p-value? In other words, what takes precedence, the adjusted r squared or the variable’s p-value? In the example I’m working on (a textbook example!), I started with two independent variables ( p = 0.00, p = 0.03) and an adjusted r squared of 0.45. In adding a third variable, my adjusted r squared jumps to 0.47, but now two of the three variables have insignificant p-values. In adding a fourth variable, my r squared jumps to 0.55 and all four variables now have significant p-values. How can this be best explained and which of the models should we use?
(I have just purchased your book on regression, so perhaps the answers are already there, but I have yet to find an intro stats textbook that addresses regression statistics that appear to contradict one another.)
Thanks for your time!
Brad
Jim Frost says
Hi Bradley,
I think the best approach for Excel is the manual model reduction approach. As far as I know, Excel doesn’t have any automated model fitting process. But, you can fit the full model and then one-by-one remove any variables that are not significant. For example, if you have multiple non-significant variables, remove the one with the highest p-value but leave the other non-significant variables in and refit. Repeat until there are no insignificant variables. Of course, as with any automated method, check to make sure that the final model and the signs and magnitudes of the coefficients make theoretical sense. By the way, I have written a post about using Excel to fit regression models that you might be interested in.
For your second question, that’s a good one that I’m willing to be that most don’t know the answer to! When the t-value is greater than 1, the adjusted R-squared increases. That’s just a byproduct baked into its calculations. However, for statistical significant, the t-value needs to be ~1.96, depending on the DF. Consequently, there’s a range from t = 1 to 1.96 where the adjusted R-squared will increase but the variable is not significant.
Unfortunately, it’s impossible for me to answer your question definitively about which model to use given your example. It’s not just about the statistics involved, and they can point you in different directions, as you’ve seen! It also involves subject-area knowledge about which variables should and should not be include and an evaluation of the coefficient signs and magnitude to see if they make theoretical sense. Of course, you also need to check the residuals. Patterns in residuals would tell you the model needs fixing regardless of what the various statistics say! I do include a discussion about model specification in my regression book. It’s the entirety of Chapter 7. I talk about all the issues I mention here along with others.
Off hand, I’d lean towards the model with four variables assuming it passes everything I mention here given that it has the highest adjusted R-squared and all the variables are significant. Unless you have many fewer than 40 observations, in which case you might be overfitting your model. But, the differences in your adjusted R-squared really aren’t that large, so dropping some variables if need wouldn’t be problematic.
I hope that helps!
Anshum Saran says
Dear Sir,
Thanks a ton for your patience with me, I know I am taking most of your time. Its just this last thing. Refer to this post ‘https://statisticsbyjim.com/regression/standard-error-regression-vs-r-squared/’, there is fitted line plot graph there. This graph basically shows S measuring the precision of the model’s predictions. Consequently, it uses S to obtain a rough estimate of the 95% prediction interval. I want to know what are there data points you are referring to?? Are these the residuals?? How do I get the values of these?? I am gonna be using the standard 95% of the data points should fall within a range that extends from +/- 2 * standard error of the regression from the fitted line only but I want to make this fitted line graph to explain and check whether do my data points fall or not. So my confusion is how to make this graph, can I make it on minitab? Also I have found prediction data as per this post, ‘https://statisticsbyjim.com/regression/prediction-precision-applied-regression/’, does any of the output will help me make this graph? Sir I am not talking about what percentage my predictions should be useful at, I am still not well familiar with regression so I cant decide with that on my own so I will take what you have mentioned in your post ‘95% of the data points should fall within a range that extends from +/- 2 * standard error of the regression from the fitted line’, the problem is how do I get this the fitted line plot graph for multiple regression. Also can I directly make this graph on minitab?
Thanks a tom for your patience with me.
Best Regards,
Anshum Saran
Jim Frost says
Hi Anshum,
You can find the dataset for creating that graph you’re asking about in my post about making predictions with regression analysis. In that post, you’ll find the link to the dataset. Take that dataset and then use the fitted line plot feature in Minitab. Have Minitab display the prediction intervals, which is one of the options.
You can use fitted line plots only when you have one predictor. That’s because you need one axis for the predictor and one for the response. If you have more predictors, you’d need extra dimensions! While you can’t use a fitted line plot with multiple regression, I show you how to use Minitab’s prediction feature to calculate prediction intervals when you have multiple predictors. That’s what you’ll need to do with your data.
Again, you should buy my regression ebook which covers this in more detail.
Anshum Saran says
Sir, I read your reply and I am confused. I also read both the posts as well. I did find out where to check the S value, but I’m not sure what am i comparing it to. I cannot find any study material about this online as well. Can you help make me understand. I am very new to this and I am really trying my best to understand it. The post regarding the precision in predictions just discuss one random prediction, While the other posts shows a graph. I want to make that graph how do I do it? what will be on x axis and what will be on the y axis. How will I make my regression line. Is there a way to make the graph on minitab? I have this software which im using for my analysis.
Thanks and appreciate your help and time you have taken out for replying back.
Regards,
Anshum
Jim Frost says
Hi Anshum,
Ah, I see where the confusion lies now. As the analyst and subject-area expert, you need to supply the value for comparison. For your predictions, how much precision do you need for them to be useful. I talk about that in the post about the standard error of the regression. The model doesn’t make predictions that are precise enough to be useful. What is considered useful varies by subject-matter and application. There is no statistical measure for determining how much precision is required. So, you’ll need to use your subject-area knowledge and standards of the field to figure out how precise you need the predictions for them to be useful. Then, compare your S to the S required for sufficiently precise prediction. You can also use prediction intervals for the same purpose.
I’ve made all my graphs in Minitab. I’m not sure which graphs you’re referring to. I have multiple graphs in multiple posts. Please specify precisely which graphs in which posts.
Anshum Saran says
Good Day Sir,
I appreciate your reply, I am a bit confused about the part where you talk about ‘You can use S in your output (24236.7) for a good estimate of the precision. 95% of new observations will fall within +/- 2*S from the predicted value’ I was unable to understand how do I prove this?? I did read your posts related to this but unfortunately it only talks about it using linear regression. Can this be used in multiple regression as well?? I an using minitab and it doesnt have this feature. Can you please help me with this last question.
Please explain how do I use my Std Error in verifying my model.
Thanks and Regards,
Anshum
Jim Frost says
Hi Anshum,
I had to remove your data and output because it was rather long. However, you need to use the standard error of the regression (S), rather than the standardized residuals. In Minitab, you’ll find S listed in Model Summary table, which is the same section as R-squared. For more information about this statistic, read my post about the standard error of the regression. Multiple regression is linear regression. So, yes, you can use it multiple regression. You’ll find it right there in the output. 🙂
To see precision used in a multiple regression context, read my post about precision in predictions. One of the examples uses multiple predictors. That post focuses on using prediction intervals, but you’ll see S in the example output in the Model Summary table.
Anshum Saran says
Good Day Sir,
I had written to you earlier as well, I haven’t got a reply yet. I am trying to do a multiple regression forecasting, but I am unable to interpret the results from the ANOVA table. I have used the obtained equation for forecasting my values and have found some promising results. I request you to please have a look at my analysis and help me understand and interpret the same.
Sir If you look at my p value for the regression equation, it is within the significance, but the p value for the other variables (independent) are above the value of significance. I am not sure how to interpret it also the value of t and f in the analysis.
Also if this is a success and it can be modelled and used for my forecasting.
I would love to hear from you at your soonest convenience as I have to submit my thesis by the end of this month and have to run it by my supervisor.
Thanks in advance for your valuable time and feedback.
Regression Equation:
1 Year Timecharter Rate Capesiz = 120984 – 6.90 Average Haul Iron Ore and Coal – 0.00203 Capesize Bulkcarrier Demolition
Model Summary
S R-sq R-sq(adj) R-sq(pred)
24236.7 55.78% 47.74% 35.23%
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 2 8151898785 4075949392 6.94 0.011
Average Haul Iron Ore and Coal 1 2095391492 2095391492 3.57 0.086
Capesize Bulkcarrier Demolition 1 840607035 840607035 1.43 0.257
Error 11 6461589869 587417261
Total 13 14613488654
Thanks and Best Regards,
Anshum
Jim Frost says
Hi Anshum,
First off, given that your thesis depends on regression analysis and the extensive nature of your questions, I highly recommend that you get my ebook about regression analysis.
Your overall F-test of significance says that the model is statistically significant but there’s not enough evidence to suggest that any of the individual predictors are significant. I write about this condition in my post about the overall F-test. I think part of the problem is that you have very few observations, which lowers the power of the analysis.
Typically, you don’t interpret the t and F-values directly. Those are the test statistics which the analysis uses to calculate the p-values. So, you can just interpret the p-values. Read my post for more information about how to interpret the coefficients and p-values.
In terms of using the model to make predictions, you’d need to first check the residual plots to be sure that the model provides a good fit for the data. Otherwise, the predictions might be biased. You can’t make that determination from the numeric output. Assuming the residual plots look good, I see one additional problem. Your R-squared and particularly predicted R-squared are low. While R-squared is often overrated, a fairly high R-squared is important when you need to make precise predictions. Your predictions are likely to be imprecise. You can use S in your output (24236.7) for a good estimate of the precision. 95% of new observations will fall within +/- 2*S from the predicted value. For more information about these concepts, read the following posts:
Making Predictions with Regression Analysis: pay particular attention to the sections on precision
Understand Precision to Avoid Costly Mistakes: again, focus on precision
S vs R-squared: More about how S is better than R-squared when it comes to precision
In a nutshell, given the small sample size, lack of significant predictors, and low R-squared (and particularly the low predicted R-squared), your model doesn’t provide much explanatory power. Predictions based on the model are likely to be too imprecise to be useful (although you can assess the precision using information I provided to make the determination).
Best of luck with your thesis!
Vivian Yu says
Very helpful! Thanks!
Jim Frost says
Thank you, Vivian!
bilalahmaduoc says
Thank You Jim for spreading knowledge. I am working on a research paper where I want develop a regression model for which the variables haven’t been used before in any literature. I have two simple questions . 1. If i have 4 independent variables how will i Show them in mathematical form secondly if my independent variables have correlation can i include them in my model
Jim Frost says
Hi, thank you for writing. I’m not sure that I understand your first question. After you fit the model, you’ll see the regression equation in the output. That’s how to write the mathematical form. As for correlated independent variables, or multicollinearity as it is called, yes, some correlation is OK. You need to check the VIFs. If they are less than 5, you should be good. I write a blog post about multicollinearity that you should read.
I hope this helps!
Jim
Toktam says
Thank you very much for informative posts. I am goning to conduct a 3 levels ordered logistic regression analysis on the world value survey data using stata and I´m not sure what should I check before building models? would you plz kindly explain
Jim Frost says
Hi Toktam, the very first thing I’d do is research what others have done in this area. Maybe others have even used the same data for the same reason? At the very least, you want to learn about the area, see what others have found, and see what variables should be related to your dependent variable. This process helps you with identifying candidate variables and determining whether your results make sense. Check out my blog post about model specification for more ideas.
Toktam says
Thank you for prompt reply. can I ask more detailed questions about particular issues in multilevel analysis (using stata) ?