Regression is a very powerful statistical analysis. It allows you to isolate and understand the effects of individual variables, model curvature and interactions, and make predictions. Regression analysis offers high flexibility but presents a variety of potential pitfalls. Great power requires great responsibility!
In this post, I offer five tips that will not only help you avoid common problems but also make the modeling process easier. I’ll close by showing you the difference between the modeling process that a top analyst uses versus the procedure of a less rigorous analyst.
Tip 1: Conduct A Lot of Research Before Starting
Before you begin the regression analysis, you should review the literature to develop an understanding of the relevant variables, their relationships, and the expected coefficient signs and effect magnitudes. Developing your knowledge base helps you gather the correct data in the first place, and it allows you to specify the best regression equation without resorting to data mining.
Regrettably, large data bases stuffed with handy data combined with automated model building procedures have pushed analysts away from this knowledge based approach. Data mining procedures can build a misleading model that has significant variables and a good R-squared using randomly generated data!
In my blog post, Using Data Mining to Select Regression Model Can Create Serious Problems, I show this in action. The output below is a model that stepwise regression built from entirely random data. In the final step, the R-squared is decently high, and all of the variables have very low p-values!
Automated model building procedures can have a place in the exploratory phase. However, you can’t expect them to produce the correct model precisely. For more information, read my Guide to Stepwise Regression and Best Subsets Regression.
Tip 2: Use a Simple Model When Possible
It seems that complex problems should require complicated regression equations. However, studies show that simplification usually produces more precise models.* How simple should the models be? In many cases, three independent variables are sufficient for complex problems.
The tip is to start with a simple a model and then make it more complicated only when it is truly needed. If you make a model more complex, confirm that the prediction intervals are more precise (narrower). When you have several models with comparable predictive abilities, choose the simplest because it is likely to be the best model. Another benefit is that simpler models are easier to understand and explain to others!
As you make a model more elaborate, the R-squared increases, but it becomes more likely that you are customizing it to fit the vagaries of your specific dataset rather than actual relationships in the population. This overfitting reduces generalizability and produces results that you can’t trust.
Learn how both adjusted R-squared and predicted R-squared can help you include the correct number of variables and avoid overfitting.
Tip 3: Correlation Does Not Imply Causation . . . Even in Regression
Correlation does not imply causation. Statistics classes have burned this familiar mantra into the brains of all statistics students! It seems simple enough. However, analysts can forget this important rule while performing regression analysis. As you build a model that has significant variables and a high R-squared, it’s easy to forget that you might only be revealing correlation. Causation is an entirely different matter. Typically, to establish causation, you need to perform a designed experiment with randomization. If you’re using regression to analyze data that weren’t collected in such an experiment, you can’t be certain about causation.
Fortunately, correlation can be just fine in some cases. For instance, if you want to predict the outcome, you don’t always need variables that have causal relationships with the dependent variable. If you measure a variable that is related to changes in the outcome but doesn’t influence the outcome, you can still obtain good predictions. Sometimes it is easier to measure these proxy variables. However, if your goal is to affect the outcome by setting the values of the input variables, you must identify variables with truly causal relationships.
For example, if vitamin consumption is only correlated with improved health but does not cause good health, then altering vitamin use won’t improve your health. There must be a causal relationship between two variables for changes in one to cause changes in the other.
Related posts: Causation versus Correlation in Statistics
Tip 4: Include Graphs, Confidence, and Prediction Intervals in the Results
This tip focuses on the fact that how you present your results can influence how people interpret them. The information can be the same, but the presentation style can prompt different reactions. For instance, confidence intervals and statistical significance provide consistent information. When a p-value is less than the 0.05 significance level, the corresponding 95% confidence interval will always exclude zero. However, the impact on the reader is very different.
A study by Cumming* finds that statistical reports which refer only to statistical significance bring about correct interpretations only 40% of the time. When the results also include confidence intervals, the percentage rises to 95%! Other research by Soyer and Hogarth* show dramatic increases in correct interpretations when you include graphs in regression analysis reports. In general, you want to make the statistical results as intuitively understandable as possible.
Related post: Confidence Intervals vs Prediction Intervals vs Tolerance Intervals.
Tip 5: Check Your Residual Plots!
Residuals plots are a quick and easy way to check for problems in your regression model. These graphs can also help you make adjustments. For instance, residual plots display patterns when you fail to model curvature that is present in your data.
For more information, read my post: Check Your Residual Plots to Ensure Trustworthy Regression Results!
Differences Between a Top Analyst and a Less Rigorous Analyst
Top analysts tend to do the following:
- Conducts research to understand the study area before starting.
- Uses large quantities of reliable data and a few independent variables with well established relationships.
- Uses sound reasoning to determine which variables to include in the regression model.
- Combines different lines of research as needed.
- Presents the results using charts, prediction intervals, and confidence intervals in a lucid manner that ensures the appropriate interpretation by others.
On the other hand, a less rigorous analyst tends to do the following:
- Does not do the research to understand the research area and similar studies.
- Uses regression outside of designed experiments to hunt for causal relationships.
- Uses data-mining to rummage for relationships because databases provide a lot of convenient data.
- Includes variables in the model based mainly on statistical significance.
- Uses a complicated model to increase R-squared.
- Reports only the basic statistics of coefficients, p-values, and R-squared values.
I hope these regression analysis tips have helped you out! Do you have any tips of your own to share? For more information about how to choose the best model, read my post: Model Specification: Choosing the Correct Regression Model.
If you’re learning regression, check out my Regression Tutorial!
Armstrong J., Illusions in Regression Analysis, International Journal of Forecasting, 2012 (3), 689-694.
Ord, K. (2012), The Illusion of Predictability: A call to action, International Journal of Forecasting, March 5, 2012.
Soyer, E. and Robin M. Hogarth, The illusion of predictability: How regression statistics mislead experts, International Journal of Forecasting, Volume 28, Issue 3, July–September 2012, Pages 695-711.
Zellner, A. (2001), Keep it sophisticatedly simple. In Keuzenkamp, H. & McAleer, M. Eds., Simplicity, Inference and Modelling: Keeping it Sophisticatedly Simple, Cambridge University Press, Cambridge.