What is Linear Regression?
Linear regression models the relationships between at least one explanatory variable and an outcome variable. These variables are known as the independent and dependent variables, respectively. When there is one independent variable (IV), the procedure is known as simple linear regression. When there are more IVs, statisticians refer to it as multiple regression.
Learn more about independent and dependent variables.
This flexible analysis allows you to separate the effects of complicated research questions by modeling and controlling all relevant variables. It lets you isolate the role that each variable plays. This procedure uses sample data to estimate the population parameters. The regression coefficients in your statistical output are the parameter estimates.
Learn more about when you should use regression analysis.
Linear regression has two primary purposes—understanding the relationships between variables and forecasting.
- The coefficients represent the estimated magnitude and direction (positive/negative) of the relationship between each independent variable and the dependent variable.
- A linear regression equation allows you to predict the mean value of the dependent variable given values of the independent variables that you specify.
Learn more about interpreting regression coefficients and using regression to make predictions.
Linear Models
Despite the name, linear regression can model curved relationships. In this context, the term “linear” describes the form of the regression equation. A regression equation is linear when all its terms are one of the following:
- Constant.
- Parameter multiplying an independent variable.
Additionally, a linear regression equation can only add terms together, producing one general form:
Dependent variable = constant + parameter * IV + … + parameter * IV
Statisticians refer to this form as being linear in the parameters. Hence, you cannot include parameters in an exponent in linear regression, but you can raise a variable to a power to model curvature.
Linear regression was the original form that statisticians studied, and it is the easiest type of model to fit and interpret. However, a linear model cannot fit some datasets well and a nonlinear model is required.
Specifying the correct model requires balancing subject-area knowledge, statistical results, and satisfying the assumptions.
Learn more about the difference between linear and nonlinear models and specifying the correct regression model.
Linear Regression Assumptions
Least squares regression, also known as ordinary least squares, is the most common form of linear regression. However, there are other types, such as least absolute deviation and ridge regression.
Each type has a set of assumptions that you primarily assess using the residuals. Residuals are the difference between the observed value and the mean value that the model predicts for that observation. If you fail to satisfy the assumptions, the results might not be valid.
Learn more about the assumptions for ordinary least squares.
Example of Linear Regression
Suppose we use linear regression to model how the outside temperature in Celsius and Insulation thickness in centimeters, our two independent variables, relate to air conditioning costs in dollars (dependent variable).
Let’s interpret the results for the following multiple linear regression equation:
Air Conditioning Costs$ = 2 * Temperature C – 1.5 * Insulation CM
The coefficient sign for Temperature is positive (+2), which indicates a positive relationship between Temperature and Costs. As the temperature increases, so does air condition costs. More specifically, the coefficient value of 2 indicates that for every 1 C increase, the average air conditioning cost increases by two dollars.
On the other hand, the negative coefficient for insulation (–1.5) represents a negative relationship between insulation and air conditioning costs. As insulation thickness increases, air conditioning costs decrease. For every 1 CM increase, the average air conditioning cost drops by $1.50.
We can also enter values for temperature and insulation into this linear regression equation to predict the mean air conditioning cost.
Reference
Yan, Xin (2009), Linear Regression Analysis: Theory and Computing
Hi Jim,
I’m trying to understand why the Beta and significance changes in a linear regression, when I add another independent variable to the model. I am currently working on a mediation analysis, and as you know the linear regression is part of that. A simple linear regression between the IV (X) and the DV (Y) returns a statistically significant result. But when I add another IV (M), X becomes insignificant. Can you explain this?
Seeking some clarity,
Peta.
Hi Peta!
This is a common occurrence in linear regression and is crucial for mediation analysis.
By adding M (mediator), it might be capturing some of the variance that was initially attributed to X. If M is a mediator, it means the effect of X on Y is being channeled through M. So when M is included in the model, it’s possible that the direct effect of X on Y becomes weaker or even insignificant, while the indirect effect (through M) becomes significant.
If X and M share variance in predicting Y, when both are in the model, they might “compete” for explaining the variance in Y. This can lead to a situation where the significance of X drops when M is added.
I hope that helps!
Thanks!
Jim, Hi! I am working on an interpretation of multiple linear regression. I am having a bit of trouble getting help. is there a way to post the table so that I may initiate a coherent discussion on my interpretation?
Is it possible that we get significant correlations but no significant prediction in a multiple regression analysis? I am seeing that with my data and I am so confused. Could mediation be a factor (i.e IVs are not predicting the outcome variables because the relationship is made possible through mediators)?
Hi Noor,
I’m not sure what you mean by “significant prediction.” Typically, the predictions you obtain from regression analysis will be a fitted value (the prediction) and a prediction interval that indicates the precision of the prediction (how close is it likely to be to the correct value). We don’t usually refer to “significance” when talking about predictions. Can you explain what you mean? Thanks!
Dear Jim,
I want to do a multiple regression analysis is SPSS (creating a predictive model), where IQ is my dependent variable and my independent variables contains of different cognitive domains. The IQ scores are already scaled for age. How can I controlling my independent variables for age, whitout doing it again for the IQ scores? I can’t add age as an independent variable in the model.
I hope that you can give me some advise, thank you so much!
Hi Irene,
If you include age as an independent variable, the model controls for it while calculating the effects of the other IVs. And don’t worry, including age as an IV won’t double count it for IQ because that is your DV.
Hi Jim,
Is there a reason you would want your covariates to be associated with your independent variable before including them in the model?
So in deciding which covariates to include in the model, it was specified that covariates associated with both the dependent variable and independent variable at p<0.10 will be included in the model.
My question is why would you want the covariates to be associated with the independent variable?
Thank you
Hi Joy,
In some cases, it’s absolutely crucial to include covariates that correlate with other independent variables, although it’s not a sufficient reason by itself. When you have a potential independent variable that correlates with other IVs and it also correlates with the dependent variable, it becomes a confounding variable and omitting it from the model can cause a bias in the variables that you do include. In this scenario, the degree of bias depends on the strengths of the correlations involved. Observational studies are particularly susceptible to this type of omitted variable bias. However, when you’re performing a true, randomized experiment, this type of bias becomes a non-issue.
I’ve never heard of a formalized rule such as the one that you mention. Personally, I wouldn’t use p-values to make this determination. You can have low p-values for weak correlation in some cases. Instead, I’d look at the strength of the correlations between IVs. However, it’s not a simple as a single criterial like that. The strength of the correlation between the potential IV and the DV also plays a role.
I’ve written an article about that discusses these issues in more detail, read Confounding Variables Can Bias Your Results.
Jim, as if by serendipity: having been on your mailing list for years, I looked up your information on multiple regression this weekend for a grad school advanced statistics case study. I’m a fan of your admirable gift to make complicated topics approachable and digestible. Specifically, I was looking for information on how pronounced the triangular/funnel shape must be–and in what directions it may point–to suggest heteroscedasticity in a regression scatterplot of standardized residuals vs standardized predicted values. It seemed to me that my resulting plot of a 5 predictor variable regression model featured an obtuse triangular left point that violated homoscedasticity; my professors disagreed, stating the triangular “funnel” aspect would be more prominent and overt. Thus, should you be looking for a new future discussion point, my query to you then might be some pearls on the nature of a qualifying heteroscedastic funnel shape: How severe must it be? Is there a quantifiable magnitude to said severity, and if so, how would one quantify this and/or what numeric outputs in common statistical software would best support or deny a suspicion based on graphical interpretation? What directions can the funnel point; are only some directions suggestive, whereby others are not? Thanks for entertaining my comment, and, as always, thanks for doing what you do.