What is Linear Regression?
Linear regression models the relationships between at least one explanatory variable and an outcome variable. This flexible analysis allows you to separate the effects of complicated research questions, allowing you to isolate each variable’s role. Additionally, linear models can fit curvature and interaction effects.
Statisticians refer to the explanatory variables in linear regression as independent variables (IV) and the outcome as dependent variables (DV). When a linear model has one IV, the procedure is known as simple linear regression. When there are more than one IV, statisticians refer to it as multiple regression. These models assume that the average value of the dependent variable depends on a linear function of the independent variables.
Linear regression has two primary purposes—understanding the relationships between variables and prediction.
- The coefficients represent the estimated magnitude and direction (positive/negative) of the relationship between each independent variable and the dependent variable.
- The equation allows you to predict the mean value of the dependent variable given the values of the independent variables that you specify.
Linear regression finds the constant and coefficient values for the IVs for a line that best fit your sample data. The graph below shows the best linear fit for the height and weight data points, revealing the mathematical relationship between them. Additionally, you can use the line’s equation to predict future values of the weight given a person’s height.
Linear regression was one of the earliest types of regression analysis to be rigorously studied and widely applied in real-world scenarios. This popularity stems from the relative ease of fitting linear models to data and the straightforward nature of analyzing the statistical properties of these models. Unlike more complex models that relate to their parameters in a non-linear way, linear models simplify both the estimation and the interpretation of data.
In this post, you’ll learn how to interprete linear regression with an example, about the linear formula, how it finds the coefficient estimates, and its assumptions.
Learn more about when you should use regression analysis and independent and dependent variables.
Linear Regression Example
Suppose we use linear regression to model how the outside temperature in Celsius and Insulation thickness in centimeters, our two independent variables, relate to air conditioning costs in dollars (dependent variable).
Let’s interpret the results for the following multiple linear regression equation:
Air Conditioning Costs$ = 2 * Temperature C – 1.5 * Insulation CM
The coefficient sign for Temperature is positive (+2), which indicates a positive relationship between Temperature and Costs. As the temperature increases, so does air condition costs. More specifically, the coefficient value of 2 indicates that for every 1 C increase, the average air conditioning cost increases by two dollars.
On the other hand, the negative coefficient for insulation (–1.5) represents a negative relationship between insulation and air conditioning costs. As insulation thickness increases, air conditioning costs decrease. For every 1 CM increase, the average air conditioning cost drops by $1.50.
We can also enter values for temperature and insulation into this linear regression equation to predict the mean air conditioning cost.
Learn more about interpreting regression coefficients and using regression to make predictions.
Linear Regression Formula
Linear regression refers to the form of the regression equations these models use. These models follow a particular formula arrangement that requires all terms to be one of the following:
- The constant
- A parameter multiplied by an independent variable (IV)
Then, you build the linear regression formula by adding the terms together. These rules limit the form to just one type:
Dependent variable = constant + parameter * IV + … + parameter * IV
This formula is linear in the parameters. However, despite the name linear regression, it can model curvature. While the formula must be linear in the parameters, you can raise an independent variable by an exponent to model curvature. For example, if you square an independent variable, linear regression can fit a U-shaped curve.
Specifying the correct linear model requires balancing subject-area knowledge, statistical results, and satisfying the assumptions.
Learn more about the difference between linear and nonlinear models and specifying the correct regression model.
How to Find the Linear Regression Line
Linear regression can use various estimation methods to find the best-fitting line. However, analysts use the least squares most frequently because it is the most precise prediction method that doesn’t systematically overestimate or underestimate the correct values when you can satisfy all its assumptions.
The beauty of the least squares method is its simplicity and efficiency. The calculations required to find the best-fitting line are straightforward, making it accessible even for beginners and widely used in various statistical applications. Here’s how it works:
- Objective: Minimize the differences between the observed and the linear regression model’s predicted values. These differences are known as “residuals” and represent the errors in the model values.
- Minimizing Errors: This method focuses on making the sum of these squared differences as small as possible.
- Best-Fitting Line: By finding the values of the model parameters that achieve this minimum sum, the least squares method effectively determines the best-fitting line through the data points.
By employing the least squares method in linear regression and checking the assumptions in the next section, you can ensure that your model is as precise and unbiased as possible. This method’s ability to minimize errors and find the best-fitting line is a valuable asset in statistical analysis.
Assumptions
Linear regression using the least squares method has the following assumptions:
- A linear model satisfactorily fits the relationship.
- The residuals follow a normal distribution.
- The residuals have a constant scatter.
- Independent observations.
- The IVs are not perfectly correlated.
Residuals are the difference between the observed value and the mean value that the model predicts for that observation. If you fail to satisfy the assumptions, the results might not be valid.
Learn more about the assumptions for ordinary least squares and How to Assess Residual Plots.
Reference
Yan, Xin (2009), Linear Regression Analysis: Theory and Computing
alex says
I managed to figure it out myself! Apparently the difference comes from using type 1 anova instead of type 2 (in R default anova function is type 1 anova, whereas the function in python is type 2).
As I understand it, in type 1 it’s done sequentially and the order of the variables in the model changes the results in this case, whereas in type 2 anova it is done marginally.
Jim Frost says
Hi Alex,
That’s great! I assume you’re referring to the sum of squares, which would actually be Adjusted Type 3 (the default in statistics) and Sequential Type I.
As you say, Type 1 depends on the order that the variables are entered into a model. That’s not usually used because a truly unimportant variable can look more important simply by being added to the model first.
Type 3 gives the results for each variable when all the other variables are already in the model. That puts them all on an even playing field and gives you the results for the unique variance each variables explains that the other variables do not.
You say Type 2. There is a type 2 sum of squares but it’s much less common that Type 1 and 3 though. Type 2 consideres each main effect as being added after all the other main effects but before the interaction terms.
I don’t know why Python uses Type 2. Generally speaking, you should use Type 3 unless you have very strong theoretical/subject-area knowledge indicating that a different type is better. But that’s very rare. Almost always use Type 3.
alex says
Hi Jim,
Thank you for your blog, it saved me so many hair being pulled in frustration! ๐
I am reading your book on Regression (big thumbs up, recommend to everyone!) and trying to recreate results on Income ~ Major+Experience example on p 68.
I tried it in python can got same results as yours, then tried R (all seems the same, one factor, one numerical independent variable), and get different results:
Analysis of Variance Table
Response: Income
Df Sum Sq Mean Sq F value Pr(>F)
Major 2 2.5165e+09 1258246677 2.7701 0.08117
Experience 1 2.2523e+09 2252342774 4.9587 0.03483
Residuals 26 1.1810e+10 454222144
Meaning that the Major isn’t significant!
Coefficient table is exaclty the same as yours.
Can you please help me to understand what is going on?
I am going mad trying to solve it.
Thanks,
Alex
Stan Alekman says
Hi Jim,
Why not perform centering or standardization with all linear regression to arrive at a better estimate of the y-intercept?
Jim Frost says
Hi Stan,
I talk about centering elsewhere. This article just covers the basics of what linear regression does.
A little statistical niggle on centering creating a “better estimate” of the y-intercept. In statistics, there’s a specific meaning to “better estimate,” relating to precision and a lack of bias. Centering (or standardizing) doesn’t create a better estimate in that sense. It can create a more interpretable value in some situations, which is better in common usage.
Peta says
Hi Jim,
I’m trying to understand why the Beta and significance changes in a linear regression, when I add another independent variable to the model. I am currently working on a mediation analysis, and as you know the linear regression is part of that. A simple linear regression between the IV (X) and the DV (Y) returns a statistically significant result. But when I add another IV (M), X becomes insignificant. Can you explain this?
Seeking some clarity,
Peta.
Jim Frost says
Hi Peta!
This is a common occurrence in linear regression and is crucial for mediation analysis.
By adding M (mediator), it might be capturing some of the variance that was initially attributed to X. If M is a mediator, it means the effect of X on Y is being channeled through M. So when M is included in the model, it’s possible that the direct effect of X on Y becomes weaker or even insignificant, while the indirect effect (through M) becomes significant.
If X and M share variance in predicting Y, when both are in the model, they might “compete” for explaining the variance in Y. This can lead to a situation where the significance of X drops when M is added.
I hope that helps!
Susan Bullington says
Thanks!
Susan Bullington says
Jim, Hi! I am working on an interpretation of multiple linear regression. I am having a bit of trouble getting help. is there a way to post the table so that I may initiate a coherent discussion on my interpretation?
Noor says
Is it possible that we get significant correlations but no significant prediction in a multiple regression analysis? I am seeing that with my data and I am so confused. Could mediation be a factor (i.e IVs are not predicting the outcome variables because the relationship is made possible through mediators)?
Jim Frost says
Hi Noor,
I’m not sure what you mean by “significant prediction.” Typically, the predictions you obtain from regression analysis will be a fitted value (the prediction) and a prediction interval that indicates the precision of the prediction (how close is it likely to be to the correct value). We don’t usually refer to “significance” when talking about predictions. Can you explain what you mean? Thanks!
Irene says
Dear Jim,
I want to do a multiple regression analysis is SPSS (creating a predictive model), where IQ is my dependent variable and my independent variables contains of different cognitive domains. The IQ scores are already scaled for age. How can I controlling my independent variables for age, whitout doing it again for the IQ scores? I canโt add age as an independent variable in the model.
I hope that you can give me some advise, thank you so much!
Jim Frost says
Hi Irene,
If you include age as an independent variable, the model controls for it while calculating the effects of the other IVs. And don’t worry, including age as an IV won’t double count it for IQ because that is your DV.
Joy says
Hi Jim,
Is there a reason you would want your covariates to be associated with your independent variable before including them in the model?
So in deciding which covariates to include in the model, it was specified that covariates associated with both the dependent variable and independent variable at p<0.10 will be included in the model.
My question is why would you want the covariates to be associated with the independent variable?
Thank you
Jim Frost says
Hi Joy,
In some cases, it’s absolutely crucial to include covariates that correlate with other independent variables, although it’s not a sufficient reason by itself. When you have a potential independent variable that correlates with other IVs and it also correlates with the dependent variable, it becomes a confounding variable and omitting it from the model can cause a bias in the variables that you do include. In this scenario, the degree of bias depends on the strengths of the correlations involved. Observational studies are particularly susceptible to this type of omitted variable bias. However, when you’re performing a true, randomized experiment, this type of bias becomes a non-issue.
I’ve never heard of a formalized rule such as the one that you mention. Personally, I wouldn’t use p-values to make this determination. You can have low p-values for weak correlation in some cases. Instead, I’d look at the strength of the correlations between IVs. However, it’s not a simple as a single criterial like that. The strength of the correlation between the potential IV and the DV also plays a role.
I’ve written an article about that discusses these issues in more detail, read Confounding Variables Can Bias Your Results.
Wes McFee says
Jim, as if by serendipity: having been on your mailing list for years, I looked up your information on multiple regression this weekend for a grad school advanced statistics case study. I’m a fan of your admirable gift to make complicated topics approachable and digestible. Specifically, I was looking for information on how pronounced the triangular/funnel shape must be–and in what directions it may point–to suggest heteroscedasticity in a regression scatterplot of standardized residuals vs standardized predicted values. It seemed to me that my resulting plot of a 5 predictor variable regression model featured an obtuse triangular left point that violated homoscedasticity; my professors disagreed, stating the triangular “funnel” aspect would be more prominent and overt. Thus, should you be looking for a new future discussion point, my query to you then might be some pearls on the nature of a qualifying heteroscedastic funnel shape: How severe must it be? Is there a quantifiable magnitude to said severity, and if so, how would one quantify this and/or what numeric outputs in common statistical software would best support or deny a suspicion based on graphical interpretation? What directions can the funnel point; are only some directions suggestive, whereby others are not? Thanks for entertaining my comment, and, as always, thanks for doing what you do.