A least squares regression line represents the relationship between variables in a scatterplot. The procedure fits the line to the data points in a way that minimizes the sum of the squared vertical distances between the line and the points. It is also known as a line of best fit or a trend line.
In the example below, we could look at the data points and attempt to draw a line by hand that minimizes the overall distance between the line and points while ensuring about the same number of points are above and below the line.
That’s a tall order, particularly with larger datasets! And subjectivity creeps in.
Instead, we can use least squares regression to mathematically find the best possible line and its equation. I’ll show you those later in this post.
In this post, I’ll define a least squares regression line, explain how they work, and work through an example of finding that line by using the least squares formula.
What is a Least Squares Regression Line?
Least squares regression lines are a specific type of model that analysts frequently use to display relationships in their data. Statisticians call it “least squares” because it minimizes the residual sum of squares. Let’s unpack what that means!
The Importance of Residuals
Residuals are the differences between the observed data values and the least squares regression line. The line represents the model’s predictions. Hence, a residual is the difference between the observed value and the model’s predicted value. There is one residual per data point, and they collectively indicate the degree to which the model is wrong.
To calculate the residual mathematically, it’s simple subtraction.
Residual = Observed value – Model value.
Or, equivalently:
y – ŷ
Where ŷ is the regression model’s predicted value of y.
Graphically, residuals are the vertical distances between the observed values and the line, as shown in the image below. The lines that connect the data points to the regression line represent the residuals. These distances represent the values of the residuals. Data points above the line have positive residuals, while those below are negative.
The best models have data points close to the line, producing small absolute residuals.
Minimizing the Squared Error
Residuals represent the error in a least squares model. You want to minimize the total error because it means that the data points are collectively as close to the model’s values as possible.
Before minimizing error, you first need to quantify it.
Unfortunately, you can’t just sum the residuals to represent the total error because the positive and negative values will cancel each other out even when they tend to be relatively large.
Instead, least squares regression takes those residuals and squares them, so they’re always positive. In this manner, the process can add them up without canceling each other. Statisticians refer to squared residuals as squared errors and their total as the sum of squared errors (SSE), shown below mathematically.
SSE = Σ(y – ŷ)²
Σ represents a sum. In this case, it’s the sum of all residuals squared. You’ll see a lot of sums in the least squares line formula section!
For a given dataset, the least squares regression line produces the smallest SSE compared to all other possible lines—hence, “least squares”!
Least Squares Regression Line Example
Imagine we have a list of people’s study hours and test scores. In the scatterplot, we can see a positive relationship exists between study time and test scores. Statistical software can display the least squares regression line and its equation.
From the discussion above, we know that this line minimizes the squared distance between the line and the data points. It’s impossible to draw a different line that fits these data better! Great!
But how does the least squares regression procedure find the line’s equation? We’ll do that in the next section using these example data!
How to Find a Least Squares Regression Line
The regression output produces an equation for the best fitting-line. So, how do you find a least squares regression line?
First, I’ll cover the formulas and then use them to work through our example dataset.
Least Squares Regression Line Formulas
For starters, the following equation represents the best fitting regression line:
y = b + mx
Where:
- y is the dependent variable.
- x is the independent variable.
- b is the y-intercept.
- m is the slope of the line.
The slope represents the mean change in the dependent variable for a one-unit change in the independent variable.
You might recognize this equation as the slope-intercept form of a linear equation from algebra. For a refresher, read my post: Slope-Intercept Form: A Guide.
We need to calculate the values of m and b to find the equation for the best-fitting line.
Here are the least squares regression line formulas for the slope (m) and intercept (b):
Where:
- Σ represents a sum.
- N is the number of observations.
You must find the slope first because you need to enter that value into the formula for calculating the intercept.
Worked Example
Let’s take the data from the hours of studying example. We’ll use the least squares regression line formulas to find the slope and constant for our model.
To start, we need to calculate the following sums: Σx, Σy, Σx2, and the Σxy. We need these sums for the formulas. I’ve calculated the sums, as shown below. Download the Excel file that contains the dataset and calculations: Least Squares Regression Line example.
Next, we’ll plug those sums into the slope formula.
Now that we have the slope (m), we can find the y-intercept (b) for the line.
Let’s plug the slope and intercept values in the least squares regression line equation:
y = 11.329 + 1.0616x
This linear equation matches the one that the software displays on the graph. We can use this equation to make predictions. For example, if we want to predict the score for studying 5 hours, we simply plug x = 5 into the equation:
y = 11.329 + 1.0616 * 5 = 16.637
Therefore, the model predicts that people studying for 5 hours will have an average test score of 16.637.
Learn how to assess the following least squares regression line output:
- Linear Regression Equation Explained
- Regression Coefficients and their P-values
- Assessing R-squared for Goodness-of-Fit
For accurate results, the least squares regression line must satisfy various assumptions. Read the following posts to learn how to assess these assumptions:
Mr. Stat says
Hi Jim, thanks a lot for your explanations. In another section you explain the interpretation of coefficients for a linear regression. I am not very clear on the interpretation of coefficients in the case of linear regression with bucketed variables (binning) and target encoding for independent variables. Let’s say I have two variables: country and gender. I group the most similar distinct values in terms of the target variable (historical saving rate of candidates, in my case) into buckets for each of these two variables. These will be the input variables in the regression.
For example, the average saving rate in the US and Canada is 60%. Thus, the dependent variable (y) is the historical saving rate of each individual and the independent variable (x) is the average value of historical saving rate for each separate bucket.
You wrote, “The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant.” For example, does a coefficient of 0.5 indicate that the mean saving rate in these countries will change by 0.5% when the average saving rate changes by 1%? I am not sure if my understanding of this sentence for my regression is meaningful.
Additionally, I am not sure how I should assess the goodness of the regression. Does R-squared make sense for such a type of regression, given that the independent variable is an average value of the dependent variable?
Richard Rosen says
Thanks for your explanation, Jim, but you have ignored my basic point which is in doing science, different data points should not be given different (unknown) weights via the squaring process. To me that takes precedence over any mathematical considerations with respect to normal distributions and what you call biased results. The best results are those that do not given give different data points different weights in the first place.
Jim Frost says
Hi Richard,
I answered your question. Least squares regression provides the most precise, unbiased estimates for linear regression when you can satisfy all the assumptions. It considers all data points using the same algorithm. Generally, when you’re performing an analysis, you want the best results possible. That’s why you’d use it.
We can take a look at your concerns about “different weights.” First of all, not all regression involves people. Secondly, you seem to prefer LAD regression but that also weights observations differently. Outliers are still counted more than those closer to the mean. It’s a linear vs. nonlinear function but both count more extreme observations more heavily the further out they are. Crucially, both methods use relevant characteristics of the observation towards that end and not some arbitrary criteria. Just be aware that LAD also weights observations differently.
But that’s not a problem. Let’s take a look a the field of medicine. Suppose you have two patients and, based on relevant characteristics of both patients, one has risk factors for a disease while the other does not. For best results, the doctor will treat the patients differently based on relevant criteria. Similarly, least squares and LAD regression take the relevant characteristics and handles them accordingly. Treating everyone the same doesn’t always work best in both medicine and regression. You need to treat them appropriately based on relevant criteria.
That’s what least squares and LAD regression do.
What you should focus on instead is whether least squares or LAD is more appropriate for your data given its characteristics. If your dataset has outliers that you can’t remove, consider LAD amongst other possibilities. If you can satisfy all the assumptions of least squares, use it.
Alternatively, you could perform a special study involving only the outliers. Maybe they’re a separate population requiring a specific study?
Those are more pertinent and productive questions than just criticizing least squares. Least squares is good for some cases while other methods are better for others. You need to make that determination.
Richard Rosen says
First of all most real world data in not normally distributed, and one certainly should not assume that it is for the purpose of doing science, namely trying to find causes and effects. Secondly, therefore, the mathematical properties of least squares methods are not relevant to doing science, which means analyzing the data as it comes. Third, in doing science, as far as I can see, there is absolutely no rationale for weighting different data points differently as a function of how far they end up lying from the regression line. Can you give me a rationale? For example, when doing regression analysis for epidemiology, each data point is a person, and all people’s data should be considered equally!!!
Jim Frost says
Hi Richard,
You seem a little bent out of shape over the normal distribution! So, my first suggestion is to calm down.
I explained the rationale for using squared differences in least squares in my previous reply. Go back and reread that.
From a mathematical standpoint, using the least squares method provides the best coefficient estimates (unbiased, lowest variance) for linear models when the error term is normally distributed. That’s a factual statement from a mathematical perspective. The mathematical properties ARE relevant when you’re using regression to analyze your data assuming you want to obtain trustworthy results.
Your point about the prevalence of the normal distribution is a separate issue. Regardless of the normal distribution’s prevalence, it doesn’t change the mathematical truth behind the least squares method, it just affects how often you can apply it.
I’m not going to debate you on that but I’ll provide my perspective. For starters, the normality assumption for least squares regression applies to the residuals, not the distribution of the variables. You can have nonnormal variables but still produce normally distributed residuals.
As for the variables, or the phenomena themselves, in my extensive experience in the field, I’m surprised at how often data fit the normal distribution. And, in regression, you frequently can obtain normally distributed residuals. However, that’s not always true. For instance, while human heights are normally distributed, weights are not. I suspect that some fields deal with nonnormal data more than others.
But this property is something you can check. I’m not sure where you got the idea that you just assume data or residuals follow a normal distribution. You can check it using distribution tests (e.g., normality tests). And I always point out the importance of checking residual plots after performing regression analysis. One of the reasons to see whether they are normally distributed!
So, it’s not guaranteed whether your variables or residuals follow the normal distribution. You need to check that out. But, if you can obtain normally distributed residuals and satisfy the other least squares assumptions, then from a mathematical standpoint you will get the best results using the least squares method. Period.
If you have significant outliers in your data that you can’t remove, then consider LAD regression as one of the possible remedies. But it’s a specialized method that is only good for certain cases.
Yes, your data comes in however it is. But you must choose the correct statistical methods based on your data’s properties. Failure to do that can distort your results.
Richard Rosen says
Why don’t you favor least absolute differences for regression analysis instead of least squares? Why should points farther from the regression line have much greater impact on where the line is located?
Jim Frost says
Hi Richard,
That’s a great question!
The least squares differences approach better aligns with the normal distribution. Consider that in the bell-shaped curve, two-thirds of the observations fall within +/- 1 standard deviation from the mean. So a 2 SD range contains about 2/3 of the values. Now take the range of -1 SD to infinity plus the range of + 1 SD to infinity, which I show in the image below for clarity. That massive range contains only 31.74% of the observations. Only only 4.55% of values are more extreme that +/- 2 SD.
This non-linear drop-off means that extreme values (outliers) are much, much rarer than values close to the mean. Least squares regression leverages this property by squaring the residuals, which penalizes larger deviations more severely. This results in a regression line that is heavily influenced by outliers, aligning well with the assumption of normally distributed errors.
In contrast, least absolute differences (LAD) regression minimizes the sum of the absolute values of the residuals. This method treats all deviations equally, regardless of their magnitude. While LAD is less sensitive to outliers, making it useful in cases with significant outliers, it does not align as closely with the characteristics of the normal distribution.
Least squares regression is preferred in many cases due to its mathematical properties, such as producing the best linear unbiased estimators (BLUE) under normally distributed errors. However, for data with many outliers LAD regression or other robust methods may be more suitable.
I hope that helps!
Jobin FRancis says
its wonderful. Very well explained..
Khursheed Ahmad Ganaie says
Amazing sir , no words for the explanation.