• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

Sum of Squares: Definition, Formula & Types

By Jim Frost 1 Comment

What is the Sum of Squares?

The sum of squares (SS) is a statistic that measures the variability of a dataset’s observations around the mean. It’s the cumulative total of each data point’s squared difference from the mean.

Drawing that shows low vs. high variability.
Variability measures how far observations fall from the center.

Larger values indicate a greater degree of dispersion. However, it is an unscaled measure that doesn’t adjust for the number of data points. Adding new data points to a dataset is virtually guaranteed to cause the SS to increase, and it can never decrease.

In short, simply being larger will cause a dataset to have a higher sum of squares. Consequently, you can’t compare these values between different-sized datasets. Additionally, this statistic is in squared units, further reducing interpretability.

Despite these shortcomings, SS is an invaluable tool for assessing linear models by partitioning their variance. This process breaks down the variability in our data and linear model into distinct components, helping us evaluate our model. More on that shortly!

Let’s move on to understanding the sum of squares formula and how it is the starting point for other variability measures. Then I’ll show you how it’s a fundamental component of least squares regression. Let’s dive in!

Related post: Measures of Variability

Sum of Squares Formula

The sum of squares formula is the following:

Sum of squares formula.

Where:

  • Σ represents the sum for all observations from 1 to n.
  • n is the sample size.
  • Xᵢ is an individual data point.
  • X̅ (pronounced “X-bar”) is the mean of the data points.

This formula provides us with a measure of variability or dispersion in a data set. The process for how to find the sum of squares involves the following:

  1. Take each data point and subtract the mean from it.
  2. Square that difference.
  3. Add all the squared values to the running total.

Notice how the squaring process in the sum of squares formula ensures that it tends to increase with each additional data point. Negative differences are squared, producing a positive value that adds to the total. Only the rare observations that equal the mean exactly contribute zero to the sum.

The formula leaves the statistic in its squared form (i.e., it does not take the square root).

Finally, there is no denominator in the sum of squares formula to divide by the number of observations or degrees of freedom. That’s the unscaled nature of SS. This statistic grows with the sample size.

This sum of squares formula is the starting point for other variability measures that do factor in the sample size. Some also take the square root to use the data’s natural units. These statistics include the following:

  • Variance
  • Standard deviation
  • Mean squared error
  • Root mean squared error

SS in Regression Analysis

In regression analysis, the sum of squares (SS) is particularly helpful because it separates variability into three types: total SS, regression SS, and error SS. After explaining them individually, I’ll show you how they work together.

Total Sum of Squares (TSS)

TSS measures the total variability in your data. It’s essentially the sum of squares we discussed earlier, but linear regression applies it to your response variable.

It measures the overall variability of the dependent variable around its mean. Consider it the total amount of variation available for your model to explain.

The total sum of squares formula is the following:

TSS formula

This value is the sum of the squared distances between the observed values of the dependent variable (yi) and its mean (ȳ).

Regression Sum of Squares (RSS)

RSS measures the variability in the model’s predicted values around the dependent variable’s mean. It reflects the additional variability your model explains compared to a model that contains no variables and uses only the mean to predict the dependent variable. In simpler terms, it’s the amount of variability that your model explains.

Higher RSS values indicate that your model explains more of your data’s variability.

The regression sum of squares formula is the following:

regression sum of squares formula.

This value is the sum of the squared distances between the fitted values (ŷi) and the mean of the dependent variable (ȳ). Fitted values are your model’s predictions for each observation.

Error Sum of Squares (SSE)

Finally, we reach SSE—the portion of variability not captured by your regression model. It’s the overall variability of the distance between the data points and fitted values.

For a specific data set, smaller SSE values indicate that the observations fall closer to the fitted values. Typically, you want this number to be as low as possible because it suggests that your model’s predictions are close to the actual data values. In other words, they’re good predictions.

Ordinary least squares (OLS) regression minimizes SSE, which means you get the best possible line. That’s why statisticians named the procedure OLS!

Learn How to Find the Least Squares Line.

The error sum of squares formula is the following:

Error sum of squares formula.

This value is the sum of the squared distances between the data points (yi) and the fitted values (ŷi). Alternatively, it’s known as the residual sum of squares because it sums the squared residuals (yi — ŷi).

Note: Some notation uses RSS for residual SS instead of regression SS. Be aware of this potentially confusing acronym switch!

Relationship Between the Types of SS

In the context of regression, these three types of SS serve as our map, guiding us through the variability. This partitioning reveals the proportion of the explained and unexplained variance and our model’s performance.

Understanding how each sum of squares relates to the others is straightforward:

  • RSS represents the variability that your model explains. Higher is usually good.
  • SSE represents the variability that your model does not explain. Smaller is generally good.
  • TSS represents the variability inherent in your dependent variable.

Consequently, these three statistics have the following mathematical relationship:

RSS + SSE = TSS

Or, Explained Variability + Unexplained Variability = Total Variability

Simple math!

As you fit better models for the same dataset, RSS increases and SSE decreases by precisely the same amount. RSS cannot be greater than TSS, while SSE cannot be less than zero.

Additionally, if you take RSS / TSS (or 1 – SSE / TSS), you get the proportion of the variability of the dependent variable that your model explains. That percentage is the R-squared statistic—a vital goodness-of-fit measure for linear models!

Learn How to Interpret R-squared.

Share this:

  • Tweet

Related

Filed Under: Regression Tagged With: conceptual

Reader Interactions

Comments

  1. Jerry says

    May 31, 2023 at 11:57 am

    Thank you for this very good refresher on sums of squares. And it’s good you pointed out the possible acronym switch for some of the terms. I have had to sometimes point out how a regression line doesn’t fit the data well because of too much variability around the mean. Some people don’t realize that software will always draw a straight line with fitted OLS, even with random data, but a small R-squared and a higher SSE can make it a useless predictor.

    Reply

Comments and QuestionsCancel reply

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • How To Interpret R-squared in Regression Analysis
    • Placebo Effect Overview: Definition & Examples
    • Z-table
    • Mean, Median, and Mode: Measures of Central Tendency
    • Cronbach’s Alpha: Definition, Calculations & Example
    • Bernoulli Distribution: Uses, Formula & Example
    • F-table
    • How to do t-Tests in Excel
    • Weighted Average: Formula & Calculation Examples

    Recent Posts

    • What is a Parsimonious Model? Benefits and Selecting
    • Bernoulli Distribution: Uses, Formula & Example
    • Placebo Effect Overview: Definition & Examples
    • Randomized Controlled Trial (RCT) Overview
    • Prospective Study: Definition, Benefits & Examples
    • T Test Overview: How to Use & Examples

    Recent Comments

    • Engelbert Buxbaum on Choosing the Correct Type of Regression Analysis
    • Jim Frost on Cronbach’s Alpha: Definition, Calculations & Example
    • John on Cronbach’s Alpha: Definition, Calculations & Example
    • Jim Frost on Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
    • Thu Nguyen on Multicollinearity in Regression Analysis: Problems, Detection, and Solutions

    Copyright © 2023 · Jim Frost · Privacy Policy