The standard error of the regression (S) and R-squared are two key goodness-of-fit measures for regression analysis. While R-squared is the most well-known amongst the goodness-of-fit statistics, I think it is a bit over-hyped.
In this post, I’ll compare these two statistics. We’ll also work through a regression example to help make the comparison. I think you’ll see that the oft overlooked standard error of the regression can tell you things that the high and mighty R-squared simply can’t. At the very least, you’ll find that the standard error of the regression is a great tool to add to your statistical toolkit!
Comparison of R-squared to the Standard Error of the Regression (S)
You can find the standard error of the regression, also known as the standard error of the estimate, near R-squared in the goodness-of-fit section of most statistical output. Both of these measures give you a numeric assessment of how well a model fits the sample data. However, there are differences between the two statistics.
- The standard error of the regression provides the absolute measure of the typical distance that the data points fall from the regression line. S is in the units of the dependent variable.
- R-squared provides the relative measure of the percentage of the dependent variable variance that the model explains. R-squared can range from 0 to 100%.
An analogy makes the difference very clear. Suppose we’re talking about how fast a car is traveling.
R-squared is equivalent to saying that the car went 80% faster. That sounds a lot faster! However, it makes a huge difference whether the initial speed was 20 MPH or 90 MPH. The increased velocity based on the percentage can be either 16 MPH or 72 MPH, respectively. One is lame, and the other is very impressive. If you need to know exactly how much faster, the relative measure just isn’t going to tell you.
The standard error of the regression is equivalent to telling you directly how many MPH faster the car is traveling. The car went 72 MPH faster. Now that’s impressive!
Let’s move on to how we can use these two goodness-of-fits measures in regression analysis.
Standard Error of the Regression and R-squared in Practice
In my view, the standard error of the regression has several advantages. S tells you straight up how precise the model’s predictions are using the units of the dependent variable. This statistic indicates how far the data points are from the regression line on average. You want lower values of S because it signifies that the distances between the data points and the fitted values are smaller. S is also valid for both linear and nonlinear regression models. This fact is convenient if you need to compare the fit between both types of models.
For R-squared, you want the regression model to explain higher percentages of the variance. Higher R-squared values indicate that the data points are closer to the fitted values. While higher R-squared values are good, they don’t tell you how far the data points are from the regression line. Additionally, R-squared is valid for only linear models. You can’t use R-squared to compare a linear model to a nonlinear model.
Note: Linear models can use polynomials to model curvature. I’m using the term linear to refer to models that are linear in the parameters. Read my post that explains the difference between linear and nonlinear regression models.
Example Regression Model: BMI and Body Fat Percentage
This regression model describes the relationship between body mass index (BMI) and body fat percentage in middle school girls. It’s a linear model that uses a polynomial term to model the curvature. The fitted line plot indicates that the standard error of the regression is 3.53399% body fat. The interpretation of this S is that the standard distance between the observations and the regression line is 3.5% body fat.
S measures the precision of the model’s predictions. Consequently, we can use S to obtain a rough estimate of the 95% prediction interval. About 95% of the data points are within a range that extends from +/- 2 * standard error of the regression from the fitted line.
For the regression example, approximately 95% of the data points lie between the regression line and +/- 7% body fat.
The R-squared is 76.1%. I have an entire blog post dedicated to interpreting R-squared. So, I won’t cover that in detail here.
I Often Prefer the Standard Error of the Regression
R-squared is a percentage, which seems easy to understand. However, I often appreciate the standard error of the regression a bit more. I value the concrete insight provided by using the original units of the dependent variable. If I’m using the regression model to produce predictions, S tells me at a glance if the model is sufficiently precise.
On the other hand, R-squared doesn’t have any units, and it feels more ambiguous than S. If all we know is that R-squared is 76.1%, we don’t know how wrong the model is on average. You do need a high R-squared to produce precise predictions, but you don’t know how high it must be exactly. It’s impossible to use R-squared to evaluate the precision of the predictions.
To demonstrate this, we’ll look at the regression example. Let’s assume that our predictions must be within +/- 5% of the observed values to be useful. If we know only that R-squared is 76.1%, can we determine whether our model is sufficiently precise? No, you can’t tell using R-squared.
However, you can use the standard error of the regression. For our model to have the required precision, S must be less than 2.5% because 2.5 * 2 = 5. In an instant, we know that our S (3.5) is too large. We need a more precise model. Thanks S!
While I really like the standard error of the regression, you can, of course, consider both goodness-of-fit measures simultaneously. This is the statistical equivalent of having your caking and eating it!