The constant term in regression analysis is the value at which the regression line crosses the y-axis. The constant is also known as the y-intercept. That sounds simple enough, right? Mathematically, the regression constant really is that simple. However, the difficulties begin when you try to interpret the meaning of the y-intercept in your regression output.
Why is it difficult to interpret the constant term? Because, the y-intercept is almost always meaningless! Surprisingly, while the constant doesn’t usually have a meaning, it is almost always vital to include it in your regression models!
In this post, I will teach you all about the constant in regression analysis.
The Definition of the Constant is Correct but Misleading
The constant is often defined as the mean of the dependent variable when you set all of the independent variables in your model to zero. In a purely mathematical sense, this definition is correct. Unfortunately, it’s frequently impossible to set all variables to zero because this combination can be an impossible or irrational arrangement.
I use the example below in my post about how to interpret regression p-values and coefficients. The graph displays a regression model that assesses the relationship between height and weight. For this post, I modified the y-axis scale to illustrate the y-intercept, but the overall results haven’t changed.
If you extend the regression line downwards until you reach the point where it crosses the y-axis, you’ll find that the y-intercept value is negative!
In fact, the regression equation shows us that the negative intercept is -114.3. Using the traditional definition for the regression constant, if height is zero, the expected mean weight is -114.3 kilograms! Huh? Neither a zero height nor a negative weight makes any sense at all!
The negative y-intercept for this regression model has no real meaning, and you should not try attributing one to it.
You think that is a head scratcher? Try imagining a regression analysis with multiple independent variables. The more variables you have, the less likely it is that each and every one of them can equal zero simultaneously.
If the independent variables can’t all equal zero, or you get an impossible negative y-intercept, don’t interpret the value of the y-intercept!
The Y-Intercept Might Be Outside of the Observed Data
I’ll stipulate that, in a few cases, it is possible for all independent variables to equal zero simultaneously. However, to have any chance of interpreting the constant, this all zero data point must be within the observation space of your dataset.
As a general statistical guideline, never make a prediction for a point that is outside the range of observed values that you used to fit the regression model. The relationship between the variables can change as you move outside the observed region—but you don’t know it changes because you don’t have that data!
This guideline comes into play here because the constant predicts the dependent variable for a particular point. If your data don’t include the all-zero data point, don’t believe the y-intercept.
I’ll use the height and weight regression example again to show you how this works. This model estimates its parameters using data from middle school girls whose heights and weights fall within a certain range. We should not trust this estimated relationship for values that fall outside the observed range. Fortunately, for this example, we can deduce that the relationship does change by using common sense.
I’ve indicated the mean height and weight for a newborn baby on the graph with a red circle. This height isn’t exactly zero, but it is as close as possible. By looking at the chart, it is evident that the actual relationship must change over the extended range!
The observed relationship is locally linear, but it must curve as it decreases below the observed values. Don’t predict outside the range of your data! This principle is an additional reason why the y-intercept might not be interpretable.
The Constant Absorbs the Bias for the Regression Model
Now, let’s assume that all of the predictors in your model can reasonably equal zero and you specifically collect data in that area. You should be good to interpret the constant, right? Unfortunately, the y-intercept might still be garbage!
A portion of the estimation process for the y-intercept is based on the exclusion of relevant variables from the regression model. When you leave relevant variables out, this can produce bias in the model. Bias exists if the residuals have an overall positive or negative mean. In other words, the model tends to make predictions that are systematically too high or too low. The constant term prevents this overall bias by forcing the residual mean to equal zero.
Imagine that you can move the regression line up or down to the point where the residual mean equals zero. For example, if the regression produces residuals with a positive average, just move the line up until the mean equals zero. This process is how the constant ensures that the regression model satisfies the critical assumption that the residual average equals zero. However, this process does not focus on producing a y-intercept that is meaningful for your study area. Instead, it focuses entirely on providing that mean of zero.
The constant ensures the residuals don’t have an overall bias, but that might make it meaningless.
Generally It Is Essential to Include the Constant in a Regression Model
The reason I just discussed explains why you should almost always have the constant in your regression model—it forces the residuals to have that crucial zero mean.
Furthermore, if you don’t include the constant in your regression model, you are actually setting the constant to equal zero. This action forces the regression line to go through the origin. In other words, a model that doesn’t include the constant requires all of the independent variables and the dependent variable to equal zero simultaneously.
If this isn’t correct for your study area, your regression model will exhibit bias without the constant. To illustrate this, I’ll use the height and weight example again, but this time I won’t include the constant. Below, there is only a height coefficient but no constant.
Now, I’ll draw a green line based on this equation on the previous graph. This comparison allows us to assess the regression model when we include and exclude the constant.
Clearly, the green line does not fit the data at all. Its slope is nowhere close to being correct, and its fitted values are biased.
When it comes to using and interpreting the constant in a regression model, you should almost always include the constant in your regression model even though it is almost never worth interpreting. The key benefit of regression analysis is determining how changes in the independent variables are associated with shifts in the dependent variable. Don’t think about the y-intercept too much!