How do you compare regression lines statistically? Imagine you are studying the relationship between height and weight and want to determine whether this relationship differs between basketball players and non-basketball players. You can graph the two regression lines to see if they look different. However, you should perform hypothesis tests to determine whether the visible differences are statistically significant. In this blog post, I show you how to determine whether the differences between coefficients and constants in different regression models are statistically significant.
Suppose we estimate the relationship between X and Y under two different conditions, processes, contexts, or other qualitative change. We want to determine whether the difference affects the relationship between X and Y. Fortunately, these statistical tests are easy to perform.
For the regression examples in this post, I use an input variable and an output variable for a fictional process. Our goal is to determine whether the relationship between these two variables changes between two conditions. First, I’ll show you how to determine whether the constants are different. Then, we’ll assess whether the coefficients are different.
Related post: When Should I Use Regression Analysis?
Hypothesis Tests for Comparing Regression Constants
When the constant (y intercept) differs between regression equations, the regression lines are shifted up or down on the y-axis. The scatterplot below shows how the output for Condition B is consistently higher than Condition A for any given Input. These two models have different constants. We’ll use a hypothesis test to determine whether this vertical shift is statistically significant.
Related post: How Hypothesis Tests Work
To test the difference between the constants, we need to combine the two datasets into one. Then, create a categorical variable that identifies the condition for each observation. Our dataset contains the three variables of Input, Condition, and Output. All we need to do now is to fit the model!
I fit the model with Input and Condition as the independent variables and Output as the dependent variable. Here is the CSV data file for this example: TestConstants.
Interpreting the results
The regression equation table displays the two constants, which differ by 10 units. We will determine whether this difference is statistically significant.
Next, check the coefficients table in the statistical output.
For Input, the p-value for the coefficient is 0.000. This value indicates that the relationship between the two variables is statistically significant. The positive coefficient indicates that as Input increases, so does Output, which matches the scatterplot above.
To perform a hypothesis test on the difference between the constants, we need to assess the Condition variable. The Condition coefficient is 10, which is the vertical difference between the two models. The p-value for Condition is 0.000. This value indicates that the difference between the two constants is statistically significant. In other words, the sample evidence is strong enough to reject the null hypothesis that the population difference equals zero (i.e., no difference).
The hypothesis test supports the conclusion that the constants are different.
Related posts: How to Interpret Regression Coefficients and P values and How to Interpret the Constant
Hypothesis Tests for Comparing Regression Coefficients
Let’s move on to testing the difference between regression coefficients. When the coefficients are different, a one-unit change in an independent variable is related to varying changes in the mean of the dependent variable.
The scatterplot below displays two Input/Output models. It appears that Condition B has a steeper line than Condition A. Our goal is to determine whether the difference between these slopes is statistically significant. In other words, does Condition affect the relationship between Input and Output?
Performing this hypothesis test might seem complex, but it is straightforward. To start, we’ll use the same approach for testing the constants. We need to combine both datasets into one and create a categorical Condition variable. Here is the CSV data file for this example: TestSlopes.
We need to determine whether the relationship between Input and Output depends on Condition. In statistics, when the relationship between two variables depends on another variable, it is called an interaction effect. Consequently, to perform a hypothesis test on the difference between regression coefficients, we just need to include the proper interaction term in the model! In this case, we’ll include the interaction term for Input*Condition.
Learn more about interaction effects!
I fit the regression model with Input (continuous independent variable), Condition (main effect), and Input *Condition (interaction effect). This model produces the following results.
Interpreting the results
The p-value for Input is 0.000, which indicates that the relationship between Input and Output is statistically significant.
Next, look at Condition. This term is the main effect that tests for the difference between the constants. The coefficient indicates that the difference between the constants is -2.36, but the p-value is only 0.093. The lack of statistical significance indicates that we can’t conclude the constants are different.
Now, let’s move on to the interaction term (Input*Condition). The coefficient of 0.469 represents the difference between the coefficient for Condition A and Condition B. The p-value of 0.000 indicates that this difference is statistically significant. We can reject the null hypothesis that the difference is zero. In other words, we can conclude that Condition affects the relationship between Input and Output.
The regression equation table below shows both models. Thanks to the hypothesis tests that we performed, we know that the constants are not significantly different, but the Input coefficients are significantly different.
By including a categorical variable in regression models, it’s simple to perform hypothesis tests to determine whether the differences between constants and coefficients are statistically significant. These tests are beneficial when you can see differences between models and you want to support your observations with p-values.
If you’re learning regression, check out my Regression Tutorial!
Would you please suggest me how to compare two curve (Relative luminosity vs time).
Hi Durgesh, I’m not exactly sure what information you need, but I have written a blog post about how to compare different curves to determine which one best fits your data. Maybe this is what you need? Curve Fitting using Linear and Nonlinear Regression
Thank you Jim for such an intuitive and efficient description (what many many expensive econometric books lack). There is one thing where I’m not sure in my own regression. Maybe you could help me?
I’m doing a panel univariate (gls) regression with two growth rates y on x (+ time effects). And like you have described in this post I want to test for a significant difference in the beta for two subsamples/conditions. My two conditions (A, B) are two different time periods. “Normally” from the separate two condition A and B regression and the full regression with the interaction term we should have (like in your example):
(from the full regression) beta_input + beta_input*condition = beta_input_B (from the separate B regression).
So we have exact betas (for A and B) regardless of taking them only from the separate A and B regression or taking/calculating both betas from full regression with the interaction term.
But when I controll for heteroscedasticity in my panel regression this equation is not true anymore.
So my actual question is: are the betas from the separate A and B regressions still “right” and the p-value for beta_input*condition in the full regression with interaction term still decides for this two betas whether they are statistically different or is the p-value for beta_input*condition only valid for the two (in my case with controlling for heteroscedasticity now slightly different) betas: beta_input and beta_input + beta_input*condition?
Hi again Jim,
This is a very interesting post. However, I am felling a little confuse. Related to the fist graph shown in this post, the regression model was running separately, right? Because you have to constant coefficient (One for Input versus Output for A condition and another for input versus Output for B condition)! Is the first equation refers to the results of this two models? Since that one model can not produce two coefficient for a single variable.
In relation to the second regression results (plotted), I see that Condition B is shown as part of independent variable…shouldn’t be Condition only (including A and B). Why is showing Condition B?
Another question is related to the interaction terms (I have read your post about it also), what you mean by Input*Condition (*) is not a multiplication since there are numerical and categorical variable, how is it incorporated in the model? There is some option in the software to incorporate it?
I hope you understand me, it seems confuse to me!
Waiting for your kind feedback,
Patrik Silva
Hi Patrik,
So, that first graph can represent two different possibilities. It can represent two separate regression models displayed on one graph. Maybe the analyst collected the data for the two conditions at different points in time? Or, it can represent a single regression model. A big point of this blog post is that sometimes analysts want to compare different models. Are the differences between models significant. So, let’s assume for the entire post that the analyst collected the data for each condition separately and originally fit separate models for each condition. At a later point, the analyst wants to determine whether the differences between the models are statistically significant. That sounds difficult. However, if you combine the two datasets into one and fit the model using an indicator variable and interaction terms as I describe, it’s very easy!
As for the two regression equations with different coefficients. Again, that can represent the two possibilities I describe above (separate models or models that use indicator variables and interaction terms). When you include an indicator variable (Condition in these examples), you’re adding a fixed amount of vertical height on the Y-axis to the fitted line. In this case, the effect of Condition B is 10 units. So, it shifts up the fitted regression line by 10 units on the Y Axis. You can represent this using a coefficient of 10 for Condition B or you can add those 10 units to the intercept for the Condition B model. They’re equivalent. The software I use automatically outputs both the separate models and the coefficients. However, it’s in the coefficients table where you can tell whether the effect of Condition is significant or not.
As for the difference in slope coefficients in the second example, that’s a similar situation but instead of depending on the indicator variable (Condition), it depends on the interaction term. An interaction indicates that the effect changes based on the value of another variable. This shows up in the graphs as different slopes, which corresponds to different slope coefficients in the output. Again, my software automatically outputs the equations for the separate models and the coefficients. That explains the different slope coefficients, but again, it is in the coefficients table where you can determine whether the difference between slopes is statistically significant.
The condition variable is categorical–A or B are the two levels. However, behind the scenes, statistical software have to represent it numerically and it does this by creating indicator variables. This type of variable is simply a column of 1s and 0s. A 1 indicates the presence of a characteristic while 0 indicates the lack of it. You used to have to create these variables manually but software does it automatically now. It’s all behind the scenes and you won’t see these indicator variables.
In this case, the software defined the indicator variable as the presence of condition B. For indicator variables, you must leave one level out of the model. The software left out condition A. One level must be excluded because it’s redundant to have one indicator variable say that an observation is condition A while another indicator variable says that the same observation is not condition B. That analysis won’t run because of the perfect correlation between those two independent variables. So, the software left out the indicator variable for condition A. However, it could’ve left out B instead and the results would’ve been the same. That’s why the output displays only Condition B. And, that’s how you multiply input*Condition. It’s really either the input value multiplied by 0 or 1 for each observation depending on whether the observation is of process A or B.
I hope that makes it clear!
Thank you again Jim.
Yes now its much clear,
You meant the “input” is multiplied by condition (0 or 1), meaning that basically the variable Input*Condition will be 0 when Input is multiplied by 0 (Condition) and Input when is multiplied by 1, right?
Thank you in advance!
Patrik
That’s correct. The value for the interaction term for each observation is basically either zero or the input value depending on whether Condition is A or B. The regression procedure uses this to determine the interaction effect.
Ok thanks Jim, it seems clear now!
Best regards!