How do you compare regression lines statistically? Imagine you are studying the relationship between height and weight and want to determine whether this relationship differs between basketball players and non-basketball players. You can graph the two regression lines to see if they look different. However, you should perform hypothesis tests to determine whether the visible differences are statistically significant. In this blog post, I show you how to determine whether the differences between coefficients and constants in different regression models are statistically significant.

Suppose we estimate the relationship between X and Y under two different conditions, processes, contexts, or other qualitative change. We want to determine whether the difference affects the relationship between X and Y. Fortunately, these statistical tests are easy to perform.

For the regression examples in this post, I use an input variable and an output variable for a fictional process. Our goal is to determine whether the relationship between these two variables changes between two conditions. First, I’ll show you how to determine whether the constants are different. Then, we’ll assess whether the coefficients are different.

**Related post**: When Should I Use Regression Analysis?

## Hypothesis Tests for Comparing Regression Constants

When the constant (y intercept) differs between regression equations, the regression lines are shifted up or down on the y-axis. The scatterplot below shows how the output for Condition B is consistently higher than Condition A for any given Input. These two models have different constants. We’ll use a hypothesis test to determine whether this vertical shift is statistically significant.

**Related post**: How Hypothesis Tests Work

To test the difference between the constants, we need to combine the two datasets into one. Then, create a categorical variable that identifies the condition for each observation. Our dataset contains the three variables of Input, Condition, and Output. All we need to do now is to fit the model!

I fit the model with Input and Condition as the independent variables and Output as the dependent variable. Here is the CSV data file for this example: TestConstants.

### Interpreting the results

The regression equation table displays the two constants, which differ by 10 units. We will determine whether this difference is statistically significant.

Next, check the coefficients table in the statistical output.

For Input, the p-value for the coefficient is 0.000. This value indicates that the relationship between the two variables is statistically significant. The positive coefficient indicates that as Input increases, so does Output, which matches the scatterplot above.

To perform a hypothesis test on the difference between the constants, we need to assess the Condition variable. The Condition coefficient is 10, which is the vertical difference between the two models. The p-value for Condition is 0.000. This value indicates that the difference between the two constants is statistically significant. In other words, the sample evidence is strong enough to reject the null hypothesis that the population difference equals zero (i.e., no difference).

The hypothesis test supports the conclusion that the constants are different.

**Related posts**: How to Interpret Regression Coefficients and P values and How to Interpret the Constant

## Hypothesis Tests for Comparing Regression Coefficients

Let’s move on to testing the difference between regression coefficients. When the coefficients are different, it indicates that the slopes are different on a graph. A one-unit change in an independent variable is related to varying changes in the mean of the dependent variable depending on the condition or characteristic.

The scatterplot below displays two Input/Output models. It appears that Condition B has a steeper line than Condition A. Our goal is to determine whether the difference between these slopes is statistically significant. In other words, does Condition affect the relationship between Input and Output?

Performing this hypothesis test might seem complex, but it is straightforward. To start, we’ll use the same approach for testing the constants. We need to combine both datasets into one and create a categorical Condition variable. Here is the CSV data file for this example: TestSlopes.

We need to determine whether the relationship between Input and Output depends on Condition. In statistics, when the relationship between two variables depends on another variable, it is called an interaction effect. Consequently, to perform a hypothesis test on the difference between regression coefficients, we just need to include the proper interaction term in the model! In this case, we’ll include the interaction term for Input*Condition.

Learn more about interaction effects!

I fit the regression model with Input (continuous independent variable), Condition (main effect), and Input *Condition (interaction effect). This model produces the following results.

### Interpreting the results

The p-value for Input is 0.000, which indicates that the relationship between Input and Output is statistically significant.

Next, look at Condition. This term is the main effect that tests for the difference between the constants. The coefficient indicates that the difference between the constants is -2.36, but the p-value is only 0.093. The lack of statistical significance indicates that we can’t conclude the constants are different.

Now, let’s move on to the interaction term (Input*Condition). The coefficient of 0.469 represents the difference between the coefficient for Condition A and Condition B. The p-value of 0.000 indicates that this difference is statistically significant. We can reject the null hypothesis that the difference is zero. In other words, we can conclude that Condition affects the relationship between Input and Output.

The regression equation table below shows both models. Thanks to the hypothesis tests that we performed, we know that the constants are not significantly different, but the Input coefficients are significantly different.

By including a categorical variable in regression models, it’s simple to perform hypothesis tests to determine whether the differences between constants and coefficients are statistically significant. These tests are beneficial when you can see differences between models and you want to support your observations with p-values.

If you’re learning regression, check out my Regression Tutorial!

Jeremy says

Thanks for that, Jim. DVs are continuous. To be clear, it is one sample. Each person in the sample has values on the two DVs I mentioned. The IV (a binary variable) is used to predict the DVs in two separate models.

I want to know if the strength of the association between the IV and the two distinct DVs is significantly different.

Thanks again for your help – I really appreciate it.

Jim Frost says

Is it just the one IV? If so, you could just use a 2-sample t-test for each DV. Although, if it’s just the one IV, you can’t use an interaction effect. I was thinking there was an additional condition variable.

Jeremy says

Hi Jim:

Thank you for this nice article. It seems quite clear to me. I am trying to wrap my head around a specific question and wanted to run it by you.

I am running separate models where my main IV of interest in each is a binary variable (it is dummy coded). The outcomes are different types of stress.

I want to test whether the binary variable is more strongly associated with some forms of stress versus others.

What I did was I created a new variable called “stress score”. Condition 1 was one type of stress, and condition 2 was another type of stress (I called the variable “Type”)

Then I basically doubled my sample size (because each participant has a score for Condition 1 and Condition 2)… I effectively copied the group variable so it repeats.

This doesn’t seem correct, but I’m not sure how to follow the logic above on to my example. You do say to combine the datasets, which would lead to doubling the sample size if I’m following.

Thank you in advance for your insights.

Jim Frost says

Hi Jeremy,

What type of variable is your outcome (DV) variable? Is continuous, ordinal, categorical, or binary? That’ll affect the type of regression analysis you can use and it’s not clear to me the nature of your DV.

But, in general, yes, you’d want to collect a certain sample size for each condition on your IV side. It doesn’t have to exactly say double when you have two conditions. But, you want good numbers in both conditions and you maximize the power for your total sample size when the condition group sizes are equal. But, if it worked out that it wasn’t exactly double, it should be a problem as long as you have sufficient numbers in the smallest group.

I hope that helps!

Yunfei says

Hi Jim,

Your articles are excellent! I have been learning a lot about statistics from you.

I am wondering if you could help me with some questions please. Recently I used the method you posted here to compare regression slopes and intercepts simultaneously in a manuscript. However, I have received a comment from the reviewer suggesting that a method called ‘overall test for coincidence’ shall be used. I quote his comment here: ‘With regards to the slope comparisons, the authors might consider an overall test for coincidence. Here you would assess whether a single equation fit the data better than multiple equation fits. This would support the premise that these relationships stay true, regardless of groups. ‘

I search in the literature but found very limited information about this ‘overall test for coincidence’. I am wondering what this test is and how it is different from the method in your article?

Also, in the second scenario in this article of yours, my understanding is that the constants are compared at X=0, so the means of the Output of the two groups are not compared in the model. My question is, in situations like this, what is the point in reporting the comparison of the constants. I am asking because I reported only the comparison of slopes in my manuscript but the reviewer asked why did not I compare intercepts.

A final question would be, the method you used in this article to compare slopes and intercepts, shall we call it ANCOVA or genera linear model?

Many thanks! Your help would be very much appreciated.

Best

Yunfei

kubra says

How can I compare two curves in graph? Which analyses I should use? Thank you

P M says

Thanks for the clarifiication. It sounds like checking the interaction plot is always key.

I liked your post on failing to reject the null hypothesis. It’s very clear and will help me not make the mistake of “proving a negative” when interpreting p-values.

Take care,

-PM

Jim Frost says

Hi,

I’m a huge fan of interaction plots. They can make sense out of potentially confusing interaction coefficients. One glance and you know what is happening! I’m glad the other post was helpful too!

P M says

Hi Jim,

Thanks for an excellent website, I’m learning so much from your clear explanations and examples.

To see if I understand how to compare regression coefficients using interaction terms, I’m hoping you can tell me if this following example would be a correct way of dealing with three conditions instead of just two:

Let’s say we want to know if the satisfaction we get from eating out (Output) is affected by the price we pay for a dish (Input), and we want to know if this relationship is the same across courses (appetizers, entrees, and deserts). So we treat the three courses as Conditions and we fit three Input Output models, one for each of the three Conditions. Let’s say that inspecting the scatterplot we find the slopes all look different.

Now we want to compare the coefficients of the three models. Would it be a good idea to use interaction terms in this three-condition situation, creating indicator variables for Condition A (appetizers, say) and Condition B (entrees), leaving out Condition C to avoid perfect correlation (https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/) and fitting a regression that includes the interaction terms Input * Condition A and Input * Condition B?

If the interaction approach is a valid one in this case, I wonder about the interpretation of results. Let’s say the coefficients of the two interaction terms have p-values above our significance level (say they are larger than 0.05). Can we conclude that Condition does not affect the relationship between Input and Output? What if (assuming it’s possible) the p-value is above 0.05 for the coefficient of Input * Condition A (appetizers) but less than 0.05 for Input * Condition B? Can we then conclude that Condition affects the relationship between Input and Output when the dishes are entrees and deserts but not when they are appetizers?

Jim Frost says

Hi,

Yes, that approach is a good one. However, the interaction term is still just Condition*Input. It’s just that in your example condition has three values instead of just two. You’d have to leave one out for the reference level as you indicate. Choose the baseline value (the one you leave out) based on what makes the most sense for the study area/research question. If the interaction term is significant, create an interaction plot to get a clear picture of what the interaction means.

The question of what a p-value greater than a significance level indicates is a different matter. I’ve written a post about failing to reject the null hypothesis, which you should read. In a nutshell, you can’t prove a negative. Instead, you have insufficient evidence to conclude that an effect exists, which is different than proving it doesn’t exist. Read that other post for more details.

I hope this helps!

Federico says

Hello Jim,

Great explanation! I have a question. Let’s say that I have a metal pipe that I want to study its corrosion. I have 3 pipes that I bury in mud and 3 pipes that I bury in wet sand. Every month I measure the corrosion in each pipe using the weight of the pipe as the variable. So, at the end of the experiment of 6 months I will average the values for each month for the 3 pipes in the mud and the 3 pipes in the wet sand. So I will have 6 data points for mud with its average and SD and 6 data points for wet sand with its average and SD. I perform the least squares to have the linear regression for each group. What test can I use to compare whether the corrosion of the pipes in mud or wet sand is statistically significantly different? I can see that the SD bars in the plot overlap and I know that they are not statistically different, but how can I formally do it? How can I know the P-value to know how statistically similar are they?

Thank you very much!

Jim Frost says

Hi Federico,

That sounds more like a 2-sample t-test. However, having only three samples per group is tiny. In fact, if you’re not sure that your data follow the normal distribution (which would be impossible to test with such small sample sizes), you’d probably need to use the Mann-Whitney nonparametric test for the medians.

Sup says

Hello Jim. Thank you for this helpful post. I have a question. I am trying to compare the relationship between 2 independent and 1 dependent variable under 3 different conditions. So, If I need to compare the regression lines of these 3 relationships, say hypothesis test for y-intercept, how do I go about with the condition variable that you have used in the above case?

Giulia says

Hello,

I am sorry if my question is stupid but it’s the first time I try this and I’m not sure how to carry out this fitting.

First of all, I am using Origin. I have an input (i.e. concentration), an output (i.e. absorbance) and four conditions (which I marked as A, B, C, D). I want to know if their slope is statistically different. I ordered them like in the excel you kindly provided. But I still don’t understand:

1- which hypothesis test are you referring to? Or do you mean the statistic reported during the fitting? but you didn’t fit all of them together in the pictures you showed…

2- how is it possible that a condition marked with a letter is recognized by the system during the test?

I probably misunderstood the meaning of the post, but you would be very kind if you could explain me what to do…

Hope everything is well; eventually, thanks in advance.

Mathias Mericskay says

Thank you very much Jim, very counterintuitive to me, but intuition is nos science, so I’ll follow your expert advise. I understand the point that there is a difference but maybe I would have understood better a chi2 comparison or some other test that would indeed conclude that the distribution of the biomarker level by various age ranges differs between the males and females. The conclusion would be a bit the same but the postulate less strong maybe.

Mathias Mericskay says

Dear Jim,

I have been reading with interest all the post, very clear explanations, thank for that.

Regarding the comparison of 2 linear regression lines, I have a question on which we disagree with a collaborator. He compared two regression lines, which are the level of a blood biomarker in function of age in males and females. He find they are different with p<0.05 but each of the regression lines are themselves not significant, i.e. the slope is not different from 0 with a p=0.1 for one line and 0.21 for the other. For me it does not make sense to compare to regression lines in that condition but I cannot find a simple reference stating that is it is a required condition for the comparison of 2 linear regression lines. I need support to nail my point 🙂

Thanks in advance if you read that, and even better if you have a reference in mind.

Best wishes

Jim Frost says

HI Mathias,

Sorry, but I agree with your collaborator. Under the right circumstances, it is possible that two regression lines each not be significantly different from zero yet are significantly different from each other. To visualize that condition, just imagine that each one is different from zero by an insignificant amount but in opposite directions. If you take each insignificant difference and sum the total distance, it becomes significant. That condition does provide some meaningful information. It’s true that each regression line is not significantly different from zero. However, you have enough evidence to conclude they’re different from each other. It is kind of a muddled interpretation but it does indicate that there is a significant difference. Ultimately, you’d probably want to take the results as indicating further study is required. I don’t see any reason not to make the comparison between regression lines. However, the fact that each one is not significant by itself does weaken the findings in my opinion.

It’s possible that given the sample size, your analysis didn’t have the statistical power necessary to detect the smaller difference between each line and zero but it did have enough power to detect the larger difference between the two lines. Additional studies with larger sample sizes will help clarify the results.

I hope that helps!

Marc says

Hi Jim, thank you for your post! if I am trying to compare the beta coefficients of two continuous variables in the same model, can I still interpret the interaction term of the two continuous variables as I would that of a categorical*continuous variable? So if my IVs are A (continuous), B(continuous) and A*B with dependent variable C (continuous) will the P-value for A*B still tell me whether or not the slopes AC and BC are significantly different? Thank you for any insights!!!

Katharina says

Hey Jim, thank you for this interesting post! I really like your blog and the way you explain complex topics. But I’m not sure, if I made the right conclusion for my data. I try to compare the y-intercepts for two data sets under different conditions (linearized cumulative Weibull-distribution). I do not want to compare the slope. When I use the first approach, the slopes are set at the same value for both data sets. But when I perform linear regressions with my data points, I observe different slopes. Does this have an impact on my result? While using the second way the regression equations are calculated with the “correct” y-intercepts and slopes. So, my question is: is the resulting p-value for condition only related to the difference in the y-intercept? And can I disregard the results for Input*Condition for my purpose?

Or is there actually no difference to the first approach?

Kind regards,

Katharina

Jim Frost says

Hi Katharina,

I’m not sure that I understand what you’re comparing. Are you saying your data follow the Weibull distribution? That doesn’t sound like a condition you compare unless you’re meaning something else?

In general, if you want to compare intercepts, you need to include one or more indicator variables that represent the different conditions. The coefficient and p-value for the indicator variable(s) tells you the difference between intercepts and the significance of those differences. I cover that in this post.

Patrick says

Dear Jim, Sorry for a stupid question. I have downloaded your dataset and worked on it with JMP in an attempt to reproduce your results. (In case you know JMP, I did the following: Analyze>Fit Model> Set “Output” as response. Added “Input”, “Condition”, “Input*Condition” under Construct Model Effects in the software. My results are:

Estimate Std Error t Ratio Prob >|t|

Intercept 7,919316 0,692899 11,43 <,0001

Input 1,770479 0,058164 30,44 <,0001

Condition[A] -1,28327 0,327306 -3,92 0,0002

(Input-10,5)*Condition[A] -0,23457 0,058164 -4,03 0,0001

Not sure why this deviates from your results. Where did I go wrong?

Jim Frost says

Hi Patrick,

I haven’t run this analysis differently to verify what I’m about to say but I’m nearly 100% sure it’s accurate!

Based on the fact that your output displays Condition A, I can assume that JMP is using Condition B as the baseline level. However, my analysis uses Condition A as the baseline level. The gist of the results should be the same. This harkens to how least squares regression incorporates categorical variables by coding them as one or more indicator (binary) variables–which I describe in detail in my ebook about regression analysis. In a nutshell, JMP is essentially using a single, binary column/variable of Condition A, where 1s represent condition A and 0s represent not-condition A. My software is doing the reverse where it is using one column for Condition B where 1s represent condition B and 0s represent not-condition B.

So, you’re not doing anything wrong. The results should say the same thing even though the numbers will be different. If you want to get the same results, tell JMP to use Condition A as the baseline level.

Mohammad Mohaghegh Faghih says

Thanks for your reply!

Actually we are measuring pressure decay inside pressurized bags of different model (model here is co-variance (?) ) over time. So the independent variable is time (horizontal axis) while vertical axis denotes the dependent variable, which is pressure in our case.

The thing is that the data points for the independent variable (time) is not the same for different bag models (due to different sampling rate we used each time for each model of the bag).

The pressure decay sort of behave like a decreasing exponential curve.

Now I need to test if the pressure decay in different models of the bags are significantly different or not.

Thanks

Mohammad Mohaghegh Faghih says

Thanks for your post.

If I am using Minitab, what method should I use to statistically compare two scatter-plots to see if they are significantly different?

Remember that, similar to your example, the independent variables across the two scatter-plots are NOT the same.

Thanks

Jim Frost says

Hi Mohammad,

Assuming that linear regression is appropriate for your data, you’d use that analysis just like I do in these examples. Additionally, to be meaningful, you’d need at least one IV to be the same. Otherwise, you’re just using a completely different set of IVs to fit completely different regression models for your DV.

Just add the indicator variables and interaction terms as I show in this blog post to suit your data. You might need to combine datasets and create indicator variables to reflect the different conditions/datasets. That depends on how you collected and recorded your data originally.

Best of luck with your analysis!

Naama says

Ok great, I understand…

Thanks again,

Naama

Naama says

Hi Jim,

Thank you very much for your post, it helped me a lot!

I would appreciate your advice regarding my case:

When I compared the constants of my two regression lines (according to your example), I got a result that the two y-intercepts differ significantly. However, when I compared the two slopes (the interaction) it was not significant but now condition B was also not !!

So my question is whether I can say that because the interaction is not significant I can remove it and stay with the significant difference between the constants?

I hope I was clear,

Thanks 🙂

Naama

Jim Frost says

Hi Naama,

Here’s what I think is happening. When you fit the two regression lines with the indicator variable with and without the interaction term, you end up with two different pairs of constant. So, it’s not surprising that the difference between one pair is statistically significant while the other difference is not.

In general, yes, I agree with you. If the interaction term is not significant, you usually don’t include it. Consequently, you can conclude that the two difference between the two constants for the model without the interaction is statistically significance. I do have one caveat. If theory strongly suggests for some reason you should include that interaction, you’d consider leaving it in even when it’s not significant. But, generally you’d remove it from the model.

Thanks for the great question. I hope that helps! I’m glad that you found this post to be helpful!

Crystal Santos says

Hi Jim,

I’m looking for a way to statistically compare two relationships in a multiple regression. My hypothesis is that one predictor will have a stronger effect on the DV than the other. It seems simple enough but I’m struggling to find out how to do this. Can you point me in the right direction?

Jim Frost says

Hi Crystal,

Determining which predictor has a stronger effect, or is more important, is a more complicated issue that you might think! Fortunately, I’ve written a blog post all about that! Please read, Identifying the Most Important Variables in Regression Models. It should answer your questions!

Best of luck with your analysis!

Atanas Gumbo says

Hie Jim. I want to determine value relevance of negative earnings (losses) on EBIT and book value. My panel dataset has 30 companies and I did my regressions using System GMM estimators in Stata on this dataset. I then divided the dataset into profit-making and loss-making companies and ran the same regressions as before. Unfortunately the loss-making sample became too small to run meaningful regressions (observations were lost during conversion of EBIT to logs because of negative earnings).

Is it possible to determine the impact of the loss-making sample by comparing regression results of the full sample to those of the profit-making sample, say using an F-test? Unfortunately, the profit-sample is already part of the full sample so combining the two datasets as described in this post does not work. Is there another way of comparing the regressions run from two datasets, where one is a subset of the the other

Jim Frost says

Hi, I might not be completely understanding your goal. But, it sounds like you just need to add an indicator variable in your dataset that identifies whether the observation is from the profit sample or loss sample. You can use 1 to represent profit and 0 to represent loss. Then you can include the indicator variable as both a main effect and an interaction effects with the various predictors as I show in this post. This process will help you determine whether the differences between the intercepts and coefficients for the two groups are statistically significant.

Best of luck with your analysis!

Abotiyuure Gray elvis says

Great job Jim. I’m impressed with your presentation.

I need some clarifications. I have an interaction between a continuous variable and categorical variable that has 6 categories. The interaction term is jointly significant and I read somewhere that I need to calculate the marginal effects, which I did. However, I’m stuck with how to interpret the marginal effects.

Any help is much appreciated. Thanks.

Pamela Marcum says

” … if summer is 1 and winter is 3, the analysis would assume that 3*summer = winter!” LOL, I think that one qualifies as the “awesome quote of the week” (for geeks, of course)!! Thanks again! And thanks for the kind words 🙂

Jim Frost says

That’s too funny. Geek humor is the best! BTW, I sent you an email.

Pamela Marcum says

Thanks for that explanation, Jim. I now understand what I was completely missing … I wasn’t thinking in terms of each “Fall”, “Winter”, “Spring” and “Summer” as essentially being individual “indicators” with their own columns of data … where the “bit” that is flipped to “1” indicates the applicable category for that data entry. So the “levels” that you were referring to earlier are these different columns of 0’s and 1’s (e.g, the “fall”, “winter”, and so on columns). When I asked the question, I naively thought of “seasons” as a single indicator (one column of data) that would take on a value of 0, 1, 2 or 3 for Fall, Winter, Spring, Summer, respectively (or however one wanted to assign those values) and assumed the magic of linear regression math would work out the coefficients to accommodate these arbitrary season value assignments. I now see that I was completely wrong in this thinking! Trying to extend a 2-category model to 3+ categories is where I went astray. (I am writing these details in case it helps other newbies like me who missed this initial critical concept!). Thanks again, Jim, for your delightful patience and tremendous assistance.

Jim Frost says

I don’t think it’s intuitive at all until you see it in action. You need to create a number of variables to describe one categorical variable! But, it all works together. And, you can’t use numbers to represent the seasons. For example, if summer is 1 and winter is 3, the analysis would assume that 3*summer = winter! It also suggests that the difference between summer and winter is twice the difference between summer and Fall! You can’t treat categorical variables as integers for those reasons. Categorical variables deal with the presence or absence of characteristics that you can’t (or just aren’t) measuring numerically.

Again, you don’t have to worry about this issue with most modern statistical software. You can just include a categorical variable in the model. But, back in the day, I did have to create indicator variables! Now, you know what is going on behind the scenes!

You’re very welcome, Pamela! By the way, I’ve read

yourblog, and you’re a very good writer! Take care!Pamela Marcum says

Hey Jim, you say in the above reply to Patrik’s question: “For indicator variables, you must leave one level out of the model. … One level must be excluded because it’s redundant to have one indicator variable say that an observation is condition A while another indicator variable says that the same observation is not condition B.”

I’ve come across similar explanations on a few other sites. Clearly there is something that I am not understanding as well as I thought: if an observation can only have a condition of “0” versus “1”, how is that condition more than one level,and what does it mean to leave out one of the levels in the model? Let’s take the example of the binary condition regarding whether one is a pet-owner or not (“Yes” versus “No”, respectively). The “yes” gets turned into a ‘1″ and the “No” gets turned into a “0”. There is only a single number … a “0” or a “1”. How in this syntax is a level excluded ? Isn’t the interaction term dealing with “Condition” that can either have a value of “0” or “1” … and you need both of those values to properly evaluate the linear regression equation? It’s the “leave one level out of the model” phrase that is utterly confusing to me, because in my mind I am translating that phrase as “leave out all the data from the model that is associated with a Condition=0”, which obviously must not be the case.

On another related note, how does one assign value to conditions that are more than just binary … maybe an example would be what kind of pet you own (“N”=no pet, “C”=cat, “D”=dog, etc). Would you do something like: 0=no pet, 1=cat, 2=dog, etc? (The arbitrariness of the number assignment and possible unintended consequences of the larger valued conditions giving more prominent “weight” to terms in the regression equation makes me nervous).

Jim Frost says

Hey Pamela,

If you have just one binary column of data that defines the presence or absence of a condition, that’s fine. The rule comes into play for the second part of your question when you want have a categorical variable that is NOT binary.

So, let’s use an of the seasons of the year: Spring, Summer, Winter, and Fall. Those four values are the levels of the categorical variable “Seasons.” To include Seasons in the regression model, we need to create a binary indicator variable for each level of the categorical variable. So, we have a column for Spring, which would contain 1s and 0s to define whether each specific observation occurred in the spring. Same for the other three seasons. Now, we need to include them in the model, but this point is where we have to leave one of them out. Why? One of the assumptions for OLS regression is that there is no perfect correlation amongst the predictor variables. When there is perfect correlation, the OLS procedure won’t even run.

If you include all four indicator variables, there is perfect correlation. To illustrate this, assume that we’re looking at a Spring observation. That observation has a 1 in the Spring column. However, it also has 0s in all of the 3 remaining indicator variable columns. In other words, if you see 0s in Summer, Fall, and Winter, you know that it’s a Spring observation even if you don’t see the Spring value. You can use these other three columns to perfectly predict the fourth column. That’s perfect correlation. So, you have to take one column out. But, you’re not losing any of the information because that column was redundant to begin with. Say we take out the Spring column, we and the regression model will still know which observations are in the Spring because it’ll have the three 0s for the other seasons.

Statistically, it doesn’t matter which indicator variable you leave out. It won’t affect the statistical significance. It does affect the correlation coefficients but in a logical consistent manner. The coefficient for each indicator variable represents the mean difference between each level and the omitted level. For example, suppose the coefficient for Summer is 1.5 and we leave out Spring. This indicates that the Summer observations are an average of 1.5 units higher than Spring observations. If we left out Summer instead, Spring would have a coefficient of -1.5 because it is an average of 1.5 units less than Summer. So, it doesn’t really matter which one you leave out. If there is a natural baseline, comparison, or control level, the results are more intuitive to interpret if you leave that column out of the model.

A couple of notes. I was going to use your Cat and Dog example, but that’s not necessarily mutually exclusive so it doesn’t work. Someone can own both cats and dogs. The levels of categorical variables must be mutually exclusive. And, most modern statistical software will do all of this coding for you behind the scenes. Back in the day, you had to create these indicator variables yourself, but now you shouldn’t need to. The most you might need to do is to specify which level to use as the reference or baseline level–and the software will leave that level out of the model.

Pamela Marcum says

Very illustrative example you provided — thanks again!!

Pamela Marcum says

I just realized that any cond^2 term is unnecessary when “cond” is either a “0” or “1”, in which case “cond^2” is redundant with the “cond” term! My question still remains, though, if one simply sets the “f” coefficient in the above examples to zero!

Pamela Marcum says

Hi Jim,

I have a follow-up question to a reply you recently provided me on one of your other posts. In that reply, you suggested that I perform a regression fit with an interaction term. I’ve made quite a bit of progress, but now I am stymied by another question: Suppose in your example above, a quadratic fits the data better than a line for each of the 2 conditions, e.g., for Condition A, output = a + b*input + c*input^2 (and similarly for Condition B but with different coefficients, of course). My question is, how do we then handle the interaction term? (specifically, do we have to worry about what I could call the “cross terms”?) In other words, would we use:

out = a + b*in + c*in^2 + d*cond + e*cond*in

… or instead must we use ….

out = a + b*in + c*in^2 + d*cond + e*cond*in + f*cond^2

… or do we have to also worry about interaction with the in^2 term and use …

out = a + b*in + c*in^2 + d*cond + e*cond*in + f*cond^2 + g*cond^2*in^2

where “out” = output, “in” = input, “cond” = condition and a-g are coefficients.

My gut tells me that if one is dealing with a 2nd order polynomial, one must take into account all the “cross terms” (or whatever they are called), eg the terms associated with “f” and “g” coefficients in my example above. The question then is how to interpret the resulting p-values: if the p-values associated with, say, the cond*in are very small but large for the cond^2 term, then what would one say about the significance of that interaction effect? Thanks in advance for any guidance you can provide!

Jim Frost says

Hi again Pamela,

Yes, you can and probably should try an interaction for the quadratic term. Let’s start with this model:

Output = Constant + Input + Input^2 + Input*Condition + Input^2*Condition

Here’s how this works. The Input*Condition is the interaction effect for the slopes of the lines. The Input^2*Condition is the interaction effect for the shape of the curve. The interpretation depends on whether one or both of these interaction terms are significant.

If Input*Condition is not significant, the overall slopes of the lines are the same. Even though we’re talking about curved lines, their overall orientation on the graph would be the same. On the other hand, if this term is significant, then the overall slopes are different.

If Input^2*Condition is not significant, the overall shapes of the curves are the same. However, if this term is significant, the shapes of the curves are different to a statistically significant degree. Maybe one curve is tighter than the other.

And, you can combine the two interactions to fit various conditions. For example, the slopes and curves can both the same or both different. Or maybe the slopes are the same and the shapes of the curves are different. Or, vice versa! Just look at the p-values to make that determination.

Below is an example where we’re looking at interactions in a model with a quadratic term. In this case, the Input*Condition interaction term is significant but the Input^2*Condition term is not significant. The result is that the slopes are different but the shapes of the curve are the same. Basically, you take the same shape curve and just rotate it.

Interpreting the coefficients and regression equations themselves would be particularly difficult with both curvature and interactions. Graphs really bring the analysis to life! And then we can say that the patterns in the graph are statistically significant, or not, as the case may be.

Nguyen Tran Vy says

Dear Jim. It is very nice to know your website. It is much more useful for my study and research. I found your web this afternoon as I am reviewing basic statistic and statistic tests for ecology. Could you help me to have your email address? Thanks.

Jim Frost says

Hi Nguyen, thanks so much for your kind words. They mean a lot to me!

If you have statistical questions, please find the relevant blog post and ask your question in the comments there. I like to have public questions and answers because they can help all readers. Thanks!

Patrik Silva says

Ok thanks Jim, it seems clear now!

Best regards!

Patrik Silva says

Thank you again Jim.

Yes now its much clear,

You meant the “input” is multiplied by condition (0 or 1), meaning that basically the variable Input*Condition will be 0 when Input is multiplied by 0 (Condition) and Input when is multiplied by 1, right?

Thank you in advance!

Patrik

Jim Frost says

That’s correct. The value for the interaction term for each observation is basically either zero or the input value depending on whether Condition is A or B. The regression procedure uses this to determine the interaction effect.

Patrik Silva says

Hi again Jim,

This is a very interesting post. However, I am felling a little confuse. Related to the fist graph shown in this post, the regression model was running separately, right? Because you have to constant coefficient (One for Input versus Output for A condition and another for input versus Output for B condition)! Is the first equation refers to the results of this two models? Since that one model can not produce two coefficient for a single variable.

In relation to the second regression results (plotted), I see that Condition B is shown as part of independent variable…shouldn’t be Condition only (including A and B). Why is showing Condition B?

Another question is related to the interaction terms (I have read your post about it also), what you mean by Input*Condition (*) is not a multiplication since there are numerical and categorical variable, how is it incorporated in the model? There is some option in the software to incorporate it?

I hope you understand me, it seems confuse to me!

Waiting for your kind feedback,

Patrik Silva

Jim Frost says

Hi Patrik,

So, that first graph can represent two different possibilities. It can represent two separate regression models displayed on one graph. Maybe the analyst collected the data for the two conditions at different points in time? Or, it can represent a single regression model. A big point of this blog post is that sometimes analysts want to compare different models. Are the differences between models significant. So, let’s assume for the entire post that the analyst collected the data for each condition separately and originally fit separate models for each condition. At a later point, the analyst wants to determine whether the differences between the models are statistically significant. That sounds difficult. However, if you combine the two datasets into one and fit the model using an indicator variable and interaction terms as I describe, it’s very easy!

As for the two regression equations with different coefficients. Again, that can represent the two possibilities I describe above (separate models or models that use indicator variables and interaction terms). When you include an indicator variable (Condition in these examples), you’re adding a fixed amount of vertical height on the Y-axis to the fitted line. In this case, the effect of Condition B is 10 units. So, it shifts up the fitted regression line by 10 units on the Y Axis. You can represent this using a coefficient of 10 for Condition B or you can add those 10 units to the intercept for the Condition B model. They’re equivalent. The software I use automatically outputs both the separate models and the coefficients. However, it’s in the coefficients table where you can tell whether the effect of Condition is significant or not.

As for the difference in slope coefficients in the second example, that’s a similar situation but instead of depending on the indicator variable (Condition), it depends on the interaction term. An interaction indicates that the effect changes based on the value of another variable. This shows up in the graphs as different slopes, which corresponds to different slope coefficients in the output. Again, my software automatically outputs the equations for the separate models and the coefficients. That explains the different slope coefficients, but again, it is in the coefficients table where you can determine whether the difference between slopes is statistically significant.

The condition variable is categorical–A or B are the two levels. However, behind the scenes, statistical software have to represent it numerically and it does this by creating indicator variables. This type of variable is simply a column of 1s and 0s. A 1 indicates the presence of a characteristic while 0 indicates the lack of it. You used to have to create these variables manually but software does it automatically now. It’s all behind the scenes and you won’t see these indicator variables.

In this case, the software defined the indicator variable as the presence of condition B. For indicator variables, you must leave one level out of the model. The software left out condition A. One level must be excluded because it’s redundant to have one indicator variable say that an observation is condition A while another indicator variable says that the same observation is not condition B. That analysis won’t run because of the perfect correlation between those two independent variables. So, the software left out the indicator variable for condition A. However, it could’ve left out B instead and the results would’ve been the same. That’s why the output displays only Condition B. And, that’s how you multiply input*Condition. It’s really either the input value multiplied by 0 or 1 for each observation depending on whether the observation is of process A or B.

I hope that makes it clear!

Anastasia says

Thank you Jim for such an intuitive and efficient description (what many many expensive econometric books lack). There is one thing where I’m not sure in my own regression. Maybe you could help me?

I’m doing a panel univariate (gls) regression with two growth rates y on x (+ time effects). And like you have described in this post I want to test for a significant difference in the beta for two subsamples/conditions. My two conditions (A, B) are two different time periods. “Normally” from the separate two condition A and B regression and the full regression with the interaction term we should have (like in your example):

(from the full regression) beta_input + beta_input*condition = beta_input_B (from the separate B regression).

So we have exact betas (for A and B) regardless of taking them only from the separate A and B regression or taking/calculating both betas from full regression with the interaction term.

But when I controll for heteroscedasticity in my panel regression this equation is not true anymore.

So my actual question is: are the betas from the separate A and B regressions still “right” and the p-value for beta_input*condition in the full regression with interaction term still decides for this two betas whether they are statistically different or is the p-value for beta_input*condition only valid for the two (in my case with controlling for heteroscedasticity now slightly different) betas: beta_input and beta_input + beta_input*condition?

Durgesh Pitale says

Would you please suggest me how to compare two curve (Relative luminosity vs time).

Jim Frost says

Hi Durgesh, I’m not exactly sure what information you need, but I have written a blog post about how to compare different curves to determine which one best fits your data. Maybe this is what you need? Curve Fitting using Linear and Nonlinear Regression