Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.

As a statistician, I should probably tell you that I love all statistical analyses equally—like parents with their kids. But, shhh, I have secret! Regression analysis is my favorite because it provides tremendous flexibility, which makes it useful in so many different circumstances. In fact, I’ve described regression analysis as taking correlation to the next level!

In this blog post, I explain the capabilities of regression analysis, the types of relationships it can assess, how it controls the variables, and generally why I love it! You’ll learn when you should consider using regression analysis.

## Use Regression to Analyze a Wide Variety of Relationships

Regression analysis can handle many things. For example, you can use regression analysis to do the following:

- Model multiple independent variables
- Include continuous and categorical variables
- Use polynomial terms to model curvature
- Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable

These capabilities are all cool, but they don’t include an almost magical ability. Regression analysis can unscramble very intricate problems where the variables are entangled like spaghetti. For example, imagine you’re a researcher studying any of the following:

- Do socio-economic status and race affect educational achievement?
- Do education and IQ affect earnings?
- Do exercise habits and diet effect weight?
- Are drinking coffee and smoking cigarettes related to mortality risk?
- Does a particular exercise intervention have an impact on bone density that is a distinct effect from other physical activities?

More on the last two examples later!

All these research questions have entwined independent variables that can influence the dependent variables. How do you untangle a web of related variables? Which variables are statistically significant and what role does each one play? Regression comes to the rescue because you can use it for all of these scenarios!

## Use Regression Analysis to Control the Independent Variables

As I mentioned, regression analysis describes how the changes in each independent variable are related to changes in the dependent variable. Crucially, regression also statistically controls every variable in your model.

### What does controlling for a variable mean?

When you perform regression analysis, you need to isolate the role of each variable. For example, I participated in an exercise intervention study where our goal was to determine whether the intervention increased the subjects’ bone mineral density. We needed to isolate the role of the exercise intervention from everything else that can impact bone mineral density, which ranges from diet to other physical activity.

To accomplish this goal, you must minimize the effect of confounding variables. Regression analysis does this by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant. This process allows you to learn the role of each independent variable without worrying about the other variables in the model. Again, you want to isolate the effect of each variable.

### How do you control the other variables in regression?

A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! Let’s look at this in action with an example.

A recent study analyzed the effect of coffee consumption on mortality. The first results indicated that higher coffee intake is related to a higher risk of death. However, coffee drinkers frequently smoke, and the researchers did not include smoking in their initial model. After they included smoking in the model, the regression results indicated that coffee intake lowers the risk of mortality while smoking increases it. This model isolates the role of each variable while holding the other variable constant. You can assess the effect of coffee intake while controlling for smoking. Conveniently, you’re also controlling for coffee intake when looking at the effect of smoking.

Note that the study also illustrates how excluding a relevant variable can produce misleading results. Omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model. This warning is particularly applicable for observational studies where the effects of omitted variables might be unbalanced. On the other hand, the randomization process in a true experiment tends to distribute the effects of these variables equally, which lessens omitted variable bias.

## How to Interpret Regression Output

To answer questions using regression analysis, you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values. When you have a low p-value (typically < 0.05), the independent variable is statistically significant. The coefficients represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.

For instance, if your dependent variable is income and your IVs include IQ and education (among other relevant variables), you might see output like this:

The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, an additional unit of education increases average earnings by $24.22 while holding the other variables constant.

Regression analysis is a form of inferential statistics. The p-values help determine whether the relationships that you observe in your sample also exist in the larger population. I’ve written an entire blog post about how to interpret regression coefficients and their p-values, which I highly recommend.

## Obtaining Trustworthy Regression Results

With the vast power of using regression comes great responsibility. Sorry, but that’s the way it must be. To obtain regression results that you can trust, you need to do the following:

- Specify the correct model. As we saw, if you fail to include all the important variables in your model, the results can be biased.
- Check your residual plots. Be sure that your model fits the data adequately.
- Correlation between the independent variables is called multicollinearity. As we saw, some multicollinearity is OK. However, excessive multicollinearity can be a problem.

Using regression analysis gives you the ability to separate the effects of complicated research questions. You can disentangle the spaghetti noodles by modeling and controlling all relevant variables, and then assess the role that each one plays.

There are many different regression analysis procedures. Read my post to determine which type of regression is correct for your data.

If you’re learning regression, check out my Regression Tutorial!

NARAYANASWAMY AUDINARAYANA says

In linear regression, can we use categorical variables as Independent variables? If yes, what should be the minimum or maximum categories in an Independent variable?

Jim Frost says

Hi, yes you can use categorical variables as independent variables! The number of groups really depends on what makes sense for your study area. Of course, the minimum is two. There really is no maximum in theory. It depends on what makes sense for your study. However, in practice, having more groups requires a larger total sample size, which can become expensive. If you have 2-9 groups, you should have at least 15 in each group. For 10-12 groups, you should have 20. These numbers are based on simulation studies for ANOVA, but they also apply to categorical variables in regression. In a nutshell, figure out what makes sense for your study and then be sure to collect enough data!

I hope this help!

Jim

Khadidja Benallou says

Thank you Mr. Jim

Jim Frost says

You’re very welcome!

Shamsun Naher says

In my model, I use different independent variables. Now my question is before useing regression, do I need to check the distribution of data? if yes then please write the name tests. My title is Education and Productivity Nexus, : evidence from pharmaceutical sector in Bangladesh.

Jim Frost says

Hi Shamsun, typically you test the distribution of the residuals after you fit a model. I’ve written a blog post about checking your residual plots that should read.

I hope this helps!

Jim

Ghulam Mustafa says

Sir usually we take

5% level of significance for comparing why 0

Jim Frost says

Hi Ghulam, yes, the significance level is usually 0.05. I’m not sure what you’re asking about in regards to zero? The p-values in the example output are all listed as 0.000, which is less than the significance level of 0.05, so they are statistically significant.

ghulam mustafa says

why do we use 5% level of significance usually for comparing instead of 1% or other

Jim Frost says

Hi, I actually write about this topic in a post about hypothesis testing. It’s basically a tradeoff between several different error rates–and a dash of tradition. Read that post and see if it answers your questions.

Kaleem says

Hello, Jim.

What is the impact* on the independent variables in the model if I omit a variable that is a determinant of dependent variable but is not related to any of the independent variables?

*Here impact relates to the independent variables’ p-values and the coefficients.

Kind regards.

Jim Frost says

Hi Kaleem,

If the independent variable is not correlated with the other independent variables, it’s likely that there would be a minimal effect on the other independent variables. Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable. You should also check the residual plots to be sure that by removing the variable you’re not introducing bias.

Kaleem says

Thanks for the reply. Jim.

I am unable to understand “Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable”. Are you stating that other independent variables will be fine but r-square will become low? I will be grateful if you can explain this.

Kind regards

Jim Frost says

Hi, you indicated that the removed independent variable is related to the dependent variable, but it is not correlated with the other independent variables. Consequently, removing that independent variable should reduce R-squared. For one thing, that’s the typical result of removing variables, even when they’re not statistically significant. In this case, because it is not correlated to the other independent variables, you know that the removed variable is supplying unique information. Taking that variable out means that information is no longer included in the model. R-squared will definitely go down, possibly dramatically.

R-squared measures the strength of the relationship between the entire set of IVs and the DP. Read my post about R-squared for more information.

Kaleem says

Further clarification on my above post. From internet I found that if a variable (z) that is related to y but unrelated to x then inclusion of z will reduce standard errors of x. So, if z is excluded, but f-stat and adjusted r-square are fine then does high standard errors create problems? I look forward to your reply.

Jim Frost says

Yes, what you read is correct. Typically, if Z is statistically significant, you should include it in your model. If you exclude it, the precision of your coefficient estimates will be lower (higher standard errors). You also risk a biased model because you are not including important information in the model–check the residual plots. The F-test of overall significance and adjusted R-squared depend on the other IVs in your model. If Z is by far the best variable, it’s possible that removing it will cause the F-test to not be significant and adjusted R-square might drop noticeably. Again, that depends on how the explanatory power of Z compares to the other IVs. Why do you want to remove a significant variable?

Kaleem says

Thanks for your reply. I really appreciate it. Could you please also provide an answer to my query mentioned below for further clarification?

Kaleem says

Thanks for the reply. I apologise if I am taking a considerable time out of your schedule.

Based on the literature, there isn’t any conclusive evidence that z is a determinant of y. So, that is why I intend to remove z. Some studies include it while some do not and some find significant association (between y and z) while some find the association insignificant. Hence, I think I can safely remove it.

Moreover, I will be grateful if you can answer another query. From an statistical viewpoint, is it fine if I use Generalized method of moments (GMM) for binary dependent variable?

Kind regards.

Jim Frost says

While I can’t offer you a concrete statement about whether you should include or exclude the variable (clearly there is disagreement in your own field), I do suggest that you read my article about specifying the correct regression model. I include a number of tips and considerations.

Unfortunately, I don’t know enough about GMM to make a recommendation. All of the examples I have seen personally are for continuous data, but I don’t know about binary data.

Kaleem says

Thanks for your reply and for the guidance.

I read your posts which are very helpful. After reading them, I concluded that only the independent variables which have a well-established association with the dependent variable should be included. Hence, in my case, variable Z should not be included given that the association of Z with dependent variable is not well-established.

Furthermore, suppose there is another variable (A) and literature suggests that it, in general, has an association with dependent variable. However, assume that A does not affect any independent variables so there is no omitted variable bias. In this case, if there is no data available for A (due to the study being conducted in different environment/context) then what statistical techniques can be deployed to address any problems caused due to the exclusion of A?

I look forward to your reply and I will be grateful for your reply.

Kind regards.

Hari says

Very nicely explanined. thank you

Jim Frost says

Thanks you, Hari!

Martin Amsteus says

No statistical tool or method turns a survey or corrolation study into an experiment, i.e. regression does not test or imply cause effect relationship. A positive relationship between smoking and cancer in a regression analysis does not mean that smoking cause cancer. You have not controlled for what you are unaware of.

Jim Frost says

Hi Martin, you are 100% correct about the fact that correlation doesn’t imply causation. This issue is one that I plan to cover in future posts.

There are two issues at play here. The type of study under which the data were collected and the statistical findings.

Being able to determine causation comes down to the difference between an observational study versus a randomized experiment. You actually use the same analyses to assess both types of designs. In an observational study, you can only establish correlation and not causality. However, in a randomized experiment, the same patterns and correlations in the data can suggest causality. So, regression analysis

canhelp establish causality, but only when it’s performed on data that were collected through a randomized experiment.Martin Amsteus says

Yes. But it may be that you miss my point. Because I argue that a proper and sound experiment will allow you to test for causality, regardless of if you deploy e.g. Pearsons r or regression. With no experimental design, neither Pearsons r nor a regression will test for an effect relationship between the variables. Randomisation makes a better case for controlling for variables that you are unaware of than picking a few, and then proclaim that your study found that x will cause an incrrase in y or that x has an effect on y. You may as well argue that you dont need to control for any variables and argue that any correlational study test for Effect relationships.

Jim Frost says

Hi Martin, yes, that is

exactlywhat I’m saying. Whether you can draw causal conclusion depends on whether you used a randomized experiment to collect your data. If it’s an observational study, you can’t assume it’s anything other than correlation. What you write in your comment agrees with what I’m saying.The controlling for other variables that I mention in this post is a different matter. Yes, if you include a variable in a regression model, it is held constant while estimating the effects of the other variables. That doesn’t mean you can assume causality though.

Cara says

Hi Jim,

I was wondering if you can help me? I am doing my dissertation and I have 1 within-subjects IV, and 3 between-subjects IVs.. most of my variables are categorical, but one is not categorical, it is a questionnaire which I am using to determine sleep quality, with both Likert scales and own answers to amount of sleep (hours), amount of times woke in the night etc. Can I use a regression when making use of both categorical data and other? I also have multiple DVs (angry/sad Likert ratings).. but I *could* combine those into one overall ’emotion’ DV. Any help would be much appreciated!

Jim Frost says

Hi Cara, because your DV use the Likert scale, you really should be using Ordinal Logistic Regression. This type of regression is designed for ordinal dependent variables like yours. As for the IVs, it can be tricky using ordinal variables. They’re not quite either continuous or categorical. My suggestion is to give them a try as continuous variable and check the residual plots to see how they look. If they look good, then it’s probably ok. However, if they don’t look good, you can try refitting the model using them as categorical variables and then rechecking the residual plots. If the residuals still don’t look good, you can then try using the chi-square test of independence for ordinal data.

As for combining the data, that would seem to be a subject-area specific decision, and I don’t know that area well enough to make an informed recommendation.

Ahmed says

Hi Jim

Hope all thing is well,

I have faced problem with plotting, which is included the relationship between dependent variable (response) and the independent variables .

when i do the main effect plots, i have the straight line increasing.

y= x, this linear trending

to change it i need to make y= square root for time

Im stuck with this thing i couldn’t find solution for it

Regards

SUBROTO Chatterjee says

In the coffee intake and smoking example, the first result showed that higher coffee intake leads to higher mortality, but after including smoking, coffee intake leads to lower or no mortality? Smoking was revealed to cause the mortality, but how did coffee intake now result in the opposite? Was a separate test taken for this result? Please let me know.

S. CHATTERJEE

Jim Frost says

Hi, that’s a great question. It turns out that coffee and smoking are correlated. The negative effects of smoking on mortality are well documented. However, for some reason, the researchers did not originally include smoking in their model. Because drinking coffee and smoking are correlated, the variable for coffee consumption took on some of smoking’s effect on mortality.

Put another way, because smoking was not included in the model, it was not being controlled (held constant). So, as you increased coffee consumption, smoking also tended to increase because it is both positively correlated with coffee consumption and not in the model. Therefore, it appeared as though increased coffee consumption is correlated with higher mortality rates but only because smoking was not included in the model.

Presumably, the researchers had already collected data about smoking. So, all they had to do was include the smoking variable in their regression model. Voila, the model now controls for smoking and the new output displays the new estimate of the effect that coffee has on mortality.

This point illustrates a potential problem. If the researchers had not collected the smoking information, they would have really been stuck. Before conducting any study researchers need to do a lot of background research to be sure that they are collecting the correct data!

I hope this helps!

Patrik Silva says

Hi Jim, I am everywhere in your post!

I am starting loving statistic, that’s why I am not quiet,

I have some questions for you:

To use OLS regression, one of the assumptions is that the dependent variable is normally distributed. To achieve this requirement what I should do with my data? Should I check the normality of my dependent variable, for example using Shapiro test (etc)? If I conclude that my dependent variable is not following the normal distribution I should start to see data transformation, right?

Another way that I have used to see people analyzing the normality is by plotting the dependent variable with the independent variable and if the relationship doesn’t follow linear trend then they go to data transformation (which one you recommend me?)

Should I perform the regression using my data (original) and then the residuals will show me non-normality if do exists?

When should I transform my independent variables, and what is the consequence of transforming them?

Sorry, I use to ask many questions in a single comment, but I think this is the way to understand the full picture of my doubt,

You are being so useful to me,

Thank you again!

Patrik

Jim Frost says

Hi Patrik, I’m so happy to hear that you’re starting to love statistics! It’s a great field that is exciting. The thrill of discovery combined with getting the most value out of your data. I’m not sure if you’ve read my post about The Importance of Statistics, but if you haven’t, I recommend it. It explains why the field of statistics is more important than ever!

In OLS regression, the dependent variable does not have to be normally distributed. Instead, you need to assess the distribution of the residuals using residual plots. If your residuals are not normally distributed, there are a variety of possible reasons and different ways to resolve that issue. I always recommend that transforming your data is the last resort. For example, the residuals might be nonnormal because the model is specified incorrectly. Maybe there is curvature in the data that you aren’t modeling correctly? If so, transforming the data might mask the problem. You really want to specify the best possible model. However, if all else fails, you might need to transform the data. When you transform the data, you’ll need to back transform the results to make sense of the results because everything applies to the transformed data. Most statistical software should do this for you.

Be aware that you can’t trust R-squared and the standard error of the regression when you transform your dependent variable because they apply to the transformed data rather than the raw data (backtransformation won’t help there).

In terms of testing the normality of the residuals, I recommend using normal probability plots. You can usually tell at a glance whether they are normally distributed. If you need a test, I generally use the Anderson-Darling test–which you can see in action in my post about identifying the distribution of your data. By the way, as a case in point, the data in that post are not normal, but I use it as the dependent variable in OLS regression in this post about using regression to make predictions. The residuals are normally distributed even though the dependent variable is not.

Patrik Silva says

Many Thanks Jim!!! You have no idea about how much you helped me.

Very well clarified!!!

God bless you always!!!

Patrik

Patrik Silva says

Hi Jim, I hope you are having good time!

I would like to ask you a question, please!

I have 24 observations to perform a regression analysis (let’s say Zones), and I have many independent variables (IV). I would like to know what is the minimum number of observations I should have to perform a reasonable linear regression model. I would like to hear something from you about how to test many regression model with different IV, since I can not use many IV in a model where a have few observations (24).

Thank you in advance!

Patrik

Jim Frost says

Hi Patrik, great to hear from you again!

Those are great questions. For 24 observations, I’d say that you usually wouldn’t want more than 2 IVs. I write an entire post about how many variables you can include in a regression model. Including too many IVs (and other terms such as interactions and polynomials) is known as overfitting the model. Check that post out because it’ll provide guidance and show you the dangers of including too many.

There’s another issue a play too because you want to compare a number of different regression models to each other. If you compare many models, it’s a form of data mining. The risk here is that if you compare enough models, you will uncover chance correlations. These chance correlations look like the real thing but only appear in your sample and not the population. I’ve written a post about how using this type of data mining to choose a regression model causes problems. This concern is particularly problematic with a small sample size like yours. It can find “patterns” in randomly generated data.

So, there’s really two issues for you to watch out for–overfitting and chance correlations found through data mining!

Hope this helps!

Patrik Silva says

Thank you Jim!

I really appreciate it!

PS

Saeed Anowar says

Thanks

Kaushal Kumar Bhagat says

Is it necessary to conduct correlation analysis before regression analysis?

Jim Frost says

Hi Kaushal,

No it’s not absolutely required. I actually prefer producing a series of scatterplots (or a matrix plot) so I can see the nature of the different relationships. That helps give me a better feel for the data along with the types of relationships. However, if you have a good theory and a solid background knowledge on which variables should be included in the model, you can go straight to modeling. I think it depends a lot on your existing level of knowledge.

That all said, I personally like knowing the correlation structure between all of the variables. It gives me a better feel for the data.