Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.

As a statistician, I should probably tell you that I love all statistical analyses equally—like parents with their kids. But, shhh, I have secret! Regression analysis is my favorite because it provides tremendous flexibility, which makes it useful in so many different circumstances. In fact, I’ve described regression analysis as taking correlation to the next level!

In this blog post, I explain the capabilities of regression analysis, the types of relationships it can assess, how it controls the variables, and generally why I love it! You’ll learn when you should consider using regression analysis.

## Use Regression to Analyze a Wide Variety of Relationships

Regression analysis can handle many things. For example, you can use regression analysis to do the following:

- Model multiple independent variables
- Include continuous and categorical variables
- Use polynomial terms to model curvature
- Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable

These capabilities are all cool, but they don’t include an almost magical ability. Regression analysis can unscramble very intricate problems where the variables are entangled like spaghetti. For example, imagine you’re a researcher studying any of the following:

- Do socio-economic status and race affect educational achievement?
- Do education and IQ affect earnings?
- Do exercise habits and diet effect weight?
- Are drinking coffee and smoking cigarettes related to mortality risk?
- Does a particular exercise intervention have an impact on bone density that is a distinct effect from other physical activities?

More on the last two examples later!

All these research questions have entwined independent variables that can influence the dependent variables. How do you untangle a web of related variables? Which variables are statistically significant and what role does each one play? Regression comes to the rescue because you can use it for all of these scenarios!

## Use Regression Analysis to Control the Independent Variables

As I mentioned, regression analysis describes how the changes in each independent variable are related to changes in the dependent variable. Crucially, regression also statistically controls every variable in your model.

### What does controlling for a variable mean?

When you perform regression analysis, you need to isolate the role of each variable. For example, I participated in an exercise intervention study where our goal was to determine whether the intervention increased the subjects’ bone mineral density. We needed to isolate the role of the exercise intervention from everything else that can impact bone mineral density, which ranges from diet to other physical activity.

To accomplish this goal, you must minimize the effect of confounding variables. Regression analysis does this by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant. This process allows you to learn the role of each independent variable without worrying about the other variables in the model. Again, you want to isolate the effect of each variable.

### How do you control the other variables in regression?

A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! Let’s look at this in action with an example.

A recent study analyzed the effect of coffee consumption on mortality. The first results indicated that higher coffee intake is related to a higher risk of death. However, coffee drinkers frequently smoke, and the researchers did not include smoking in their initial model. After they included smoking in the model, the regression results indicated that coffee intake lowers the risk of mortality while smoking increases it. This model isolates the role of each variable while holding the other variable constant. You can assess the effect of coffee intake while controlling for smoking. Conveniently, you’re also controlling for coffee intake when looking at the effect of smoking.

Note that the study also illustrates how excluding a relevant variable can produce misleading results. Omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model. This warning is particularly applicable for observational studies where the effects of omitted variables might be unbalanced. On the other hand, the randomization process in a true experiment tends to distribute the effects of these variables equally, which lessens omitted variable bias.

**Related post**: Confounding Variables and Omitted Variable Bias

## How to Interpret Regression Output

To answer questions using regression analysis, you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values. When you have a low p-value (typically < 0.05), the independent variable is statistically significant. The coefficients represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.

For instance, if your dependent variable is income and your IVs include IQ and education (among other relevant variables), you might see output like this:

The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, an additional unit of education increases average earnings by $24.22 while holding the other variables constant.

Regression analysis is a form of inferential statistics. The p-values help determine whether the relationships that you observe in your sample also exist in the larger population. I’ve written an entire blog post about how to interpret regression coefficients and their p-values, which I highly recommend.

## Obtaining Trustworthy Regression Results

With the vast power of using regression comes great responsibility. Sorry, but that’s the way it must be. To obtain regression results that you can trust, you need to do the following:

- Specify the correct model. As we saw, if you fail to include all the important variables in your model, the results can be biased.
- Check your residual plots. Be sure that your model fits the data adequately.
- Correlation between the independent variables is called multicollinearity. As we saw, some multicollinearity is OK. However, excessive multicollinearity can be a problem.

Using regression analysis gives you the ability to separate the effects of complicated research questions. You can disentangle the spaghetti noodles by modeling and controlling all relevant variables, and then assess the role that each one plays.

There are many different regression analysis procedures. Read my post to determine which type of regression is correct for your data.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Irina says

Many thanks. I appreciate it.

Irina says

Hello Jim,

I stumbled across your website in hopes of finding an answer to a couple of questions regarding the methodology of my political science paper. If you could help, I would be very grateful.

My research question is “Why do North-South regional trade agreements tend to generate economic convergence while South-South agreements sooner cause economic divergence?”. North = OECD developed countries and South = non-OECD developing countries.

This is my lineup of variables and hypotheses:

DV: Economic convergence between country members in a regional trade agreement

IV1: Complementarity (differentness) of relative factor abundance

IV2: Market size of region

IV3: Economic policy coordination (Harmonization of Foreign Direct Investment (FDI) policy)

H1: The higher the factor endowment difference between countries, the greater the convergence

H2: The larger the market size, the greater the convergence

H3: The greater the harmonization of FDI policies, the greater the convergence

I am not sure what the best methodological approach is. I will have to take North-South and South-South groups of countries and assign values for the groupings. I want to show the relationship between the IVs and DV, so I thought to use a regression. But there are at least two issues:

1. I feel the variables are not appropriate for a time series, which is usually used to show relationships. This is because e.g. the market size of a region will not be changing with time. Can I not do a time series and still have meaningful results?

2. The IVs are not completely independent of one another. How can I work with that?

Also, what kind of regression would be most appropriate in your view?

Many sincere thanks in advance.

Irina

Jim Frost says

Hi Irina,

I’m not an expert in that specific field, so I can’t give you concrete advice, but here are somethings to consider.

The question about whether you need to include time related information in the model depends on the nature of your data and whether you expect temporal effects to exist. If your data are essentially collected at the same time and refer to the same time period, you probably don’t need to account for time effects. If theory suggests that the outcome does not change over time, you probably don’t need to include variables for time effects.

However, if your data are collected at or otherwise describe different points in time, and you suspect that the relationships between the IVs and DV changes overtime, or there is an overall shift over time, yes, you’d need to account for the time effects in your model. In that case, failure to account for the effects of time can bias your other coefficients–basically there’s the potential for omitted variable bias.

I don’t know the subject area well enough to be able to answer those questions, but that’s what I’d think about.

You mention that the IVs are potentially correlated (multicollinearity). That might or might not be a problem. It depends on the degree of the correlation. Some correlation is OK and might not be a problem. I’d perform the analysis and check the VIFs, which measure multicollinearity. Read my post about multicollinearity, which discusses how to detect it, determine whether it’s a problem and some corrective measures.

I’d start with linear regression. Move away from that only if you have specific reason to do so.

Best of luck with your analysis!

Lizzy Casey says

Hi Jim

I was wondering if you could help. I’m currently doing a lab report on Numerical cognition in Human and non human primates. Where we are looking at whether size , quantity and visibility of food effects choice. We have tested Humans so far and then are going to test chimps in the future. My Iv is Condition : visible and opague containers and my Dv is number of correct responses. So far I have compared the means of number of correct responses for both conditions using a one way repeated measures ANOVA but I don’t think this is correct. After having a look at your website, should I look to run a regression analysis instead ? Sorry for the confusion I’m really a rookie at this. Hope you can help !

Jim Frost says

Hi Lizzy,

Linear regression analysis and ANOVA are really the same type of analysis-linear models. They both use the same math “underneath the hood.” They each have their own historical traditions and terminology, but they’re really the same thing. In general, ANOVA tends to focus on categorical (nominal) independent variables while regression tends to focus on continuous IVs. However, you can add continuous variables into an ANOVA model and categorical variables into a regression model. If you fit the same model in ANOVA as regression, you’ll get the same results.

So, for your study, you can use either ANOVA or regression. However, because you have only one categorical IV, I’d normally suggest using one-way ANOVA. In fact, if you have only those two groups (visible vs opaque), you can use a 2-sample t-test.

Although, you mention repeated measures, you can use that if you in fact do have a pre-test and post-test conditions. You could even use a paired t-test if you have only the two groups and you have a pre- and post-tests.

There is one potential complication. You mention that the DV is a count of correct responses. Counts often do not follow the normal distribution but can follow other distributions such as the Poisson and Negative Binomial distributions. Although, counts can approximate the normal distribution when the mean is high enough (>~20). However, if you have two groups and each group has more than 15 observations, the analyses are robust to departures from the normal distribution.

I hope this helps! Best of luck with your analysis!

Kris mckinnon says

Thankyou so much for the reply . Appreciate it and I finally worked it out and got good mark on lab report, which was good :). Appreciate your time replying you explain things very clear so thankyou

Kris mckinnon says

Hi there. I am currently doing a lab report and have not done stats in years so hoping someone can help as due tommorow.

When I do correlation bivariate test it shows the correlations not significant between a personaility trait and a particular cognitive task. Yet when I conduct a simple t test it shows a significant p value and gives the 95 % conf interval. If I was to compare that higher scores on one trait tends to mean higher scores on a particular cognitive task then should I be doing a regression then. We were told basic correlations so I did the bivariate option and just stated that the pearson’s r is not significant r=.. n= p =.84 for example. Yet if do a regression analysis for each it is significant. Why could this be?

Thankyou

Jim Frost says

Hi Kris,

There not quite enough details to know for sure what is happening–but here are some ideas.

Be aware that a series of pairwise correlations is not equivalent to performing regression analysis with multiple predictors. Suppose you have your outcome variable and two predictors (Y X1 X2). When you peform the pairwise correlations (X1 and Y, X2 and Y), each correlation does not account for the other X. However, when you include both X1 and X2 in a regression model, it estimates the relationship between each X and Y while accounting for the other X.

If the correlation and regression model results differ as you describe, you might well have a confounding variable, which biases your correlation results. I write about this in my post about omitted variable bias. You’d favor the regression results in this situation.

As for the difference between the 2-sample t-test and correlation, that’s not surprising because they are doing two entirely different things. The 2-sample t-test requires a continuous outcome variable and a categorical grouping variable and it tests the mean difference between the two groups. Correlations measure the linear association between two continuous variables. It’s not surprising the results can differ.

It sounds like you should probably use regression analysis and include your multiple continuous variables in the model along with your categorical grouping variables as independent variables to model your outcome variable.

Best of luck with your analysis!

Kathlene Gale M. Dulay says

This is Kathlene, and I am a Grade 12 student. I am currently doing my research. It’s a quantitative research. I am having a little trouble on how will i approach my statistical treatment. My research is entitled ” Emotional Quotient and Academic Performance Among Senior High School Students in Tarlac National High School: Basis to a Guidance Program.

I was battling what to use to determine the relationship between the variables in my study.

I’m thinking to use chi-square method but a friend said it would be more accurate to use the regression analysis method. Math is not really my field of study so i badly need your opinion regarding this.

I’m hoping you could lend me a helping hand.

Thank you.

Jim Frost says

Hi Kathlene,

It sounds like you’re in a great program! I wish more 12th grade students were conducting studies and analyzing their results! 🙂

To determine how to model the relationships between your variables, it depends on the type of variables you have. It sounds like your outcome variable is academic performance. If that’s a continuous variable, like GPA, then I’d agree with your friend that regression analysis would be a good place to start!

Chi-square assesses the relationship between categorical variables.

Best of luck with your analysis!

Umar Awan says

Hi Mr Jim,

I am using orthogonal design having 7 factors with three levels. I have done regression analysis on Minitab software but i don’t know how to explain them or interpret them. I need your help in this regard.

Jim Frost says

Hi Umar,

I have a lot of content throughout my blog that will help you, including how to interpret the results. For a complete list for regression analysis, check out my regression tutorial.

Also, early next year I’ll be publishing a book about regression analysis as well that contains even more information.

If you have a more specific question after reading my other posts, you can ask them in the comments for the appropriate blog post.

Best of luck!

Ty Pulliam says

By the way my gun laws vs VCR, is part of a regression model. Any help you can give, I’d greatly appreciate.

Ty Pulliam says

Mr. Jim, I have a problem. I’m working on a research design on gun laws vs homicides with my dependent variable being violent crime rate. My sig is .308 The constant’s (VCR) standard error is 24.712 my n for violent crime rate is 430.44. I really need help ASAP. I don’t know how to interpret this well. Please help!!!

Jim Frost says

Hi Ty,

There’s not enough information for me to know how to interpret the results. How are you measuring gun laws? Also, VCR is your dependent variable, not the constant as you state. You don’t usually interpret the constant. All I can really say is that based on your p-value, it appears your independent variable is not statistically significant. You have insufficient evidence to conclude that there is a relationship between gun laws and homicides (or is it VCR?).

angela says

Hi Jim

Your blog has been very useful. I have a query.. if I am conducting a multiple regression is it okay to have an outcome variable which is normally distributed ( i winsorized an outlier to achieve this) and have two other predictor variables which are not normally distributed? ( the normality tests scores were significant).

I have read in many places that you have to transform your data to achieve normality for the entire data set to conduct a multiple regression – but doing so has not helped me at all. Please advice.

Jim Frost says

Hi Angela,

I’m dubious about the Winsorizing process in general. Winsorizing reduces the effect of outliers. However, this process is fairly indiscriminate in terms of identifying outliers. It simply defines outliers as being more extreme than an upper and lower percentile and changes those extreme values to equal the specified percentiles. Identifying outliers should be a point by point investigation. Simply changing unusual values is not a good process. It might improve the fit of your data but it is an artificial improvement that overstates the true precision of the study area. If that point is truly an outlier, it might be better to remove it altogether, but make sure you a good explanation for why it’s an outlier.

For regression analysis, the distributions of your predictors and response don’t necessarily need to be normally distributed. However, it’s helpful, and generally sought, to have residuals that are normally distributed. So, check your residual plots! For more information, read my post about OLS assumptions so you know what you need to check!

If your residuals are nonnormally distributed, sometimes transforming the response can help. There are many transformations you can try. It’s a bit trial by error. I suggest you look into the Box-Cox and Johnson transformations. Both methods assess families of transformations and pick one that works bets for your data. However, it sounds like your outcome is already normally distributed so you might not need to do that.

Also, see what other researchers in your field have done with similar data. There’s little general advice I can offer other than to check the residuals and make sure they look good. If there are patterns in the residuals, make sure you’re fitting curvature that might be present. You can graph the various predictors by the residuals to find where the problem lies. You can also try transforming the variables as I describe earlier. While the variables don’t need to follow the normal distribution, if they’re very nonnormally distributed, it can cause problems in the residuals.

Best of luck with your analysis!

DMA says

Hi, I am confused about the assumption of independent observations in multiple linear regression. Here’s the case. I have heart rate data per five-minute for a day of 14 people. The dependent variable is the heart rate. During the day, the workers worked for 8 hours (8 am to 5 pm), so basically, I have 90 data points per worker for a day. So that makes it 1260 data points (90 times 14) to be included in the model. Is it valid to use multiple linear regression for this type of data?

Jim Frost says

Hi DMA,

It sounds like your model is more of a time series model. You can model those using regression analysis as well, but there are special concerns that you need to address. Your data are not independent. If someone has a height heart rate during one measurement, it’s very likely it’ll also be heighted 5 minutes later. The residuals are likely to be serially correlated, which violates one of the OLS assumptions.

You’ll likely need to include other variables in your model that capture this time dependent information, such as lagged variables. There are various considerations you’ll need to address that go beyond the scope of these comments. You’ll need to do some additional research into use regression analysis for time series data.

Best of luck with your analysis!

Asad says

Ok.Thank you so much.

Asad says

Thank you so much for your time!

Actually i don’t have authentic data about property values (dependent variable) nor the concerning institutions have this data. Can i ask the property value directly to the property owner thorough walk interview?

Jim Frost says

You really need to have valid data. Using a self-reported valuation might be better than no data. However, be aware there might be differences between what the property owner says and the true market value. Your model would describe self-valuation rather than market valuation. Typically, I’ve seen studies like yours use actual sales prices.

Asad says

Hello Sir!

is it necessary fir dependent variable in multiple regression model to have values. i have number of independent variable( age of property, stories in building, location close to park)and single dependent variable (Property values).

Some independent variable decrease the value of dependent variable, while some independent variables increase the value of the dependent variable?

Can i put the value if my single dependent variable as ( a.<200000, b.<300000,c. d. 500000)?

Jim Frost says

Hi Asad,

Why would can’t you enter the actual property values? Ideally, that’s what you would do. If you are missing a value for a particular observation, you typically need to exclude the entire observation from the analysis. However, there are some ways to estimate missing values. For example, SPSS has advanced methods for imputing missing values. But, you should use those only to estimate a few missing values. Your plan should be to obtain the property values. If you can’t do that, it will be difficult to perform regression analysis.

There are some cases where you can’t record the exact values and it’s usually related to the observation time. This is known as censored data. A common example is in reliability analysis where you record failure times for a product. You run the experiment for a certain amount of time and you obtain some failures and know their failure times. However, some products don’t fail and you only know that their failure time is greater than the test time. There are censored regression models you can use in situations like that. However, I don’t think that applies to your subject-area, at least as far as I can tell.

Best of luck with your analysis!

Rachel Wang says

thank you so much Jim! this is really helpful 🙂

Jim Frost says

You’re very welcome! Best of luck with your analysis!

Rachel Wang says

Hi Jim,

The variances (SD) for the 3 groups are 0.45, 0.7 and 1. Would you say that they vary by a lot?

Another follow up question: does a narrower CI equals a better estimate?

thanks!

Jim Frost says

Yes, that’s definitely it!

I would suggest using Welch’s one-way ANOVA to analyze it and potentially use that analysis to calculate the CI. You’re essentially performing a one-way ANOVA. And, in ANOVA, there is the assumption of equal variances between groups, which your data do not satisfy. In regression, we’d refer to it as heteroscedasticity. In Welch’s ANOVA, you don’t need to satisfy that assumption. That makes it a simple solution for your case.

In terms of CIs, yes, narrower CIs indicate that the estimate is more precise than if you had a wider CI. Think of the CI as a margin of error around the estimate and it’s good to have a smaller margin of error. With a narrower CI, you can expect the actual mean to fall closer to the fitted value.

Rachel Wang says

Hi Jim,

Thank you so much for the quick response!

I checked the residual plots, it gives me a pretty trend line at y=0, and my R square = 0.87. However the CI it gives me by using all 15 points (regression inference) is a little wider (2.012 – 3.655) than if I just use that 5 points(2.245 – 3.355). In this case, would you still prefer using all 15 points?

Thank you!

Jim Frost says

That’s tricky. I hate to throw out data, but it does seem warranted. At least you have a good rationale for not using the data!

CIs of the mean for the a point at the end of a data range in a regression model do tend to be wider than in the middle of the range. Still, I’m not sure why it would be wider. Are the variances of the groups roughly equal? If not, that might well be the reason.

Rachel Wang says

Hi Jim,

suppose I have total of 15 data points at x=0, x=40, and x=80 (5 data points at each x value), now I can use regression to estimate y when x=60. But what if I want to estimate the average when x=0? Should I just use that 5 data points when x=0, or use the intercept from the regression line? Which is the best estimate for a 95% CI for the average y value when x=0?

Thank you 🙂

Jim Frost says

Hi Rachel,

Assuming that model provides a good fit to the data (check the residual plots), I’d use all the data to come up with the CI for the fitted value that corresponds to X = 0. That approach uses more data to calculate the estimate. Your CI might even be more precise (narrower) using all the data.

Salam says

Hi,

What make us use the linear regression instead of other types of regression. In other words, the motivation for selecting a linear model?

Jim Frost says

Hi Salam,

Typically, try linear regression first. If your data contain curvature, you might still be able to use linear regression. Linear regression is generally easier to use and includes some useful statistics that nonlinear regression can’t provide, such as p-values for the coefficients and R-squared.

However, if you can’t adequately fit the curvature in your data, it might be time to try nonlinear regression. While both types allow you fit curvature, nonlinear regression is more flexible because it allows your model to fit more types of curvature.

I’ve written a post about how to choose between linear and nonlinear regression that you should read. Within that post are various related links that talk about how to fit curves using both types of regression, along with additional information about both types.

I hope this helps!

V.G.Subramanian says

Hi Jim

Thank u so much for your reply. I am really gorgeous to know much more of this . I shall keep sending mails seeking your reply which i hope you will not mind

V.G.Subramanian says

Hi Jim

I have been unfortunate to get your reply to my comment on 18/09/2018

Jim Frost says

Hi V.G.,

Sorry about the delay. As you can no doubt imagine, my schedule gets busy and things can fall through the cracks.

I replied under your original comment.

Aisling Dunphy says

Hi Jim,

Your blog has been really helpful! 🙂 I am currently completing my Masters Thesis and my primary outcome is to assess the relationship between Diabetes Distress and blood glucose control. I am a newbie to SPSS and I am at a loss as to how best to analyse my small (not normally distributed pre and post data transformation) data set.

I have been advised that regression analysis may be appropriate and better than correlations? However my data does not appear to be linear.

My diabetes distress variables consist of a score of 1-6 based on a likert scale and also are categorical (low, moderate, high distress) and my blood glucose consists of continuous data and also a categorical variable of poorly controlled blood glucose and well controlled blood glucose.

At the moment I am struggling to complete this analysis. Any help would be greatly appreciated 🙂

Tetyana says

Dear Jim, thatk you very much for this post! Could you, please, explain the following.

You are writing: “you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values”

What if I have small r-squired, but the coefficiants are statistically significant with the small values?

Almadi Obere says

Hi Jim

Thanks for your enlightened explanations. However I want to engage you a bit. under how to interpret regression results, you indicated that a small p-Value indicates that the ” independent variable is statistically significant”. i tend not to agree> Note that since the null hypothesis is that the coefficient of the independent variable is equal to Zero, it’s rejection as evidenced by low p-Value should imply that it is the coefficient which is significantly different from zero and not the variable.

almadi

Jim Frost says

Hi Almadi,

Yes, you’re correct that the p-value tests whether the coefficient estimate is significantly different from zero. If it is, you can say that the coefficient is statistically significant. Alternatively, statisticians often say that the independent variable is statistically significant. In this context, these are two different ways of saying the same thing because the coefficient is a property of the variable itself.

V.G.Subramanian says

Hi Jim

As u must be well aware, govt releases price indices and these are broadly used to determine the effect of base prices during a given period of time.

Construction industry, normally uses these price indices running over a period of time to redetermine the

prices based on the movement between the base date and current date, which is called as price adjustment

Govt after a few years of time releases a new series of price indices where we may not have the data of indices with old series which will necessitate us to use these new indices with a conversion factor to arrive at the equivalent value of the base price.

Where do you feel that Regression Analysis could be of help where we have to determine the current value of the base price using the new indices.

It is a bit amusing that someone was suggesting to me.

Regards

V.G.Subramanian

Jim Frost says

Hi V.G.,

I agree that switching price indices can be a problem. If the indices overlap, you can perform regression analysis where the old index is the independent variable and the new index is the dependent variable. However, that is problematic if you don’t have both indices. If you had both indices, I suppose it wouldn’t be a problem to begin with!

Ideally, you’d understand the differences behind how the government calculates both indices, and you could use that to estimate the value of the other index.

I’m not particularly familiar with this practice, so I don’t have a whole lot of insight into it. I hope this helps somewhat!

Antonio Padua says

Thank you for this, Jim. I’ve always felt a common sense explanation minus all the impressive math formulas is what is needed in statistics for data science. This is a big part of the basics I’ve been missing. I’m looking forward to your Logistic Regression Tutorial. How is that coming along for you?

Jim Frost says

Hi Antonio,

Thanks so much for your kind words! They mean a lot to me! Yes, I totally agree, explanations should focus on being intuitive and helping people grasp the concepts.

I have written a post on binary logistic regression. Unfortunately, it’ll be awhile before I have a chance to write a more in-depth article–just too many subject to write about!

Rashmi Bs says

Dear sir,

I have a few question about when to use ANOVA and when to use regression analysis. In my study i have conducted a experiment by considering temperature , pH, weight of a compound as a independent variables and extraction as a dependent variable ( i mentioned very generally but i have some specific independent and dependent variables along with these variables). I did statistical analysis by using one way ANOVA-Tukey’s test and i have used grouping method ( using alphabets a,b,c….) to show the significance based on the p value . My question is, for these type of data can i use regression analysis? and what is the main difference between Tukey’s test and regression analysis?

Jim Frost says

Hi Rashmi,

Both regression analysis and ANOVA are linear models. As linear models, both types of analyses have the same math “under the hood.” You can even use them interchangeably and get the same results. Traditionally, you use ANOVA when you have only, or mainly, categorical factors–although you can add in covariates (continuous variables). On the other hand, you tend to use regression when you have only, or mainly, continuous variables–although you can add in categorical variables.

Because ANOVA focuses on categorical factors and comparing multiple group means, statisticians have developed additional post hoc analyses to work with ANOVA, such as Tukey’s test. Typically, you’ll perform the ANOVA first and then the post hoc test. Suppose you perform a one-way ANOVA and obtain significant results. This significance tells you that not all of the group means are equal. However, it does not tell you which differences are statistically significant.

That point is where post hoc tests come in. These tests do two things. They’ll tell you which differences are statistically significant. They also control the family error rate for the group of comparisons. When you compare multiple differences like that, you increase the risk of a Type I error–which is when you say there is a difference but there really isn’t. When you compare multiple means, the Type I error rate will be higher than your significance level (alpha). These post hoc tests (other than Fishers) maintain the type I error rate so it continues to equal alpha, which is what you would expect.

So, use an ANOVA first. If you obtain significant results for a categorical factor, you can use post hoc tests like Tukey’s to explore the differences between the various factor levels.

I really need to write a blog post about this! I will soon!

In the meantime, I hope this helps!

Kaushal Kumar Bhagat says

Is it necessary to conduct correlation analysis before regression analysis?

Jim Frost says

Hi Kaushal,

No it’s not absolutely required. I actually prefer producing a series of scatterplots (or a matrix plot) so I can see the nature of the different relationships. That helps give me a better feel for the data along with the types of relationships. However, if you have a good theory and a solid background knowledge on which variables should be included in the model, you can go straight to modeling. I think it depends a lot on your existing level of knowledge.

That all said, I personally like knowing the correlation structure between all of the variables. It gives me a better feel for the data.

Saeed Anowar says

Thanks

Patrik Silva says

Thank you Jim!

I really appreciate it!

PS

Patrik Silva says

Hi Jim, I hope you are having good time!

I would like to ask you a question, please!

I have 24 observations to perform a regression analysis (let’s say Zones), and I have many independent variables (IV). I would like to know what is the minimum number of observations I should have to perform a reasonable linear regression model. I would like to hear something from you about how to test many regression model with different IV, since I can not use many IV in a model where a have few observations (24).

Thank you in advance!

Patrik

Jim Frost says

Hi Patrik, great to hear from you again!

Those are great questions. For 24 observations, I’d say that you usually wouldn’t want more than 2 IVs. I write an entire post about how many variables you can include in a regression model. Including too many IVs (and other terms such as interactions and polynomials) is known as overfitting the model. Check that post out because it’ll provide guidance and show you the dangers of including too many.

There’s another issue a play too because you want to compare a number of different regression models to each other. If you compare many models, it’s a form of data mining. The risk here is that if you compare enough models, you will uncover chance correlations. These chance correlations look like the real thing but only appear in your sample and not the population. I’ve written a post about how using this type of data mining to choose a regression model causes problems. This concern is particularly problematic with a small sample size like yours. It can find “patterns” in randomly generated data.

So, there’s really two issues for you to watch out for–overfitting and chance correlations found through data mining!

Hope this helps!

Patrik Silva says

Many Thanks Jim!!! You have no idea about how much you helped me.

Very well clarified!!!

God bless you always!!!

Patrik

Patrik Silva says

Hi Jim, I am everywhere in your post!

I am starting loving statistic, that’s why I am not quiet,

I have some questions for you:

To use OLS regression, one of the assumptions is that the dependent variable is normally distributed. To achieve this requirement what I should do with my data? Should I check the normality of my dependent variable, for example using Shapiro test (etc)? If I conclude that my dependent variable is not following the normal distribution I should start to see data transformation, right?

Another way that I have used to see people analyzing the normality is by plotting the dependent variable with the independent variable and if the relationship doesn’t follow linear trend then they go to data transformation (which one you recommend me?)

Should I perform the regression using my data (original) and then the residuals will show me non-normality if do exists?

When should I transform my independent variables, and what is the consequence of transforming them?

Sorry, I use to ask many questions in a single comment, but I think this is the way to understand the full picture of my doubt,

You are being so useful to me,

Thank you again!

Patrik

Jim Frost says

Hi Patrik, I’m so happy to hear that you’re starting to love statistics! It’s a great field that is exciting. The thrill of discovery combined with getting the most value out of your data. I’m not sure if you’ve read my post about The Importance of Statistics, but if you haven’t, I recommend it. It explains why the field of statistics is more important than ever!

In OLS regression, the dependent variable does not have to be normally distributed. Instead, you need to assess the distribution of the residuals using residual plots. If your residuals are not normally distributed, there are a variety of possible reasons and different ways to resolve that issue. I always recommend that transforming your data is the last resort. For example, the residuals might be nonnormal because the model is specified incorrectly. Maybe there is curvature in the data that you aren’t modeling correctly? If so, transforming the data might mask the problem. You really want to specify the best possible model. However, if all else fails, you might need to transform the data. When you transform the data, you’ll need to back transform the results to make sense of the results because everything applies to the transformed data. Most statistical software should do this for you.

Be aware that you can’t trust R-squared and the standard error of the regression when you transform your dependent variable because they apply to the transformed data rather than the raw data (backtransformation won’t help there).

In terms of testing the normality of the residuals, I recommend using normal probability plots. You can usually tell at a glance whether they are normally distributed. If you need a test, I generally use the Anderson-Darling test–which you can see in action in my post about identifying the distribution of your data. By the way, as a case in point, the data in that post are not normal, but I use it as the dependent variable in OLS regression in this post about using regression to make predictions. The residuals are normally distributed even though the dependent variable is not.

SUBROTO Chatterjee says

In the coffee intake and smoking example, the first result showed that higher coffee intake leads to higher mortality, but after including smoking, coffee intake leads to lower or no mortality? Smoking was revealed to cause the mortality, but how did coffee intake now result in the opposite? Was a separate test taken for this result? Please let me know.

S. CHATTERJEE

Jim Frost says

Hi, that’s a great question. It turns out that coffee and smoking are correlated. The negative effects of smoking on mortality are well documented. However, for some reason, the researchers did not originally include smoking in their model. Because drinking coffee and smoking are correlated, the variable for coffee consumption took on some of smoking’s effect on mortality.

Put another way, because smoking was not included in the model, it was not being controlled (held constant). So, as you increased coffee consumption, smoking also tended to increase because it is both positively correlated with coffee consumption and not in the model. Therefore, it appeared as though increased coffee consumption is correlated with higher mortality rates but only because smoking was not included in the model.

Presumably, the researchers had already collected data about smoking. So, all they had to do was include the smoking variable in their regression model. Voila, the model now controls for smoking and the new output displays the new estimate of the effect that coffee has on mortality.

This point illustrates a potential problem. If the researchers had not collected the smoking information, they would have really been stuck. Before conducting any study researchers need to do a lot of background research to be sure that they are collecting the correct data!

I hope this helps!

Ahmed says

Hi Jim

Hope all thing is well,

I have faced problem with plotting, which is included the relationship between dependent variable (response) and the independent variables .

when i do the main effect plots, i have the straight line increasing.

y= x, this linear trending

to change it i need to make y= square root for time

Im stuck with this thing i couldn’t find solution for it

Regards

Cara says

Hi Jim,

I was wondering if you can help me? I am doing my dissertation and I have 1 within-subjects IV, and 3 between-subjects IVs.. most of my variables are categorical, but one is not categorical, it is a questionnaire which I am using to determine sleep quality, with both Likert scales and own answers to amount of sleep (hours), amount of times woke in the night etc. Can I use a regression when making use of both categorical data and other? I also have multiple DVs (angry/sad Likert ratings).. but I *could* combine those into one overall ’emotion’ DV. Any help would be much appreciated!

Jim Frost says

Hi Cara, because your DV use the Likert scale, you really should be using Ordinal Logistic Regression. This type of regression is designed for ordinal dependent variables like yours. As for the IVs, it can be tricky using ordinal variables. They’re not quite either continuous or categorical. My suggestion is to give them a try as continuous variable and check the residual plots to see how they look. If they look good, then it’s probably ok. However, if they don’t look good, you can try refitting the model using them as categorical variables and then rechecking the residual plots. If the residuals still don’t look good, you can then try using the chi-square test of independence for ordinal data.

As for combining the data, that would seem to be a subject-area specific decision, and I don’t know that area well enough to make an informed recommendation.

Martin Amsteus says

Yes. But it may be that you miss my point. Because I argue that a proper and sound experiment will allow you to test for causality, regardless of if you deploy e.g. Pearsons r or regression. With no experimental design, neither Pearsons r nor a regression will test for an effect relationship between the variables. Randomisation makes a better case for controlling for variables that you are unaware of than picking a few, and then proclaim that your study found that x will cause an incrrase in y or that x has an effect on y. You may as well argue that you dont need to control for any variables and argue that any correlational study test for Effect relationships.

Jim Frost says

Hi Martin, yes, that is

exactlywhat I’m saying. Whether you can draw causal conclusion depends on whether you used a randomized experiment to collect your data. If it’s an observational study, you can’t assume it’s anything other than correlation. What you write in your comment agrees with what I’m saying.The controlling for other variables that I mention in this post is a different matter. Yes, if you include a variable in a regression model, it is held constant while estimating the effects of the other variables. That doesn’t mean you can assume causality though.

Martin Amsteus says

No statistical tool or method turns a survey or corrolation study into an experiment, i.e. regression does not test or imply cause effect relationship. A positive relationship between smoking and cancer in a regression analysis does not mean that smoking cause cancer. You have not controlled for what you are unaware of.

Jim Frost says

Hi Martin, you are 100% correct about the fact that correlation doesn’t imply causation. This issue is one that I plan to cover in future posts.

There are two issues at play here. The type of study under which the data were collected and the statistical findings.

Being able to determine causation comes down to the difference between an observational study versus a randomized experiment. You actually use the same analyses to assess both types of designs. In an observational study, you can only establish correlation and not causality. However, in a randomized experiment, the same patterns and correlations in the data can suggest causality. So, regression analysis

canhelp establish causality, but only when it’s performed on data that were collected through a randomized experiment.Hari says

Very nicely explanined. thank you

Jim Frost says

Thanks you, Hari!

Kaleem says

Thanks for your reply and for the guidance.

I read your posts which are very helpful. After reading them, I concluded that only the independent variables which have a well-established association with the dependent variable should be included. Hence, in my case, variable Z should not be included given that the association of Z with dependent variable is not well-established.

Furthermore, suppose there is another variable (A) and literature suggests that it, in general, has an association with dependent variable. However, assume that A does not affect any independent variables so there is no omitted variable bias. In this case, if there is no data available for A (due to the study being conducted in different environment/context) then what statistical techniques can be deployed to address any problems caused due to the exclusion of A?

I look forward to your reply and I will be grateful for your reply.

Kind regards.

Kaleem says

Thanks for the reply. I apologise if I am taking a considerable time out of your schedule.

Based on the literature, there isn’t any conclusive evidence that z is a determinant of y. So, that is why I intend to remove z. Some studies include it while some do not and some find significant association (between y and z) while some find the association insignificant. Hence, I think I can safely remove it.

Moreover, I will be grateful if you can answer another query. From an statistical viewpoint, is it fine if I use Generalized method of moments (GMM) for binary dependent variable?

Kind regards.

Jim Frost says

While I can’t offer you a concrete statement about whether you should include or exclude the variable (clearly there is disagreement in your own field), I do suggest that you read my article about specifying the correct regression model. I include a number of tips and considerations.

Unfortunately, I don’t know enough about GMM to make a recommendation. All of the examples I have seen personally are for continuous data, but I don’t know about binary data.

Kaleem says

Thanks for your reply. I really appreciate it. Could you please also provide an answer to my query mentioned below for further clarification?

Kaleem says

Further clarification on my above post. From internet I found that if a variable (z) that is related to y but unrelated to x then inclusion of z will reduce standard errors of x. So, if z is excluded, but f-stat and adjusted r-square are fine then does high standard errors create problems? I look forward to your reply.

Jim Frost says

Yes, what you read is correct. Typically, if Z is statistically significant, you should include it in your model. If you exclude it, the precision of your coefficient estimates will be lower (higher standard errors). You also risk a biased model because you are not including important information in the model–check the residual plots. The F-test of overall significance and adjusted R-squared depend on the other IVs in your model. If Z is by far the best variable, it’s possible that removing it will cause the F-test to not be significant and adjusted R-square might drop noticeably. Again, that depends on how the explanatory power of Z compares to the other IVs. Why do you want to remove a significant variable?

Kaleem says

Thanks for the reply. Jim.

I am unable to understand “Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable”. Are you stating that other independent variables will be fine but r-square will become low? I will be grateful if you can explain this.

Kind regards

Jim Frost says

Hi, you indicated that the removed independent variable is related to the dependent variable, but it is not correlated with the other independent variables. Consequently, removing that independent variable should reduce R-squared. For one thing, that’s the typical result of removing variables, even when they’re not statistically significant. In this case, because it is not correlated to the other independent variables, you know that the removed variable is supplying unique information. Taking that variable out means that information is no longer included in the model. R-squared will definitely go down, possibly dramatically.

R-squared measures the strength of the relationship between the entire set of IVs and the DP. Read my post about R-squared for more information.

Kaleem says

Hello, Jim.

What is the impact* on the independent variables in the model if I omit a variable that is a determinant of dependent variable but is not related to any of the independent variables?

*Here impact relates to the independent variables’ p-values and the coefficients.

Kind regards.

Jim Frost says

Hi Kaleem,

If the independent variable is not correlated with the other independent variables, it’s likely that there would be a minimal effect on the other independent variables. Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable. You should also check the residual plots to be sure that by removing the variable you’re not introducing bias.

ghulam mustafa says

why do we use 5% level of significance usually for comparing instead of 1% or other

Jim Frost says

Hi, I actually write about this topic in a post about hypothesis testing. It’s basically a tradeoff between several different error rates–and a dash of tradition. Read that post and see if it answers your questions.

Ghulam Mustafa says

Sir usually we take

5% level of significance for comparing why 0

Jim Frost says

Hi Ghulam, yes, the significance level is usually 0.05. I’m not sure what you’re asking about in regards to zero? The p-values in the example output are all listed as 0.000, which is less than the significance level of 0.05, so they are statistically significant.

Shamsun Naher says

In my model, I use different independent variables. Now my question is before useing regression, do I need to check the distribution of data? if yes then please write the name tests. My title is Education and Productivity Nexus, : evidence from pharmaceutical sector in Bangladesh.

Jim Frost says

Hi Shamsun, typically you test the distribution of the residuals after you fit a model. I’ve written a blog post about checking your residual plots that should read.

I hope this helps!

Jim

Khadidja Benallou says

Thank you Mr. Jim

Jim Frost says

You’re very welcome!

NARAYANASWAMY AUDINARAYANA says

In linear regression, can we use categorical variables as Independent variables? If yes, what should be the minimum or maximum categories in an Independent variable?

Jim Frost says

Hi, yes you can use categorical variables as independent variables! The number of groups really depends on what makes sense for your study area. Of course, the minimum is two. There really is no maximum in theory. It depends on what makes sense for your study. However, in practice, having more groups requires a larger total sample size, which can become expensive. If you have 2-9 groups, you should have at least 15 in each group. For 10-12 groups, you should have 20. These numbers are based on simulation studies for ANOVA, but they also apply to categorical variables in regression. In a nutshell, figure out what makes sense for your study and then be sure to collect enough data!

I hope this help!

Jim