Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.

As a statistician, I should probably tell you that I love all statistical analyses equally—like parents with their kids. But, shhh, I have secret! Regression analysis is my favorite because it provides tremendous flexibility, which makes it useful in so many different circumstances. In fact, I’ve described regression analysis as taking correlation to the next level!

In this blog post, I explain the capabilities of regression analysis, the types of relationships it can assess, how it controls the variables, and generally why I love it! You’ll learn when you should consider using regression analysis.

## Use Regression to Analyze a Wide Variety of Relationships

Regression analysis can handle many things. For example, you can use regression analysis to do the following:

- Model multiple independent variables
- Include continuous and categorical variables
- Use polynomial terms to model curvature
- Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable

These capabilities are all cool, but they don’t include an almost magical ability. Regression analysis can unscramble very intricate problems where the variables are entangled like spaghetti. For example, imagine you’re a researcher studying any of the following:

- Do socio-economic status and race affect educational achievement?
- Do education and IQ affect earnings?
- Do exercise habits and diet effect weight?
- Are drinking coffee and smoking cigarettes related to mortality risk?
- Does a particular exercise intervention have an impact on bone density that is a distinct effect from other physical activities?

More on the last two examples later!

All these research questions have entwined independent variables that can influence the dependent variables. How do you untangle a web of related variables? Which variables are statistically significant and what role does each one play? Regression comes to the rescue because you can use it for all of these scenarios!

## Use Regression Analysis to Control the Independent Variables

As I mentioned, regression analysis describes how the changes in each independent variable are related to changes in the dependent variable. Crucially, regression also statistically controls every variable in your model.

### What does controlling for a variable mean?

When you perform regression analysis, you need to isolate the role of each variable. For example, I participated in an exercise intervention study where our goal was to determine whether the intervention increased the subjects’ bone mineral density. We needed to isolate the role of the exercise intervention from everything else that can impact bone mineral density, which ranges from diet to other physical activity.

To accomplish this goal, you must minimize the effect of confounding variables. Regression analysis does this by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant. This process allows you to learn the role of each independent variable without worrying about the other variables in the model. Again, you want to isolate the effect of each variable.

### How do you control the other variables in regression?

A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! Let’s look at this in action with an example.

A recent study analyzed the effect of coffee consumption on mortality. The first results indicated that higher coffee intake is related to a higher risk of death. However, coffee drinkers frequently smoke, and the researchers did not include smoking in their initial model. After they included smoking in the model, the regression results indicated that coffee intake lowers the risk of mortality while smoking increases it. This model isolates the role of each variable while holding the other variable constant. You can assess the effect of coffee intake while controlling for smoking. Conveniently, you’re also controlling for coffee intake when looking at the effect of smoking.

Note that the study also illustrates how excluding a relevant variable can produce misleading results. Omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model. This warning is particularly applicable for observational studies where the effects of omitted variables might be unbalanced. On the other hand, the randomization process in a true experiment tends to distribute the effects of these variables equally, which lessens omitted variable bias.

**Related post**: Confounding Variables and Omitted Variable Bias

## How to Interpret Regression Output

To answer questions using regression analysis, you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values. When you have a low p-value (typically < 0.05), the independent variable is statistically significant. The coefficients represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.

For instance, if your dependent variable is income and your IVs include IQ and education (among other relevant variables), you might see output like this:

The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, an additional unit of education increases average earnings by $24.22 while holding the other variables constant.

Regression analysis is a form of inferential statistics. The p-values help determine whether the relationships that you observe in your sample also exist in the larger population. I’ve written an entire blog post about how to interpret regression coefficients and their p-values, which I highly recommend.

## Obtaining Trustworthy Regression Results

With the vast power of using regression comes great responsibility. Sorry, but that’s the way it must be. To obtain regression results that you can trust, you need to do the following:

- Specify the correct model. As we saw, if you fail to include all the important variables in your model, the results can be biased.
- Check your residual plots. Be sure that your model fits the data adequately.
- Correlation between the independent variables is called multicollinearity. As we saw, some multicollinearity is OK. However, excessive multicollinearity can be a problem.

Using regression analysis gives you the ability to separate the effects of complicated research questions. You can disentangle the spaghetti noodles by modeling and controlling all relevant variables, and then assess the role that each one plays.

There are many different regression analysis procedures. Read my post to determine which type of regression is correct for your data.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Lyana says

Hi Jim,

Looks like treating both ordinal variables as continuous seems to solve my problems with non-mutually exclusive levels of the variables if I enter the variables as categorical. My main concern is to look at the variable as a whole not by its levels so it might be what I need; the measurement ranges were based on a an established rating system and does not have any weight for my analysis. Tho, I’ll have to looks more into it as well as the residual plot etc before deciding. Thank you for highlighting this option!

Is it correct if I assign the numerical value to the levels like this? 1 to 5, from lowest to highest.

Spacing

1: less than 60mm

2: 60-200mm

3: 200-600mm

4:0.6-2m

5: more than 2m

length

1: less than 1m

2: 1-3m

3: 3-10m

4: 10-20m

4: more than 20m

As for the data repetition, what I mean was say data for Site A is:

Set 1 (quantity: 25) SP3 PER5

Set 2 (quantity: 30) SP4 PER6

set 3 (quantity: 56) SP2 PER3

so in the data input I’d entered set 1 data 25 times, set 2 data 30 times and set 3 data 56 times. From what I have gathered from fellow student and my lecturer, it is correct but I’d like a confirmation from a statistician. Thanks again!

Lyana says

Hi JIm,

I’m sorry, again the levels disappeared. maybe bc I used (>) and (<) so it's messing up the coding of the comment.

spacing levels:

SP1: less than 60mm

SP2: 60-200mm

SP3: 200-600mm

SP4:0.6-2m

SP5: more than 2m

length level:

PER1: more than 20m

PER2: 10-20m

PER3: 3-10m

PER4: 1-3m

PER4: less than 1m

Spacing and Length were recoded as ranges since they were estimate and not measured individually as it'd take too much time to measure each one (1 set of cracks may have at least 10 cracks, some can reach 50 or more and the measurement are not exactly the same between cracks belonging to the same set).

I've input the dummy like in my previous reply when running the model, tho the resulting equation I've provided does not include the length. Can ordinal variable be converted/treated into continuous variables?

Also, since each set has their own quantities, so I repeated the data in the input according to their quantity. Is that the right way of doing it?

Thanks!

Jim Frost says

Hi Lyana,

Technically those are ordinal variables. I write about this in more detail in my book about regression analysis, but you can enter these variables as either continuous variables (if you assign a numeric value to the groups) or as categorical variables. If you go the categorical route, you’ll need to use the indicator variable scheme and leave out a reference level approach as we discussed. The approach you should use depends on a combination of your analysis goals, the nature of your data, and the ability to adequately fit the model (i.e., properties of the residual plots).

I don’t exactly know what you mean by “repeated the data in the input.” However, you have levels for each categorical variable. Let’s use the lowest level for each variable as the reference level. Here’s how you’d use indicator variables to include both categorical variables in your model (some statistical software will do that for you behind the scenes).

Spacing variable:

Leave out SP1. It’s the reference.

Include and indicator variable for:

SP2

SP3

SP4

SP5

Length Variable:

Leave PER5 out as reference.

Include indicator variables for:

PER1

PER2

PER3

PER4

And just code each indicator variable appropriately based on the presence or absences of the corresponding characteristic. All zeros in a set of indictor variables for a categorical variable represents the reference level for that categorical variable.

As you can see, you’ll need to include many indicator variables (8), which is a drawback of entering them as categorical variables. You can quickly get into overfitting your model.

Lyana says

I’m sorry I had just noticed that the levels are missing

SP1: 2m

Lyana says

For my case, I’m studying the cracks set on a rock face and I have two independent categorical variables (spacing and length) that have 5 levels of measurement ranges each. Dependant variable is the blasted rock size i.e I want to know how the spacing and length of the existing cracks on a rock face would effect the size of blasted rocks.

E.g: For Spacing: SP1 = 2m

I’ve coded the levels to run the regression model into:

SP1 SP2 SP3 SP4

SP1 1 0 0 0

SP2 0 1 0 0

SP3 0 0 1 0

SP4 0 0 0 1

SP5 0 0 0 0

From the coding (leaving SP5 out as the reference level) above, after running the model, I have obtained the equation:

Blasted rock size (mm) = 1849.146 + 332.224SP1 + 137.624SP2 – 115.268SP3 – 103.604SP4

1 rock slope could consist of 2 or more crack sets hence the situation where more than 1 levels of spacing and length can be observed. As an example, rock face A consist of 3 crack sets with set #1 having SP1, set #2 with SP3 and set #3 have SP4. To predict blasted rock size for rock face A using the equation, I’ll have to insert “1” for SP1, SP3 and SP4. Which is actually the wrong way of doing it since they are not mutually exclusive? Or can I calculate each crack set separately using the same equation then average the of blasted rock size for these 3 crack sets?

From the method in your explanation, does this mean that I’ll have to separate each level into 10 different variables and code them as 1=yes and 0=no? If so, for spacing, will the coding be

SP1 SP2 SP3 SP4 SP5

SP1 1 0 0 0 0

SP2 0 1 0 0 0

SP3 0 0 1 0 0

SP4 0 0 0 1 0

SP5 0 0 0 0 1

in the input table which would be similar to the initial one except with SP5 included? But if I were to include all levels when running the model, SPSS would automatically excluded 1 level since I ran several rock faces (belonging to a single location) in a model so all levels of spacing and length are present in the data set.

The other way that I can think of is to create interaction for all possible combinations and dummy code them but wouldn’t that end up with a super long equation?

I’m sorry for imposing like this but I couldn’t grasp this problem on my own. Your help is very much appreciated.

Jim Frost says

Hi Lyana,

Ah, ok, it sounds like you have two separate categorical variables. In that case, for each observation, you can have one level for each variable. Additionally, for each categorical variable, you’ll leave out one level for its own reference level.

I do have a question. spacing and length sound like continuous measurements. Why are you including them as categorical variables? There might be a good reason why but it almost seems like you can include them as continuous predictors. Perhaps you don’t have the raw measurements but instead they’re in groups? In which case, they might actually be ordinal variables. You can include ordinal variables as categorical variables. But sometimes they’ll still work as continuous variables.

Lyana says

I see, sorry I couldn’t fully understand your previous reply before this, thanks for the clarification. However, I am dealing with a situation where 2 or more levels of a variable could be observed simultaneously, is it theoretically right to use dummy or is there other method around it?

thanks!

Jim Frost says

Hi Lyana,

That sounds like you’re dealing with more than one variable rather than one categorical variable. Within an individual categorical variable, the levels of the variable are mutually exclusive. In your case, you need to sort out which categorical variables you have and be sure that the levels are mutually exclusive. If you looking at the presence and absence of certain characteristics, you can use a series of indicator variables. If these characteristics are not the mutually exclusive levels of a single categorical variable, you don’t use the rule about leaving one out.

For example, in a medical setting, you might include characteristics of a patient using a series of indicator variables: gender (1 = female 0 = male), high blood pressure (1 = Yes, 0 = No), On medication, etc. These are separate characteristics (not part of one larger categorical variable) and you can just include an indicator variable to

indicatethe presence or absence of that characteristic.Perhaps that it what you need? But be aware that what you describe with multiple levels possible does not work for a single categorical variable. But the method I describe might be what you need if you’re talking about separate characteristics.

Sakeena Muzzafar says

Thank you , sir

Lyana says

Thanks for the answer Jim,

does that mean predicted value for when both L4 and L1 are observed and when only L1 is observed without L4 is the same? (Y = 133)

thanks again!

Jim Frost says

The groups must be mutually exclusive. Hence, an observation could not be in both L1 and L4.

Lyana says

Hi Jim,

I have a question regarding categorical variables dummy coding, I can’t seem to find any post about this topic. Hope you don’t mind me asking here.

I ran a regression model with categorical variable containing 4 level: using the 4th level as the reference group. Meaning in the equation there will only be level 1 to 3 since level 4 is the reference. Say, the equation is Y = 120 + 13L1 – 6L2 + 15L3, to predict the Y with L4 then I’ll have Y = 120, right?

My question is what if I want to predict Y when there is L1 but no L4? if I calculate Y = 120 + 13L that would mean I am including L4 in the equation, or am I wrong about this?

Thank you in advance.

Jim Frost says

Hi Lyana,

I cover how this works in my book about regression analysis. If you’re using regression for a project, you might consider it.

It sounds like you’re approach is correct. You always leave one level out for the reference group. And, yes, given your equation, the predicted value for level 4 is 120.

For observations where the subject/item belongs to group 1, your equation stays the same, but you enter a 1 for L1 and 0s for L2 and L3. Hence, the predicted value is 133. In other words, you don’t change the equation given the level, you change the X values in the equation. When an observation belongs to group 4, you’ll enter 0s for L1, l2, and L3, which is why the predicted Y is 120. For a given categorical variable, you’ll only enter a single 1 for observations that belong to a non-reference group, and all 0s for observations belonging to the reference group. But the equation stays the same in all cases. I hope that makes sense!

Anthony says

Hi Jim,

May I just ask if there is a difference between a true and simple linear regression model? I can only think that their difference is the presence of a random error. Thanks a lot!

Jim Frost says

Hi Anthony,

I’ve never heard the dichotomy state as being true vs. simple linear regression. I take true models to refer to the model that is correctly specified for the population. A simple regression model is just one that has a single predictor whereas multiple regression has more than one predictor. The true model has as many terms as are required, which includes predictors and other terms that fit curvature and interaction as needed.

Isaac Aidoo says

Hi Jim,

I find your explanation to questions very good and so important. Thanks for that.

Please I need your help in my thesis work. My question is if for example I want to measure say level of resilience capacity in a company’s safety management system. What tool would you advise. Regression or which other one ?

Thanks

Kwame

Jim Frost says

Hi Kwame,

The type of analysis you use depends the data you collect as well as a variety of other factors. The answer is entirely specific to your research question, field of study, data, etc. After you make those determinations, you can begin to figure out which type of analysis to use. I recommend researching your study area to answer all of those questions, including which type of analysis to use. If you need help after you start developing the answers to the preliminary question, I’d be able to provide more input.

Also, I really recommend reading my post about designing a study that includes statistical analyses. That’ll help you understand what type of information you need to collect and questions you need to answer.

Cindy Julia says

Thank you so much for your answer, Jim!

Cindy Julia says

hello Jim, I have a question. I have one independent variable, and two dependent variables, I will explain the case before asking you a question. So, I obtain the data for independent variable using a questionnaire, and one of my dependent variable is also using a questionnaire. But, another dependent variable, which is my second variable, the data is from official website which is secondary data, different from the another variables. Then, I have a question, Is it okay if I use regression analysis to analyze these three variables? Or I have to use another statistical analysis that suit the best to analyze these variables? Thanks in advance.

Jim Frost says

Hi Cindy,

Most forms of regression analysis allow you to use one dependent variable and multiple independent variables. Because you have two dependent variables, you’ll need to fit two regression models, one for each dependent variable.

In regression, you need to be able to tie together all corresponding values of an observation for the dependent variable and the independent variables. We’ll use an example with people. To fit a regression model, for each person, you’ll need to know their values for the dependent variable and all the independent variables in the model. In your case, it sounds like you’re mixing data from an official website and a survey. If those data sources contain the same people and you can link their values as describes, that can work. However, if those data sources have different people, or you can’t link their scores, you won’t be able to perform regression analysis.

Kristian Pocrnjić says

Hi Jim, if you’ve got three predictors and one dependent variable, is it ever worth doing linear regression on each individual predictor beforehand or should you just dive into the multiple regression? Thanks a lot!

Jim Frost says

Hi Kristian,

You should probably just dive right into multiple regression. There’s a risk of being misled by starting out with regressions with individual predictors. It’s possible that omitted variable bias can increase or decrease the observed effect. By leaving out the other predictors, the model can’t control for them, which can cause that bias.

However, that said, it’s often a good idea to graph the relationship between pairs of variables using scatterplots to get an idea of the nature of each relationship. That’s a great place to start. Those plots not only reveal the direction of the relationship but also whether you need to model curvature.

I’d start with graphs and then try modeling with all the variables. You can always remove insignificant variables.

Stefania Bottega says

Hi Jim,

do you think it is correct to estimate a regression model based on historical data as Y=aX+b

and then use the model for the forecast as Y=aX?

Would this be biased?

if the variables involved are growth rates, would it be preferable to directly estimate the model without the intercept?

Thank you in advance

Stefania

Jim Frost says

Hi Stefania,

The answer to that question depends on a very close understanding of the subject area. However, there are very few cases where fitting a model without a constant is advisable. Bias would be very likely. Read my article about the y-intercept, where I discuss this issue specifically.

William Hruska says

Nice article. Thank you for sharing.

Azad Ibrahim says

If your outcome variable is a pass or fail, then it is binomial logistic. My undergrad thesis was on this topic. May be I can offer some help as this topic is of interest to me. Azad ([email protected])

Hashi says

Sir , what is cox regression analysis ?

Samantha says

Hi Jim,

A friend recommended your help with a stats question for my dissertation. I am currently looking at data regarding pass rate and student characteristics. I have collected multiple data points. One example is student pass rate (pass or rate) and observation hours (continuous variable (0-1000). Would this be a binomial logistic regression? Can that be performed in Excel?

Additionally I am looking at pass rate in relation to faculty characteristics. Another example is pass rate (percentage of 100% maybe continuous data 0-100) and categorical data (Level of degree – bachelor, masters, doctorate)? Additionally, pass rate (percentage of 100) and ratio of faculty to student within classroom (continuous Data) which test would be appropriate for this type of data comparison? Linear regression?

Thanks for your guidance!

Keerthi Kanth says

Hi Jim. Concepts were well explained. Thank you so much for making this content available.

I have the data of Mortgage loan customers who are currently in default. There are various parameters why default would have happened. But predominantly there are two factors where we would have gone wrong while sanctioning the loan one is underwriting the loan( Credit Risk) and/or Property Valuation (Technical Risk). I have data of sub parameters coming under credit and technical risk at the point of sanction.

Now I want to arrive at an output where predominantly where did I go wrong. Either Technical/Credit risk or both. Which model of regression analysis can help in solving this.

hashiiiii says

dear sir,

i ‘m currently final year undergradute of Bsc.Radiography degree, so i choosed risk estimation of cardiovascular diseses using several risk factors from regression analysis as my undergraduate research.

i want to predict a percentage value for my cardiovascular risk estimation as a dependent variable using regression analysis.

how can i do that sir,i’m very pleased to have your answer sir ?

Thank you very much.

Jim Frost says

Hi, It sounds like you might need to use binary logistic regression. If your dependent variable indicates the presence or absence (i.e., binary outcome measure) of a cardiovascular condition, binary logistic regression will predict the probability of having that condition given the values of your dependent variables.

Niall says

Hi Jim

Thank you for all the information on your page , I am currently beginning to get into statistics and wanted to ask your advice about something

I am an business analyst with MI skills building dashboard etc and using sales data and kpi s

I am wondering for regression would a good independent variable be the significance of a salespersons sales performance over the teams total sales performance or am I on the wrong track with that ?

Kanmani K says

Dear Jim… I am a first year ‘MBA’ student having least exposure to the research kind of things. Please have patience and explain me whether I can use regression to determine the impact of a variable on a ‘construct’?

Jiren says

Hi Jim,

which criteria does an independent variable need to meet in order to use it in a regression analysis?

How do you deal with data that does not meet these requirements?

Jim Frost says

Hi Jiren,

I recommend you read my post about specifying the correct regression model. That deals directly with which variables to include in the model. If you have further questions on the specifics, please post them in the comments section there.

Yvonne says

Hi Jim,

How should we interpret the factor A that becomes not significant when fitting with factor B in a model? Can I conclude that factor B incorporates factor A and just ignore the effect of factor A?

Royal says

Hello Mr.Jim and friends,

I have one dependent variable Y and six independent variables X1….X6. I have to find the effect of of all independent variables on Y , Specifically X6. to check wither it is effective or not

1) Can I use OLS regression

2) which other test i need to do before or after regression analysis

Jim Frost says

Hi,

If your dependent variable is continuous, then OLS is a good place to start. You’ll need to check the OLS assumptions for your model.

adesipo yinka says

good,very explicit processes.

Damian Howard says

Jim,

I hope this comment reaches you in good health as we are living in some pretty tough times right now. Also, thank you for building this website as it is an excellent resource for novice statisticians such as myself. My question has to do with the first paragraph of this post. In it you state,

“Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.”

Is it possible to use regression analysis to produce a regression equation when you have two independent variables and two dependent variables? Also, while I hopefully have you attention, would I need to do regression analysis twice(one for each dependent variable versus the independent variables)?

Jim Frost says

Hi Damian,

Typically, you would separate regression models for each dependent variable. There are a few exception. For example, if you use multivariate ANOVA (MANOVA), you can include multiple dependent variables. If those DVs are correlated, using MANOVA provides some benefits. You can include covariates in the MANOVA model. For more informaton, read my post about MANOVA.

Hoda says

n my study, I intervened with an instructional practice. My intervention has 4 independent variables (A, B, C, and D). In literature each subskill can be graded alone and we can get one whole score.

In literature, the effect of the intervention is holistic (A, B, C, together predict the performance on D).

So, I conducted a multiple regression (enter method) before and after the intervention where individual scores of A, B, C were added as predictors on D.

I added Group (Experimental Vs Control ) to delete any difference at baseline between experimental and control. No significant effect was noticed except for individual score of A and C on D. Model had a weak fit.

However, after the intervention, I repeated the same regression. the group (experimental Vs Control) was the best predictor. No significant effect of A was noticed but significant effect of B and C was noticed

—

How do you think I can interpret the change in the significance value of A? It is relevant in literature but after the intervention it was not significant. Does the significance have to do with the increase of the significance of the Group?

Daniel says

I’d like to ask a question that builds on your example of income regressed on IQ and education. In the dataset I am sure there would be a range of incomes. Let’s say you want to find ways to bring up the low income earners based on the data from this regression.

Can I use the coefficients from the regression to guide ideas on how to improve the lower income earners as an estimate of how much improvement would be expected? For example, if I take the lowest earner and find that he is also below average in IQ and education, could I suggest that he gets another degree and try to improve IQ test results to potentially gain $X (n*IQ + m*Edu) in income?

This example may not be strictly usable because I imagine there are many other factors for income. Assuming that we are confident that we’ve captured most of the variables that affect income, can the numbers be used in this way?

If this is not an appropriate application, how would one go about this? Thanks.

Nita says

Hello

I am completing a reflection paper for Math 221 I work in a call center can I use a regression analysis for this type of work?

Oloruntoba says

Hello Jim,

I am a total novice when it comes to Statistics. My challenge is, I am working on the relationship between population growth of a town and class size of secondary schools in that same town (about 10 schools) over a period of years (2008-2018). Having gathered my data, I don’t know what to use in analyzing my data to show this relationship.

Thanks

Marlene says

Hi Jim!

Im just a student whos trying to finish her science investigation 🙂

but i have a question.

What is linear regression and how do we know if this method is appropriate for our data?

Jim Frost says

Hi Marlene,

I think this blog post describes pretty well when to use regression analysis generally. Linear regression analysis is a specific form of regression. Linear refers to the form of the model–not whether it can fit curvature. I talk about this in my post about the differences between linear and nonlinear regression. I always suggest that you start with linear regression because it’s an easier to use analysis. However, sometimes linear regression can’t fit your data. It can fit curvature in your data but it can fit all types of curves. Nonlinear regression is more flexible in the types of curves.

As for determining whether linear regression is appropriate for your data, you need to see if it can provide an adequate fit to your data. To make that determination, please read my posts about residual plots because that’s how you can tell.

Best of luck with your research!! 🙂

Jinky Esin says

Hello Jim, thank you for this wonderful page. It has enlightened me when to use regression analysis. However, I am a complete beginner to using SPSS (and statistics at that) so I am hoping you can help me with my specific problem.

I intend to use a linear regression analysis. My dependent variable is continuous and I would think it’s ordinal (data was obtained through a 5-point Likert scale). I have two independent variables (also obtained through 5-point Likert scales). However, I also intend to use 7 control variables and this is where my problem lies. My control variables are all (I think) nominal (or is that called categorical in statistics?). They are as follows:

Age – 4 categories

Gender – 2 categories

Marital Status – 4 categories

Education level – 11 categories

Household income – 4 categories

Nationality – 4 categories

Country of origin – 9 categories

Do I input these control variables as it is? Or do I have to do something beforehand? I have heard about creating dummy variables. However, if I try creating dummy variables for each control variable, won’t I end up with many variables?

Please give me some advise regarding this. I am really stuck in this process for a while now. I look forward to hearing from you, thanks.

Jim Frost says

Hi Jinky,

There are several issues to address in your questions. I’ll provide some information. However, my regression ebook goes it into the details much further. So, I highly recommend you get that.

In terms of the dependent variable, the answer is clear. Likert scale data, if it’s the actual values of 1, 2, 3, 4, and 5, these are actually ordinal data and are not considered continuous. You’ll need to use ordinal logistic regression. If the DV is an average of multiple Likert score items for each individual, so an individual might have a 3.4, that is continuous data and you can try using linear least squares regression.

Categorical data and nominal data are the same. There are different naming conventions, but those synonyms.

For categorical data, it’s true that you need to recode them as indicator variables. However, most software should do that automatically behind the scenes. However, as you noticed, the recoding (even if your software does it for you) can involve creating many indicator variables (dummy variables), particularly when you have many categorical variables and/or many levels within a categorical variable. That can use up your degrees of freedom! My ebook covers this in more detail.

For Likert IV variables. Again, if it’s an average of multiple Likert items, you can probably include it as a continuous variable. However, if it’s the actual Likert values of 1, 2, 3, 4, and 5, then you’ll need to decide whether to include it as a continuous or categorical variable. There are pros and cons for both approaches. The best answer depends on both your data and your goals. My ebook describes this in more detail.

Yes, as a general rule, you want to include your control variables and IVs that you are specifically testing. Control variables are just more IVs, but they’re usually not your main focus of study. You include them so that you can account for them while testing your main variables of interest. Excluding relevant IVs that are significant can bias the estimates for the variables you’re interested in. However, if you include control variables and find they’re not significant, you can consider removing them from the model.

So, those are some pointers to start with!

San says

Hi Jim and everyone!

I’m starting some some statistical analysis and is been really useful. I have a question regarding variables and samples.

I need to see if there is any relationship between days of the week and number of robberies. I already have the data but I wonder, if my variables (# of robberies in each day of the week (independent) and # of total roberies (dependent)) come from the same data sample, can it be a problem?

Thanks!

Lisa says

Thank you Jim this was really helpful

I have a question

How do you interpret an independent variable lets say AGE with categories that are insignificant

for example i run the regression analysis for the variable age with categories

age as a whole was found to be significant but there appear insignificance within categories , it was as follows

Age =0.002

<30 years =0.201

30-44 years=0.161

45+ ( ref cat)

I had another scenario

occupation = 0.000

peasant farmers =0.061

petty businessmen=0.003

other occupation ( ref cat)

my research question was " what are effect of socio- demographic characteristics on men's attendance to education classes

I failed to interpret them , kindly help

Jim Frost says

Hi Lisa,

For categorical variables, the linear regression procedure uses two tests of significance. It uses an F-test to determine the overall significance of the categorical variable across all its levels jointly. And, it uses separate t-tests to determine whether each individual level is different from the reference level. If you change the reference level, it can change the significance of t-tests because that changes the levels that the procedure directly compares. However, changing the reference level won’t change the F-test for the variable as a whole.

In your case, I’m guessing that the mean for <30 is on one side (high or low) compared to the reference category of 45+ while the mean of 30-44 is on the other side of 45+. These two categories are not far enough from 45+ to be significant. However, given the very low p-value for age, I'd guess that if you change the reference level from 45+ to one of the other two groups, you'll see significant p-values for at least one of the t-tests. The very low p-value for Age indicates that the means for the different levels are not all equal. However, given the reference level, you can't tell which means are different. Using a different reference level might provide more meaningful information.

For occupation, the low p-value for the F-test indicates that not all the means for the different types of occupations are equal. The t-test results indicate that the difference in means between petty businessmen and other (reference level) is statistically significant. The difference between peasant farmers and the reference category is not quite significant.

You don't include the coefficients, but those would indicate how those means differ.

Because you're using regression analysis, you should consider getting by regression ebook. I cover this topic, and others, in more detail in the book.

Best of luck with your analysis!

Awudu says

Hi Jim, I have followed your discussion and I want to know if I can apply this analysis in case study

Nigatu Mekonnen says

Hi Jim

really appreciate your excellency in regression analysis.

please would help the steps to draw a single fitted line for several, say five IVs, against a sing DV

with regard

Jim Frost says

Hi Nigatu,

It sounds like you’re dealing with multiple regression because you have more than one IV. Each IV requires an axis (or dimension) on a graph. So, for a two-dimensional graph, you can use the X-axis (horizontal) for IV and the Y-axis for the DV. If you have two IVs, you could theoretically show them as hologram in three dimensions. Two dimensions for the IVs and one for the DV. However, when you get to three or more IVs, there’s just no way to graph them! You’d need four or more dimensions. So, what can you do?

You can view residual plots to see how the model with all 5 IVs fits the data. And, you can predict specific values by plugging numbers into the equation. But you can’t graph all 5 IVs against the DV at the same time.

You could graph them individually. Each IV by itself against the DV. However, that approach doesn’t control for the other variables in the model and can produce biased results.

The best thing you can do that shows the relationship between an individual IV and a DV while controlling for all the variables in a model is to use main effects plots and interaction plots. You can see interaction plots here. Unfortunately I don’t have a blog post about main effects plots, but I do write about them in my ebook, which I highly recommend you get to understand regression! Learn more about my ebook!

I hope this helps!

Irina says

Many thanks. I appreciate it.

Irina says

Hello Jim,

I stumbled across your website in hopes of finding an answer to a couple of questions regarding the methodology of my political science paper. If you could help, I would be very grateful.

My research question is “Why do North-South regional trade agreements tend to generate economic convergence while South-South agreements sooner cause economic divergence?”. North = OECD developed countries and South = non-OECD developing countries.

This is my lineup of variables and hypotheses:

DV: Economic convergence between country members in a regional trade agreement

IV1: Complementarity (differentness) of relative factor abundance

IV2: Market size of region

IV3: Economic policy coordination (Harmonization of Foreign Direct Investment (FDI) policy)

H1: The higher the factor endowment difference between countries, the greater the convergence

H2: The larger the market size, the greater the convergence

H3: The greater the harmonization of FDI policies, the greater the convergence

I am not sure what the best methodological approach is. I will have to take North-South and South-South groups of countries and assign values for the groupings. I want to show the relationship between the IVs and DV, so I thought to use a regression. But there are at least two issues:

1. I feel the variables are not appropriate for a time series, which is usually used to show relationships. This is because e.g. the market size of a region will not be changing with time. Can I not do a time series and still have meaningful results?

2. The IVs are not completely independent of one another. How can I work with that?

Also, what kind of regression would be most appropriate in your view?

Many sincere thanks in advance.

Irina

Jim Frost says

Hi Irina,

I’m not an expert in that specific field, so I can’t give you concrete advice, but here are somethings to consider.

The question about whether you need to include time related information in the model depends on the nature of your data and whether you expect temporal effects to exist. If your data are essentially collected at the same time and refer to the same time period, you probably don’t need to account for time effects. If theory suggests that the outcome does not change over time, you probably don’t need to include variables for time effects.

However, if your data are collected at or otherwise describe different points in time, and you suspect that the relationships between the IVs and DV changes overtime, or there is an overall shift over time, yes, you’d need to account for the time effects in your model. In that case, failure to account for the effects of time can bias your other coefficients–basically there’s the potential for omitted variable bias.

I don’t know the subject area well enough to be able to answer those questions, but that’s what I’d think about.

You mention that the IVs are potentially correlated (multicollinearity). That might or might not be a problem. It depends on the degree of the correlation. Some correlation is OK and might not be a problem. I’d perform the analysis and check the VIFs, which measure multicollinearity. Read my post about multicollinearity, which discusses how to detect it, determine whether it’s a problem and some corrective measures.

I’d start with linear regression. Move away from that only if you have specific reason to do so.

Best of luck with your analysis!

Lizzy Casey says

Hi Jim

I was wondering if you could help. I’m currently doing a lab report on Numerical cognition in Human and non human primates. Where we are looking at whether size , quantity and visibility of food effects choice. We have tested Humans so far and then are going to test chimps in the future. My Iv is Condition : visible and opague containers and my Dv is number of correct responses. So far I have compared the means of number of correct responses for both conditions using a one way repeated measures ANOVA but I don’t think this is correct. After having a look at your website, should I look to run a regression analysis instead ? Sorry for the confusion I’m really a rookie at this. Hope you can help !

Jim Frost says

Hi Lizzy,

Linear regression analysis and ANOVA are really the same type of analysis-linear models. They both use the same math “underneath the hood.” They each have their own historical traditions and terminology, but they’re really the same thing. In general, ANOVA tends to focus on categorical (nominal) independent variables while regression tends to focus on continuous IVs. However, you can add continuous variables into an ANOVA model and categorical variables into a regression model. If you fit the same model in ANOVA as regression, you’ll get the same results.

So, for your study, you can use either ANOVA or regression. However, because you have only one categorical IV, I’d normally suggest using one-way ANOVA. In fact, if you have only those two groups (visible vs opaque), you can use a 2-sample t-test.

Although, you mention repeated measures, you can use that if you in fact do have a pre-test and post-test conditions. You could even use a paired t-test if you have only the two groups and you have a pre- and post-tests.

There is one potential complication. You mention that the DV is a count of correct responses. Counts often do not follow the normal distribution but can follow other distributions such as the Poisson and Negative Binomial distributions. Although, counts can approximate the normal distribution when the mean is high enough (>~20). However, if you have two groups and each group has more than 15 observations, the analyses are robust to departures from the normal distribution.

I hope this helps! Best of luck with your analysis!

Kris mckinnon says

Thankyou so much for the reply . Appreciate it and I finally worked it out and got good mark on lab report, which was good :). Appreciate your time replying you explain things very clear so thankyou

Kris mckinnon says

Hi there. I am currently doing a lab report and have not done stats in years so hoping someone can help as due tommorow.

When I do correlation bivariate test it shows the correlations not significant between a personaility trait and a particular cognitive task. Yet when I conduct a simple t test it shows a significant p value and gives the 95 % conf interval. If I was to compare that higher scores on one trait tends to mean higher scores on a particular cognitive task then should I be doing a regression then. We were told basic correlations so I did the bivariate option and just stated that the pearson’s r is not significant r=.. n= p =.84 for example. Yet if do a regression analysis for each it is significant. Why could this be?

Thankyou

Jim Frost says

Hi Kris,

There not quite enough details to know for sure what is happening–but here are some ideas.

Be aware that a series of pairwise correlations is not equivalent to performing regression analysis with multiple predictors. Suppose you have your outcome variable and two predictors (Y X1 X2). When you peform the pairwise correlations (X1 and Y, X2 and Y), each correlation does not account for the other X. However, when you include both X1 and X2 in a regression model, it estimates the relationship between each X and Y while accounting for the other X.

If the correlation and regression model results differ as you describe, you might well have a confounding variable, which biases your correlation results. I write about this in my post about omitted variable bias. You’d favor the regression results in this situation.

As for the difference between the 2-sample t-test and correlation, that’s not surprising because they are doing two entirely different things. The 2-sample t-test requires a continuous outcome variable and a categorical grouping variable and it tests the mean difference between the two groups. Correlations measure the linear association between two continuous variables. It’s not surprising the results can differ.

It sounds like you should probably use regression analysis and include your multiple continuous variables in the model along with your categorical grouping variables as independent variables to model your outcome variable.

Best of luck with your analysis!

Kathlene Gale M. Dulay says

This is Kathlene, and I am a Grade 12 student. I am currently doing my research. It’s a quantitative research. I am having a little trouble on how will i approach my statistical treatment. My research is entitled ” Emotional Quotient and Academic Performance Among Senior High School Students in Tarlac National High School: Basis to a Guidance Program.

I was battling what to use to determine the relationship between the variables in my study.

I’m thinking to use chi-square method but a friend said it would be more accurate to use the regression analysis method. Math is not really my field of study so i badly need your opinion regarding this.

I’m hoping you could lend me a helping hand.

Thank you.

Jim Frost says

Hi Kathlene,

It sounds like you’re in a great program! I wish more 12th grade students were conducting studies and analyzing their results! 🙂

To determine how to model the relationships between your variables, it depends on the type of variables you have. It sounds like your outcome variable is academic performance. If that’s a continuous variable, like GPA, then I’d agree with your friend that regression analysis would be a good place to start!

Chi-square assesses the relationship between categorical variables.

Best of luck with your analysis!

Umar Awan says

Hi Mr Jim,

I am using orthogonal design having 7 factors with three levels. I have done regression analysis on Minitab software but i don’t know how to explain them or interpret them. I need your help in this regard.

Jim Frost says

Hi Umar,

I have a lot of content throughout my blog that will help you, including how to interpret the results. For a complete list for regression analysis, check out my regression tutorial.

Also, early next year I’ll be publishing a book about regression analysis as well that contains even more information.

If you have a more specific question after reading my other posts, you can ask them in the comments for the appropriate blog post.

Best of luck!

Ty Pulliam says

By the way my gun laws vs VCR, is part of a regression model. Any help you can give, I’d greatly appreciate.

Ty Pulliam says

Mr. Jim, I have a problem. I’m working on a research design on gun laws vs homicides with my dependent variable being violent crime rate. My sig is .308 The constant’s (VCR) standard error is 24.712 my n for violent crime rate is 430.44. I really need help ASAP. I don’t know how to interpret this well. Please help!!!

Jim Frost says

Hi Ty,

There’s not enough information for me to know how to interpret the results. How are you measuring gun laws? Also, VCR is your dependent variable, not the constant as you state. You don’t usually interpret the constant. All I can really say is that based on your p-value, it appears your independent variable is not statistically significant. You have insufficient evidence to conclude that there is a relationship between gun laws and homicides (or is it VCR?).

angela says

Hi Jim

Your blog has been very useful. I have a query.. if I am conducting a multiple regression is it okay to have an outcome variable which is normally distributed ( i winsorized an outlier to achieve this) and have two other predictor variables which are not normally distributed? ( the normality tests scores were significant).

I have read in many places that you have to transform your data to achieve normality for the entire data set to conduct a multiple regression – but doing so has not helped me at all. Please advice.

Jim Frost says

Hi Angela,

I’m dubious about the Winsorizing process in general. Winsorizing reduces the effect of outliers. However, this process is fairly indiscriminate in terms of identifying outliers. It simply defines outliers as being more extreme than an upper and lower percentile and changes those extreme values to equal the specified percentiles. Identifying outliers should be a point by point investigation. Simply changing unusual values is not a good process. It might improve the fit of your data but it is an artificial improvement that overstates the true precision of the study area. If that point is truly an outlier, it might be better to remove it altogether, but make sure you a good explanation for why it’s an outlier.

For regression analysis, the distributions of your predictors and response don’t necessarily need to be normally distributed. However, it’s helpful, and generally sought, to have residuals that are normally distributed. So, check your residual plots! For more information, read my post about OLS assumptions so you know what you need to check!

If your residuals are nonnormally distributed, sometimes transforming the response can help. There are many transformations you can try. It’s a bit trial by error. I suggest you look into the Box-Cox and Johnson transformations. Both methods assess families of transformations and pick one that works bets for your data. However, it sounds like your outcome is already normally distributed so you might not need to do that.

Also, see what other researchers in your field have done with similar data. There’s little general advice I can offer other than to check the residuals and make sure they look good. If there are patterns in the residuals, make sure you’re fitting curvature that might be present. You can graph the various predictors by the residuals to find where the problem lies. You can also try transforming the variables as I describe earlier. While the variables don’t need to follow the normal distribution, if they’re very nonnormally distributed, it can cause problems in the residuals.

Best of luck with your analysis!

DMA says

Hi, I am confused about the assumption of independent observations in multiple linear regression. Here’s the case. I have heart rate data per five-minute for a day of 14 people. The dependent variable is the heart rate. During the day, the workers worked for 8 hours (8 am to 5 pm), so basically, I have 90 data points per worker for a day. So that makes it 1260 data points (90 times 14) to be included in the model. Is it valid to use multiple linear regression for this type of data?

Jim Frost says

Hi DMA,

It sounds like your model is more of a time series model. You can model those using regression analysis as well, but there are special concerns that you need to address. Your data are not independent. If someone has a height heart rate during one measurement, it’s very likely it’ll also be heighted 5 minutes later. The residuals are likely to be serially correlated, which violates one of the OLS assumptions.

You’ll likely need to include other variables in your model that capture this time dependent information, such as lagged variables. There are various considerations you’ll need to address that go beyond the scope of these comments. You’ll need to do some additional research into use regression analysis for time series data.

Best of luck with your analysis!

Asad says

Ok.Thank you so much.

Asad says

Thank you so much for your time!

Actually i don’t have authentic data about property values (dependent variable) nor the concerning institutions have this data. Can i ask the property value directly to the property owner thorough walk interview?

Jim Frost says

You really need to have valid data. Using a self-reported valuation might be better than no data. However, be aware there might be differences between what the property owner says and the true market value. Your model would describe self-valuation rather than market valuation. Typically, I’ve seen studies like yours use actual sales prices.

Asad says

Hello Sir!

is it necessary fir dependent variable in multiple regression model to have values. i have number of independent variable( age of property, stories in building, location close to park)and single dependent variable (Property values).

Some independent variable decrease the value of dependent variable, while some independent variables increase the value of the dependent variable?

Can i put the value if my single dependent variable as ( a.<200000, b.<300000,c. d. 500000)?

Jim Frost says

Hi Asad,

Why would can’t you enter the actual property values? Ideally, that’s what you would do. If you are missing a value for a particular observation, you typically need to exclude the entire observation from the analysis. However, there are some ways to estimate missing values. For example, SPSS has advanced methods for imputing missing values. But, you should use those only to estimate a few missing values. Your plan should be to obtain the property values. If you can’t do that, it will be difficult to perform regression analysis.

There are some cases where you can’t record the exact values and it’s usually related to the observation time. This is known as censored data. A common example is in reliability analysis where you record failure times for a product. You run the experiment for a certain amount of time and you obtain some failures and know their failure times. However, some products don’t fail and you only know that their failure time is greater than the test time. There are censored regression models you can use in situations like that. However, I don’t think that applies to your subject-area, at least as far as I can tell.

Best of luck with your analysis!

Rachel Wang says

thank you so much Jim! this is really helpful 🙂

Jim Frost says

You’re very welcome! Best of luck with your analysis!

Rachel Wang says

Hi Jim,

The variances (SD) for the 3 groups are 0.45, 0.7 and 1. Would you say that they vary by a lot?

Another follow up question: does a narrower CI equals a better estimate?

thanks!

Jim Frost says

Yes, that’s definitely it!

I would suggest using Welch’s one-way ANOVA to analyze it and potentially use that analysis to calculate the CI. You’re essentially performing a one-way ANOVA. And, in ANOVA, there is the assumption of equal variances between groups, which your data do not satisfy. In regression, we’d refer to it as heteroscedasticity. In Welch’s ANOVA, you don’t need to satisfy that assumption. That makes it a simple solution for your case.

In terms of CIs, yes, narrower CIs indicate that the estimate is more precise than if you had a wider CI. Think of the CI as a margin of error around the estimate and it’s good to have a smaller margin of error. With a narrower CI, you can expect the actual mean to fall closer to the fitted value.

Rachel Wang says

Hi Jim,

Thank you so much for the quick response!

I checked the residual plots, it gives me a pretty trend line at y=0, and my R square = 0.87. However the CI it gives me by using all 15 points (regression inference) is a little wider (2.012 – 3.655) than if I just use that 5 points(2.245 – 3.355). In this case, would you still prefer using all 15 points?

Thank you!

Jim Frost says

That’s tricky. I hate to throw out data, but it does seem warranted. At least you have a good rationale for not using the data!

CIs of the mean for the a point at the end of a data range in a regression model do tend to be wider than in the middle of the range. Still, I’m not sure why it would be wider. Are the variances of the groups roughly equal? If not, that might well be the reason.

Rachel Wang says

Hi Jim,

suppose I have total of 15 data points at x=0, x=40, and x=80 (5 data points at each x value), now I can use regression to estimate y when x=60. But what if I want to estimate the average when x=0? Should I just use that 5 data points when x=0, or use the intercept from the regression line? Which is the best estimate for a 95% CI for the average y value when x=0?

Thank you 🙂

Jim Frost says

Hi Rachel,

Assuming that model provides a good fit to the data (check the residual plots), I’d use all the data to come up with the CI for the fitted value that corresponds to X = 0. That approach uses more data to calculate the estimate. Your CI might even be more precise (narrower) using all the data.

Salam says

Hi,

What make us use the linear regression instead of other types of regression. In other words, the motivation for selecting a linear model?

Jim Frost says

Hi Salam,

Typically, try linear regression first. If your data contain curvature, you might still be able to use linear regression. Linear regression is generally easier to use and includes some useful statistics that nonlinear regression can’t provide, such as p-values for the coefficients and R-squared.

However, if you can’t adequately fit the curvature in your data, it might be time to try nonlinear regression. While both types allow you fit curvature, nonlinear regression is more flexible because it allows your model to fit more types of curvature.

I’ve written a post about how to choose between linear and nonlinear regression that you should read. Within that post are various related links that talk about how to fit curves using both types of regression, along with additional information about both types.

I hope this helps!

V.G.Subramanian says

Hi Jim

Thank u so much for your reply. I am really gorgeous to know much more of this . I shall keep sending mails seeking your reply which i hope you will not mind

V.G.Subramanian says

Hi Jim

I have been unfortunate to get your reply to my comment on 18/09/2018

Jim Frost says

Hi V.G.,

Sorry about the delay. As you can no doubt imagine, my schedule gets busy and things can fall through the cracks.

I replied under your original comment.

Aisling Dunphy says

Hi Jim,

Your blog has been really helpful! 🙂 I am currently completing my Masters Thesis and my primary outcome is to assess the relationship between Diabetes Distress and blood glucose control. I am a newbie to SPSS and I am at a loss as to how best to analyse my small (not normally distributed pre and post data transformation) data set.

I have been advised that regression analysis may be appropriate and better than correlations? However my data does not appear to be linear.

My diabetes distress variables consist of a score of 1-6 based on a likert scale and also are categorical (low, moderate, high distress) and my blood glucose consists of continuous data and also a categorical variable of poorly controlled blood glucose and well controlled blood glucose.

At the moment I am struggling to complete this analysis. Any help would be greatly appreciated 🙂

Tetyana says

Dear Jim, thatk you very much for this post! Could you, please, explain the following.

You are writing: “you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values”

What if I have small r-squired, but the coefficiants are statistically significant with the small values?

Almadi Obere says

Hi Jim

Thanks for your enlightened explanations. However I want to engage you a bit. under how to interpret regression results, you indicated that a small p-Value indicates that the ” independent variable is statistically significant”. i tend not to agree> Note that since the null hypothesis is that the coefficient of the independent variable is equal to Zero, it’s rejection as evidenced by low p-Value should imply that it is the coefficient which is significantly different from zero and not the variable.

almadi

Jim Frost says

Hi Almadi,

Yes, you’re correct that the p-value tests whether the coefficient estimate is significantly different from zero. If it is, you can say that the coefficient is statistically significant. Alternatively, statisticians often say that the independent variable is statistically significant. In this context, these are two different ways of saying the same thing because the coefficient is a property of the variable itself.

V.G.Subramanian says

Hi Jim

As u must be well aware, govt releases price indices and these are broadly used to determine the effect of base prices during a given period of time.

Construction industry, normally uses these price indices running over a period of time to redetermine the

prices based on the movement between the base date and current date, which is called as price adjustment

Govt after a few years of time releases a new series of price indices where we may not have the data of indices with old series which will necessitate us to use these new indices with a conversion factor to arrive at the equivalent value of the base price.

Where do you feel that Regression Analysis could be of help where we have to determine the current value of the base price using the new indices.

It is a bit amusing that someone was suggesting to me.

Regards

V.G.Subramanian

Jim Frost says

Hi V.G.,

I agree that switching price indices can be a problem. If the indices overlap, you can perform regression analysis where the old index is the independent variable and the new index is the dependent variable. However, that is problematic if you don’t have both indices. If you had both indices, I suppose it wouldn’t be a problem to begin with!

Ideally, you’d understand the differences behind how the government calculates both indices, and you could use that to estimate the value of the other index.

I’m not particularly familiar with this practice, so I don’t have a whole lot of insight into it. I hope this helps somewhat!

Antonio Padua says

Thank you for this, Jim. I’ve always felt a common sense explanation minus all the impressive math formulas is what is needed in statistics for data science. This is a big part of the basics I’ve been missing. I’m looking forward to your Logistic Regression Tutorial. How is that coming along for you?

Jim Frost says

Hi Antonio,

Thanks so much for your kind words! They mean a lot to me! Yes, I totally agree, explanations should focus on being intuitive and helping people grasp the concepts.

I have written a post on binary logistic regression. Unfortunately, it’ll be awhile before I have a chance to write a more in-depth article–just too many subject to write about!

Rashmi Bs says

Dear sir,

I have a few question about when to use ANOVA and when to use regression analysis. In my study i have conducted a experiment by considering temperature , pH, weight of a compound as a independent variables and extraction as a dependent variable ( i mentioned very generally but i have some specific independent and dependent variables along with these variables). I did statistical analysis by using one way ANOVA-Tukey’s test and i have used grouping method ( using alphabets a,b,c….) to show the significance based on the p value . My question is, for these type of data can i use regression analysis? and what is the main difference between Tukey’s test and regression analysis?

Jim Frost says

Hi Rashmi,

Both regression analysis and ANOVA are linear models. As linear models, both types of analyses have the same math “under the hood.” You can even use them interchangeably and get the same results. Traditionally, you use ANOVA when you have only, or mainly, categorical factors–although you can add in covariates (continuous variables). On the other hand, you tend to use regression when you have only, or mainly, continuous variables–although you can add in categorical variables.

Because ANOVA focuses on categorical factors and comparing multiple group means, statisticians have developed additional post hoc analyses to work with ANOVA, such as Tukey’s test. Typically, you’ll perform the ANOVA first and then the post hoc test. Suppose you perform a one-way ANOVA and obtain significant results. This significance tells you that not all of the group means are equal. However, it does not tell you which differences are statistically significant.

That point is where post hoc tests come in. These tests do two things. They’ll tell you which differences are statistically significant. They also control the family error rate for the group of comparisons. When you compare multiple differences like that, you increase the risk of a Type I error–which is when you say there is a difference but there really isn’t. When you compare multiple means, the Type I error rate will be higher than your significance level (alpha). These post hoc tests (other than Fishers) maintain the type I error rate so it continues to equal alpha, which is what you would expect.

So, use an ANOVA first. If you obtain significant results for a categorical factor, you can use post hoc tests like Tukey’s to explore the differences between the various factor levels.

I really need to write a blog post about this! I will soon!

In the meantime, I hope this helps!

Kaushal Kumar Bhagat says

Is it necessary to conduct correlation analysis before regression analysis?

Jim Frost says

Hi Kaushal,

No it’s not absolutely required. I actually prefer producing a series of scatterplots (or a matrix plot) so I can see the nature of the different relationships. That helps give me a better feel for the data along with the types of relationships. However, if you have a good theory and a solid background knowledge on which variables should be included in the model, you can go straight to modeling. I think it depends a lot on your existing level of knowledge.

That all said, I personally like knowing the correlation structure between all of the variables. It gives me a better feel for the data.

Saeed Anowar says

Thanks

Patrik Silva says

Thank you Jim!

I really appreciate it!

PS

Patrik Silva says

Hi Jim, I hope you are having good time!

I would like to ask you a question, please!

I have 24 observations to perform a regression analysis (let’s say Zones), and I have many independent variables (IV). I would like to know what is the minimum number of observations I should have to perform a reasonable linear regression model. I would like to hear something from you about how to test many regression model with different IV, since I can not use many IV in a model where a have few observations (24).

Thank you in advance!

Patrik

Jim Frost says

Hi Patrik, great to hear from you again!

Those are great questions. For 24 observations, I’d say that you usually wouldn’t want more than 2 IVs. I write an entire post about how many variables you can include in a regression model. Including too many IVs (and other terms such as interactions and polynomials) is known as overfitting the model. Check that post out because it’ll provide guidance and show you the dangers of including too many.

There’s another issue a play too because you want to compare a number of different regression models to each other. If you compare many models, it’s a form of data mining. The risk here is that if you compare enough models, you will uncover chance correlations. These chance correlations look like the real thing but only appear in your sample and not the population. I’ve written a post about how using this type of data mining to choose a regression model causes problems. This concern is particularly problematic with a small sample size like yours. It can find “patterns” in randomly generated data.

So, there’s really two issues for you to watch out for–overfitting and chance correlations found through data mining!

Hope this helps!

Patrik Silva says

Many Thanks Jim!!! You have no idea about how much you helped me.

Very well clarified!!!

God bless you always!!!

Patrik

Patrik Silva says

Hi Jim, I am everywhere in your post!

I am starting loving statistic, that’s why I am not quiet,

I have some questions for you:

To use OLS regression, one of the assumptions is that the dependent variable is normally distributed. To achieve this requirement what I should do with my data? Should I check the normality of my dependent variable, for example using Shapiro test (etc)? If I conclude that my dependent variable is not following the normal distribution I should start to see data transformation, right?

Another way that I have used to see people analyzing the normality is by plotting the dependent variable with the independent variable and if the relationship doesn’t follow linear trend then they go to data transformation (which one you recommend me?)

Should I perform the regression using my data (original) and then the residuals will show me non-normality if do exists?

When should I transform my independent variables, and what is the consequence of transforming them?

Sorry, I use to ask many questions in a single comment, but I think this is the way to understand the full picture of my doubt,

You are being so useful to me,

Thank you again!

Patrik

Jim Frost says

Hi Patrik, I’m so happy to hear that you’re starting to love statistics! It’s a great field that is exciting. The thrill of discovery combined with getting the most value out of your data. I’m not sure if you’ve read my post about The Importance of Statistics, but if you haven’t, I recommend it. It explains why the field of statistics is more important than ever!

In OLS regression, the dependent variable does not have to be normally distributed. Instead, you need to assess the distribution of the residuals using residual plots. If your residuals are not normally distributed, there are a variety of possible reasons and different ways to resolve that issue. I always recommend that transforming your data is the last resort. For example, the residuals might be nonnormal because the model is specified incorrectly. Maybe there is curvature in the data that you aren’t modeling correctly? If so, transforming the data might mask the problem. You really want to specify the best possible model. However, if all else fails, you might need to transform the data. When you transform the data, you’ll need to back transform the results to make sense of the results because everything applies to the transformed data. Most statistical software should do this for you.

Be aware that you can’t trust R-squared and the standard error of the regression when you transform your dependent variable because they apply to the transformed data rather than the raw data (backtransformation won’t help there).

In terms of testing the normality of the residuals, I recommend using normal probability plots. You can usually tell at a glance whether they are normally distributed. If you need a test, I generally use the Anderson-Darling test–which you can see in action in my post about identifying the distribution of your data. By the way, as a case in point, the data in that post are not normal, but I use it as the dependent variable in OLS regression in this post about using regression to make predictions. The residuals are normally distributed even though the dependent variable is not.

SUBROTO Chatterjee says

In the coffee intake and smoking example, the first result showed that higher coffee intake leads to higher mortality, but after including smoking, coffee intake leads to lower or no mortality? Smoking was revealed to cause the mortality, but how did coffee intake now result in the opposite? Was a separate test taken for this result? Please let me know.

S. CHATTERJEE

Jim Frost says

Hi, that’s a great question. It turns out that coffee and smoking are correlated. The negative effects of smoking on mortality are well documented. However, for some reason, the researchers did not originally include smoking in their model. Because drinking coffee and smoking are correlated, the variable for coffee consumption took on some of smoking’s effect on mortality.

Put another way, because smoking was not included in the model, it was not being controlled (held constant). So, as you increased coffee consumption, smoking also tended to increase because it is both positively correlated with coffee consumption and not in the model. Therefore, it appeared as though increased coffee consumption is correlated with higher mortality rates but only because smoking was not included in the model.

Presumably, the researchers had already collected data about smoking. So, all they had to do was include the smoking variable in their regression model. Voila, the model now controls for smoking and the new output displays the new estimate of the effect that coffee has on mortality.

This point illustrates a potential problem. If the researchers had not collected the smoking information, they would have really been stuck. Before conducting any study researchers need to do a lot of background research to be sure that they are collecting the correct data!

I hope this helps!

Ahmed says

Hi Jim

Hope all thing is well,

I have faced problem with plotting, which is included the relationship between dependent variable (response) and the independent variables .

when i do the main effect plots, i have the straight line increasing.

y= x, this linear trending

to change it i need to make y= square root for time

Im stuck with this thing i couldn’t find solution for it

Regards

Cara says

Hi Jim,

I was wondering if you can help me? I am doing my dissertation and I have 1 within-subjects IV, and 3 between-subjects IVs.. most of my variables are categorical, but one is not categorical, it is a questionnaire which I am using to determine sleep quality, with both Likert scales and own answers to amount of sleep (hours), amount of times woke in the night etc. Can I use a regression when making use of both categorical data and other? I also have multiple DVs (angry/sad Likert ratings).. but I *could* combine those into one overall ’emotion’ DV. Any help would be much appreciated!

Jim Frost says

Hi Cara, because your DV use the Likert scale, you really should be using Ordinal Logistic Regression. This type of regression is designed for ordinal dependent variables like yours. As for the IVs, it can be tricky using ordinal variables. They’re not quite either continuous or categorical. My suggestion is to give them a try as continuous variable and check the residual plots to see how they look. If they look good, then it’s probably ok. However, if they don’t look good, you can try refitting the model using them as categorical variables and then rechecking the residual plots. If the residuals still don’t look good, you can then try using the chi-square test of independence for ordinal data.

As for combining the data, that would seem to be a subject-area specific decision, and I don’t know that area well enough to make an informed recommendation.

Martin Amsteus says

Yes. But it may be that you miss my point. Because I argue that a proper and sound experiment will allow you to test for causality, regardless of if you deploy e.g. Pearsons r or regression. With no experimental design, neither Pearsons r nor a regression will test for an effect relationship between the variables. Randomisation makes a better case for controlling for variables that you are unaware of than picking a few, and then proclaim that your study found that x will cause an incrrase in y or that x has an effect on y. You may as well argue that you dont need to control for any variables and argue that any correlational study test for Effect relationships.

Jim Frost says

Hi Martin, yes, that is

exactlywhat I’m saying. Whether you can draw causal conclusion depends on whether you used a randomized experiment to collect your data. If it’s an observational study, you can’t assume it’s anything other than correlation. What you write in your comment agrees with what I’m saying.The controlling for other variables that I mention in this post is a different matter. Yes, if you include a variable in a regression model, it is held constant while estimating the effects of the other variables. That doesn’t mean you can assume causality though.

Martin Amsteus says

No statistical tool or method turns a survey or corrolation study into an experiment, i.e. regression does not test or imply cause effect relationship. A positive relationship between smoking and cancer in a regression analysis does not mean that smoking cause cancer. You have not controlled for what you are unaware of.

Jim Frost says

Hi Martin, you are 100% correct about the fact that correlation doesn’t imply causation. This issue is one that I plan to cover in future posts.

There are two issues at play here. The type of study under which the data were collected and the statistical findings.

Being able to determine causation comes down to the difference between an observational study versus a randomized experiment. You actually use the same analyses to assess both types of designs. In an observational study, you can only establish correlation and not causality. However, in a randomized experiment, the same patterns and correlations in the data can suggest causality. So, regression analysis

canhelp establish causality, but only when it’s performed on data that were collected through a randomized experiment.Hari says

Very nicely explanined. thank you

Jim Frost says

Thanks you, Hari!

Kaleem says

Thanks for your reply and for the guidance.

I read your posts which are very helpful. After reading them, I concluded that only the independent variables which have a well-established association with the dependent variable should be included. Hence, in my case, variable Z should not be included given that the association of Z with dependent variable is not well-established.

Furthermore, suppose there is another variable (A) and literature suggests that it, in general, has an association with dependent variable. However, assume that A does not affect any independent variables so there is no omitted variable bias. In this case, if there is no data available for A (due to the study being conducted in different environment/context) then what statistical techniques can be deployed to address any problems caused due to the exclusion of A?

I look forward to your reply and I will be grateful for your reply.

Kind regards.

Kaleem says

Thanks for the reply. I apologise if I am taking a considerable time out of your schedule.

Based on the literature, there isn’t any conclusive evidence that z is a determinant of y. So, that is why I intend to remove z. Some studies include it while some do not and some find significant association (between y and z) while some find the association insignificant. Hence, I think I can safely remove it.

Moreover, I will be grateful if you can answer another query. From an statistical viewpoint, is it fine if I use Generalized method of moments (GMM) for binary dependent variable?

Kind regards.

Jim Frost says

While I can’t offer you a concrete statement about whether you should include or exclude the variable (clearly there is disagreement in your own field), I do suggest that you read my article about specifying the correct regression model. I include a number of tips and considerations.

Unfortunately, I don’t know enough about GMM to make a recommendation. All of the examples I have seen personally are for continuous data, but I don’t know about binary data.

Kaleem says

Thanks for your reply. I really appreciate it. Could you please also provide an answer to my query mentioned below for further clarification?

Kaleem says

Further clarification on my above post. From internet I found that if a variable (z) that is related to y but unrelated to x then inclusion of z will reduce standard errors of x. So, if z is excluded, but f-stat and adjusted r-square are fine then does high standard errors create problems? I look forward to your reply.

Jim Frost says

Yes, what you read is correct. Typically, if Z is statistically significant, you should include it in your model. If you exclude it, the precision of your coefficient estimates will be lower (higher standard errors). You also risk a biased model because you are not including important information in the model–check the residual plots. The F-test of overall significance and adjusted R-squared depend on the other IVs in your model. If Z is by far the best variable, it’s possible that removing it will cause the F-test to not be significant and adjusted R-square might drop noticeably. Again, that depends on how the explanatory power of Z compares to the other IVs. Why do you want to remove a significant variable?

Kaleem says

Thanks for the reply. Jim.

I am unable to understand “Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable”. Are you stating that other independent variables will be fine but r-square will become low? I will be grateful if you can explain this.

Kind regards

Jim Frost says

Hi, you indicated that the removed independent variable is related to the dependent variable, but it is not correlated with the other independent variables. Consequently, removing that independent variable should reduce R-squared. For one thing, that’s the typical result of removing variables, even when they’re not statistically significant. In this case, because it is not correlated to the other independent variables, you know that the removed variable is supplying unique information. Taking that variable out means that information is no longer included in the model. R-squared will definitely go down, possibly dramatically.

R-squared measures the strength of the relationship between the entire set of IVs and the DP. Read my post about R-squared for more information.

Kaleem says

Hello, Jim.

What is the impact* on the independent variables in the model if I omit a variable that is a determinant of dependent variable but is not related to any of the independent variables?

*Here impact relates to the independent variables’ p-values and the coefficients.

Kind regards.

Jim Frost says

Hi Kaleem,

If the independent variable is not correlated with the other independent variables, it’s likely that there would be a minimal effect on the other independent variables. Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable. You should also check the residual plots to be sure that by removing the variable you’re not introducing bias.

ghulam mustafa says

why do we use 5% level of significance usually for comparing instead of 1% or other

Jim Frost says

Hi, I actually write about this topic in a post about hypothesis testing. It’s basically a tradeoff between several different error rates–and a dash of tradition. Read that post and see if it answers your questions.

Ghulam Mustafa says

Sir usually we take

5% level of significance for comparing why 0

Jim Frost says

Hi Ghulam, yes, the significance level is usually 0.05. I’m not sure what you’re asking about in regards to zero? The p-values in the example output are all listed as 0.000, which is less than the significance level of 0.05, so they are statistically significant.

Shamsun Naher says

In my model, I use different independent variables. Now my question is before useing regression, do I need to check the distribution of data? if yes then please write the name tests. My title is Education and Productivity Nexus, : evidence from pharmaceutical sector in Bangladesh.

Jim Frost says

Hi Shamsun, typically you test the distribution of the residuals after you fit a model. I’ve written a blog post about checking your residual plots that should read.

I hope this helps!

Jim

Khadidja Benallou says

Thank you Mr. Jim

Jim Frost says

You’re very welcome!

NARAYANASWAMY AUDINARAYANA says

In linear regression, can we use categorical variables as Independent variables? If yes, what should be the minimum or maximum categories in an Independent variable?

Jim Frost says

Hi, yes you can use categorical variables as independent variables! The number of groups really depends on what makes sense for your study area. Of course, the minimum is two. There really is no maximum in theory. It depends on what makes sense for your study. However, in practice, having more groups requires a larger total sample size, which can become expensive. If you have 2-9 groups, you should have at least 15 in each group. For 10-12 groups, you should have 20. These numbers are based on simulation studies for ANOVA, but they also apply to categorical variables in regression. In a nutshell, figure out what makes sense for your study and then be sure to collect enough data!

I hope this help!

Jim