Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data.

I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.

**Related post**: Guide to Data Types and How to Graph Them

## Regression Analysis with Continuous Dependent Variables

Regression analysis with a continuous dependent variable is probably the first type that comes to mind. While this is the primary case, you still need to decide which one to use.

Continuous variables are a measurement on a continuous scale, such as weight, time, and length.

### Linear regression

Linear regression, also known as ordinary least squares (OLS) and linear least squares, is the real workhorse of the regression world. Use linear regression to understand the mean change in a dependent variable given a one-unit change in each independent variable. You can also use polynomials to model curvature and include interaction effects. Despite the term “linear model,” this type can model curvature.

This analysis estimates parameters by minimizing the sum of the squared errors (SSE). Linear models are the most common and most straightforward to use. If you have a continuous dependent variable, linear regression is probably the first type you should consider.

There are some special options available for linear regression.

**Fitted line plots**: If you have one independent variable and the dependent variable, use a fitted line plot to display the data along with the fitted regression line and essential regression output. These graphs make understanding the model more intuitive.**Stepwise regression and Best subsets regression**: These automated methods can help identify candidate variables early in the model specification process.

### Advanced types of linear regression

Linear models are the oldest type of regression. It was designed so that statisticians can do the calculations by hand. However, OLS has several weaknesses, including a sensitivity to both outliers and multicollinearity, and it is prone to overfitting. To address these problems, statisticians have developed several advanced variants:

**Ridge regression**allows you to analyze data even when severe multicollinearity is present and helps prevent overfitting. This type of model reduces the large, problematic variance that multicollinearity causes by introducing a slight bias in the estimates. The procedure trades away much of the variance in exchange for a little bias, which produces more useful coefficient estimates when multicollinearity is present.**Lasso regression**(least absolute shrinkage and selection operator) performs variable selection that aims to increase prediction accuracy by identifying a simpler model. It is similar to Ridge regression but with variable selection.**Partial least squares (PLS) regression**is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. PLS decreases the independent variables down to a smaller number of uncorrelated components, similar to Principal Components Analysis. Then, the procedure performs linear regression on these components rather the original data. PLS emphasizes developing predictive models and is not used for screening variables. Unlike OLS, you can include multiple continuous*dependent*variables. PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables.

### Nonlinear regression

Nonlinear regression also requires a continuous dependent variable, but it provides a greater flexibility to fit curves than linear regression.

Like OLS, nonlinear regression estimates the parameters by minimizing the SSE. However, nonlinear models use an iterative algorithm rather than the linear approach of solving them directly with matrix equations. What this means for you is that you need to worry about which algorithm to use, specifying good starting values, and the possibility of either not converging on a solution or converging on a local minimum rather than a global minimum SSE. And, that’s in addition to specifying the correct functional form!

Most nonlinear models have one continuous independent variable, but it is possible to have more than one. When you have one independent variable, you can graph the results using a fitted line plot.

My advice is to fit a model using linear regression first and then determine whether the linear model provides an adequate fit by checking the residual plots. If you can’t obtain a good fit using linear regression, then try a nonlinear model because it can fit a wider variety of curves. I always recommend that you try OLS first because it is easier to perform and interpret.

I’ve written quite a bit about the differences between linear and nonlinear models. Read the following posts to learn the differences between these two types, how to choose which one is best for your data, and how to interpret the results.

- What is the Difference Between Linear and Nonlinear Models?
- How to Choose Between Linear and Nonlinear Regression?
- Curve Fitting with Linear and Nonlinear Regression

## Regression Analysis with Categorical Dependent Variables

So far, we’ve looked at models that require a continuous dependent variable. Next, let’s move on to categorical independent variables. A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic. Logistic regression transforms the dependent variable and then uses Maximum Likelihood Estimation, rather than least squares, to estimate the parameters.

Logistic regression describes the relationship between a set of independent variables and a categorical dependent variable. Choose the type of logistic model based on the type of categorical dependent variable you have.

### Binary Logistic Regression

Use binary logistic regression to understand how changes in the independent variables are associated with changes in the probability of an event occurring. This type of model requires a binary dependent variable. A binary variable has only two possible values, such as pass and fail.

**Example:** Political scientists assess the odds of the incumbent U.S. President winning reelection based on stock market performance.

Read my post about a binary logistic model that estimates the probability of House Republicans belonging to the Freedom Caucus.

### Ordinal Logistic Regression

Ordinal logistic regression models the relationship between a set of predictors and an ordinal response variable. An ordinal response has at least three groups which have a natural order, such as hot, medium, and cold.

**Example:** Market analysts want to determine which variables influence the decision to buy large, medium, or small popcorn at the movie theater.

### Nominal Logistic Regression

Nominal logistic regression models the relationship between a set of independent variables and a nominal dependent variable. A nominal variable has at least three groups which do not have a natural order, such as scratch, dent, and tear.

**Example**: A quality analyst studies the variables that affect the odds of the type of product defects: scratches, dents, and tears.

## Regression Analysis with Count Dependent Variables

If your dependent variable is a count of items, events, results, or activities, you might need to use a different type of regression model. Counts are nonnegative integers (0, 1, 2, etc.). Count data with higher means tend to be normally distributed and you can often use OLS. However, count data with smaller means can be skewed, and linear regression might have a hard time fitting these data. For these cases, there are several types of models you can use.

### Poisson regression

Count data frequently follow the Poisson distribution, which makes Poisson Regression a good possibility. Poisson variables are a count of something over a constant amount of time, area, or another consistent length of observation. With a Poisson variable, you can calculate and assess a rate of occurrence. A classic example of a Poisson dataset is provided by Ladislaus Bortkiewicz, a Russian economist, who analyzed annual deaths caused by horse kicks in the Prussian Army from 1875-1984.

Use Poisson regression to model how changes in the independent variables are associated with changes in the counts. Poisson models are similar to logistic models because they use Maximum Likelihood Estimation and transform the dependent variable using the natural log. Poisson models can be suitable for rate data, where the rate is a count of events divided by a measure of that unit’s *exposure* (a consistent unit of observation). For example, homicides per month.

**Example**: An analyst uses Poisson regression to model the number of calls that a call center receives daily.

### Alternatives to Poisson regression for count data

Not all count data follow the Poisson distribution because this distribution has some stringent restrictions. Fortunately, there are alternative analyses you can perform when you have count data.

**Negative binomial regression**: Poisson regression assumes that the variance equals the mean. When the variance is greater than the mean, your model has overdispersion. A negative binomial model, also known as NB2, can be more appropriate when overdispersion is present.

**Zero-inflated models**: Your count data might have too many zeros to follow the Poisson distribution. In other words, there are more zeros than the Poisson regression predicts. Zero-inflated models assume that two separate processes work together to produce the excessive zeros. One process determines whether there are zero events or more than zero events. The other is the Poisson process that determines how many events occur, some of which some can be zero. An example makes this clearer!

Suppose park rangers count the number of fish caught by each park visitor as they exit the park. A zero-inflated model might be appropriate for this scenario because there are two processes for catching zero fish:

- Some park visitors catch zero fish because they did not go fishing.
- Other visitors went fishing, and some of these people caught zero fish.

Whew! That’s many different types of regression analysis! If you’re trying to figure out which one to choose, I hope you will use this information to point yourself in the right direction!

If you’re learning regression, check out my Regression Tutorial!

Abdi says

am glad that I found your page. Thank you so much for this informative post.

Actually, I have questions regarding data analysis;I have both scale and nominal variables as independent and 4 categorical dependent variables,how can i analys by which model?

Thank you!!

Abdi

Jim Frost says

Hi, it sounds like you need nominal logistic regression assuming that your DVs don’t have a natural order. If they have a natural order, then ordinal logistic regression. You’ll need to fit separate models for each DV.

Best of luck with your analysis!

Harriet noble says

Hi jim,

Thank you for sharing your statistical Knowledge its very helpful indeed.

I just wondered is there an easy was to tell from reading a research article if a statisticians has used a linear or logistic regression model? and if so how?

Thank you

Jim Frost says

Hi Harriet,

The key giveaway would be the dependent variable. If the dependent variable is binary, ordinal, or nominal, the analyst should be using logistic regression. Also, if the article refers to odds ratios, link functions, or the study uses deviance R-squared, pseudo R-squared, or McFadden’s R-squared, they’re likely using logistic regression. Also, hopefully they’re transparent about which analysis they’re using!

shahd says

Hi Jim

How can I perform jacknife regression in SPSS and what is the difference between jacknife and bootstrapping

Sylvia Burgess says

Thanks Jim this is very helpful

Sylvia Burgess says

JIm I was reading your comments and they seem helpful…so here goes. I have ELL-ESL Training as an ordinal variable and Math Scores as a continuous variable should I run linear regression or logistical regression.

Jim Frost says

Hi Sylvia,

I’m not sure which variable is your dependent and independent variable. It makes all the difference! From a similar comment you made, I’ll assume that the ELL-ESL training is the independent variable and you want to determine the impact on math scores as the dependent variable.

Assuming that is true, you’ll want to use linear regression because your dependent variable is continuous. You have an ordinal independent variable, which can be tricky to use. Ordinal variables are a bit like continuous variables and a bit like categorical (nominal) variables. Yet, it’s not quite either. You’ll need to include the ELL-ESL training as either a continuous or categorical independent variable. If you included it as a continuous variable you might need to use either a polynomial or log transformation to fit curvature that

mightbe present.My suggestion is to start by using linear regression and fit different models that include the training variable as a continuous variable and then as a categorical variable and see which provides the best fit. Categorical variables use more degrees of freedom than a continuous variable, which can be problematic depending on your sample size and the number of levels in your categorical variable.

Best of luck with your analysis!

Charles says

Jim thank you for making statistics easy to learn. However, do you also have one for discrete time survival analysis?

Emma Kooij says

Hi Jim,

I was hoping you might help me shed some light on a math paper I’m doing for my Calculus class. I’m trying to find a regression that might predict the shape of a circle based on a rainbow arc.

Rainbows are actually circles, we just can’t see them fully from our perspective and I wanted to provide more evidence that a circle is in fact the best way to predict a rainbow’s curve using a regression. However, when I tried several regressions based on the points I had found using an outline of the visible arc, the end result looks more like very wide parabola rather than the narrow curvature of the actual circle it’s supposed to be. My professor is trying to push the point of having our paper sound as if we were explaining it to a high school freshman and that makes it a lot harder to explain and the mathematics somewhat tedious as we have to explain every step and justify our choices as well.

Is there any light you can shed on how I can make this a little more feasible?

Thank you,

Emma

Jim Frost says

Hi Emma,

You’ll most likely need to use a different type of regression analysis, such as nonlinear regression and then choose the correct form of the expectation function. At that point the analysis can fit the data to the model. I think that’s the answer to your problem. Use nonlinear regression and then find the correct expectation function–which I’d guess would be the formula for a circle.

I hope this helps!

Seun Opaleye says

Hi Jim, a quick one.

I sent a paper to a scopus index journal.

My model is a panel data with n=4 & T=21

I used fixed effects model to analyse the data.

The journal sent it back saying we have to test for unit root and stationarity because the data is 21years x 4countries. And one reviewer said try ardl.

What is your advise and how do I take care of this?

Jim Frost says

Hi Seun,

I’m not expert in time series regression model. However, I can confirm that those are legitimate concerns. I have not used ARDL (autoregressive distributed lag) myself. I have used ARIMA (autoregressive integrated moving average)a bit. I don’t know enough about the analyses or you subject area enough to make a recommendation, but yes, you’ll probably need to use something like those analyses. The problem is that you need to account for time order effects and avoid serially correlated residuals.

shahd says

so does the multiple regression model differ from the least square mode ? because both gave me the same results

Jim Frost says

There are different types of multiple regression, but, yes, it usually refers to OLS. If the results match, that’s what you’re using.

Sam says

ah ok thanks for all your help!

Sam says

For my positive data, it is split into ‘yes’ and ‘unsure’ and ‘no’ so I assumed that is a nominal variable? I may be wrong though!

Jim Frost says

Those are ordinal data. Two-way ANOVA might still work, but be sure to check the assumptions on the residual plots.

Sam says

Hi Jim,

thank you for your feedback. I have double checked and my variables are nominal! I am a bit confused whether it is possible to carry out a two way anova on just two nominal variables as I have looked online and the assumptions say I need two independent variables and a dependent variable?

If so, I would do the two way anova and then the post hoc test?

Jim Frost says

If they are nominal, then your comment about “more positive reviews” is not consistent with that.

Yes, two-way ANOVA is designed exactly for that scenario. Your two independent variables are your two nominal/categorical variables. The dependent variable is your continuous outcome.

Yes, do the two-way ANOVA. If you find that one or both of the independent variables are significant, perform the post hoc test.

Shahd says

Hi Jim,

I have a salary data set with 10 variables. I checked the multicolinearity and there is no multicolinearity. I need to use advance modeling technique other than regression analysis to get higher grade in my assignment. Could you please advise with this? What about principle component analysis will this improve my prediction model?

Jim Frost says

Hi Shahd,

It all depends on the nature and characteristics of your data. There is no inherently better analysis. In fact, if a linear model fits your data, then OLS is the best possible analysis to use. To see why, see my post about the Gauss-Markov Theorem and OLS Estimates.

However, if your data has specific problems, such a multicollinearity, other methods can be better. Typically, you’d use PCA when you have a large number of correlated predictors often in conjunction with a small sample size. In that scenario, you could also use Partial Least Squares (PLS) regression.

OLS is a great starting place and if you find that it somehow fails to fit your data adequately, use that information about how/why it fails to find another analysis that addresses the problem. And, I wouldn’t say the alternatives are more advanced, they just address various problems/characteristics of your data. So, I really can’t make a recommendation. It depends on the specifics of your dataset.

Sam says

Also, I am a bit unsure how I would analyse the data once the anova tests have been run because even if it tells me there is a difference, I don’t think it tells you the relationship between the two variables e.g. I am trying to find out if those with more positive views buy more Fair Trade. I am assuming the test would tell me there is a difference between views and goods bought, but not whether those with positive view bought more? I may be wrong!

Thanks again for your help!!

Jim Frost says

Hi Sam,

For two nominal variables, you’d need two-way ANOVA. Typically, you’d perform ANOVA first and that tells you whether there is a statistically significant difference between group means. However, as you indicate, it doesn’t tell you the nature of those differences. To determine whether differences between specific pairs of groups are significant, you’ll need to perform a post hoc test, such as Tukeys. I need to write a post about post hoc tests! I don’t have one yet!

Now, one thing I notice in what you wrote. You mention “more positive reviews.” That sounds like it’s potentially not a nominal variable. Possibly ordinal or continuous? If so, that changes things. ANOVA is for nominal (categories). Although, it might possibly work for groups based on ordinal data. It can be tricky figuring out how to analyze data when you have ordinal data as a predictor/IV.

Sam says

Hi Jim,

That’s very helpful thank you! Would I be able to use a one- way anova for two nominal variables?

Thanks again!

Sam

Sam says

Hi Jim!

I would be very grateful for your help with my data analysis.

I have a dependent continuous variable and an independent nominal variable. I wish to use correlation analysis to examine the relationship between the two variables but I do not think this is possible. How else could I analyse the two variables to see if there is a correlation?

Thank you!

Sam

Jim Frost says

Hi Sam,

It sounds like you need to use one-way ANOVA. This analysis will tell you how whether the mean of your continuous DV is significantly different across the levels of your nominal IV.

Because the IV is nominal, regular correlation is not possible in the normal sense. In other words, you can’t increase or decrease a nominal variable to see how the other variable tends to change. Nominal variables are a difference in type. Such as type of damage: scratch, dent, and tear. ANOVA will tell you whether the different nominal values (types) are associated with different mean values of your DV.

I hope this helps!

Ellie Sharaki says

Hi, Jim.

If my parameter estimates in ordinal regression show negative values, do I need to report them one by one or does it suffice if I report the statistical (in)significance in the sig. column?

Thanks,

Ellie

shahd says

I need to use advance modeling to get A in this class. Can I use lasso regression method. If i didn’t include the interaction term, is this will be a problem ?

Jim Frost says

Hi, they might if you can show that lasso fits a better model than OLS. Typically, you’d use lasso when you have multicollinearity. It does introduce a bit of bias to reduce the variance. That can be worthwhile in some scenarios. I have no idea if that will produce a better model for your data, but you can give it a try! If it doesn’t produce a better model, it’s not worthwhile doing.

As for the interaction term, again, that depends on your data and study area. If an interaction effect actually occurs in the study area, not including it in the model can produce invalid results. However, not all study areas have interaction effects. Read the link I shared with you in my previous reply.

Best of luck with your analysis!

shahd says

I did that and I got a good fit. And I need to show what other possible regression model for the salary data set as I was wondering if I have to include interaction term because 3 of those variables are dummy. could you please advise with this

Jim Frost says

Hi Shahd,

Fitting the correct model is a balance between statistics and subject-area knowledge. There’s no way I can tell you exactly what you need to do. However, I write about the things you need to consider in my post about specifying the correct regression model. I think that post will be particularly helpful.

If you need to learn more about interaction effects, I wrote about that as well. You can include interaction terms for dummy variables. Would that make sense theoretically for your study area?

Best of luck with your analysis!

shahd says

Hi

What if I have a salary data set and my dependent variable is continuous and I have 10 independent variables one of them is not linear and I transformed it to be exponential

Jim Frost says

Sounds like you should start with OLS linear regression and see if you can get a good fit. That’s where I’d start.

Elisa says

Sorry I forgot to mention that variables in factor 1 and 3 can follow linear, quadratic… different models! That is why I also wanted to analyse them individualy.

Elisa says

Thank you Jim! I am getting a clearer idea! I have a model which looks like this factor 1–> factor 2 —> factor 3. Factor 1 and 3 are formed by diferent ordinal variables and factor 2 is an ordinal variable! so I am performing two analysis. One saying factor 1 is independent and factor 2 is dependent. Another one saying factor 2 is independent and factor 3 is dependent. Should I perform logaristic regression in both cases? Or Can I consider the independent variable as continuos and dependent variable as continuos/categorical depending on the better fit for both analaysis? So in this way, I would perform a normal regression analysis. Another question I have is: if I say factor 2 is continuous in analysis one, must I say it is continuos for the second analysis? Or can I use it as continuous for analysis one and categorical for the second one? I think I might treat it the same in both analysis so in this case the option would be to assume all the factors are continuous and perform an OLS in both cases.

Thank you again!

Elisa says

Hi Jim,

I am quite confused about which model to use. I have some items that affects a variable. The items follow a likert scale (from 1 to 5), and the variable is an ordinal variable (there are 4 groups that go from highest customization to lowest customization). First I would like to study independently how each item affects the variable, to see if my hypotheses are correct. Can I do I use fitted line plot? I am not sure about which model I might use. And for analysing the impact of all the items on the ordinal variable? Data does not follow normal distribution.

Jim Frost says

Hi Elisa,

First you need to figure out which variable is your dependent variable and which are your independent variables. It’s not clear from your comment which are which.

If your dependent variable is ordinal data, you’ll need to use ordinal logistic regression. If the DV is continuous, probably OLS linear regression is a good place to start.

It sounds like your independent variables are ordinal data. That can be tricky because ordinal data are a bit like categorical/nominal data and a bit like continuous data. You’ll have to try including them both ways to see which one produces a better fit. All things being equal, you’d prefer to include them as continuous data. When you include them as categorical data, behind the scenes, the analysis is actually fitting many variables that relate to the levels of each categorical variable. In short, your using many more variables going the categorical route, which can cause problems if you treat multiple variables as categorical. However, categorical can sometimes provide a more flexible fit.

You

mightbe able to sum the ordinal variables to create a pseudo-continuous variable. I don’t have experience doing that myself. It seems like ordinal variables are common in some fields, but not others. And, they weren’t in mine. Of course, if you combine them, they might be easier to analyze but you won’t be able to isolate the role of each.Basically, there’s a lot going on here that goes beyond what I can address in comments. But, that’s the situation in a nutshell! It’ll probably require some research and experimentation on your part to find the best solution out of several possibilities.

Finally, we’re getting to several clear cut answers now! If you include your variables in a regression model, the simple fact that they’re in the model means that analysis controls for them, and you can assess the independent effect of each variable. In other words, when you assess the effect of each variable, the model is holding constant the values of all the other variables that are in the model. That’s makes interpretation easy!

As for how to display the effect, you can’t use a fitted line plot when you have more than one independent variable–you’d need more that two-dimension to display it in! However, you can display the effects on a main effects plot. These are specialized plots that graph the fitted values for values of each variable while holding the other variables constant. How to create main effects plots depends on the software you’re using, but they’re a nice feature because you can use them when you have more than one independent variable. It really helps you to graphically isolate the role of each variable.

I show an example of a main effects plot in my post about interaction effects (most of the graphs in that post are interaction plots but there is one example of a main effects plot). I should probably write a post about those some time soon!

Best of luck with your analysis!

Al says

Hi Jim,

Do you have a tutorial on variable selection for logistic regression? I am studying from a textbook that discusses the topic for MLR — the software produces a best-subsets table with adjusted R-squared and Mallow’s Cp for all the different variable combinations. For logistic regression the table contains RSS and Mallows Cp, along with something labeled “probability.”

THanks,

Al

Marcelo Ribeiro says

Hallo Jim,

I have a DV, normally distributed, in percentage (the % of women in Governing Board in different companies around the world) and my Independt variables (national gender imbalance indicators) are in %, continuous and ordinals.

Which model should I use? (and do you have any other post of the chose one for me to test the assumptions in Stata to obtain consistent results).

I checked that my DV and IV are significant correlated but there’s no significance when using a GLM model…

thank you,

Marcelo

Mridha says

I appreciate your work. I have learnt a lot from your blog.

Jim Frost says

Thanks–I’m glad it’s been helpful!

Faruque says

Hello Jim,

I am glad that I found your page. Thank you so much for this informative post.

Actually, I have questions regarding difference in difference (DID) method. I want to employ DID method in my paper to find out the pre-post treatment effect of adopting a technology. When I discussed with a colleague, he recommend me not to choose the treatment and control groups randomly but better to choose from a project. For instance, if the government decide to impose a new policy for X region in Y year and the whole population do not have any other alternatives rather then to adopt, then we can randomly choose treatment group from region X while control group should come from other provinces.

But what if the govenement says its not mandatory but people who will adopt will get many benefits. In this case, if we want to find out the pre-post benefit effects between adopters and non-adopters, can we also employ DID method?

Thank you very much.

Jim Frost says

Hi Faruque,

I haven’t used the difference in difference method myself. I know it is a method that tries create an experiment but using observational data. So, I don’t have first hand insights.

One difficulty I see in your scenario where it is not mandatory but people choose is that there might be (probably is) a difference between those who choose to participate versus those who do not. If the program provides many benefits, there must be some reason why particular people do not join the program. Whatever that difference is, you’re starting out with it as a selection bias for adopters/non-adopters. This pre-existing difference could bias the results. It’s entirely possible that those who don’t join do that because they won’t receive as many benefits as the average person. If that’s the case, this bias will artificially inflate the difference in benefits between adopters and non-adopters.

It might be that your colleague is making that suggestion about different provinces because it prevents that decision as a source of bias. Maybe. Of course, you then have to worry about whether the citizens of the various provinces are different in some other systematic manner. Such is the nature of observational data!

I always hesitate to offer suggestions based on limit information. But, these are the types of issues to consider.

Ellie Sharaki says

Dear Jim,

I can’t thank you enough for your advice. I’ll re-do the analysis using ordinal regression.

All the best,

Ellie

Ellie Sharaki says

Hi, Jim,

I already sought your advice re the multinomial reg. I ran for my IVs (Teachers’ gender, years of experience, qualifications, grade they teach) and and DV (English Teachers’ agreement to promote learner autonomy with four values (1) agree and practice, (2) agree but not practice, (3) unsure, (4) disagree). Among the IVs, gender produced very odd Exp(b) values (e.g., 3972393.841 !!) So, I was thinking of running ordinal regression though the four values defined for the DV are not naturally ordered. This yielded reasonable results all less than 1.

What do you think? Should I report the results of ordinal regression for gender separately or is something wrong in the multinomial analysis which I should fix?

I deeply appreciate your time and advice.

Best,

Ellie

Jim Frost says

Hi Ellie,

I looked back at your first comment to refresh my memory about. I see that wasn’t totally clear about the nature of your DV. If your DV is discrete and has a natural order, yes, you should use ordinal logistic regression. If the DV is discrete and there is no natural ordinal (i.e., nominal/categorical), you should use nominal logistic regression.

It appears with the new information that there is a natural order, so you should probably use ordinal logistic regression and disregard the other results.

E says

Hi Jim,

This blog is great! If you don’t mind, I was hoping you could help me with something. I’m looking at employing a cross-sectional study to determine health needs in a community. With a couple exceptions (like age, and the distance it takes someone to drive to the hospital), the majority of my variables are nominal (the questions ask about perceived needs, opinions on community strengths, etc.). I believe I need to do a regression analysis but I’m not entirely sure which one. I think I need to use nominal logistic regression but am curious on your thoughts. I might be missing something… Thank you so much!

Sincerely,

E

Jim Frost says

Hi, it really depends on the type of dependent variable that you have. I couldn’t tell what variable is your dependent variable, so it’s hard to say. You can include nominal variables as independent variables in quite a few different types of regression analysis, which doesn’t narrow it down. If you can clarify the dependent variable specifically, I can give you a better answer. Thanks!

Levan Mumladze says

Thanks for your post. That is exactly what is needed for nonstatistican researchers. But I think it would be even more helpful if one can found other cases as well. For instance, when dependent variable is a proportion between 0-1, and independent either continuous or categorical. I found beta regression suggested in the last case. Also, I am pretty sure there is also other varieties of regressions.

best regards

Jim Frost says

Hi Levan,

I’m sure there are many additional types of regression analysis. This post was meant to cover the most common types. Additional research might be required to determine the correct type for more specialized cases.

Timothy Dickson says

Hi Jim !

So basically I have confused myself.

I have an independent variable of age

And the dependent variable is a total of 8yes/no decision tasks. (yes=1 and no=0, so total scores 0-8)

Can you please clarify my understanding? If I was testing each question individually I would use a binomial logistic regression, but by totalling scores I have created a continuous dependent variable and should proceed with a linear regression ? (and if the model is not adequate potentially a non-linear)

Jim Frost says

Hi Timothy,

Yes, if you analyzed each question individual, you would use binary logistic regression.

Now, summing those eight items together to create a continuous variable might be a bit debatable. The rule of thumb that I’m aware of is that if you have a discrete variable that has 10 equally spaced values and your distribution covers that range, you can consider it a continuous variable. You’re close to that with 8. And, it appears like you satisfy the equally space aspect. However, I don’t have a good reference for my old rule of thumb! I’m not sure how widely accepted the rule is.

What I would do is try fitting the model using it as a continuous variable. Then, be very sure to check the residuals. That’s always a good practice. If the residuals look good, you’re probably fine fitting the model that way. In your write up, you might spend a bit more time explaining how the model satisfies the assumption despite the somewhat unusual dependent variable to allay any fears about that aspect. Also be aware that if you use the model for prediction, you’ll get decimal values and predictions that are potentially past the ends of the data range.

Best of luck with your analysis!

Tim says

HI Jim!

I am using age as my independent variable (my stats professor said not to code the groups into older and younger) and a total score from a yes/no decision task as my dependent variable (yes = 1 no=0) over 8 related questions. (so total score can be anywhere between 0-8). Does this become a continuous dependent variable? and if my understanding is correct, I should use a linear regression over a binomial logistic (which I would use if I was testing each question individually)?

Dave says

Jim, I’m a novice trying to recall old business school stats. How do I interpret a high p value for the intercept (0.069) but a significant p (0.01) for the dependent variable?

Jim Frost says

Hi Dave,

A non-significant p-value for the constant technically means that you fail to reject the null hypothesis that the constant equals zero. In other words, you have insufficient evidence to conclude that the constant is different from zero. However, you typically should not interpret the constant and it’s statistical significance for a variety of reasons. I talk about these reasons in my post about the regression constant.

However, the significant p-value for the independent variable (IV) (and I’m assuming you do mean independent rather than dependent because there are no p-values for the DV) is much more important. First off, there’s no need to attempt to draw a connection or explanation between the lack of significance for the constant and the significance for the IV. Those are independent things. The significant IV indicates that you have sufficient evidence to conclude that there is a relationship between the IV and the DV. For more information about this interpretation, read my post about regression coefficients and p-values.

I hope this helps!

immaculate says

halo am greatful for your detailed work, am a student and am doing my first research, it is really hard for me now, my topic is ” determinants of male participation in family planning decision making” and my dependent variable has three categories ( mainly respondent, mainly husband, joint decision) then the independent variable has demographic and socio-economic determinants. i was requesting help out on which type of regression analysis am to use under SPSS plz

Ellie says

Hi Jim,

I need to see if there is any association between 4 IVs (teachers’ gender, age (four groups coded nominally), grade they teach, years of experience (three groups coded nominally)) and one DV (belief in learner autonomy at four levels).

I am using multinomial logistic regression, and of course I haven’t observed any significant association between my predictors and dependent variable. Have I chosen the correct stat?

Thanks.

Jim Frost says

Hi Ellie,

For your dependent variable, do the four groups have a natural order? It sounds like they might.

If they do, that order contains some information that you’re missing out on. Instead, you should use ordinal logistic regression.

However, if the groups do not have any type of natural order, then it sounds like you’re using the correct analysis.

I hope this helps!

Ellie says

Agree with you. Thanks, Jim.

Nikola says

HI Jim. Please, can you tell me what is statistical method i should use for analyse impact of independent variable (numerical variable which has same value in period 2012-2016) on dependent variables (numerical variables, which has different values in each year in period 2012-2016). Thank you.

ab says

Hi Jim, I have both independent and dependent variable in likert type (strongly agree, agree, somewhat agree, disagree and strongly disagree). What kind of regression method is helpful in order to find the effect of predictor variable on response variable. Thank you

Rajesh Kavediya says

Dear Jim. Really a good post explaining the type regressions to be used in various situation. I need your help. I am working on analysing the determinants of inflation expectations. My dependent variable is categorical, i.e between 0-1, 1-3, 3-5, 5-10 ,10-15 and above 15 (inflation expectations) and independent variables are either categorical (like age group, income and education level) or macro-economical variables like inflation, unemployment and growth. I will be grateful if you could suggest the appropriate regression framework/model.

vivi says

Hi Jim,

I wanna ask about type of data for multiple regression

I use google to get a rating from a place, for example the ranking for the Eiffel Tower is 4.6. I know, on Google itself, this value is the result of processing between the ratings given by the review of the place being assessed and other factors.

What I want to ask, is a value of 4.6 called ordinal data or numerical data (scale or interval)? so that it can be used as a variable from multiple regression. Because other variables are types of intervals or ratios.

Even if the ranking value is ordinal, should it be changed to numeric first so that it can be used in multiple regression models?

Jim Frost says

Hi Vivian,

Sorry about the delay in replying to your question. I’ve been away traveling.

Ratings are usually ordinal data. For example, if diners can rate a restaurant from 1 to 5 stars, it’s an ordinal scale. You can average the number of stars and obtain an average of 4.6 or other value. But the data points are ordinal.

However, I don’t know how Google determines the rankings. If it’s more complex than users simply entering an ordinal value, Google’s rating might not be ordinal. I don’t enough about it to really say for sure. But ratings are generally ordinal data.

If the rating is ordinal, you can’t just change it to numeric data. You can represent ordinal data using numbers, but it’s still ordinal data. Image that we use the numbers 1, 2, and 3 to represent first, second, and third in a race. Even though we are using numbers, they are not numeric or continuous data. For example, numerically, the number 2 is twice the value of 1. However, that does not mean that the second place finisher took twice as long as the first place finisher. And, the third place finisher isn’t necessarily three times as long. Etc. You can change how ordinal data are represented, but it doesn’t change the underlying fact that they are ordinal data.

If you have ordinal data and it’s the dependent variable, use ordinal logistic regression. If it’s an independent variable, it can be tricky. You can try using it as an independent variable, but pay extra attention to the residual plots. They may or may not provide a good fit for reasons that I describe in the race example!

Alex says

Hi Jim

Thanks for this constructive post. I have a question.

I run a multiple regression model in which my dependent variable is the vote for Social Democratic parties and my independent variables are associated with a range of factors.

Do you think that I should keep this single multiple regression model or could I divide this model into more multiple regression models? The advantage of dividing the model into more multiple regression models is that I acquire a better R squared value.

However, my argument in favour of keeping the single multple regression model is that the vote is a complicated phenomenon, which is affected by many factors and by controlling for more factors, we can explain more of the variation in y.

Thank you in advance.

Jim Frost says

Hi Alex,

I’m not quite sure how you would be dividing up the different models if you’re using the same dependent variable? Maybe by election, region, or year?

To make this determination, you’ll really have to use your subject area knowledge. If you think the relationships between the independent variables and the dependent variable are likely to be constant across the entire large model, that’s a good reason to use just one model. However, if those relationships change based on however you are dividing the models, that’s an argument for either dividing the model or modeling those changes themselves in the large model–possibly by including interaction effects.

Best of luck with your analysis!

Linch dan says

First of all, I want to thank you to maintain an excellent blog and it is very helpful for everyone.

I am one of a student who is struggling to find the best regression type for my study. In brief, Animals were fed with a supplement with different doses namely (0, .5 %, 1% ,2% and 3%). Each treatment group has 9 replicates ( ex .5% group has 9 animals). After feeding trail, Different blood parameters (Ex: Immune cells) are measured along with supplement concentration in blood. Now I want to correlate blood parameters(Ex: Immune cells counts ) with supplement concentration in blood.

For this experiment, What is the Correct Type of Regression Analysis?

What is Regression type that I need to work on is it linear regression or non-linear regression? I am still in the learning curve and your help is highly appreciated.

Jim Frost says

Hi Linch,

It could be either linear or nonlinear regression. It depends on the nature of the relationships between the variables. There’s no way I could possibly guess about that. Definitely start with linear regression and determine whether you can obtain an adequate fit by checking the residual plots. If you can get a good fitting using linear regression, you can avoid nonlinear regression, which is often more complicated.

Best of luck with your analysis!

Geo says

Hi jim,

Thanks for the post.I have a set of categorical & continous variables that need to predict the success of a event ( fail/pass).The categorical values in the data set take multiple values ( q1 to q400 ) which may be related to region code or some other parameter which may impact the final output.what may be the best model to fit in here.

Jim Frost says

Hi Geo, because of the binary dependent variable, you’d need to use binary logistic regression. Using this type of regression allows you to determine which variables are correlated with changes in the probabilities of the success of an event. You can use both continuous and categorical independent variables with this type of analysis.

richard sadaka says

Hi, Jim

i am trying to find the regression line of 1 dependent variable and 35 independent variables (all categorical), but i faced a problem related to the significance of 33 out of 35 coef, all of them is insignificant

I could really appreciate any help

Jim Frost says

Hi Richard,

That’s a very difficult question to answer. The explanation can range from no relationship existing between those variables to an effect that is too small given your sample size to be detectable (i.e., your statistical power is too low).

Additionally, for 35 categorical independent variables, you probably need at least 700 observations. Possibly higher depending on the number of levels per categorical variable and the distribution of observations across those levels.

Can you narrow those variables down to a few that theory strongly suggests?

richard sadaka says

actually i can’t since they are the components of the consumption (i.e revenue = C1 + C2 +C3 ……..C35 + S)

Lis Bittencourt says

Hi, Jim! Thank you for your post.

My doubt is: if I have a continuous dependent variable and a count independent variable, what is the most suited regression analysis?

I understand that if the count variable was my dependent variable, a poisson regression was OK. But I am not sure if the inverse situation demands a regular simple linear regression.

Thank you,

Lis

Seun Opaleye says

Thank you Jim. I will increase the time period for the model to include period of growth together with the recession, then use structural breaks to identify effects during the period of recession. Right?

Seun Opaleye says

Thanks for your response Jim!

Recall that recession is measured on a quarterly basis and my country experienced recession for a period of 5 quarters which gives us 15 months.

Will it be proper to run an analysis based on data where T=5 (five quarters) & five independent variables.

Will GMM capture this data size? Or what do you suggest?

Jim Frost says

I think you might have a problem there. I’m not the most familiar with Generalized Methods of Moments (GMM). However, my understanding is that this method trades off efficieny in order to obtain more robust estimates using fewer assumptions. In other words, you need a larger sample size using this method compared to OLS. And, you’d have a problem with using OLS for your study. In OLS, you generally need at least 10 observations per term in your model. You’re nowhere near that. I don’t know the guidelines for GMM, but it is less efficient so presumably you’d need more observations per term.

You probably have an additional issue as well. If you have data only from times of recession, it limits the variability in your dependent variable and possibly the independent variables. This situation weakens the ability of the analysis to detect relationships in the data. It’s much better if you have data from strong and weak economic times because that allows the analysis to more easily determine which independent variables covary with the dependent variable. It’s much harder to determine how the variables covary when at least one of them (dependent variable) doesn’t vary that much.

I think you need more data and particularly include a variety of economic conditions.

Best of luck with your analysis!

Seun Obed says

Hi Jim,.

Nice work you’re doing here. I’d like to find out the model to use in running a regression where the time period is in months and we are looking at 15months and we have five independent variables. The research aim at looking into factors that significantly increased unemployment during a period of recession which lasted 15 months in my country.

Jim Frost says

Thanks Seun! Performing regression analysis with time series data can be trick but it’s possible. At some point, I might write a post about that topic! I’d start out with linear regression. You’ll almost certainly need to include time information along with your independent variables. Very possibly include time lagged variables as well. For instance, the state of variable X in the previous time period might affect the current time period. You should always check the residual plots. However, when you’re working with time series data, be sure to check the Residuals versus Time Order plot to ensure that you’re accounting for all the time-related effects.

Torsha says

Hi Jim,

I found this extremely helpful!

I have a very elementary doubt. Is is practically feasible that both dependent and independent variables (all of them) are all dummies, i.e., are binary in nature?

Jim Frost says

Hi Torsha,

Yes, you can certainly do this using binary logistic regression. That type of regression allows you to use the binary dependent variable, which the other types of regression don’t allow. Then, you can add the binary independent variables, which isn’t unique to binary logistic regression.

I think the type of model you describe is relatively common in the health care field. That’s not my field but I attended a presentation by someone in the field who talked about how they use that type of model to assess the risk of a surgical procedure for different patients. All the independent variables are patient traits (e.g., high blood pressure, etc.) and the dependent variable was survival. The model allowed doctors to enter patient attributes and estimate survival probability. This type of model can also be used in other fields.

Barney says

Thank you Jim!

I’ve been fiddling around with Minitab trying to get it to include CIs for the parameter estimates, but I just can’t find information online to learn to enable it specifically for parameter estimates. Would you know if this is possible on Minitab, or could you please name some software that I might be able to use to do the CIs? It’s for asymptotic regressions.

Jim Frost says

Hi Barney, yes, Minitab can display the CIs for parameter estimates. On the main dialog box, click the Results button. Under Display of Results, choose Expanded. When you rerun the analysis, it will now display those CIs along with various other additional results.

pipi says

Thanks a lot Jim, it does help…..

Jim Frost says

You’re very welcome! Best of luck with your analysis!

pipi says

Thanks alot for your replying Jim,

Actually I kinda confused, because my supervisor said that my response variable which is travel time is discrete data. Because the way i collected, because in every interval time, i only have 1 data for every day..

ex. on Sunday– 6am – 6.59am = 42 minutes, 7am-7.59am = 32 minutes, so on

on Monday — 6 am – 6.59am = 40 minutes, 7 am-7.59am = 30 minutes, so on

on Tuesday — 6 am -6.59am = 30 minutes, 7am-7.59am =20 minutes, so on

is it still continuous data or discrete? Because at first I choose continuous too…

And my next question, if i choose time as my predictor variable, how it would like?

because as my example above, it is in interval, can it be like:

Y (mean of travel time at 6am-6.59am) = 37.33,

x (predictor variable) = 6

??

And Jim, can I contact you in private because I really need some suggestions about my research or about regression. Thank you so much

Jim Frost says

Hi Pipi,

Time is usually a continuous variable even if you collect it once per interval per day. The type of data doesn’t change based on how often you collect measurements. I suppose you could make the case that it’s a count of minutes if you only recorded whole minutes. In that case, you could try Poisson regression. But, generally time is considered a continuous variable. Personally, I’d try least squares regression first and see how well you can model the data using it.

The response variable, you’d just use the travel time.

For the predictor variables you’d include time related variables and you could possibly include other variables as well if you have that data. For example, you can trying including Hour of Day, Day of Week. And, if you have the data, you could try weather conditions too.

I wish I could help more in depth, but if I did that for everyone who asks, I woouldn’t have time for my own life! I get a lot of requests for that. As it is answering comments of a more general nature takes a lot of time! I hope you understand. But, I do try to provide general tips and points–like I am here! 🙂

pipi says

Hi Jim,

I am now in reseach about regression model. I was wrong before because I used polynomial regression and Trend analysis (Time series) for predicting my data which the response variable is discrete and the predictor is continuous.

Actually I want to estimate and predict the travel time, so I’ll describe the way I collect the data,…

I collected the travel time datas (response variable)about 1 month, everyday.. And in everyday i collected the datas from 6am-10.59pm, where the interval per 1 hour.

Ex. on Sunday– 6am-6.59 am = 42 minutes, then 7am-7.59am= 40 minutes, and so on

so I want to estimate and predicting by using regression.

Is it right if i used Poison regression for solving my problem?

Can I combine with Trend Analysis especially Quadratic ? if I can combine how it would like?

Really need your suggest,

Thanks Jim

Jim Frost says

Hi Pipi,

Time series analyses require data that are collected at consistent intervals and that do not have any gaps. Your data have gaps (midnight – 5:59AM), so you can’t use time series analysis. Also, you use Poisson regression when your response variable is a count. Your response seems to be a continuous variable.

I would try using linear regression analysis and then including predictors such as time of day, maybe day of week, etc. Include the time components as predictors. You can include polynomials if needed. Regression with time related data can be tricky but it can work.

Trying fitting the models, checking the residuals, and adjusting as necessary. In addition to the regular residual plots, be sure to pay extra attention to the residuals vs order plot because you have time ordered data (assuming you record them in your worksheet in time order).

I have not done much regression with time related data, so I don’t have much more to suggest than trying that approach. Best of luck with your analysis!

Barney says

Hi Jim,

Very informative article.

Pardon me for perhaps a simplistic question, but is it considered regression analysis if the function is known, and I want to test the correlation of experimental data against the function? If so, what should I research to learn more about it?

Jim Frost says

Hi Barney,

Thanks! I’m glad it was helpful!

If you have a theoretical function and want to compare it to the fit you obtain for your data when you specify the same model as the theoretical function, you can use the confidence intervals (CIs) for the parameter estimates. If these CIs do not include the parameter values from the known function, you have sufficient evidence to conclude that the differences between your parameter estimates and the known function are statistically significant. Most statistical software should be able to produce this type of CI, although it might not always be included in the default output.

So, that’s what you should look into: CIs for the parameter estimates (coefficients) in a regression equation.

Kelly Parris Yeldham says

Hi Jim,

Cool article!

How should I proceed when I want to compare a hypothetical non-linear graph with an experimental graph, when the function and the shape of graph is unknown? I want to quantify the two graph shapes, and compare their equations, as opposed to visually overlapping the two graphs.

I have a continuous independent variable of time, and the dependent variable of velocity calculated from previous iterations of velocity starting from 0. Increasing the order of the polynomial that I use brings me closer and closer to the shape, but I do not believe that my function is polynomial.

Thank you!

Jim Frost says

Hi Kelly,

I’m not 100% sure what you’re asking, but I think might be asking how to specify a model that fits the curve in your data and how to determine whether that model adequately fits the curvature. If so, I’ve got the perfect blog post for you: Curve Fitting Using Linear and Nonlinear Regression. I cover the different methods you can use to fit curves and how to determine which provides the best fit.

For your data, if the polynomials don’t provide a good fit, you might well need to transform your data or use a nonlinear model. Note that nonlinear models are different than linear models that use polynomials to model curvature, which is what you’re doing. I talk about all of that in that post!

I hope that helps!

Kirti says

Hi. Thanks for this post. It is very informative.

I am a student. I have a dataset with 300,00 rows and 77 columns. How do I approach the data?

Also I have to do some predictive analysis. My independent variables are a mix of continuous and nominal categorical variables and my dependent variable is continuous. Which regression model should I use?

Jim Frost says

Hi Kirti, with a few exceptions, the type of regression analysis you should use doesn’t depend on the size of the dataset and number of variables. Usually, it’s the type of variables that you have.

For your case, I’d start with multiple linear regression. See if you can get a good fit to your data using that procedure.

Tony says

Hey Jim! Once I’ve trained a logistic model and know which predictors are important, is there a way that I can define an optimal range for my input variables? For example if I’m trying to adjust three settings on a machine to minimize my probability of introducing a defect, how could I use my coefficients from a logistic model to decide what the mean setting should be for all three to maximize probability of no defects? Thanks!

Jim Frost says

Hi Tony,

There are ways to do optimize your inputs. The process for doing this depends on the statistical software package that you are using. So, it’s hard for me to give any practical advice about it. Essentially the process takes the model that you settled on and then uses an optimization routine to determine which input values optimize the output. So, it’s a separate process from the model specification and fitting process–although your software might tie them together. Typically, you can specify whether you want to obtain a specific target response value, minimize the response value, or maximize the response value. In your case, you probably want to minimize the probability of defects.

Another approach is to perform a Monte Carlo simulation where you generate random data for the input values in your regression equation. The data for each input follow a distribution that you specify. You input these randomly generated data in your regression equation, which produces a distribution of outputs that you can then study. Additionally, you can change the distributions of the inputs to determine how that affects the distribution of the outputs. That allows you to answer “what if” types of questions about changing the inputs.

I hope this helps!

Maro says

Hi Jim

Thank you so much! This is very helpful. Here’s more information on what I’m trying to do.

The problem:

I’m studying the impact of adopting technological capabilities (independent variables) on teams performance (dependent variables) in IT.

The approach:

Identify significant performance clusters between teams, and understand how the adoption of technological capabilities impact teams performance clusters.

My dataset consists of:

1) 17 independent variables, one variable is ordinal (1 to 7 scale), 7 variables are categorical (true/false), 9 variables are continuous/numbers.

2) 3 dependent variables, these are continuous/numbers.

3) Dataset size is 36 subjects.

My analysis is two steps:

1) Run the 3 dependent variables (performance measures) through clustering algorithm and find out if there are significant clusters. This test was successful and I found 3 significant clusters (high, medium, low).

2) Now I want to test the influence of the 17 independent variables (technological capabilities) on the 3 clusters. I planned to use multinomial regression but it didn’t work due to the issue mentioned in the earlier post.

My questions:

1) Given the number of independent variable (17), is there a recommended data size for the multinomial logistic regression to work successfully? How many subject can be good enough?

2) Since my clusters consist of the 3 dependent variable, I’m thinking of testing the impact of the 17 independent variables on the 3 dependent variables that make the clusters instead of the clusters themselves using PLS or multi linear regressions? This is still a workaround but I may consider it just in case my logistic regression model continuous to fail.

3) Any other recommendations?

Alternatives:

I checked the linear discriminant function and it seems promising. I think the problem is I don’t know how to interpret its results to find out how the independent variables influence the 3 clusters. I’m not planning to build a prediction model with either logistic regression or discriminant function, I only need the “inferential” piece not the “prediction” piece since I just want to understand the influence of the independent variables on the cluster not interested in building a prediction model.

Your help is much appreciated! Thanks again!

Maro says

Hi Jim – Thank you so much for this clear post! This is very helpful.

I have a question. My dependent variable is categorical (3 categories). when I tried nominal logistic regression using minitab, the model didn’t converge. After some research I found out that my data has a collinearity problem making it difficult for the nominal logistic model to converge.

Instead, I tried both linear regression and partial least squares, using minitab, on the same data and the results seem reasonable. My question is, can I use linear or PLS regression if my response variable is categorical? or do I have to do nominal logistic regression?

Thanks again!

Jim Frost says

Hi Maro,

Thanks for writing with the great question. Unfortunately, when you have a true categorical variable, you cannot treat it as a numeric variable. Suppose you have three groups. You can label each one with numbers: 1, 2, 3. However, that doesn’t mean you can analyze them as numbers. Those numbers might represent: scratch, dent, and tear. Or maybe gold, silver and bronze. The numeric labels don’t measure/represent the actual characteristic that the groups are based on. To illustrate this, the value 2 doesn’t indicate that it is exactly twice the value of 1. The numbers also suggest a logical order to the groups that just doesn’t exist (otherwise you’d be using ordinal logistic regression).

Consequently, you can’t use linear or PLS regression. I don’t know what your model is or your other variables, but if you have only categorical variables, you can try the chi-square test of independence to look for relationships among categorical variables. Otherwise, I think nominal logistic regression is your best bet. To address the collinearity, you might try removing or linearly combining variables and using them in nominal logistic regression. By linearly combining them yourself, you’re incorporating some aspects of PLS into nominal logistic regression.

I hope this helps!

Lokesh Gupta says

Hi Jim,

Is there any way we can mention a categorical dependent variable to be ordinal variable before passing into the logistic regression model as it might provide extra edge to the model output

Jim Frost says

Assuming that I understand your question correctly, if you have an ordinal dependent variable, you should use ordinal logistic regression to analyze your data.

Anudeep Venkata says

Hi Jim,

I read your blog on regression and it was lucid. But I am confused about the regression testing with the model with below mentioned variables

I have control and test samples which have discrete quantitative variables.

If I have to model this with another independent variables like gender (dummy variable or independent), and one ordinal variable.

Since I have my control and test with other factors like gender and housing. I want to determine the link and analyze the effect of Housing and gender on my control and test samples.

Can you please suggest which regression model would be appropriate. Is that Binary logistic or Multi nominal logistic regression??

Thank you in advance.

Jim Frost says

Hi Anudeep,

It really depends on what type of dependent variable you have. Can you clarify the nature of your dependent variable?

WW says

Hi Jim, Thanks for this post. It clears the regression clouds haunted me for a loooong time! It is really helpful! Statistics had been my nightmare since Uni but I guess no more. Look forward to your next posts!

Jim Frost says

Hi WW, I’m so happy to read your comment! I strive to make these posts as easy and intuitive to understand as possible. It makes my day to read that they’ve helped you!

Stayed tuned, I am writing more posts but taking a short break at the moment.

SK says

Hi Jim,

Really nice article

I am stuck in a problem where i have to do regression but I am unable to decide with which regression model i should proceed. To give you a background, I have sales (dependent variable) and lets say a, b, c,d and e are independent variable(The sales is triggered by these independent variable). My objective is to find the importance of each independent variable, so that we can prioritize on that channel. Now the values in independent variable can be either “Open”, “close” or blank. I thought of using Binary regression model but here i have three type of values but Binary takes only two.

Pls suggest

Thanks in advance

SK

Jim Frost says

Hi SK,

Assuming that sales is a continuous dependent variable, you do not want to use binary logistic regression or other specialized type. Binary logistic is for cases where the

DVis binary. You’re talking about independent variables.You have categorical independent variables, which you can include in linear regression. Most statistical software will code those as indicator variables automatically. Does “blank” represent a missing value or is that an actual value for the IVs? If you have “Open”, “Close”, and missing values, you’d just need one indicator variable for each IV. The indicator variable could be something like Open_A, which is a 1 if variable A = Open and 0 if variable A = Close. Repeat for the other channels. But, again, most software will do that automatically.

Finding the relative importance of each IV is a separate matter that I write about: Identifying the Most Important IVs

I hope this helps!

Lin says

Hi Jim,

I am currently doing a research on behavior pattern in Peru.

My dependent variable are binary but my independent variables is a mix between binary and continuous variables . I have to use data from previous round to predict the later round . For example dependent variable is smokes at age 15 which is binary and some of the independent variables are mathscore standardize at age 12 (continuous) and drinks alcohol at age 12 (binary). I also think that there is also an endogeneity problem in this setting. Hence, I do not know what regression in STATA is best for this kind problems? Also how do i solve the endogenous problem ?

I thought of using linear probability model, since it seems the easiest but I don’t think this is the best method .

Thanks in advance

Kind regard,

J.zhong

Prashant Dey says

Hey Jim,

Thanks a lot.

This is very good explanation of regression techniques.

This post will gain another boost if a flow chart or map for choosing the right technique is provided.

Just a suggestion.

Thank You again!

This is really helpful.

Sandeep says

Sir which model is best for stock market short time prediction

Jim Frost says

Hi Sandeep, ah, I get asked that question many times. And, if I knew the answer to that one, I’d be so rich that I’d have more money than I’d know what to do with! The fact is that the stock market is fairly unpredictable. If you could predict it in the short term, then everyone would know exactly where to put there money at any given point. And, then the advantage is gone. So, it doesn’t work that way.

Kai says

This is fantastic! I’m a 3rd year statistics major at university, and it is so refreshing having this overview of regression set out in such a clear way. Major kudos for all your work!

Jim Frost says

Thank you, Kai!

Antoine says

Hey Jim,

Thank you for this post, really like the way you explain things.

I am working on a project where I am assessing the relation between discriminative attitude and healthcare provision in health care workers:

– Discriminative attitude: is the independent variable and will be measured using a series of 10 scaled question (scaled from 1 to 5). In that way any respondent will have a score somewhere between 1 and 5, hence i am assuming it is a continuous ordered variable right?

– Healthcare provision: is the dependent variable and will also be measured using a series of 10 scaled question (scaled from 1 to 4) – similarly to the independent variable, i am assuming this is a continuous ordered variable.

In your opinion, what analytical model would be most suitable for that purpose?

Thanks!

CMO says

Thank you, Jim. This is helpful

Jim Frost says

Hi, I’m so glad that is helpful!

Sebastian says

Hello Mr. Frost,

first of all great website. Wish I knew the existence back when I was in my bachelors studies. My question is concerned with log-linear models and binary variables. I developed a model for a thesis that looks like this:

log y_t – log y_t-1 = beta_0 + beta_1 A + beta_2 B + u. The dependent variable is the percentage change of the Treasury yield and A and B are binary events like FOMC meetings. Is this example considered a log-linear regression model? Thanks in advanvce.

Ahmad says

Hi Jim

Im student, have problem with how can choose which regression model i need to use in my case.

i have many variables with one response like mix design variables and the response is compresive strength of concrete

Jim Frost says

Hi Ahmad, choosing the best regression model is a very important task. In statistics, we call that process model specification. I’ve written an entire blog post about it that will help you. Model Specification: Choosing the Correct Regression Model

Best of luck with your analysis!

Hassan Elkatawneh says

That is very helpful, but did not answer my own need. If you can advice my, I have 2 IV and one DV, in addition I have one moderator variable. What is the best test, all variables are ratio scale? thanks in advance for your help

Jim Frost says

Moderator variables are commonly used in psychology–which isn’t my field. However, from my understanding, they are essentially interaction effects. That is, the effect between an independent variable and a dependent variable depends on the value of another variable. To fit this type of model, you can use OLS multiple regression. You just need to include the appropriate interaction term in the model. For more information, read my post about understanding interaction effects.

Pankaj Kumar says

Hello Mr. Jim

I hope you are doing very well.

I am in confusion in the testing of regression analysis. Well, as we read in basic Statistics that F test is a two tailed test whereas when we use F test in testing of regression analysis then we always treat it as a one tailed test. Why so?

Thanks

Pankaj

Jim Frost says

Hi Pankaj,

That’s a great question. As it turns out, for regression and ANOVA, the F-test is always a one-tailed test. The F-test tests the ratio of two variances (technically mean squares rather than variances). In regression and ANOVA, it’s a one-tailed test because of the nature of what you’re testing. In One-Way ANOVA, you’re determining whether the between group variance is greater than the within group variance. In regression, you want to determine whether the model with all of your predictors is better than the model with no predictors (only the constant). Those are one-tailed tests by the definition of how the hypotheses are specified–you are determining whether one variance is significantly larger than the other variance.

To see how the F-test works in detail I suggest you read my post about the F-test and One-Way ANOVA. Regression analyses uses the F-test in a similar way but changes the variances in the ratio. You’re testing the model with all of your predictors compared to the model with no predictors (just the constant). You can also read my post about the F-test of overall significance.

You do use two-tailed F-tests for Variance Tests. In this case, you require the ability to determine whether the variance in the numerator is larger than or less than the variance in the denominator. You’re testing both directions (larger and smaller), hence it’s two-tailed.

I hope this helps!

Raof says

Thanks Jim for this informative Blog

I want to examine the influence of predictor variables such as Physical activity (low, moderate,high), sedentary time and dietary habits ( fruits, vegetables, junk food etc.) on a dependent variable BMI, collapsed into lower level ordinal categories like underweight, normal, overweight and obese. If I have to see the odds of being overweight/obese for a person based on these behavioural practices. What would be the appropriate regression analysis. Or am i required to dichotomize (1.underweight/normal and 2.overweight/ obese) my dependent variable and use binary logistic regression. Your views will be much appreciated.

Jim Frost says

Hi Raof,

It sounds like you need to use Ordinal Logistic Regression. Your dependent variable is an ordinal variable. Unfortunately, I don’t have a good example of this type of regression to share with you, but it can do what you describe.

The one problem I see is that you also have an ordinal predictor (physical activity–high, medium, and low). That can be problematic. You can try to fit the model and check the residuals to see if you satisfy the assumptions. If it doesn’t work, you can try converting those three levels to two indicator variables. Indicator variables are binary variables where you have one for each level–however you need to leave one out of your model (e.g. High, Moderate). You need to leave one level out for the analysis to run so I intentionally didn’t include Low–but you can leave any level out.

But, for your ordinal response variable, use ordinal logistic regression.

I hope this helps!

CMO says

Hi Jim,

Thanks for posting this.

I would appreciate your thoughts on my analyses. I have an independent variable that is a count variable (number of days at work). My dependent variables are all continuous variables. Can I use a simple linear regression model to test a moderated mediation relationship with the the IV as a count variable?

Thanks!

Jim Frost says

Hi, I’d give the model a try but check the residual plots to be sure that the model satisfies the assumptions. If you’re fitting just the one independent variable, you can use a fitted line plot and really just see at glance if it provides a good fit. I show an example early in this post.

Shiji says

Hai Jim,

It is very informative. I found it very useful for the researcher. I have a doubt in my study, i wish test the relationship between domestic tourists and foreign tourists. when we look at the total number the same trend is observed by the two . so I wish to know which method can be used to prove that the pattern of change of domestic and total are the same or the movement of total tourist is same as the domestic.

thanking you

Shiji

Jim Frost says

Hi Shiji, I’m not 100% sure I understand what you are studying. However, it sounds like you might need to include one or more interaction term in your model to determine whether the relationships between your independent variables and dependent variables depend on whether a tourist is a domestic or foreign tourist. I write about comparing regression lines in an article. Read that article and, in the graphs where I show the regression lines for two different groups, imagine that one group represents domestic tourists and the other represents foreign tourists. That might be what you’re looking for. I hope this helps!

nasim says

hi, i am a student and i have a problem, i want to predict bankruptcy in IRAN . and i want to use LASSO regression to choose more effective independent variables, i select dependent variable y(0 , 1), and i have 50 independent variables that are financial ratios , and i do analysis on Spss, but i have many problem with result, so i have a main question, can i use lasso to predictive with 0 and 1 dependent variable, can i use Spss to do it?

thank you Jim.

Jim Frost says

Hi Nasim, I haven’t done this myself but apparently it is possible. I recommend that you read this about using Lasso with logistic regression. This example uses R, but I’m not sure about SPSS.

Renee Sartin says

Hi. I am a student, and I am having grave difficulty in determining what types of variables I have for my study. (still in the learning phase). This is my problem statement. It is not known if and to what extent a positive correlation exists between organizational commitment of supervisors and practicum success among students, and whether student intrinsic motivation moderates the relationship.

Please, if you were me what analysis would you use and why. And to your best knowledge what types of variables are these? I look so foraward to receiving yuour respose.

Jim Frost says

Hi Renee, most likely you are working with either continuous or ordinal variables. To determine which type of variable, check out my glossary definitions for both:

Continuous variables

Ordinal variables

For pairs of continuous variables, you can use the Pearson correlation. Be sure to create a scatterplot and determine whether the relationship is linear.

For pairs of ordinal variables, you can use Spearman’s correlation.

Best of luck!

John Petroda says

For the count example (number of calls an analyst receives daily), curious about using Log transformation of the the dependent count variable and using random forest on that?

Would that work?

Than you…

Jim Frost says

Hi John, unfortunately I’m not overly familiar with random forest models. That’s something I should learn more about!

Abhishek Singh (@abhi121289) says

Very intitutive. Loved the way you explained it. Thanks Jim.

Jim Frost says

Thank you, Abhishek! I really appreciate the kind words and I’m glad you found it to be helpful!

Mukesh Bishnoi says

Very knowledgeable points

Jim Frost says

Thank you, Mukesh!

Nicol says

Technically, regression examines a relationship between predictor and response variables. I wish people will stop using IV and DV incorrectly. There’s nothing the researchers are manipulating in your examples either.

Jim Frost says

Hi Nicol,

Predictor and response variables are synonyms for independent and dependent variables, respectively. You can use them interchangeably. Also, you’re correct that none of the examples have researchers setting the values for the independent (predictor) variables. However, that’s just fine in regression analysis. These examples are observational studies where you measure data and observe the relationships.

You can also use regression analysis in designed experiments where you use random assignment and the researchers set the values of the experimental variables. The designed experiment approach is particularly good when you want to establish causality (rather than just correlation) and it helps rule out confounding variables. However, this type of experiment isn’t always feasible, and it’s OK to use observational studies as long as you are aware of the limitations and potential problems.

Thanks for writing!

Jim

roy hampton says

Great post Jim. I really like the way you explain the different types of regression.

Jim Frost says

Thank you, Roy! I’m glad that you found it helpful!