Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data.
I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.
Related post: Guide to Data Types and How to Graph Them
Regression Analysis with Continuous Dependent Variables
Regression analysis with a continuous dependent variable is probably the first type that comes to mind. While this is the primary case, you still need to decide which one to use.
Continuous variables are a measurement on a continuous scale, such as weight, time, and length.
Linear regression
Linear regression, also known as ordinary least squares (OLS) and linear least squares, is the real workhorse of the regression world. Use linear regression to understand the mean change in a dependent variable given a oneunit change in each independent variable. You can also use polynomials to model curvature and include interaction effects. Despite the term “linear model,” this type can model curvature.
This analysis estimates parameters by minimizing the sum of the squared errors (SSE). Linear models are the most common and most straightforward to use. If you have a continuous dependent variable, linear regression is probably the first type you should consider.
There are some special options available for linear regression.

Fitted line plots: If you have one independent variable and the dependent variable, use a fitted line plot to display the data along with the fitted regression line and essential regression output. These graphs make understanding the model more intuitive.
 Stepwise regression and Best subsets regression: These automated methods can help identify candidate variables early in the model specification process.
Advanced types of linear regression
Linear models are the oldest type of regression. It was designed so that statisticians can do the calculations by hand. However, OLS has several weaknesses, including a sensitivity to both outliers and multicollinearity, and it is prone to overfitting. To address these problems, statisticians have developed several advanced variants:
 Ridge regression allows you to analyze data even when severe multicollinearity is present and helps prevent overfitting. This type of model reduces the large, problematic variance that multicollinearity causes by introducing a slight bias in the estimates. The procedure trades away much of the variance in exchange for a little bias, which produces more useful coefficient estimates when multicollinearity is present.
 Lasso regression (least absolute shrinkage and selection operator) performs variable selection that aims to increase prediction accuracy by identifying a simpler model. It is similar to Ridge regression but with variable selection.
 Partial least squares (PLS) regression is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. PLS decreases the independent variables down to a smaller number of uncorrelated components, similar to Principal Components Analysis. Then, the procedure performs linear regression on these components rather the original data. PLS emphasizes developing predictive models and is not used for screening variables. Unlike OLS, you can include multiple continuous dependent variables. PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables.
Nonlinear regression
Nonlinear regression also requires a continuous dependent variable, but it provides a greater flexibility to fit curves than linear regression.
Like OLS, nonlinear regression estimates the parameters by minimizing the SSE. However, nonlinear models use an iterative algorithm rather than the linear approach of solving them directly with matrix equations. What this means for you is that you need to worry about which algorithm to use, specifying good starting values, and the possibility of either not converging on a solution or converging on a local minimum rather than a global minimum SSE. And, that’s in addition to specifying the correct functional form!
Most nonlinear models have one continuous independent variable, but it is possible to have more than one. When you have one independent variable, you can graph the results using a fitted line plot.
My advice is to fit a model using linear regression first and then determine whether the linear model provides an adequate fit by checking the residual plots. If you can’t obtain a good fit using linear regression, then try a nonlinear model because it can fit a wider variety of curves. I always recommend that you try OLS first because it is easier to perform and interpret.
I’ve written quite a bit about the differences between linear and nonlinear models. Read the following posts to learn the differences between these two types, how to choose which one is best for your data, and how to interpret the results.
 What is the Difference Between Linear and Nonlinear Models?
 How to Choose Between Linear and Nonlinear Regression?
 Curve Fitting with Linear and Nonlinear Regression
Regression Analysis with Categorical Dependent Variables
So far, we’ve looked at models that require a continuous dependent variable. Next, let’s move on to categorical independent variables. A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic. Logistic regression transforms the dependent variable and then uses Maximum Likelihood Estimation, rather than least squares, to estimate the parameters.
Logistic regression describes the relationship between a set of independent variables and a categorical dependent variable. Choose the type of logistic model based on the type of categorical dependent variable you have.
Binary Logistic Regression
Use binary logistic regression to understand how changes in the independent variables are associated with changes in the probability of an event occurring. This type of model requires a binary dependent variable. A binary variable has only two possible values, such as pass and fail.
Example: Political scientists assess the odds of the incumbent U.S. President winning reelection based on stock market performance.
Read my post about a binary logistic model that estimates the probability of House Republicans belonging to the Freedom Caucus.
Ordinal Logistic Regression
Ordinal logistic regression models the relationship between a set of predictors and an ordinal response variable. An ordinal response has at least three groups which have a natural order, such as hot, medium, and cold.
Example: Market analysts want to determine which variables influence the decision to buy large, medium, or small popcorn at the movie theater.
Nominal Logistic Regression
Nominal logistic regression models the relationship between a set of independent variables and a nominal dependent variable. A nominal variable has at least three groups which do not have a natural order, such as scratch, dent, and tear.
Example: A quality analyst studies the variables that affect the odds of the type of product defects: scratches, dents, and tears.
Regression Analysis with Count Dependent Variables
If your dependent variable is a count of items, events, results, or activities, you might need to use a different type of regression model. Counts are nonnegative integers (0, 1, 2, etc.). Count data with higher means tend to be normally distributed and you can often use OLS. However, count data with smaller means can be skewed, and linear regression might have a hard time fitting these data. For these cases, there are several types of models you can use.
Poisson regression
Count data frequently follow the Poisson distribution, which makes Poisson Regression a good possibility. Poisson variables are a count of something over a constant amount of time, area, or another consistent length of observation. With a Poisson variable, you can calculate and assess a rate of occurrence. A classic example of a Poisson dataset is provided by Ladislaus Bortkiewicz, a Russian economist, who analyzed annual deaths caused by horse kicks in the Prussian Army from 18751984.
Use Poisson regression to model how changes in the independent variables are associated with changes in the counts. Poisson models are similar to logistic models because they use Maximum Likelihood Estimation and transform the dependent variable using the natural log. Poisson models can be suitable for rate data, where the rate is a count of events divided by a measure of that unit’s exposure (a consistent unit of observation). For example, homicides per month.
Example: An analyst uses Poisson regression to model the number of calls that a call center receives daily.
Alternatives to Poisson regression for count data
Not all count data follow the Poisson distribution because this distribution has some stringent restrictions. Fortunately, there are alternative analyses you can perform when you have count data.
Negative binomial regression: Poisson regression assumes that the variance equals the mean. When the variance is greater than the mean, your model has overdispersion. A negative binomial model, also known as NB2, can be more appropriate when overdispersion is present.
Zeroinflated models: Your count data might have too many zeros to follow the Poisson distribution. In other words, there are more zeros than the Poisson regression predicts. Zeroinflated models assume that two separate processes work together to produce the excessive zeros. One process determines whether there are zero events or more than zero events. The other is the Poisson process that determines how many events occur, some of which some can be zero. An example makes this clearer!
Suppose park rangers count the number of fish caught by each park visitor as they exit the park. A zeroinflated model might be appropriate for this scenario because there are two processes for catching zero fish:
 Some park visitors catch zero fish because they did not go fishing.
 Other visitors went fishing, and some of these people caught zero fish.
Whew! That’s many different types of regression analysis! If you’re trying to figure out which one to choose, I hope you will use this information to point yourself in the right direction!
If you’re learning regression and like the approach I use in my blog, check out my eBook!
Laura Kukkonen says
Hi Jim!
Thank you a lot for your posts, they are very useful and wellwritten.
I’m trying to find a way to analyse multiple hypotheses based on questionnaire data. Each hypothesis requires the analysis of multiple dependent and independent variables. The independent variables are of continuous, categorical and binary types, while the dependent variables are in this case all categorical (Likert scale). If possible, I would like to find a single model to analyse each hypothesis, but I’m having a hard time figuring out if there is any statistics available to fit my needs. I have been considering a multinomial logistic regression, but as I understand it only allows for one dependent variable. Is there an alternative where multiple dependent variables are available? If all of my different types of independent variables cannot be included in a single test, which types of separate tests would you recommend? I am thinking of a MANOVA for the categorical independent variables, but I don’t know of any suitable tests for the continuous and binary variables where I can include more than one dependent variable. Thank you a lot in advance!
Kinds regards,
Laura
Girma Gedamu says
hi Hi Jim, I’m interested to find out the prediction of drought of categorical independent variables on a continuous dependent variable.
for example my is data contain five variables one dependent and four other independent but the dependent variable again has 5 categories so which regression model i use please help me
Jim Frost says
Hi Girma,
If the DV variable’s categories have no natural ordering, then you’d use nominal logistic regression. However, if they do have a natural ordering (e.g., a five point Likert scale), you’d need to use ordinal logistic regression.
Tim says
Hello Jim,
Thanks for your post.
I have been tasked to perform an OLS regression on wage of a certain population. We are asked to have a close look at the definition of the variable education (1:low and 5:high). What kind of conceptual problem might there be in using it in an OLS?
Thanks in advance!
Jim Frost says
Hi Tim,
One potential issue is that you have an ordinal independent variable. You’ll have to determine whether to include it as a categorical variable or a continuous variable. That depends on the goals of your analysis, the nature and amount of data, and the adequacy of the fit using either method. If you have my book about regression, I write more about the decision and challenges of both approaches.
Also, regarding the definition of the variable, that sounds fairly subjective/vague. You should have a clear definition of how you’ll measure education level. Ideally, come up with an operational definition. Then test it on sample data to be sure that it provides consistent, accurate results in representing education level. I talk about creating operational definitions of your variables in my post about including statistical analyses in scientific studies.
İrem says
Hi Jim,
Hello I’m trying to build a regression model. I have 122 independent variable and I am trying to predict
legal proceeding amount .I also have many zeros in my dependent variable. These zeros show that he is not under legal prosecution.Most of these independent variables include the frequency of loan usage by years. I also have a variable in how much credit they use in total. But there are too many zeros in these variables. Because it was entered as zero for not using credit in any year. Therefore, I have different variables such as gender and age. Which regression model would be appropriate for me in this context ?
Sanjay Mali says
I have a table showing readings of a variable that depends on 4 different ‘YesNo’ type of variables which are independent of each other. Besides, in the descriptive part, I am told that these “+ – ” type variables are themselves can vary in the sense that suppose there is a circular button. This button is on or off is given in the table. Besides if on, its range of rotation is also given that influences on part of question. What type of regression model should I use?
Abeer says
Hi ,,
thank you for the valuable information ..
I have a question about ordinal logistic regression.
I have a likertscale questionnaire with several subscales. I have calculated the composite score for each subscale. First , I did compressions between males and females in my sample using mean scores of each . further I focused on one of the subscales and wish to do ordinal logistic regression for each item under that subscale separately to test its strength of association with other IV (gender, GPA, etc) .
would that be a correct approach ?OR is it contraindicated to treat items separately using ordinal logistic regression if they have already been combined as a composite score in the same study?
Hugo says
HI, I was wondering if my DV is numerical (ratio), what is de best regression analysis? The IVs are both ratio (numerical) and continuous.
Jim Frost says
Hi Hugo, the best place to start would be linear least squares!
Ian Berryman says
Hi, great informative webpage. I was curious. If dependent variable is natural logged and key ind. var is not (semilog form) can your controls be natural logged? my dep var is percent change in poverty, my key ind. var. is # of new churches(so must be measured by units) and my controls are all measured in percents (i.e. percent of the population thats black, percent of the population that earned a HS diploma, etc.) so do I need to have these controls natural logged to interpret my regression as “a 1% increase in a control variable leads to a x% change in poverty” or, since they are already measure in my data set as percentages, is that already how the regression interprets?
Thanks!
Jim Frost says
Hi Ian,
Yes, you can “mix and match” as needed with using natural logs or not! It really depends on the the nature of the data and theories about what is appropriate for it.
Sajeeka Gunasekara says
Thank you for replying me.! If I want to investigate the impact of happiness on the GPA, what would be suitable? Happiness would be collected as a average score it would be in between 0 to 6. What regression would be suitable?
Jim Frost says
Hi Sajeeka,
Is happiness an average of a set of scores that are something like Likert scale items? If so, taking the average is a method to use it as a continuous independent variable in your model. The coefficient for this variable indicates the average increase in GPA for a oneunit increase in the happiness score. That’s still least squares regression.
Sajeeka Gunasekara says
Hi Jim!
Thank you for your posts they are really useful. I have a problem with my scenario. My DV is grade point averages (GPA). It is a continuous value between 0 and 4. My IVs are happiness score, gender and academic level (it has 4 levels as 1000,2000,3000 and 4000 level). what type of regression is suitable here? What type of regression would you recommend for this??
Thank you so much!
Jim Frost says
Hi Sajeeka,
I’d start with linear least squares regression (OLS). You have a continuous DV. Gender and academic level are categorical independent variables, which OLS can handle. For academic level, the regression model will determine whether the mean GPA for each academical level is significantly different from one the baseline academic level (which is something you have to pick).
I hope that helps!
Helen says
Hi Jim,
My dependent variable is a score 1 – 5, with 5 being the best rating, and 1 being the worst. I want to see if gender has an association with a higher rating. I have been looking into doing an ordinal logistic regression analysis, but not sure if this is correct. I know I could create a binary outcome of (13) vs (4,5) but not sure if I want to lose that data? What type of regression would you recommend for this analysis?
Thanks!
Jim Frost says
Hi Helen,
Your dependent variable is an ordinal variable. Consequently, you’ll need to use ordinal logistic regression. You won’t need collapse any values using that method. That method will tell you if gender relates to those ordinal values.
Bidisha Chakraborty says
Hi Jim! Your posts are really helpful. My DV is an happiness index lying between 0 and 1(continuous) that has been constructed from various responses in categorical form. Hence DVis ordered. Should ordered logit or ordered probit would be ideal regression model to fit? Earlier, I have fitted OLS to the happiness scores lying between 10 to 45. R square is coming out to be very low. In ordered logit, chi square is highly significant, but, the pseudo R square is very low. Waiting for your valuable comments.
Thanks and regards,
Bidisha
Jim Frost says
Hi Bidisha,
It sounds like you have a continuous DV that is restricted in range between 0 and 1. Because it is continuous, you cannot use binary or ordinal logistic regression. However, you can use a logistic sigmoid function to model it. This process forces the model to respect the limits of the DV. I’ve never performed this type of analysis, so I can’t offer much help. But that should point you in the right direction!
Louis G. Daily says
Jim, I have one continuous predictor (IV) variable in my multiple regression and one ordinal predictor (IV). Can I do this? Can I incorporate this ordinal variable which comes from a Likert scale into the multiple regression?
Jim Frost says
Hi Louis,
I write about this in more detail in my regression book. Go to my webstore for more details about it.
Ordinal variables have a mix of attributes of categorical and continuous variables, which makes including them in a model a bit more complex. You’ll need to enter your ordinal variable as either a continuous variable or a categorical variable. The decision depends on the nature of your data, the goals of your research, and the quality of fit each approach provides.
Abel Chipeta says
Hey Jim
Thanks for the clarifications on the models…they really help.
Am conduction a study on Gender decomposition on the effect of education on household savings….but under this Savings is my dependent Variable while education, gender, sex, age, income, marital status and occupation are the independent variables…Help me on the best model to use since saving is categorical variable while in the explanatory variables their is mix of both categorical and continuous variables…
Looking forward from your Help
best regard
Jim Frost says
Hi Abel,
I would start with linear least squares regression (aka ordinary least squares). Your dependent variable is continuous (savings), which this type of regression can handle. Linear least squares can also handle a mix of continuous and categorical independent variables, such as you have.
Meriem says
Hi Jim,
Could you recommend any references?
Thank you in advance
joel says
Hi Jim,
I’m interested to find out the impacts of categorical independent variables on a continuous dependent variable.
For example, what are the effects of the type of news (financials, dividend payout) on the volatility of stocks.
Jim Frost says
Hi Joel, that’s possible in linear least squares regression along with other types of regression. Assessing the role of categorical IVs is a fairly common and basic usage for many types of regression models. If you have only categorical variables in your model, that’s often called ANOVA (analysis of variances). However, ANOVA uses the same mathematics “under the hood” as regression.
Gemechu says
I have two dependent variables(aboveground biomass and carbon content) and three indepedent variables(wood density, diameter and height). Which model is best for anaysing my data? Waiting you under reply. Thank you
Meriem says
Thank you for your answer. What about the fixed and random effects is ok to use the mix of binary and continuous varibales?
Jim Frost says
Yes, just be sure to use a type of model where you can specify both fix and random effects, such as the MIXED procedure in SPSS. That aspect does affect the type of procedures you can use.
Meriem says
Hi Jim,
Your content has helped me a lot in my work, Thank you!
I’m conducting a regression analysis using panel data on a sample of 74 individuals on an 8 year period of time. I’ve been wondering if I can use a mix of binary and continuous independent variables to explain a continuous dependent variable. Would that be ok? how would it affect the type of regression I use?
Thanks
Meriem
Jim Frost says
Hi Meriem,
Yes, it’s entirely ok to use a mix of binary and continuous independent variables for ordinary least squares. You still need to satisfy the same set of OLS assumptions, but there are no additional requirements. Binary independent variables are also known as indicator variables and analyst frequently use them in linear regression. Typically, the 1s and 0s of an indicator variable represent the presence or absence of a characteristic. You just need to interpret their coefficients differently. The coefficient represents the mean difference between observations with and without the characteristic. The pvalue indicates whether that difference is statistically significant.
I hope this helps!
Erick Loetz says
Hello Jim.
When using logistic regression how critical is to have balanced treatment groups. I have a binary response variable evaluated from two treatments at n=40 and n=14? How does the need for balance (approximately equal n’s for each group) compares to OLS analysis? Is there any published information on the subject? I appreciate your comments and expertise. Many thanks.
Jim Frost says
Hi Erick,
I don’t have a reference specific to binary logistic regression. However, in general, it’s OK to have unbalanced groups like that. Having balanced groups helps you maximize power for any given number of subjects. But, it’s ok to have an unequal number. I’m not sure for logistic regression, but for most analysis I’d say that having 14 in the smallest group is fine.
Perry Gonen says
When doing multiple regression analysis, as apposed to a simple OLS, where we have a number of independent variables, do you recommend to plot each independent variable against the dependent variable, one at a time to see how the plot of each variable on its own (without the other variables) against the dependent variable looks like. After analyzing each plot on its own go forward with the statistical analysis
Sarah says
Hi Jim,
I am hoping to do a regression analysis on social posts. The question I am usually trying to answer is does a certain variable (for example a photo vs no photo) play a role in engagement rate.
I’ve been going about this as coding the post: “Does this post have a photo or not?” Yes = 1, No=0 , but then I’m unsure what type of analysis would make the most sense?
(I would replicate this analysis on a bunch of different categories as well, like is the post light vs. dark, morning or night etc )
Any help/ thoughts would be much appreciated!!
Sarah
Nick Dekkers says
Dear Jim,
I would like to describe the correlation between 1 categorial factor (3 levels) and 2 continuous responses which are nonlinear over time. I have a continuous factor (concentration) which I would like to add to the model. But I’m not sure if I should use the MANCOVA or a nonlinear regression model. What would you suggest?
Thanks in advance!
Daphne says
Dear Jim,
I’ve been reading some of your blogposts and they are very helpful! However, I do have some remaining doubts about the regression I should run for my thesis. The doubt is mainly related to the setup of my dependent variable. I am constructing a measure of corporate social responsibility performance by summing strengths and concerns of several dimensions for each firm per year. There are about 50 dimensions in total and a firm is either given a 1 or a 0 in case the firm is either known to perform well/poorly(1) or 0 otherwise. Hence the dependent variable is constructed by summing these binary data observations. The minimum value of both the total strengths and the total concerns variable is 0. The maximum value of the strengths is equal to 44 and for the concerns this is equal to 36. I have thus only nonnegative integer values. The mean for both strengths and concerns is around 3.0, however the standard deviation of the strengths is equal to 3.8 and for the concerns it equals 4.9. I’m hoping to be able to use OLS regression, however I was wondering whether there is enough variance in the DV and whether it can be considered a continuous variable. After reading this blogpost I also started wondering about the count data dependent variable and whether something like a poisson regression isn’t more suited. I was also considering the ordinal logistic regression. For a bit more clearance: I am trying to find the effect of several ownertypes on CSR performance. I hope you can help me out! In case you have some questions about my dependent variable, please let me know, then I will elaborate further!
Many many thanks in advance!
Jayson says
Hello Sir Jim,
What if there are 12 variables 2 Variants ( Variable 1 to 6 measures well being and Variable 7 to 12 positive thinking?
Should I run simple linear regression or multiple? How can I know if the variables are affecting each other.thank you
Jessica says
Hello Mr Jim Frost,
Thank you so much for the information you have provided on your website, as well as the answers to the questions above they are very helpful!
I am currently designing a research proposal that will assess medication adherence for rural vs urban groups in Canada. I plan on using an adherence scale that will give the participants a score (from 18) and based on the score, their adherence will be categorised. So >7 =high adherence. 58 =moderate adherence. <5 is low adherence.
I am planning the statistical analysis for the adherence.
Based on the responses I have read, would it be correct to conduct an ordinal logistic regression for the two groups? We’re also planning on conducting an independent samples ttest between the two groups and a paired samples ttest to compare the same group at the start and end of the study. Would these ttests be possible if we transformed the data to continuous variables?
We also wanted to conduct a Chisquare test; is this possible if we don’t have estimated adherence scales.
How would you go about the statistical analysis for this study?
Thanks!
Jessica
Jim Frost says
Hi Jessica,
Yes, your plan sounds like a sound one. Ordinal logistic regression is a good choice. You can use ttests for Likert scale data. Your data aren’t exactly Likert scale. It’s an 8 point ordinal scale. My guess (though I don’t a a reference to cite) is that you could not use a ttest based on the recategorized scale but probably on the full eight point scale. If you really want to use the recategorized scale, you might need to use a MannWhitney test instead. Transformation is probably not necessary. As an ordinal scale has more points, it become more like continuous data. Not perfectly so, but studies have shown that at least 5 point scales are close enough to use ttests. Read my post about using ttests for Likert data for more information.
About the chisquare test, I’m not sure what you mean by not having estimated adherence scales?
I think any of those approaches are valid. Ordinal logistic regression has the benefit of being designed for ordinal data. While studies have shown that ttests and Mann Whitney tests can both work, it avoids a potential debate about the results if you just use a test designed for that data type! Consequently, I’d probably lean in that direction myself.
Geska says
Hii Jim,
If one respondent answering same set of question for different brand, when i run the regression analysis, should I compared both answer and use dummy variable, or I should run the regression separetly? because one of my aim is looking how customer trust affect customer loyalty
Jim Frost says
Hi Geska,
Let me make sure I’m understanding your scenario correctly. You’ll have the respondents answer the same questions about two or more brands and you’ll some dependent/outcome variable that you’re measuring? And, you want to determine how the responses related to the brands?
In that case, you’ll need to use a mixed ANOVA model. You’ll need to include a subject ID as a random factor, survey responses as fixed factors or covariates depending on the type of response, and the brand itself as a fixed factor.
I hope that helps! Unfortunately, I haven’t written a post about mixed models yet. But, that’s the direction it sounds like you need to go!
Dave says
Hi Jim
I am wondering whether regression analysis can be used to predict the strength of relationships between multiple income variables and bottom line expenditure. I have a conundrum in how I compare Business performance pre Covid ( Feb) to future performance in say ( October) and how i might model and estimate future income and expenditure. Can you enlighten?
Fei says
Hi Jim, thank you for sharing this information. It is very helpful! I have a question regarding gender. There are only a few males in my sample, the female – male ratio is 14:1. Will this affect my analysis if I run a multiple linear regression? How about a simple mediation or moderation analysis (IV, DV, mediator/moderator). I plan to use gender as a control variable for these three types of analyses.
Thank you! I look forward to your reply.
Jim Frost says
Hi Fei, because there are so few males, it’ll be hard for the model estimate the effect of gender and the statistical test will have low statistical power. However, if the effect of gender is large, it might still be a valuable addition and you could possibly get significant results. Honestly, I would just try it and see how it works. Don’t expect too much from gender, but if the effect is large it might still be important to include it.
Michaela says
Hi Jim,
I hope you can help me, I am riding the clock a bit.
I hope you are well and safe with all the madness of COVID19 in the world. I was hoping you could please help me. I am a student currently in need of a lot of help. Statistical analysis is not my forte but when needs must, we grin and bare it and look to the professionals for help.
I have just received my data back from my survey and I have found that my independent variables are all 15 Likert scales. I am looking at consumer attitude and intention, however, I believe if I am not mistaken that my dependent variable is my ‘Intention’ as that is what I am looking to find out overall. From my research, I think ordinal logistic regression is most appropriate to use. I ended up looking at youtube on how to run the test and what goes where but a lot of it is contradictory to the next.
My independent variables are;
Relative advantage/ Compatibility/ Environmental Impact / Psychological ownership.
Each of these are subcategories with between 4 to 8 questions under each – All 5 point Likert
I need to determine these expectancy values against attitude and intention but I have no idea how to.
So my questions are:
Firstly, am I using the right regression
Do the IV go into the ‘factors’ or the ‘Covariate’ box
Is it Test of Parallel lines where I am receiving my null hypothesis data?
Is there another test I should be running with it?
Also when analysing the output, if I run it through ‘factor’ box, in the parameter estimates, under the location information, is the significance data suggesting that my data is proving that – say for instance – Relative advantage significance is linked to consumer intention or does it mean that there is no link between RA and intention?
Thank you very much for your time. I hope you can help.
Jim Frost says
Hi Michaela,
Choosing the correct type of regression depends on the dependent variable, and I’m not sure what your DV is. If your DV is also Likert scale, then, yes, ordinal logistic regression is the correct type.
For the IVs, Likert scale items can be tricky if you’re using the individual item scores for your values. They’re not continuous but they’re not categorical. However, you’ll need to model them one of those ways. That gets to your question about whether they are “factors” (categorical) or covariates (continuous). There are pros and cons for both ways. And, the goals of your study also play a role. I don’t have an article to refer you to about that, but I do write about it in my regression ebook.
If you’re summing or averaging Likert items together and using those values in your model, you might be able to treat them as continuous variables (covariates).
The test of parallel lines is for ordinal (or ordered) logistic regression tests whether the coefficients are the same for all levels of your DV, which I’m assuming are the Likert scale values. This test really is about determining whether you have a good model or not. If this test is significant, it’s not good. It means that there is something wrong with your model, but it doesn’t tell you what exactly. You could just have a poor fitting model or you might be using the incorrect link function. You want this pvalue to be greater than your significance level.
You should also look at the chisquared goodnessoffit tables. Again, you want pvalues that are higher than your significance levels. Low pvalues are problematic here because they indicate your model doesn’t fit the data well.
You should also look at the pseudo Rsquared values. Like the regular Rsquared in linear regression, they give you an idea of how well the model fits the data. Low values don’t necessarily mean the model is incorrect but they do mean that it won’t be good for prediction.
For the location parameter estimates, low pvalues that are statistically significant indicate that the IV is statistically significant. Suggests that the effect of IV does not equal zero (i.e., there is an effect). It’s a good thing!
I hope that helps! Best of luck with your study!
Robbie says
Hi Jim
I have read your books and a lot of your blog and have found all incredibly useful. One area I am repeatedly confused by (perhaps as it is not in your books) is the interpretation of logistic regressions.
I am looking at various negative outcomes for children (child labour, begging, etc.) and comparing them with a range of households characteristics (child headed, large households, elderly headed households etc)
The results show some interesting findings. For instance, regression the binary dependent variable of ‘begging’ or not give me the odds ratio of 2.5 for child headed households. For the same regression, te margins commands shows probabilities of 0.57 for child headed households and 0.35 for adult headed households.
I am at a loss as the best way to interpret these findings in a manner which makes sense for the data…
Is it more intuitive to say child headed households are 2.5 times more likely to beg than adult headed households (or is this not correct interpretation), or is the better interpretation that child headed households are 1.6 times more likely to beg than adult headed households (0.57 v 0.35).
Can you assist?
Robbie says
Hi Jim
Thanks so much for your useful blog and posts. Its really helped me understand regressions and do some really interesting analysis. And inevitably led me to by your books on linear regressions and statistics… I only wish there was one on logistic regressions too 🙂
I have run a simple logistic regression using a child labour survey in an East African country.
I ran the number of households that reported begging against those households that were child headed.
The logistic regression showed that child headed households had an odds ratio of 2.51. The constant was 0.54.I interpret this as meaning that child headed households are 2.5 times more likely to beg that nonchild headed households (although I am not sure if i need to subtract the constant?)
However the margins command shows the predicted probability of begging in a childheaded household is 0.57 while in a nonchild headed household it is 0.35. The odds ratio and margins do not speak to each other as I understand. How can the odds of begging in a child headed household be 2.5 times greater using the odds ratio but the probability of begging in a childheaded household be far less at 1.6 times greater?
Apologies for the no doubt obvious question, but I am struggling to find any answers.
. logistic begging i.child_head
Logistic regression Number of obs = 1,288
LR chi2(1) = 3.95
Prob > chi2 = 0.0470
Log likelihood = 836.89941 Pseudo R2 = 0.0024
——————————————————————————
begging  Odds Ratio Std. Err. z P>z [95% Conf. Interval]
————+—————————————————————
1.child_head  2.51981 1.180173 1.97 0.048 1.006238 6.310081
_cons  .545676 .032052 10.31 0.000 .4863365 .6122557
——————————————————————————
Note: _cons estimates baseline odds.
.
end of dofile
. margin child_head
Adjusted predictions Number of obs = 1,288
Model VCE : OIM
Expression : Pr(begging), predict()
——————————————————————————
 Deltamethod
 Margin Std. Err. z P>z [95% Conf. Interval]
————+—————————————————————
child_head 
0  .3530339 .0134158 26.31 0.000 .3267393 .3793285
1  .5789474 .113269 5.11 0.000 .3569443 .8009505
——————————————————————————
Varma says
Jim,
Thanks for the valuable advice!
jamesey10 (@jamesey10) says
Hi Jim,
Maybe you can tell me if I’m on the right track or getting derailed.
I think an Ordinal Logistic Regression or Poisson Regression is right for me. I’m still obtaining my counts and learning RStudio, but I’ve been reading your site and others to figure out what I’ll do with the data I collect.
I have over 1200 observations from content analysis (text mining)
My dependent variable is a count of words relating to one of 5 levels (a,b,c,d,e.)
My independent variable is a count of words relating to one of 4 typologies (y,z,w,xorder does not matter.)
I want to determine if any of the 4 typologies correlate to any of the 5 levels.
Dan says
Hi Jim,
Thanks for you article, it is very helpful!
I have a question about my research: I want to examine if a correlation exists between clinical characteristics (amount of pain, degree of physical disability, etc.) and radiological scoring of an Xray. Most clinical parameters are continuous variables, but one is binary (success/failure) and another is Likert scale with 7 categories (ordinal of course). The Xray scoring is ordinal with 4 categories (from no degeneration to extensive degeneration).
I figure I could use Spearman’s rho for continuous variables; Kendall’s tau, Somer’s d or Goodman and Kruskal’s gamma for Likert; and rankbiserial correlation coefficient for binary variable (success/failure). This way I can correlate these variables to the ordinal Xray scoring.
The next step is to do regression analysis to determine which independent variables account most for the Xray scoring. Since my dependent variable is ordinal I was thinking of ordinal logistic regression. But is it possible to incorporate continuous, ordinal and binary independent variables into this model?
Kind regards,
Dan
Jim Frost says
Hi Dan,
Yes, ordinal logistic regression is the way to go for your data. It can handle continuous and binary data with no problem. It can handle categorical data with recoding as indicator variables, which most software should do automatically these days. Your ordinal independent variable is a bit problematic. You’ll need to include it either as a continuous variable or a categorical variable. There are pros and cons for each way. The correct answer depends upon a combination of the goals of your analysis and the nature of your data. I don’t have article to point you towards but I do write about it in my regression ebook. I write about in in the context of least squares regression, but the same principles apply to ordinal logistic regression.
Best of luck with your analysis!
Varma says
Great post, Jim.
I am dealing with a challenge. I am not sure what regression analysis method to use to analyze a data set where the independent variables are in Likert scale (1 through 5 – completely disagree through completely agree) and the dependent variables are Likert items (1 through 5 – completely disagree through completely agree).
Your advice would be greatly appreciate.
Thanks in advance
Jim Frost says
Hi Varma, because your dependent variable is ordinal (Likert scale), you’ll need to use ordinal logistic regression. However, including ordinal independent variables is problematic. You’ll need to include them as either categorical or continuous variables. Each approach has strengths and weaknesses. The correct choice depends on the goals of your analysis and the characteristics of your data. I write about it in my ebook about regression analysis.
Best of luck with your analysis!
YUSUF JAMAL says
Hello Jim,
I want some help in determining a threshold poplation density.I have two data sets.One is the poulation density in many counties of US and another is number of cholera disease cases in those counties.I wat to figure out that population density at which there are 50% chances of getting infected with cholera.Can regression help me here?
Thank You so much
Jim Frost says
Hi Yusuf,
It sounds like you need to use binary logistic regression because that’ll allow you to calculate the probability of an event occurring (cholera infection). You’d need the number of events (infections) and opportunities (county population) along with the population density values. Although, I’m not sure if any US county ever got up to 50% cholera infections.
ronnie says
hello jim, so my doctor in college gave me the task to make a regression analysis on the impact of smoking, exercise level. and i dont know how to approach it.
EJ says
Hi there! I am trying to find the right correlation test for my data. I have a continuous independent variable, and a categorical dependent variable (with 4 possible categories). Do you have any suggestions for this type of data?
Jim Frost says
Hi EJ,
I’m not sure that finding a correlation is what you’d want to do with that type of data. Instead, use a oneway ANOVA to determine whether the mean of your continuous variable is different between the four groups. Click the link to see an example of that analysis.
Alexis says
Hello Jim , i need your help with my masters project, i have a set of audio recordings from which i extracted the MFCC matrix and the corresponding image sequence from which the mouth landmarks were detected for each MFCC frame, my goal ultimately is to give an audio to a system that will predict the mouth landmarks for each MFCC vector but i literally have no idea what that system should be or how to proceed , can you give me any advice concerning that?
Yusuf Jamal says
Hello Sir,
Is HosmerLemeshow statistic really important in binary logistic regression? Does a pvalue of less tan 0.05 of HosmerLemeshow sufficient enough to discard the otherwise good binary logistic regression model?
Thank You
Jim Frost says
Hi Yusuf,
In binary logistic regression, the HosmerLemeshow goodnessoffit test compares the observed and expected frequencies of events to determine whether the model adequately fits the data. It evaluates whether the differences between the expected probabilities and the observed probabilities are statistically significant. Consequently, your low pvalue indicates that your model does not fit the data well because these differences are significant.
In this case, try different link functions and/or change your model.
haneen says
Another question, sorry but I am a beginner in statistics. Now I have to check the residual plot ( same as scatterplot right?) for positive correlation right? If there is a positive correlation does it mean the dependent variable is normally distributed? Or shall I do another test to check for it’s normal distribution? One more question, I have to check the residual and the normal distribution for the dependent variable “for the means?” or for the original responses?
Thanks alot for your response
Jim Frost says
Hi, I have an article about residual plots. I put a link in my other reply. That’ll answer your questions.
haneen says
Dear Jim,
Yes there is a positive relationship, you mean by the residual plot the scatter plot right? Can I send a screenshot for the scatter plot as a message on your fb page just to make sure that all is ok?
Jim Frost says
Hi Haneen,
You use residual plots to check the OLS assumptions. Read those posts for more information.
Haneen says
Dear Jim,
I have a sample of 140 instructors and 21 tool in Moodle website and I want to show statistically that the the awareness of Moodle tools (in general) influences its usage.
I conducted a 5items likert scale survey. Two questions for each tool, one asks about awareness of the tool and the other about usage of the tool. Then I calculated the overall usage of each tool (by finding the mean of the responses for each tool) and the overall awareness of the tool (by finding the mean of the responses for each tool). then I used a simple linear regression between the overall awareness and the overall usage (i.e. between the means) (resulting in 21 plot in the scatter plot)) is this right???? or shall I make the linear regression between between all responses for all tools?? (21*140 dot in the scatter plot)
Can you kindly reply a.s.a.p?
regards,
Jim Frost says
Hi Haneen,
I’m not quite clear about your model. For Tool A, are you looking at the mean awareness of tool A by all subjects and the mean usage of tool A for all subjects? And, you have 21 tools, so for your model, your 21 observations are basically 21 means for the DV and the IV? And each mean is calculated using 140 values. Is that right?
You should be able to fit the model that way. Typically, each observation is an individual but in your case each observation is a mean. But, it should tell you whether higher awareness leads to higher usage. You should find a positive relationship. Just be sure to check the residual plots. Using 140 values to calculate each average should allow you to treat the underlying ordinal variables as continuous data. However, because they are based on ordinal data with a constricted range and the difference between values might not be constant, you might have curved relationships and nonnormal distributions. Pay particular attention to the residual plots. If you satisfy the assumptions, you should be good. If the distribution of the DV is highly skewed, you might need to use a transformation. Or possibly fit curvature.
Rose says
Hi Jim,
I really liked your article, I only have one question. My dependent variable is a categorical variable with three levels, and my independent variable is a discrete variable with 5 levels. Is ordinal logistic regression then the right way to analyse my data?
Thanks in advance!
Rose
Vanessa says
Hi Jim,
Thank you so much for all your valuable posts! i was going over them in order to prepare for my own analysis.
I decided to both run an ordered logic regression (ordered categorical dependent variable) and a multinomial logic regression (unordered categorical dependent variable). The question i right now have is what is the best way to interpret my coefficients received from the logit regressions?
shall i use the odds ratio? the RRR? or simply use conditional or average marginal effects? is there anything i need to take into account?
Many thanks already very much for your reply!! I am so looking forward to it!
Jamaldeen Nurudeen says
please, which type of regression is appropriate to quantify the influence of funding on academic success and why?
Jim Frost says
Hi Jamaldeen, you don’t pick the type of regression based on the subject area but by the types of variables you have, particularly the type of dependent variable. So, determine what your dependent variable is and then look through this post for the types of regression that can analyze that type of dependent variable.
David Lee says
Do you discuss use Information Theortic Approach in your blog? In particular do you discuss use of AICc in comparing models? I recently purchased your Regression Analysis book but don’t see this discussed.
Gaia Fiordispini says
Thank you very much Jim! I really appreaciate your responsiveness.
I know it would be better to consider it as a nested dataset, and this makes the analysis more complex and difficult to understand for me. Since I am a marketing student and not a statistician, and my research question wether there is a relationship between the maturity levels (fixed values for each brand, that I want to use as independent continous variables) and the consumer mindset metrics (different for each respondents, who answered for one brand only, and that I want to consider as my dependent variable)… wouldn’t it be sufficient to run a correlation? It would come with limitations but I have to answer my research question with my competencies.
Thank you and all the best!
Jim Frost says
There are several considerations. If your DV is Likert, you can’t use the regular Pearson’s correlation, you’d need to use Spearman’s correlation.
However, there is a larger issue with the correlation approach. When you include multiple independent variables in your model, it automatically controls for all the variables in the model. Consequently, the effect of each variable is calculated while controlling for (i.e., holding constant) the other variables in the model. You lose that with correlation because you’re assessing IVs one at a time. You could end up with biased estimates–basically omitted variable bias, which might or might not be a problem with your data. That’s a risk to weigh and involves subjectarea knowledge that I don’t have. Read here to see an example of this problem in action.
I’m rethinking the nested design as I think I’m understanding your design better. Nesting occurs on the IV side of things. And you don’t have that there. Respondents are answering for one brand, but their responses are on the DV side. So, disregard that. I’d perform ordinal logistic regression with no nesting.
You could perform Spearman’s correlation but with the caveat I mentioned earlier.
Gaia Fiordispini says
Thank you! My supervisor suggested me to think about CLUSTERED DATA and perform REAPEATED MEASURE MULTIVARIATE ANOVA. This should imply to transform my dataset? And, also, I do not understand if it is the proper way to analyze my data, or if a specific type of correlation (e.g. I tried to perform Spearman) can be enough!
Jim Frost says
Hi Gaia,
Unless I misunderstood your design, which is possible (see my other comment), it doesn’t sound like you’re using repeated measures. It does sound like you have a nested design because each respondent is answering about one brand.
My other comment has thoughts about the type of regression and other details about your model. It’s a rather complex mix of things, at least if I’m understanding correctly.
Gaia Fiordispini says
Hi Jim! Could you please reply to my question? It is important for my Master Thesis!
Jim Frost says
Hi Gaia,
So sorry about the delay. I’ll get to it soon. It unintentionally fell through the cracks!
Lauren says
Hi Jim,
Great article! I am still a bit confused. Would a type of regression still be best if the data included 1 categorical independent variable (two groups), 2 continuous dependent variables, and 1 categorical dependent variable?
Jim Frost says
Hi Lauren,
Typically, you have just one dependent variable per regression model. If you have multiple dependent variables, you’ll need to fit separate models.
You can use least squares models to fit models for your continuous dependent variables with the categorical independent variables. Although, if you have a continuous DV and just two groups, you can use a 2sample ttest for that instead of regression.
For the categorical dependent variable, you’d need to use a binary logistic model if it has two levels or a nominal logistic model if it has more than two levels and assuming there is no natural order to the levels.
I hope that helps!
Diego says
Hi Jim,
Great article! Thanks for the info.
I have a question regarding a doubt I have on which regression to use for the dataset I’m currently working in.
I’m trying to analyze the impact of HIV on MSK health among women aging with HIV. So, I assessed body composition measures (fat mass, lean mass, etc) a in a cohort of women HIVpositive and women HIVnegative to see if there are differences between them. Also, I did some physical test (grip, chair standing among others). So, I want to see if body composition measurements affects in the scoring of those physical tests, and how HIV influence that.
In this case, as body composition measurments are continues variables and the physical test scores are also continues variables should I use a linear regression model? But I’m not sure how to plot the HIV impact part on that model. Thanks!!
Best,
Diego
Gabriel Villas says
Hi Rahma!
If you want to investigate the level of attendance, you dependent variable will be Categorical, because it has a natural order (Low, Medium and High) to obey.
Considering this, the regression model I suggest you is the Ordinal Logistic Regression.
Kind regards,
Gabriel Villas.
Jim Frost says
Hi Gabriel,
A categorical variable with a natural order is an ordinal variable, and I agree with the recommended analysis. Even better would be to record the number of days attended and use either least squares regression or Poisson (or related) regression depending on how well the count approximates the normal distribution. You’d probably get more information that way.
Gaia Fiordispini says
Hello I have a question regarding a dataset I must analyze. I did two research steps:
1. Based on a managerial maturity model, I ranked 8 brands according to their level of usage of social media tools. For example how many different content formats they use form 3 to +5. They are a total of 7 components that, summed all together, give the total maturity score of each brand.
2. I made a survey among consumers to measure their level of brand awareness, consideration and purchase intent. Each respondent was randomly assigned to a brand for which he had to give his rankings, so I have 8 conditions in the dataset.
3. Now my hypothesis is that there is a relationship between the level of maturity and the consumer mindset metrics (namely higher levels of maturity should score higher levels of mindset metrics).
So the questions are:
– I am analyzing the full sample because each condition has only 50 respondents. Should I split the dataset per condition for some analysis?
– The variables I want to compare are: interval Likert scales 15 for the mindset metrics (dependent) and ratio (probably I will rescale them from 0100% for example) that are fixed for each condition (independent). How to do correlation and regression analysis? Specifically, which regression is more appropriate?
Sorry for the long request but I tried to be as precise as possible.
Jim Frost says
Hi Gaia,
This is much deeper into the real deep details of study than I can usually go in blog comments. As you’ll see, your design is a complex mix of elements. From your description, I’m not sure that I have a completely clear understanding of your study, which makes it difficult to comment. However, I think I’ve got and here are some thoughts.
If your dependent variable consists of ordinal data (Likert scale), you’ll probably need to use something like ordinal logistic regression–unless the DV is a sum of multiple Likert items, which you can sometimes treat as continuous. If the DV is a sum of multiple Likert items, you can try using least squares regression but you’ll need to pay particular attention to the residual plots to ensure you’re getting an unbiased fit.
It sounds like each respondent is nested within one condition (brand). If that’s the case, you’ll need to use a nested design. You’ll probably need to include respondent as a random factor (as opposed to a fixed factor) in your design.
That’s a complex model to run and you might need to consult with a statistician at your institution to help you out.
I hope that helps! Best of luck with your analysis!
Naduth Must says
Hi
I have a dataset with 3 continuous variables in which the dependent variable includes negative values too. How can i decide wat type of regression to apply(other than linear)? by looking at the scatterplot or something?
Rahma says
Hi!! I have a question!
I have data on the number of classes that 86 student attended.
I have grouped those students into LOW attendance, MEDIUM attendance and HIGH attendance, based on the number of classes they attend.
And I have many Independent Variables (age, gender, BMI, smoking status, employment status, martial status), both categorical and continuous.
1) Is my Dependent Variable categorical or continuous?
2) Which regression do I use?
Waiting for your answer!!! Thank you
Maggie says
Hi, Jim.
I really need your help on the appropriate test for my undergraduate thesis. I controlled 3 categorical variables (A,B,C). Each have 3 levels. This is a within subject experiment where each subject was tested for 12 conditions (e.g. condition 1: A1,B1,C2; condition 2: A1,B2, C2…) Could you please suggest a test to analyse the contribution of each variable? I was thinking of the binary logistic test since the response is binary.
Thank you so much for your attention!!
Jim Frost says
Hi Maggie,
In general, that sounds like the right approach. One potential hiccup is that given the withsubjects nature of the study, you’ll need to include subject as a random factor. I discuss this a bit in my post about repeated measures designs. I’ve never seen a similar type analysis with a binary dependent variable and a random independent variable. So, I’m not sure if there’s software out there that can do that analysis. I’d consult with a statistician that can look into the special needs of your study and find the best solution. If it weren’t for the random factor, I’d say that binary logistic regression would meet your needs.
Katuta says
Hi Jim
I wanted to analyze visual acuity outcome, which is primarily continuous variable but I have made the range and categorised it into three groups as good outcome, borderline and poor outcome, can I use ordinal, multinomial or still I required to use linear regression?
My independent variables are, age, sex, type of cataract, commobidities, and level of education.
Thanks.
Jim Frost says
Hi Katuta,
You can use ordinal logistic regression. However, if you have the continuous data for acuity outcome, you might consider analyzing it using linear regression on the continuous data rather than converting it to an ordinal variable. But, if you do transform it as you describe, use ordinal logistic regression.
Priya Mohan says
Hi Jim,
Your post is very helpful and too clear, thank you. In my observtional study, the outcome variable is categorical, dichotomous – yes/no and the independent or explanatory variables are also categorical example – age groups, gender, education, smoking etc. I understood that we need to use binary logistic regressiON. Is this right? and how would we interpret beta coefficient?
Fads says
Hi Jim, I have a question on the type of regression analysis that would be the most appropriate for me to use . I’m investigating the impact factors such as ethnicity, geographical region and gender have on educational attainment at GCSE level
I have 3 categorical variables; ethnicity, geographical region and gender which im measuring against a continous dependent variable.
Thank you!
ariel says
Hi JIm,
Thank you again for the prompt response.
My regression lacks the offset variable.
So, in the Parameter Estimates table (SPSS) of this regression there is an Exp b column and I wonder how to refer to its values as IRR or RR?
thanks,
ariel
ariel says
Hey Jim,
You are right my dependent variable is not a count variable but it is not a pure rating variable either. It is difference/delta between the student rating pre and post a course and it has a poisson distribution. Is it wrong to use a poisson regression for this type of variable?
Thank you for the prompt answer,
Ariel
Jim Frost says
Hi Ariel,
Yes, you can use Poisson regression in your case. And, that does seem to be borne out. However, it might raise questions amongst reviewers because it is an unusual use for it. However. Poisson regression is designed for data the follow a Poisson distribution. Usually that means count data. However, if your DV follows a Poisson distribution, the procedure will work fine even when it’s not count data. Just be ready to explain that!
ariel says
Hey Jim,
I have a dependent variable that is the variation between likert scale student’s ratings pre and post a course they took. I did a KolmogorovSmirnov test and found that this variable has a poisson distribution so I constructed a multivariate poisson regression (without an offset variable).
My first question is did I construct the right regression?
My second question is that the Parameter Estimates table of these regression has an Exp b column and I wonder if to refer to it as IRR or RR?
thanks,
ariel
Jim Frost says
Hi Ariel,
Typically, you use Poisson regression for count data. However, the student ratings don’t sound like count data. Ratings are ordinal data. Consequently, you should consider using ordinal logistic regression.
Aujah says
I need help finding the right regression model for a categorical, binary independent variable and 3 different continuous dependent variables. Please!
Jim Frost says
Sounds like you need to use a binary logistic model.
Erik says
Jim,
Could you give a thumbs up if zeroinflated models would be appropriate given the following question:
Is there a relationship between the number of backpacking trips in a 3year period among a purposive sample (n=50) of participants in a outdoor leadership class (dependent variable) and their age, sex, occupation, and race (independent variables)?
Note: there is some evidence that many of the participants did not go on a backpacking trip in the last 3 years.
Thank you in advance and stay safe!
Jim Frost says
Hi Erik,
From what you describe, it sounds like a zeroinflated model might be appropriate. At least as a something to consider. You need to consider whether there are two separate processes that you can describe. One would be the desire to go on the trips. Some people want to go while others don’t. That will explain some of the zeros. Then, of those who want to go, there’s an expected distribution of trips. That distribution will likely include zero because I’d imagine there are people who would want to go but for some reason or other weren’t able to. So, you’d have the two sources of zeros, those who just didn’t want to go and those who wanted to go but weren’t able to go.
However, you don’t have to go straight to a zeroinflated model. You can try a regular Poisson or negative binomial model to see if you get a good fit. Sometimes you can even get a good fit with least squares regression when the mean count is high enough.
So, it does sound like a good possible approach but you might give the simpler models a try first. See if you do in fact have an excessive amount of zero trips that the simpler models can’t explain.
Best of luck with your analysis!
Luis Saenz says
Hi Jim, first thank you for making statistics and regression analysis so easy to understand for the masses.
I am currently working on an analysis to find the cost of an hypothetical zero load carrying gas distribution pipe. All of my variables are continuous, as the pipe size increases the cost of it as well. But I have two main types of material (Plastic and Metal). How can I do this analysis, doing a regression analysis for each material type and then average the results or using an ANOVA? Im lost.
Thank you
Miro Mar says
Valuable information and it explains the subject to its primary scratch!!!
I just want to know the difference between Regression analysis and ANOVA?
Jim Frost says
Hi Miro,
If we’re comparing ANOVA to least squares regression, they’re actually the same analysis but have different historic traditions and terminology. However, the math “under the hood” is the same.
Regression typically works with continuous predictors, although you can add categorical variables. For categorical variables, regression uses binary coding (1, 0) so that you compare the results for each categorical value to a baseline value.
ANOVA typically uses categorical factors. These factors will often use effects coding (1, 0, 1), which allows you to compare each factor level to the overall mean (rather than to a baseline group). You can include continuous predictors but they’re called covariates. ANOVA is more focused on explaining the differences between groups that are formed by categorical variables. Consequently, ANOVA is also linked to post hoc testing that carefully assesses the differences between multiple groups. However, you could perform the same post hoc tests on regression results.
Despite these differences, you will obtain completely consistent results using both of these methods. Although the differences between coding for categorical variables can make some of the coefficients appear to be different, but they’re telling the same story. If you can control the coding scheme in your statistical software, you can obtain completely identical results.
Jyoti Pandey says
Hi Jim,
This post is very useful i just need to know that if we have continuous dependent variable and three independent variable one is continuous second nominal and third ordinal, in this situation how to calculate regression.
Thanks!!
Jim Frost says
Hi Jyoti,
Yes, you can use least squares regression for that type of model with one caveat. There’s no way to include an ordinal independent variable in the model. You’ll need to include it in the model as either a continuous variable or a categorical (nominal) variable. There are pros and cons either way. I don’t have a post that discusses that decision but I do write about it in my regression ebook.
Best of luck with your analysis!
Gabriella says
Hi Jim, thanks for the detailed breakdown of the different types of regression analysis!
I would like to seek your advice on how I should run some of my own tests for my undergrad final year study, which is on the topic of how temporary depletion of level of executive functioning may affect honesty behaviour in young children.
– My main IV is executive functioning: either depleted or not depleted (categorical).
– My other variables are a mix of categorical and continuous variables, namely: gender, age (in years), Theory of Mind score, and English vocabulary score.
– My only DV is the child’s honesty behaviour: either truthteller or lieteller (categorical).
I am currently about to carry out my statistical tests on SPSS, and have dummy coded my categorical variables to numerical codes.
I haven’t learned the more advanced statistical tests in my undergrad courses, but based on what I have read in your post, I think the Binary Logistic Regression is most suitable for my study’s design. The question I have is, how do I proceed from there?
1) Specifically, how do I control for my other variables (i.e. gender, age, ToM score, vocab score) to ensure that my model is mostly looking at the effects of executive functioning on honesty behaviour? Is that already accounted for by the regression? If not, do I need to do additional tests even before I carry out my regression? What tests would these be to control for these other factors?
2) Conversely, how do I check if the other variables have main effects or interaction effects with my main IV, assuming that they may present? Can the regression reveal these? If not, what other tests should I run?
I am sorry if these questions are really basic, I am just a mess when it comes to math and statistics. I would appreciate any feedback I can get from you! Thank you and stay safe!
Jonny says
Hi Jim,
I hope you’re keeping well wherever you are during these uncertain times.
I’d really appreciate it if you could give me some help when you get the chance!
I’m writing up the results from my study which is looking at the effect of a single nucleotide polymorphism (random genetic variation) on postprandial glucose response.
I have identified no significant difference between any of the three genotype groups and any of my glucose response outcome variables (fasting, peak, 2hour postprandial, and incremental area under the curve). I have 73 participants in my sample (27 female and 46 male), and identified a significant difference between sexes for 2hour postprandial glucose concentration.
Using SPSS statistics, I would like to conduct some kind of multiple regression to assess the effects of both sex and genotype on the different measures of glucose response (primarily 2hour postprandial to begin with). Therefore, my variables will be categorical.
Would I have to assign numbers to my categories?
I am unsure if my data has a linear relationship, and I don’t know how to check this in SPSS especially with categorical variables. Most of my outcome measures are also nonnormally distributed.
Will a nonnormal distribution and lack of linear relationship prevent me from doing a multiple regression analysis?
Thank you in advance!
Jim Frost says
Hi Jonny,
Can you more fully describe the nature of your dependent variable and how you’ll record it? That would help me answer. Thank you!
Gretchen Edstrom says
Thank you so much!
Easy explanation of the different types of regression
Nibaldo Colina says
Hi Jim,
I would like to establish the correlation or lack there of between the cost of accidents and the experience level of the person who caused the accident. I have three levels of experience.
Thank you
eva says
thank you jim! its all making much more sense. i guess im getting confused because ive been seeing some examples where people use t tests or mann whitney u tests to compare things like age or amount of a drug (continuous variables) to a categorical outcome, but that would obviously be incorrect since they’re the independent variables not the dependent variables right?
appreciate your help!
eva says
Hi There! thank you very much for this. I was wondering what kind of analysis you would use to compare a continuous independent variable (i.e. age) with a binary categorical dependent variable (i.e. mortality)?
would that just be a logistic regression?
Jim Frost says
Hi Eva, yes, that sounds like a situation where you’d use a binary logistic regression model.
Sujit says
Hello Jim,
So nice of you !
Thank you so much for the guidance 🙂
Regards
Sujit
Jim Frost says
You’re very welcome! Again, sorry for the delay! 🙂
Sujit says
You Never replied my query 🙂
Regards
Sujit
Jim Frost says
Hi Sujit,
Apologies for the delay. Sometimes life gets in the way, especially during these days.
Because you’re doing a count of walkins, it sounds like you might need Poisson or Negative Binomial regression. Although it depends on the distribution of your count variable. If the count is high enough, the distribution can approximate the normal distribution and OLS might be okay to use. You’ll need to define a consistent period for the count. For more information, read the section in the post titled, “Regression Analysis with Count Dependent Variables”–it’s near the end.
S says
Thanks a lot for the explanation!
I’m collecting data at 2 time points, and would like my variables at Time 1 to predict Time 2. So my DV is a variable at Time 2 and as predictors, I have 1 variable at Time 1 (not continuous) as well as a 1 categorical variable from Time 2 (condition = 1 or 2). I am assuming that I’ll need to use multiple linear regression?
Regression is new to me, so I’m not sure which model to use if I have 1 categorical predictor and another predictor that isnt continuous.
Jim Frost says
Hi,
What type of variable is Time 1 and what type is Time 2?
Anwasia Anthonia says
I want estimate farmers willingness to use renewable energy. So I want to find out if I can use switch regression to analyze willingness to use. Thanks and waiting your response
James Knuppenburg says
I want to test if science and technology has increased peak performance and age amongst different positions in the NFL. What type of regression do you think would be best for this type of test?
Jim Frost says
Hi James,
Choosing the type of regression analysis is not usually based on the subject area. (There are some exceptions to that rule.) In other words, there’s not a correct type for performance in the NFL. Instead, you need to determine which variable(s) are your dependent variables and the nature of the relationship they have with the independent variables. If your DVs are continuous, I’d recommend starting with least squares regression and determining whether you can obtain a satisfactory fit with it. You can always change analyses as need. Sometimes you’ll learn things along the way that require you to change the type of regression. Although, advance research will help you in that regard as well.
Best of luck with your analysis!
Steven says
Hello! Hope this finds you well. I am running a multilevel multiple regression analysis for my dissertation. At the individual level, my IV is bullying experiences in schools and DVs are three different health outcomes for students (1 dichotomous, 2 continuous; all mental health). I also have a moderator (LGBTQ identity, dichotomous). This data has been identified all at the school district level for an entire state. At the group level, I have voting data for school districts (second IV; currently in percentages: % for Hillary, % for Trump, in 2016 election). My analysis is hoping to explain the relationship between school district voting records, bullying experiences, and health outcomes for LGBTQ students.
I am planning on using Mplus for analysis. I have a few questions after reading this article:
1) What would your recommended analysis plan be? After reading this article, I was unsure how to apply this to a multilevel, multiple regression analysis in Mplus.
2) What would be the best way to insert voting data into the data set for analysis, given that it is in percentage form? Currently, I was thinking of calculating the difference between voting (80% Clinton, 20% Trump = .60; 20% Clinton, 80% Trump = .60 on a scale of 1 to 1 from most conservative to most liberal). That way, it could serve as a continuous IV at the group level (level 2) of the multilevel regression.
Lemme know what you think. Thanks for any help you can provide a novice statistician!
Sujit says
Thank you so much Jim . Have a great vacation
Regards
Sujit
Sujit says
Hello Jim,
Any suggestions on my request
Regards
Sujit
Jim Frost says
Hi Sujit, I’m on vacation now. Will reply when I return. Thanks.
Dessalegn Wudima says
I present my thesis proposal entitled Impacts of ICT on Tax Administration for the Degree Master in Development Economics, I have prepared a yes no questionnaires does ICT has an impact on cost of taxation, tax compliance and tax revenue collection, but here the challenge is which model is preferable, for the time being in the proposal I mention to use Multilinear regression using OLS, so my question is that my model choice really good?
bedilu adugna says
HI JIM ,
I am going to do master thesis on determinant/factors that affect graduate youth unemployment and which regression model you advise me to use for single dependent variable
Jim Frost says
Hi Bedilu,
I’d start with linear regression and see how that works. From there you, see how well the model satisfies the assumptions.
Best of luck with your thesis!
Dan says
Hi everybody,
I’m using panel data to examine the effect of CEO age on tone of annual reports for 5 years. I have applied this model but I am not sure whether it is the best model or no. I am using Stata, year as dummies and random effect:
xtreg tone CEOage controls i.year, re robust
Any help would be hugely appreciated.
Yours Sincerely,
dan billy
Linda says
Hi,
I want to use a regression model to find the number of charging stations in cities (which is a count data) with respect to different features of the cities (i.e., population, vehicle miles traveled, etc). I tried different models and linear regression gave me the best goodness of fit. Is it wrong to use linear regression for this purpose?
Jim Frost says
Hi Linda,
The Poisson distribution, which is based on count data, begins to model the normal distribution when the counts are large enough. So, it might be entirely appropriate to use linear regression. Be extra careful in checking those residual plots. If those look good, it’s probably OK to use linear regression. With smaller counts, it’s harder to satisfy the OLS assumptions.
Keep in mind, that you can only interpret goodnessoffit statistics when the model satisfies the assumption, which you assess by residual plots. A good goodnessoffit doesn’t necessarily indicate that the residuals are good. Read my post about Rsquared for more about this issue. If the residuals aren’t good, you can’t trust the model regardless of what the goodnessoffit statistics indicate (i.e., you can’t trust them either).
Sujit Dora says
Hello Jim,
Need your guidance for an exercise where I need to establish Walkin to a store as a function of number of print inserts by a brand + print inserts by competition
Regards
Sujit
PV says
Hi Jim . If my dependent variable is “Number of job switches” , it is Discrete random variable. Can I still use MLR?
Michael says
Jim, thanks for the great information. While describing PLS regression you mentioned this is a good technique for predictive models but not good for screening variables. My motivation for wanting to understand regression techniques is largely because I want to be able to make predictions about things. What about PLS regression makes it better suited for this? Can’t we make predictions from any regression technique that fits the data?
Jim Frost says
Hi Michael,
I guess I wouldn’t say that PLS is better at prediction than other methods in general. Rather, PLS is a good method to use when you face particular problems–few observations per predictor and multicollinearity. In this context, PLS is relatively good at predictions and not so good at screen variables. However, if you don’t have those problems, I wouldn’t consider using PLS and would probably start with OLS. If you have severe multicollinearity and are interesting in determining the significance of particular variables, consider Ridge or LASSO regression.
Ivac says
Hi Jim,
I am a junior doctor and I need to do some paper in Statistics about ,,Types of regression analysis used in biomedical sciences”. Can you help me where I can find that data and which are the types of regression analysis used in biomedical sciences? Everyone is mentioning just linear regression analysis.
Thank you in advance.
Jim Frost says
Hi,
I wouldn’t be surprised if linear regression is the most common. I’ve also heard that binary logistic regression is commonly used in things like surgical procedures where the researchers want to model the variables that affect the probability of survival. They use binary logistic regression because the outcome is binary (patient either survives or does not). I can also imagine them using binary logistic regression to model the success or failure of other treatments. I once helped a fertility doctor to model the probability of pregnancy by the treatment and other factors. Again, the outcome is binary (pregnant and notpregnant).
Hospitals could use Poisson regression to model the availability of hospital beds. The outcome is a count of available beds.
I think the medical sciences could use many different types of regression just like other fields. It depends on the nature of the dependent variable and the characteristics of their data. I would guess that linear regression and binary logistic regression are two of the more common types.
I hope this helps!
Naz says
Hi Jim,
I have a binary outcome variable (presence of disease = yes or no)
And would like to study the effect of different independent variables on the presence of disease
Examples of my independent variables are as follows:
geographic longitude
geographic latitude
age
all of which are on a continuous scale
Which regression model should I use?
Thank you
Jim Frost says
Hi Naz,
Given your binary outcome (dependent) variable, you’d need to use binary logistic regression.
Elsa Pakpahan says
Hi, Jim. I’d like to ask, I want to examine how selfcompassion (iv) can predict depression (dv), selfcompassion itself is formed by 6 dimensions, could you suggest what regression that would fit for my research? I already run simple linear regression for sc and depression, so if I use multiple regression for the 6 dimensions and depression, am I doing it right? or should I run simple linear regression for each dimension and depression?
an answer would be really appreciated, thank you! 🙂
Jim Frost says
Hi Elsa,
The best place to start with is least squares regression. Use multiple regression so you can include all your IVs in the model. You don’t want to fit separate models for each IV because it won’t be controlling for the other variables.
If you had to form the 6 dimensions because of need to reduce the number of dimensions, and now you want to perform a regression on the dimensions (which are not variables), you might need to use something like partial least squares (PLS) regression instead. That will find the dimensions and perform regression on them. I don’t have much experience with that type of regression. There might be other methodologies that I’m unfamiliar with that you can use. Generally, you’ll have a specific need for reducing the number of dimensions and working with those rather than just using the variables themselves.
Randunu Galhena says
Thank you your post,
i’m doing advance regression modeling for house price prediction, so there have very large number of variables.do you can give me a suggest a method for the dimension reduction?
Jim Frost says
Hi Randunu,
I’ve seen regression models for house prices that use least squares regression. Unless you’re using a small sample size, I’m not sure that you need to reduce the dimensions. I’d at least start your investigation using least squares regression and move away from that only if it’s really required.
deepanshu mishra says
Hi Jim,
what if I want to apply regression if I have variables of # of complications like: dengue, Malaria etc. and have variable of deaths due to these complications. which regression should be used in this data.
thank you!!
Jim Frost says
Hi Deepanshu,
One thing I hope to have conveyed and the readers understand about this post is that the type of regression you use large depends on the type of variables you have, particularly for the dependent variable (DV).
For your case, you don’t state how you’re recording deaths. What type of variable is the DV? Are data recorded by patient and you have a binary death/survival outcome for each patient? If so, binary logistic regression. Is it by hospital (or geographic region) and you have counts of deaths, you could use Poisson, Negative Binomial, or possibly least squares regression if the counts are high enough to approximate the normal distribution. In that case, you might need to convert it to a rate for comparison purposes.
Alaa says
Hi, I am a student, am still confused which type of regression should I use, the Dependent variable is binary (Benign or Malignant).
Independent variables are three
1 Age group: which are 6 groups (1,2,3,4,5,6)
2Color: binary variable only two color option.
3 Third variable is also binary variable.
What should I choose?
thanks in advance Mr. Frost.
Jim Frost says
Hi Alaa, you’d use binary logistic regression.
Zaafir says
Hi I have 11 continuous random variables X1 . . . X10,Y. I need to create a regression model for predicting values of Y based on the values of X1 . . . X10. Which regression should i chose.
Jesse says
Hi Jim, sorry for the confusion! Each observation is a single Likert Value, so I guess Ordinal logistic regression applies. Could stepwise regression still be used with this kind of data?
Jesse
Jim Frost says
Hi Jesse,
Yes, it sounds like you’ll need to ordinal logistic regression. Also, as I mentioned in an earlier comment, you’ll have some thinking to do about whether to treat the ordinal DVs as continuous or categorical. There are pros and cons both ways. More details in my ebook about regression analysis!
Jane says
I mean to say discrete and continuous variables…
Jane says
Hello
Thank you for your time in answering our questions.
I have a problem and I would like your help please.
I have a dependent variable that takes the value 0 and 1. This variable is explained by other independent binary and continuous variables.
Only in the survey the answers of my dependent variable are conditioned by the answer obtained to a previous question ( say Q1)
so that if Q1 = 0 “no” then Q2 (which is my outcome) is asked.
if Q1 = 1 “yes” then Q2 is not asked.
Q1 and Q2 are not exclusive and Q1 is not a selection issue. The individual interviewed can answer yes to Q1 and also answer yes to Q2. I don’t understand why the survey design excludes them and it really bothers me
My question is, apart from a multinomial model, do you see any other models that I could apply to get around this serious selection problem?
Jim Frost says
Hi Jane,
I’m not entirely clear about the scenario and what you want to analyze.
Q2 is your outcome. And, only particular respondents are presented Q2 based on responses to other questions. At first, it sounded like it depended entirely on their response to Q1, but that might not be the case. Maybe other answers as well? You’ll have to examine how the survey presents questions to respondents to understand that portion.
If you’re analyzing Q2 as your outcome variable, I’d suggest figuring out who exactly answered it. If you can identify a particular subpopulation that answers Q2, you can perform the analysis and understand that the results apply only to that specific subpopulation. If Q2 is binary, you could use binary logistic regression with Q2 as the DV.
I hope that helps!
Jesse says
Hi Jim, Thank you for the response! I’m not sure I understand your question, but I think they are the score of each item, the scores for each scale is not averaged or summed. I think because its a Likert style scale with an absolute zero (ratio data), we would treat it as a continuous variable. But would this be a stepwise regression method to see which variables stay in the equation (i.e., are still significant and have the highest coefficient)?
Thanks!
Jim Frost says
Hi Jesse,
I’m asking whether for each observation (i.e., row in your data sheet) the values for each variable are a single Likert value (1, 2, 3, 4, 5, 6, or 7). Or is each value an average or sum or multiple Likert scores?
It makes a difference. A straight Likert score item, as I described before, is an ordinal variable. It is not a continuous variable. However, sometimes you can treat sums or averages of multiple Likert scores as a continuous variable. In fact, if you’re using a single Likert score for each value of the DV, you’d need to use ordinal logistic regression rather than least squares regression.
Jesse says
Thank you for the response Jim! To clarify, we had customers fill out a survey, all likert (17) style questions. We had an overall satisfaction question (Are you satisfied with our company overall?) and multiple satisfaction questions based on dept (e.g. How satisfied are you with our customer service?, How satisfied are you with our webpage?).
We basically wanted to see which dept was the best predictor of Overall satisfaction… essentially, which area of our company should we focus on most based on satisfaction scores?
Thanks again!!
Jim Frost says
Thanks for the extra details! I don’t use SPSS so I wasn’t familiar with their Enter Method!
One more question. Are the values for the independent variables and outcome variable for each observation the score of one item (e.g. 1, 2, 3, 4, etc) or the sum or average of multiple items? The answer to that will affect the method you can use. Least squares regression doesn’t play well with ordinal outcome variables. However, if it’s a sum of multiple Likert items, it might work fine.
For the IVs, if you’re using the value of individual items, it’s an ordinal variable which has the properties of both continuous and categorical (nominal). You’d have to go with one or the other. (I talk about that in my ebook about regression analysis.) If you use a sum or average of several items, you might be able to treat them as continuous IVs.
In theory, you should be able to do what you want to do. However, ordinal variables (such as Likert items) can be a bit tricky and you need to work out those details and choose correct analyses.
I hope this helps!
Jesse says
Hi Jim,
This is a simple regression model but I’m getting a difference of opinions on the answers. I’m looking at overall customer satisfaction (likert scale), and I want to know which area of the company is contributing most to this.. each area of the company each has their own likert scale. Would I just use overall satisfaction as a DV and put the different areas of the company as IVs in a ‘Enter’ type regression?
Thank you Jim!
Jim Frost says
Hi Jesse,
I don’t know what you mean by an “enter” type regression.
Here’s what it sounds like to me that you want to do. Please let know if it’s incorrect. You want see if there’s an association between area of the company and the Likert score? If so, you can do that. The dependent variable is Likert score, which is an ordinal variable if you’re just using a single Likert item for the DV values. In this case, you’d need to use ordinal logistic regression. If you want to use area of company (e.g., department) as an independent variable, that would be one categorical variable where the different values are the various departments. You can fit that model and see if there is a statistically significant association between department and the Likert score.
If you are summing multiple Likert scores and using that summed value as the DV, rather than just using the score for an individual item, you might be able treat the DV as a continuous variable and use least squares regression or ANOVA to assess the differences by department.
I hope this helps!
Jesse says
And by “Enter Method’ I meant in SPSS the ‘Enter’ procedure in linear regression (as opposed to Stepwise, or backward elimination methods. I just put in all the different depts as predictors and total sastisfaction as an outcome variable, then looked at which predictors were significant, and their coefficients.
James says
Hi Jim, thanks for all your great posts! Apologies if this has already been covered but perhaps you can offer some further clarification.
Let’s say I have two variables, a continuous variable (e.g. test performance) and a categorical grouping variable with three levels (e.g. group 1, group 2, and group 3).
My question relates to how to structure the regression analysis itself. We could model “test performance ~ group” (so test performance is the DV and group is the IV). Alternatively we could use a logistic model “group ~ test performance” (so group is the DV and test performance is the IV).
Are there rules of thumb here in deciding which way to structure the model? Are both valid or does it depend on the specifics of the experiment and the underlying data?
Any thoughts on this would be appreciated!
Jim Frost says
Hi James,
If you’re creating a model to explain the relationships between variables, you’ll use the explanatory variable(s) as the independent variable and the variable you want to explain as the dependent variable. Often you’ll use theory, subjectarea knowledge, and literature to guide you about the direction of the relationship. In your example, if you think the group assignment might cause differences in the test scores, you’d use group as the independent variable and score as the dependent variable. You want to learn about how the different groups affect the test scores.
However, if the test scores measure some characteristic of the individuals that might affect the group that the individuals join, you could use test score as the independent variable and group as the dependent variable. You want to learn about the characteristic that the test measures affects the group that individuals join.
So, a large part is the direction of the relationship that you are assuming to be true based on theory and subjectarea knowledge.
However, if you’re creating a model to make predictions, then you’ll always use the variable you want to predict as the dependent variable. Imagine you want to predict variable A, which is correlated to variable B. Suppose the value of A causes changes in the values of B. In an explanatory model, you’d use B as the dependent variable because that fits the known nature of the relationships. However, imagine that you want to predict the value of A. In this case, you’d use A as the dependent variable and B as the independent variable.
In the end, you can really do the analysis either way depending on whether you want the model to explain the actual relationships between the variables or you’re making predictions for a variable.
Jessica Walker says
Hi Jim,
I am trying to determine the predictive power of a nominal independent variable with 8 groups (age cohorts) for a continuous count dependent variable (number of stress incidences). What test do you think would be most appropriate? My samples are also very small (<5) for some age cohorts.
Âákäsh Síñglâ says
sir i have this business problem
Company collected data from 5000 customers. The objective of this case study is to understand what’s driving the total spend of credit card(Primary Card + Secondary card)
Prioritize the drivers based on the importance.
and i have to built linear regression model.
please help me out like how to deal with two dependent variables at a same time with a set of predictors
Jim Frost says
Hi Âákäsh,
You’re asking about learning how to use, perform, and interpret multiple regression analysis. That’s way more than I can write about in a comment!
However, you’re in luck, I have written an ebook all about regression analysis, with a focus on multiple regression. Please take a look at my intuitive guide to regression ebook!
Naty says
Hi Jim,
Thanks for such an insightful article!
I am trying to predict the best time (date) to contact a customer for an offer by analysing purchase history. I want to use Regression for this but having a hard time to decide which one to use. I have considered using logistic Regression where I thought I will see if in a month it is best time or not. But on the other hand also think that date is a continuous variable so I might be better off using multiple linear regression.
Would be very helpful to hear your thoughts.
Many thanks,
Naty
Jim Frost says
Hi Naty,
In making this decision, a good place to start is to understand your dependent variable. What’s the outcome that you want to explain/predict. What type of variable is it? If it’s something binary, like Sale/No Sale, and you want to use independent variables such as date/time to predict that, you’d need to use binary logistic regression. With binary logistic regression, you can use independent variables such as time (continuous) and day of week (categorical/nominal) to predict the binary dependent variable of sales/no sales.
That sounds like the direction you might want to go, but it really depends on the nature of your dependent variable.
Yaynabeba says
Dear Sir,
If both independent and dependent variables are categorical and dependent variable have natural order, which test variables do you suggest?
Jim Frost says
Hi, sounds like you should use ordinal logistic regression!
I don’t know what you’re studying. You’ll have to use your subject area knowledge to determine which test variables to include. Depends on what you’re testing!
michael says
i am trying to analyse car accident data set.
there is 1 dependent and 1 independent variable in my equation
dependent variable is categorical (4 category) and represents number of wounded person
1 (1 person wounded)
2 (2 person wounded)
3 (3 person wounded)
4 (4 or more person wounded)
independent variable is also categorical (6 category) and represents crash type
1 (rear crash)
2 (side crash)
etcetera……
which type regression should i use to analyse crash type effect on wounded person number?
Jim Frost says
Hi Michael,
There would seem to be a variety of problems with modeling this data set.
The DV is sort of a count variable except for the four or more category. For count data, you can often use Poisson or Negative Binomial regression as I indicate in this blog post. However, because of the four or more category, it’s not really count data. I suppose the best would nominal logistic regression and treat your DV as a categorical variable.
However, there would seem to be too much information left out of your model to get meaningful results Number of passengers in the car, speed, and seat belt usage just to name three quickly. You might have the worst possible accident yet only have 1 person injured if there is only one person in the vehicle. I suppose if you had a large enough sample size you might get some meaningful information.
Again, you can try nominal logistic regression. Or even perform a chisquare test of independence because you have two categorical variables.
Dr. M.A Habib says
Dear Jim,
I have scale (income) dependent variable and both scale and categorical variables. Which regressuin model is appropriate?
Abeeha Zaidi says
Dear Sir,
I have a continuous dependent variable and a categorical independent variable with three groups.
whereas my covariates are continuous and few are categorical as well. Which statistical test can i apply?
can i use linear regression or ancova?
Ranjith Kumar says
Dear Sir,
If both Independent and dependent variables are catagorical variables, Which test you suggest to test variables except Chisquare sir?
Jim Frost says
Hi Ranjith, you can use nominal logistic regression. This type uses categorical dependent variables with categorical and/or continuous independent variables.
Lore says
Hi Jim!
Thank you for excellent explanations. I hav plotted my data and my curve looks more like a logarithmic graph. How do I best calculate the correlation between my two variables?
Lore
Jim Frost says
Hi Lore,
It sounds like you might need a loglog relationship, which you can read about in my post about loglog plots. For details on a variety of other methods you can use, read my post about curve fitting.
Ah shoot, I noticed after replying that you’re asking about correlation rather than how to fit the curve! I need more coffee! Try using the Spearman Correlation which can assess monotonic relationships. That’s a fancy word that means that the variables tend to move in the same relative direction (either a positive or negative correlation) but the rate of change is not constant.
Sunita (@SunitapandeyNep) says
Hi Jim,
Thank you for posting several useful information regarding data analysis. I follow your website regularly for statistical matter.
I have one question. Can we natural log transform independent continuous variable. I have independent variable of different distance points (1 to 180 m with nineteen measured distance points with inconsistent data ranges).
My dependent variable is overdispersed count data (and I am using negative binomial with log link function).
Please could it be possible for you to answer this question.
Thank you.
Jim Frost says
Hi Sunita,
Thanks for reading! I’m happy to hear that you find my site to be helpful!
Yes, it is possible to use the natural log to transform a continuous independent variable. Of course, there’s no way I’d know whether it’ll resolve whatever problem you’re experiencing. I’d recommend fitting the model without the transformation and checking the residual plots. Then, transform your variable and fit the model again and check the residual plots to see if it helped. Transforming the IVs is particularly helpful for fitting specific types of curvature, which you can assess using residual plots.
Using transformations is a bit of a trial by error process. There are some guidelines that can help, but it often comes down to trying them to see if they do the trick. I write about this in detail my regression ebook. Although, I don’t write about it in the context of negative binomial regression.
I hope this helps!
Elsa3335 says
Thank you for your time, Jim. Looks like i have to crack my head for some time.
Elsa
Elsa3335 says
Hi Jim, I’m completely lost when trying to choose the type of regression analysis for my problem. Is it nominal logistic regression, time series, linear (autoregressive model), nonlinear or other? Can you please give me some advise?
345
658
116
380
651
851
709
196
269
071
478
518
451
989
519
843
662
428
794
949
Suppose, everyday I’m given 20 data as above. They’re random numbers generated by three devices, namely Device A, Device B and Device C. Every device is capable to generate random numbers from 0 to 9.
For the above first data of “345”, “3” was generated by Device A, “4” was generated by Device B and “5” was generated by Device C. Similarly, for the last data of “949”, “9” was generated by Device A, “4” was generated by Device B and the last “9” was generated by Device C.
I want to be able to predict the next outcome variable after the last “949”.
I have a total of 300 historical data for 15 days, 20 data per day.
From my rough observation for the recent 300 data, every device tends to produce repeated sequences. For instance, Device A will repeat the sequence “86479” (as in the last 5 data). That means this sequence appeared at least twice within the 300 historical data for Device A. Similarly, for Device B and Device C, each of them tends to produce repeated sequences. For another example, the repeatation of the sequence “51855” (the 2nd to the 6th data) in Device B.
I’d like to predict the next outcome variable after the last “949”. Is it possible to model it with nominal logistic regression, or time series (if i assume that 20 data was taken at successive equally spaced points in time)?
Otherwise, which model is more appropriate?
Thank you in advance.
Elsa
Jim Frost says
Hi Elsa,
Based on the limited knowledge that I have, this problem doesn’t sound like one that regression analysis can solve. If it’s random as you say, then you can’t predict it. However, if there are patterns in the data, it might be more of a pattern finding problem.
If there was some underlying phenomenon that influenced the numbers that the devices display, you might be able measure the phenomenon and link it to outcomes and use regression analysis to model that process.
Sorry I couldn’t be of more help. But it sounds like you’ll need to use something other than regression.
Youssef says
Hi Jim,
Great website. I have a quick question, excuse me if you have covered it before and i did not notice. When you build a regression model, multiple linear regression model or just simple one, do you need to have all variables with the same units to conduct the regression? . For example, if i want to study the relationship between stock market returns which is in % terms and funds flow which is in dollar terms, do i need to transform both variables to the same unit of measure ie lets say percentage % to study the regression? Thank you.
Sophie Holland says
Hi Jim,
Thank you so much for your reply, very helpful but I am just unsure as to how to test for a linear relationship between a categorical IV and continuous DV… Could you please advise?
When trying to do scatter plots you just get a straight line…
Thank you in advance for any help you can offer!
Jim Frost says
Hi Sophie,
When you have a categorical IV, the type of relationship you’re testing for differs than between two continuous variables. When you have a categorical IV and a continuous DV, the analysis tests for differences between group means. The categorical IV defines the groups in your data. It’s performing an ANOVA with your data to see if the DV mean varies between groups. And, you don’t use a scatterplot to graph that type of relationship but either a boxplot or individual value plot.
I don’t want to keep pushing my book, but I go through an example of a regression model with both categorical and continuous IVs in it and show how to interpret the results.
Sophie Holland says
Hi Jim,
I love the way you explain the different types of regression so I was wondering if you could help me on a project I am doing…
My IV is bullying victimisation which is categorical (never, some of the time, all of the time) and my DV is internalising problems which is on a scale from 120. I also have covariates some continuous, some categorical. What statistical method do you suggest I do? I have quite a few outliers in the DV but as I am using the MCS these are hard to remove and due to their being quite a lot I don’t believe I have a logical reason to remove them. I am also not sure on how to test for a linear relationship with my IV being categorical.
Do you have any suggestions for the above?
Look forward to hearing your thoughts!
Thanks!
Sophie
Jim Frost says
Hi Sophie,
In your other message, you clarified that the DV is a continuous variable, although constrained to a range of 120. Consequently, I’d start by using a least squares linear model first. Try that first and see if you can get a good fit–check the residual plots. Specify a variety of models and possible fit curvature and interaction effects. The constricted range of the DV might be a problem if it causes the scores to be extremely skewed. If your DV is very skewed, you might need to use a data transformation. The residual plots will help you determine what you need to do.
As for the outliers, if they’re a regular part of the distribution of outcomes for your population, it’s often best to leave them in as they represent variability that is inherent in the process. However, if they represent scores from people outside your defined population, or some other fluke during measuring, testing, etc., that doesn’t represent the typical process, consider excluding outliers. Just be sure to record the removed outliers and explain why. But, the default position should be to leave them in unless you have a good rational for removing them.
I write more detail about the above in my ebook about regression. You might consider getting because you’re using for your study.
Best of luck with your study!
Syed Abbas says
Hi Jim,
Could you please suggest any source where I can find why ordinal regression is preferable over multiple regression analysis. Even if you have written any thing on it that will be also great. I am in argument with my team to use ordinal but they says opposite so need some strong sources. I will be very thankful to your help!
Jim Frost says
Hi Syed,
When you have an ordinal dependent variable, you should use ordinal logistic regression. Ordinary least squares regression is unlikely to provide value results when you have an ordinal dependent variable. Ordinal logistic regression was designed with ordinal DVs in mind. This decision is based on the type of variable you’re using for the dependent variable.
First, ensure that you have an ordinal dependent variable. If you do, I’m not sure why your team would resist ordinal logistic regression?
I don’t have a reference on hand at the moment, but I will check when I can. However, any advanced regression textbook that covers ordinal logistic regression will state this clearly!
Best of luck with your analysis!
Katy says
Hi Jim,
Thanks for your helpful blog.
I’m hoping you might be able to offer some advice on a project I’m working on to understand customer telephone contact. The sort of questions I’m looking to answer are as follows –
Are customers with certain characteristics more likely to contact us?
Do customers with certain characteristics contact us more frequently?
I have a snapshot from our customer database including customer characteristics such as age, gender, marital status, whether have children, geographical location, etc, as well as the number of telephone calls made in that month. One row for each customer.
My rough plan is to look at the data and run off some descriptive statistics (scatter plots, box plots, etc) to see how frequency of contact varies by different characteristics. Then depending on what that shows, I’m hoping to construct a regression model(s). This is where I’m a little stuck in terms of the type of model.
I’m planning to start by looking at the probability of a customer making a call (a binary logistic regression?) I think this would tell me which characteristics increase/decrease the probability of a customer making a call versus no call.
I’m then thinking an alternative would be to define the dependent variable on the basis of low/medium/high frequency callers and run a regression on that basis. I think an ordinal logistic regression?
Could I also use calls per customer per month as the dependent variable and run a standard linear regression? Or would this be classified as count data and require a different model?
Grateful for any thoughts.
Kate
Mario says
Greetings Jim
I conducted a simple multiple regression. I firstly tested all the assumptions and then created the models. I would like to know how I can validate these models? Do I have to perform additional tests or can I use information from the model analysis outputs? NB. I do not want to run complex tests that are difficult to interpret. Kind regards and many thanks,
Mario
Jim Frost says
Hi Mario,
Unlike most statistical procedures, with regression analysis you need to check the assumptions after you fit a model. You can’t check the assumptions before fitting the model. Sometimes you can get a sense that it’ll be hard to satisfy the assumptions before fitting the model. For example, if your dependent variable is highly skewed, it’ll be harder to satisfy the assumptions. So, fit the models and then check the assumptions. To learn more about that process, read my post about Checking the OLS Assumptions.
When you say “validate” the model, there are several ways I can interpret that. One is that you want to ensure that it satisfies all of the assumptions. If you mean that, read my blog post above. It doesn’t involve performing any complex statistical tests.
However, often people will validate a model when they use it to make predictions. They want to ensure that the model predicts new observations as well as it predict the observations used during the model fitting process. If this is what you mean, you’ll need to have a training dataset and a validation dataset. Use the training dataset to fit the model. Then, apply the model to a separate validation dataset and assess how well the model predicts the new set of observations. This process is known as crossvalidation.
Because your study involves regression analysis, you might consider buying my ebook: Regression Analysis: An Intuitive Guide. My ebook covers the material in more depth than the blog posts.
Best of luck with your analysis!
Min says
Hi Jim,
Thanks a lot for the excellent article!
Can I ask if I have a mix of continuous and categorical dependent variables, can I still use partial least square regression?
Thanks!
Jim Frost says
Hi Min,
As far as I’m aware, you can use PLS only with continuous dependent variables.
Abbas says
Thanks a lot Jim
Abbas says
Hi Jim you did not reply to me? 🙁
Jim Frost says
Hi Abbas,
So sorry about that! Sometimes when I get really busy some questions are accidentally missed. I’ll answer your previous question within a day.
Tanvir Ahammed says
Great post Jim. I really like the way you explain the different types of regression.
But I’m little bit confused here. What if my dependent variable is a score, say having value 1100 & my independent variable is a categorical one? Is poisson regression ok in this case?
Jim Frost says
Hi Tanvir
Thanks for the kind words!
If you working with a score that has at least 10 values, you can often treat it as a continuous variable. Consequently, I’d try using linear regression.
Poisson regression wouldn’t really be the right choice because you’re not counting outcomes. That’s usually what you use Poisson regression for. Even then, if the mean is high enough, the Poisson distribution begins to approximate the normal distribution and you might still be able to use linear regression.
At any rate, I’d start with OLS regression. If the scores are highly skewed, you might need to transform the data.
Best of luck with your analysis!
Anastasia says
Dear Jim,
What type of regression would you recommend in a case when the key independent variable is not regularly reported? My dependent variable is a firm’s investment policy reported every fiscal quarter. The independent variable of the interest is a debt covenant strictness reported just at the moment when the firm issues any loans. I’ve tried to use investment changes over next 4 quarters (8 quarters) after the loan inception as a dependent variable. But I’m not 100% sure that it is a correct empirical design. Is it possible to use as the dependent variable also the investment levels instead?
Thank you!
Best,
Anastasia
Prosper says
hi which one can be the best model when you want to establish the relcatiionship between tax consultant and tax compliance
Jim Frost says
Hi Prosper,
Usually you don’t pick the type of regression model based on the subject area. In other words, there is no model that is best for taxation studies. Take a look at the type of dependent and independent variables you will use in your model and then read through this post to see which models can handle those types of variables.
Best of luck with your study!
Aki says
Hi Jim,
Loved your post.
I wanted a suggestion from you regarding my project work.
Could you please suggest which model I can use for regression.
Following is the brief overview of the problem I am facing:
My dependent variable(y) is a nonnegative integer and independent variables are categorical as well as continuous variables.
There are 3 continuous variables and 4 categorical variables on which(y) would depend.
Could you please suggest which regression models should I test in this case?
Jim Frost says
Hi Aki,
If your nonnegative integer is a count variable, look in the section for Count Dependent Variables for ideas.
Best of luck with your analysis!
Catherine Glover says
Hi Jim,
I have been reading your articles on regression analysis – they have been super helpful for me as I am just learning about regression analysis!
I am writing a research proposal on what factors influence voters to still vote for their preferred party when they have knowledge of that parties involvement in corruption.
previous research has indicated that voters continue to support the part as they personally benefit from the corruption. Whereas other academic conclude that loyalties to certain parties are a result of a clouded vision causing cognitive dissonance. The most frequently argued factor as to why individuals forgive a corrupt official is that if the economy are secure and improving. Finally, the more extreme the individuals party is (far left or far right) the more likely they are to vote for another party due to a lack of viable alternatives.
I wanted to use a MCQ questionnaire to collect data on all these factors and treat them as independent variables and use regression analysis. Is this possible? and if so how should I go about this. I have spent about 2 weeks on the proposal and I am not totally sure
Best wishes
Catherine
Jim Frost says
Hi Catherine,
I’m so happy to hear that my regression articles have been helpful! If you like them, you might consider my ebook, which is all about regression analysis.
Yes, it is possible to include MCQs in an analysis. The manner in which you include them depends on the type of answers. If they’re categorical (nominal) variables, such as, “What type of damage? Scratch Dent Tear,” you’d include it as a categorical IV. If it’s an ordinal variable, such as Likert scale responses, they straddle the line and you can potentially enter them in the model as either categorical variables or continuous variables. It depends on the nature of the data and the goals or your analysis. My ebook talks about all of these aspect, categorical IVs and how to handle ordinal IVs.
Best of luck with your analysis. It sounds interesting!
Mario says
Hello Jim
I trust you are well.
Could you perhaps tell me how I should go about testing the assumptions for generalized linear models.
Thanks,
Mario
Josh says
Hi, does your book cover all the regression analyses you covered in this blog post?
Jim Frost says
Hi Josh,
The ebook focuses on linear regression. I cover many aspects of that including, continuous and categorical IVs, main and interaction effects, modeling curvature, checking assumptions, goodnessoffit, predictions, etc. I offer a few glimpses into the other types of regression, but they’re not the focus.
I can foresee a future book going beyond linear regression and covering the other types in more detail.
To get a better idea of exactly what the book covers, go to my blog post about my ebook. At the bottom you’ll see screenshots of the full TOC.
You can also go to My Store and download a free sample of the ebook, which has the complete first two chapters.
I hope this helps!
Sidney Gueye says
Great thread, this was really informative and has made me rethink my current regression model.
My undergraduate dissertation project is on how perfectionist concerns, strivings, and coping skills (7 sub scales) influence sport acute injury occurrence and frequency. The perfectionism data cam in the form ofa composite score made of two subscales from different questionnaires with different item number (5 and 7 items). Thus, a standardised composite score was created. which regression analysis tests would you recommend for an SPSS newbie?
Thanks
Frank says
Hi Jim,
It would be great to see the kinds of estimates that can be obtained from each of these models, and possibly some of the interpretations that might come with using one of these models. Thanks!
Frank
Jim Frost says
Hi Frank,
In future posts, I’ll do that! For now, you can see examples of nonlinear regression and binary logistic regression. I’m always adding posts and will get to the others at some point.
Thanks for writing!
rksd says
Hi Jim,
I’m having a bit of trouble selecting the right nonlinear regression for my analysis. I have data that looks like an inhibition dose response curve with a variable slope. The X is not dose however it is time and I am trying to compare the rate of closure between multiple samples over time. If I do a nonlinear fit of [inhibitor] vs response with a variable slope, the curve fitting fits my data very well. I was wondering if in this scenario using a generated hill slope would be sufficient to evaluate rate? Would there be a more appropriate regression or should I consider an alternative such as area under the curve?
Thank you so much for any advice!
Jim Frost says
Hi, I often find that in nonlinear regression models, subjectarea knowledge becomes even more important than it usually is. In nonlinear models, you’re often trying to incorporate underlying physical or biological processes into the regression equation itself. I once saw biologists debate over which nonlinear model is best for plant growth. It turns out that the best model depends on a plant’s specific growth pattern. All of this is to say, I don’t feel like I’m qualified to answer your question. You should consult with subjectarea experts and see how similar studies have tackled the problem. Sorry I couldn’t provide a more specific answer!
Paola Gabriela Villa Paro says
Hi Jim,
I’m searching for a regression model for the following: I have a categorical dependent variable which comes from a multiple options question, I mean, the individual was being asked for his preference about six different saving alternatives. He could choose all the alternatives that he prefers (he can choose even all the six alternatives). So, I don’t know which could be the type of regression that I could apply.
In conclusion, my dependent categorical variable is nonexcluding so I can’t apply a multinomial logit or multinomial probit. Which other models could I use?
Thanks in advance.
Lingyun Huang says
Thanks Jim. Very appreciate it
Franco says
Hello Jim
I have a question about my study. Both the IV and DV in my data are categorical. IVs (nominal) are 5 types of behaviors and DVs (ordinal) are 3 types of reasoning. I am interested in how IVs predict DVs. I am wondering if logistic regression fit my study? Thanks in advance for your efforts.
Jim Frost says
Hi Franco,
It sounds like you need to use ordinal logistic regression!
Psy says
Hey Jim,
I got a query.
Suppose we are given voting turnout results (1 for yes, 0 for no) of 4 previous years. And we want to predict it for the next year, and hence we do not have the 2019 result column. Logistic regression need X train and Y train, but in this case we do not have a Y train. How should I proceed with this? Or should I use some other model?
Anu says
Dear Jim, I came across this article of yours. Very informative. I am a student doing chiasma count on chromosomes. Now I want to compare and contrast different varieties just based on this count number. The count range is limited ranging from 2 to 8. I cannot go for Poisson distribution as it is known that the chiasma occurrence on chromosomes is not random. Infact, two pathways occur, one is non random and second is random, but most of the counts are accounted by non random pathway. I do not know this, just based on the count data. Can you suggest any analysis which can be done to find out differences between varieties.
Allie Hicks says
Hi Jim!
Fantastic blog post. Thank you for laying this out so clearly.
For the regression I am trying to run, the independent variable is number of classes attended. The possible options being 1, 2, 3, and 4 classes attended. To me, this seems like an ordinal variable, but I am not sure.
Is it okay to proceed with a regular linear regression? The output variable is change in symptoms of depression (a continuous variable).
Thank you in advance!
Allie
Jim Frost says
Hi Allie,
That’s a count variable. Count variables can have nonnegative integers. This type of variable is different from a ordinal variable.
An ordinal variable does have a natural order. You can represent ordinal variables using numbers. So, that looks similar to a count variable. But, suppose you are recording how runners finish in a race. You have the first, second, and third place runners that you record as 1, 2, and 3. However. the second place finisher doesn’t necessarily take twice as long as the first place finisher even though 2 is twice the value of 1. You only know that 2nd place is after 1st place. 3rd place is after 2nd and 1st, but it’s not necessarily three times as long as 1st.
Contrast that with counts of 1, 2, and 3. A count of 2 is twice a count of 1. A count of 3 is three times 1. The difference between 1 and 2 is the same as the difference between 2 and 3. Etc.
When you have count IVs and ordinal IVs, neither are quite continuous (ratio) nor are they quite categorical (nominal). You’ll have to decide which way to enter them into your model. That depends on the nature of your data and the goals of your analysis. I write about this in my regression ebook that I just released. But, that’s the issue in a nutshell.
It’s often OK to include this type of variable in a linear model. Just be sure, as always, to check those residual plots to be sure that your fitting an adequate model!
Shohreh Kariminezhad says
Hi Jim!
First of all many thanks for your helpful, comprehensive tutorial on regression. I have a question that I really appreciate if you reply to it. What is the best possible way to choose the most relevant features in a linear regression model? How we can check the features?
Thanks in advance for your time
Jim Frost says
Hi Shohreh,
I’m not sure what you mean by “relevant features”? That can range anything from the type of data that a type of regression can model to key statistical output. Can you be more precise what you’d like to know? Thanks!
Abbas says
Hi Jim,
Great and very helpful blog. the way you explain is really helpful.
I want to run a regression on SPSS using customer satisfaction scale (5 point scale) as dependent variable. and other variables as independent variables such as some are on 5 or 10 point scale, some are yes/no. so basically I want to check what are the variables that impact customer satisfaction. Could you please help me to find out which regression analysis I suppose to use?
Jim Frost says
Hi Abbas,
Sorry about the delay in replying!
Because your dependent variable is an ordinal variable, you need to use ordinal logistic regression. In SPSS, you should use either the PLUM (easier to use) or GENLIN procedure. You can find PLUM in the Output Management System Control Panel under Command Identifiers.
Best of luck with your analysis!
Chris says
Hello Jim,
First and foremost, thanks a lot for writing this blog – it’s pure gold.
However, I have a dilemma.
I have the data with 5 variables and with 100 units of observation. Every 10 units in the data encompass specific time frame – so first 10 units encompass one year, then next 10 units encompass the following year etc. For each of those 10 units, I have a distribution of 100% for each variable. This means that each variable in the first 10 units makes 100%, then the next 10 units also 100% etc. My dilemma is that units reflect the distribution of something (which makes a sum of 100% for each year) so I am not even sure can this be considered a continuous variable. Here is an example: https://i.imgur.com/dxCQfHT.png
But, to be concrete, I would like to know whether I can use multivariate regression analysis with this data where I would focus on one variable as DV and the other as IV in order to see which IV impact the most DV.
Jim Frost says
Hi Chris,
Thanks for kind words about my blog! 🙂
It sounds like you want to perform regression analysis with time series data. There are no inherent problems with that but there are special issues you’ll need to address. For example, you’ll need to include time related independent variables in your model, such as the date and lagged variables. You’ll need more specific date information than just the year. You’ll need to know the time between observations. You’ll also need to pay particular attention to correlation between adjacent residuals. See assumption #4 in my post about OLS assumptions.
I don’t see a problem with having the total within a year sum up to 100% of that year. That’s going to be true with all time series data! At least if I understand what you’re saying.
I don’t see an inherent problem with doing what you want to do as long as you know the time between the observations. If there’s a constant time between observations (observations occurred at fixed intervals) you can probably use the ID column as an IV. Just be aware of the issues I mention. In addition to the time related variables, it’s OK to include other IVs. Regression with time series data is not my specialty, so that’s about as far as I can take you. But, it is doable.
If you can fit a good model, here’s a post that can help you determine which IV is the most important.
Bella says
Hello Jim, thank you for your helpful posts. I am trying to find out influence statistics for 32 states in a country and the states are in scale, nominal variables. How do I conduct spss with that kind of variable?
Jim Frost says
Hi Bella,
I’m not completely sure what you’re referring to by influence statistics. Are you referring to variables that influence an outcome? Or, perhaps statistics that identify unusual observations in a regression model that have a large influence on the results if they were to be removed from the dataset?
Bkr says
Thanks a lot
Another questions:
can me use Spearman coefficient to check the correlation between binary variable and continues variable?
And can me use person chi square between nominal and nominal or ordinal variables?
Jim Frost says
Hi, no, you can’t use Spearman’s correlation for that. With binary data, you have two groups. You can use a 2sample ttest to determine whether the difference between the means of the continuous variable for these two groups is statistically significant.
Samuli Junttila says
Hi Jim,
first, thanks for this great website! Very helpful.
I have a question: I have a dependent variable that has been recorded as nominal categories (i.e. defoliation of a tree 025%, 2650%, 5175%, 76100%). Should this be treated as nominal and a nominal logistic regression should be used although the intervals are equal?
I have also computed an tree attack level score based on similar categories (defoliation, discoloration, amount of resin flow etc.) by summing them, resulting in scores between 5 and 16, can this score be treated as a continuous variable and ols be used?
Thank you very much if you have time to answer!
Best,
Samuli Junttila
Jim Frost says
Hi Samuli,
I’m glad to hear my website has been helpful!
Your dependent variable has categories that have a natural order, Consequently, your DV contains ordinal data and you should use ordinal logistic regression.
I’m assuming the tree attack level score is the independent variable? If so, it is most likely fine using it as a continuous predictor. It appears you have a good range of possible values (516) as long the actual scores use most of the values. However, you can’t use OLS because you have an ordinal DV.
Be sure to check those residual plots! That’s always important but even more so when you’re combining noncontinuous variables to form a different continuous variable.
Best of luck with your analysis!
Bkr says
Hi Jim
Have a nice day
Thank you for above great article, I have some questions I hope to give me advice.
I will conducting study about Risk factors of Anemia among children (< 6 years ) in Sudan, the DV is childhood anemia I will classifying according to hemoglobin concentration level (HCL) as anemic (HCL=11 g/dl)
IV’s are: sex of the child, Birth order, child’s birth weight, child’s age in months, mother’s education level, mother’s age at birth …. etc
this data I will analysis in two stage, first stage I will using bivariate correlation to identify the association between DV (childhood anemia) and IV’s to check whether the IV’s are significantly associated with anemia or not.
In the second stage I will using the ordinal logistic regression between DV and IV’s ( which associated with DV in the first stage) to make a model, and chose the best model based on Akaike Information Criteria (AIC) and 2logLikelihood
The End.
My question Is this method it is correct?
Jim Frost says
Hi,
It sounds like your dependent variable is binary–anemic yes or no. If that’s true, you need to use binary logistic regression.
Additionally, Assuming the DV is binary you can’t use correlation to assess the relationship between the independent variables and the dependent variable. If you have the raw continuous HCL data, you can use correlation between the continuous IVs and HCL. Even though you’re using a binary form of HCL, that correlation with the continuous form can provide meaningful information. I would actually use scatterplots to graph the IVs by HCL. That will allow you see whether there is curvature you need to model.
You won’t be able to use correlation with your IVs like education level because that’s a categorical variable. However, you can use graphs like boxplots or individual value plots to see whether there are differences between levels.
Finally, assuming your DV is binary, use binary logistic regression. Assess the CIs for the odds ratio for each DV and determine whether it excludes 1. Check the residual plots!
Best of luck with your analysis!
Skip says
Jim,
If you have say 6 IVs and they are all categorical, doesn’t it limit the conclusions somewhat? Wouldn’t a mix of continuous and categorical be better? In other words, isn’t there more information in using interval variables or at least keeping the number of categorical variables to a minimum?
thanks!
Skip
Jim Frost says
Hi Skip,
In general, yes, continuous data provide more information than categorical data. However, you need to determine which variables actually explain your dependent variable. If for some reason categorical variables explain your DV better, well you’re pretty much stuck using categorical predictors!
You might have other reasons to use categorical variables or indicator variables rather than continuous data. For example, if you have a model that you’re using mainly for prediction and you need to use data that are easy to collect, you might use a series of indicator variables (aka dummy variables). This type of variable is simply a binary variable that indicates the presence or absence of a characteristic. This type of data might be easier/cheaper to collect for predictive purposes rather than gathering potential difficult/expensive to obtain continuous measurements.
Consequently, it depends on your subject area and the purpose/context for your model.
However, when analysts have continuous data and they are considering converting it to a categorical variable (perhaps based on ranges), I’d recommend against that unless there is a very good reason to do so.
I hope this answers your question.
Mario says
Hi Jim.
I have a mixture of categorical and continuous independent variables. My dependent variable is continuous based on a score. I have 4 dependent variables measuring quality of life. The 4 dependent variables are correlated so I can’t use general linear models. Hope this helps
Jim Frost says
Hi Mario, having a mix of categorical and continuous IVs is not a problem for OLS. And, if you mean that the IVs are correlated, generalized linear models won’t fix that problem. Run your model using OLS and check the VIFs. If VIFs are high, you might need to use Ridge or LASSO regression. Read my post on multicollinearity for more information.
Mario says
Hi Jim
I’m having a slight problem. I ran regression using Generalized linear models. How do I interpret the deviance or Pearson chisquare statistic for Goodness of fit. Are there ranges? My deviance statistic is quite high = 151.233.
Thanking you in advance
Jim Frost says
Hi Mario, first, help me understand why you’re using Generalized Linear Models rather than OLS? Thanks!
Rui Fang says
Hi Jim,
I have some questions:
1) Let say I have dependent variable which consists of ordinal data, 3 independent variable consists of ordinal data and 1 independent variable consists of nominal data, can I use ordinal logistic regression? Besides ordinal logistic regression, what is the regression analysis that I can use?
2) Before conducting the regression analysis, do I need to do any test on my data? For example the collinearity between independent variables.
Jim Frost says
Hi Rui,
Because you have an ordinal dependent variable, you’ll probably need to use ordinal logistic regression. For the ordinal independent variables, you’ll need to enter them in the model as either categorical (nominal) variables or continuous variables, depending on which form provides the best fit and the goals of your study.
As for collinearity, I usually fit the model and check the VIFs. However, VIFs only work for continuous data. If you enter the ordinal variables as categorical variables, you might need to use something like the chisquared test of independence to determine if they’re correlated. Only strong correlations are problematic.
Best of luck with your analysis!
shahd says
Hi
I need to compare two curves to find if they are similar or not. My question is what is the most appropriate statistical procedure to compare them. I used the compare function in SAS. What about comparative density or kernel distribution can I use that for my comparison. Could you please advise with this
Mario says
Hello, Jim! Thank for this great post.
I am conducting a study which predicts happiness (continuous dv based on a score). I have a multitude of IV’s (20) which are both categorical and continuous. My readings point to using ‘generalized linear models’ which can accommodate both types of IV’s, or, should I firstly try and reduce the amount of IV’s (can I use univariate regression?) and then only use variables which were found to be significant in the GLM? Am I on the right track?
Thanks Jim
Jim Frost says
Hi Mario, unless you have count data or binary data for the dependent variable, I’d think that OLS linear regression would be your first choice. If your DV is continuous start with OLS. OLS can accommodate both continuous and categorical independent variables.
Simple and multiple regression are really the same analysis (OLS regression) but have different names based on the number of IVs. But, it’s the same analysis really.
I’d start by creating a matrix of scatterplots that graphs all of your IVs by the DV. Use those graphs along with subjectarea knowledge to identify candidates. The scatterplots will also show you whether you need to fit curvature as well. Then, include your candidates in the model and go from there. For help on this subject, read my post about choosing the correct regression model.
Thanuja Mallikaarachchi says
Thank you, this article was really helpful for me.
Doe Godwin says
This is great and it has been of good use to me. Thank you for sharing. I have a concern, how will you interpret estimation done by using the PLS procedure.
Jim Frost says
Hi Doe,
PLS reduces the number of independent variables down to fewer number of components that are not correlated. Then, it performs OLS on these components rather than the original data. Typically, you use PLS more for prediction rather than understanding the role of specific variables because the estimate relates to the components rather than the variables. If your goal is to understand the role of the independent variables using an approach more akin to least squares regression when you have problematic levels of multicollinearity, try using Ridge regression.
Susan H. says
Hello Jim,
Great article. I am attempting to measure the influence of culture strength on institutional performance. Two questions: 1) My IV is currently a percent change measure, and ranges from negative to positive, less than zero and greater than one. Is there any regression I can run or do I need to recode into a categorical variable for logistic regression? Although that would be challenging as well since I have three values and not two values – negative, no change, positive.
2) Can my DV be a measure of variance such as standard deviation?
Thanks,
Susan H.
Abdi says
am glad that I found your page. Thank you so much for this informative post.
Actually, I have questions regarding data analysis;I have both scale and nominal variables as independent and 4 categorical dependent variables,how can i analys by which model?
Thank you!!
Abdi
Jim Frost says
Hi, it sounds like you need nominal logistic regression assuming that your DVs don’t have a natural order. If they have a natural order, then ordinal logistic regression. You’ll need to fit separate models for each DV.
Best of luck with your analysis!
Harriet noble says
Hi jim,
Thank you for sharing your statistical Knowledge its very helpful indeed.
I just wondered is there an easy was to tell from reading a research article if a statisticians has used a linear or logistic regression model? and if so how?
Thank you
Jim Frost says
Hi Harriet,
The key giveaway would be the dependent variable. If the dependent variable is binary, ordinal, or nominal, the analyst should be using logistic regression. Also, if the article refers to odds ratios, link functions, or the study uses deviance Rsquared, pseudo Rsquared, or McFadden’s Rsquared, they’re likely using logistic regression. Also, hopefully they’re transparent about which analysis they’re using!
shahd says
Hi Jim
How can I perform jacknife regression in SPSS and what is the difference between jacknife and bootstrapping
Sylvia Burgess says
Thanks Jim this is very helpful
Sylvia Burgess says
JIm I was reading your comments and they seem helpful…so here goes. I have ELLESL Training as an ordinal variable and Math Scores as a continuous variable should I run linear regression or logistical regression.
Jim Frost says
Hi Sylvia,
I’m not sure which variable is your dependent and independent variable. It makes all the difference! From a similar comment you made, I’ll assume that the ELLESL training is the independent variable and you want to determine the impact on math scores as the dependent variable.
Assuming that is true, you’ll want to use linear regression because your dependent variable is continuous. You have an ordinal independent variable, which can be tricky to use. Ordinal variables are a bit like continuous variables and a bit like categorical (nominal) variables. Yet, it’s not quite either. You’ll need to include the ELLESL training as either a continuous or categorical independent variable. If you included it as a continuous variable you might need to use either a polynomial or log transformation to fit curvature that might be present.
My suggestion is to start by using linear regression and fit different models that include the training variable as a continuous variable and then as a categorical variable and see which provides the best fit. Categorical variables use more degrees of freedom than a continuous variable, which can be problematic depending on your sample size and the number of levels in your categorical variable.
Best of luck with your analysis!
Charles says
Jim thank you for making statistics easy to learn. However, do you also have one for discrete time survival analysis?
Emma Kooij says
Hi Jim,
I was hoping you might help me shed some light on a math paper I’m doing for my Calculus class. I’m trying to find a regression that might predict the shape of a circle based on a rainbow arc.
Rainbows are actually circles, we just can’t see them fully from our perspective and I wanted to provide more evidence that a circle is in fact the best way to predict a rainbow’s curve using a regression. However, when I tried several regressions based on the points I had found using an outline of the visible arc, the end result looks more like very wide parabola rather than the narrow curvature of the actual circle it’s supposed to be. My professor is trying to push the point of having our paper sound as if we were explaining it to a high school freshman and that makes it a lot harder to explain and the mathematics somewhat tedious as we have to explain every step and justify our choices as well.
Is there any light you can shed on how I can make this a little more feasible?
Thank you,
Emma
Jim Frost says
Hi Emma,
You’ll most likely need to use a different type of regression analysis, such as nonlinear regression and then choose the correct form of the expectation function. At that point the analysis can fit the data to the model. I think that’s the answer to your problem. Use nonlinear regression and then find the correct expectation function–which I’d guess would be the formula for a circle.
I hope this helps!
Seun Opaleye says
Hi Jim, a quick one.
I sent a paper to a scopus index journal.
My model is a panel data with n=4 & T=21
I used fixed effects model to analyse the data.
The journal sent it back saying we have to test for unit root and stationarity because the data is 21years x 4countries. And one reviewer said try ardl.
What is your advise and how do I take care of this?
Jim Frost says
Hi Seun,
I’m not expert in time series regression model. However, I can confirm that those are legitimate concerns. I have not used ARDL (autoregressive distributed lag) myself. I have used ARIMA (autoregressive integrated moving average)a bit. I don’t know enough about the analyses or you subject area enough to make a recommendation, but yes, you’ll probably need to use something like those analyses. The problem is that you need to account for time order effects and avoid serially correlated residuals.
shahd says
so does the multiple regression model differ from the least square mode ? because both gave me the same results
Jim Frost says
There are different types of multiple regression, but, yes, it usually refers to OLS. If the results match, that’s what you’re using.
Sam says
ah ok thanks for all your help!
Sam says
For my positive data, it is split into ‘yes’ and ‘unsure’ and ‘no’ so I assumed that is a nominal variable? I may be wrong though!
Jim Frost says
Those are ordinal data. Twoway ANOVA might still work, but be sure to check the assumptions on the residual plots.
Sam says
Hi Jim,
thank you for your feedback. I have double checked and my variables are nominal! I am a bit confused whether it is possible to carry out a two way anova on just two nominal variables as I have looked online and the assumptions say I need two independent variables and a dependent variable?
If so, I would do the two way anova and then the post hoc test?
Jim Frost says
If they are nominal, then your comment about “more positive reviews” is not consistent with that.
Yes, twoway ANOVA is designed exactly for that scenario. Your two independent variables are your two nominal/categorical variables. The dependent variable is your continuous outcome.
Yes, do the twoway ANOVA. If you find that one or both of the independent variables are significant, perform the post hoc test.
Shahd says
Hi Jim,
I have a salary data set with 10 variables. I checked the multicolinearity and there is no multicolinearity. I need to use advance modeling technique other than regression analysis to get higher grade in my assignment. Could you please advise with this? What about principle component analysis will this improve my prediction model?
Jim Frost says
Hi Shahd,
It all depends on the nature and characteristics of your data. There is no inherently better analysis. In fact, if a linear model fits your data, then OLS is the best possible analysis to use. To see why, see my post about the GaussMarkov Theorem and OLS Estimates.
However, if your data has specific problems, such a multicollinearity, other methods can be better. Typically, you’d use PCA when you have a large number of correlated predictors often in conjunction with a small sample size. In that scenario, you could also use Partial Least Squares (PLS) regression.
OLS is a great starting place and if you find that it somehow fails to fit your data adequately, use that information about how/why it fails to find another analysis that addresses the problem. And, I wouldn’t say the alternatives are more advanced, they just address various problems/characteristics of your data. So, I really can’t make a recommendation. It depends on the specifics of your dataset.
Sam says
Also, I am a bit unsure how I would analyse the data once the anova tests have been run because even if it tells me there is a difference, I don’t think it tells you the relationship between the two variables e.g. I am trying to find out if those with more positive views buy more Fair Trade. I am assuming the test would tell me there is a difference between views and goods bought, but not whether those with positive view bought more? I may be wrong!
Thanks again for your help!!
Jim Frost says
Hi Sam,
For two nominal variables, you’d need twoway ANOVA. Typically, you’d perform ANOVA first and that tells you whether there is a statistically significant difference between group means. However, as you indicate, it doesn’t tell you the nature of those differences. To determine whether differences between specific pairs of groups are significant, you’ll need to perform a post hoc test, such as Tukeys. I need to write a post about post hoc tests! I don’t have one yet!
Now, one thing I notice in what you wrote. You mention “more positive reviews.” That sounds like it’s potentially not a nominal variable. Possibly ordinal or continuous? If so, that changes things. ANOVA is for nominal (categories). Although, it might possibly work for groups based on ordinal data. It can be tricky figuring out how to analyze data when you have ordinal data as a predictor/IV.
Sam says
Hi Jim,
That’s very helpful thank you! Would I be able to use a one way anova for two nominal variables?
Thanks again!
Sam
Sam says
Hi Jim!
I would be very grateful for your help with my data analysis.
I have a dependent continuous variable and an independent nominal variable. I wish to use correlation analysis to examine the relationship between the two variables but I do not think this is possible. How else could I analyse the two variables to see if there is a correlation?
Thank you!
Sam
Jim Frost says
Hi Sam,
It sounds like you need to use oneway ANOVA. This analysis will tell you how whether the mean of your continuous DV is significantly different across the levels of your nominal IV.
Because the IV is nominal, regular correlation is not possible in the normal sense. In other words, you can’t increase or decrease a nominal variable to see how the other variable tends to change. Nominal variables are a difference in type. Such as type of damage: scratch, dent, and tear. ANOVA will tell you whether the different nominal values (types) are associated with different mean values of your DV.
I hope this helps!
Ellie Sharaki says
Hi, Jim.
If my parameter estimates in ordinal regression show negative values, do I need to report them one by one or does it suffice if I report the statistical (in)significance in the sig. column?
Thanks,
Ellie
shahd says
I need to use advance modeling to get A in this class. Can I use lasso regression method. If i didn’t include the interaction term, is this will be a problem ?
Jim Frost says
Hi, they might if you can show that lasso fits a better model than OLS. Typically, you’d use lasso when you have multicollinearity. It does introduce a bit of bias to reduce the variance. That can be worthwhile in some scenarios. I have no idea if that will produce a better model for your data, but you can give it a try! If it doesn’t produce a better model, it’s not worthwhile doing.
As for the interaction term, again, that depends on your data and study area. If an interaction effect actually occurs in the study area, not including it in the model can produce invalid results. However, not all study areas have interaction effects. Read the link I shared with you in my previous reply.
Best of luck with your analysis!
shahd says
I did that and I got a good fit. And I need to show what other possible regression model for the salary data set as I was wondering if I have to include interaction term because 3 of those variables are dummy. could you please advise with this
Jim Frost says
Hi Shahd,
Fitting the correct model is a balance between statistics and subjectarea knowledge. There’s no way I can tell you exactly what you need to do. However, I write about the things you need to consider in my post about specifying the correct regression model. I think that post will be particularly helpful.
If you need to learn more about interaction effects, I wrote about that as well. You can include interaction terms for dummy variables. Would that make sense theoretically for your study area?
Best of luck with your analysis!
shahd says
Hi
What if I have a salary data set and my dependent variable is continuous and I have 10 independent variables one of them is not linear and I transformed it to be exponential
Jim Frost says
Sounds like you should start with OLS linear regression and see if you can get a good fit. That’s where I’d start.
Elisa says
Sorry I forgot to mention that variables in factor 1 and 3 can follow linear, quadratic… different models! That is why I also wanted to analyse them individualy.
Elisa says
Thank you Jim! I am getting a clearer idea! I have a model which looks like this factor 1–> factor 2 —> factor 3. Factor 1 and 3 are formed by diferent ordinal variables and factor 2 is an ordinal variable! so I am performing two analysis. One saying factor 1 is independent and factor 2 is dependent. Another one saying factor 2 is independent and factor 3 is dependent. Should I perform logaristic regression in both cases? Or Can I consider the independent variable as continuos and dependent variable as continuos/categorical depending on the better fit for both analaysis? So in this way, I would perform a normal regression analysis. Another question I have is: if I say factor 2 is continuous in analysis one, must I say it is continuos for the second analysis? Or can I use it as continuous for analysis one and categorical for the second one? I think I might treat it the same in both analysis so in this case the option would be to assume all the factors are continuous and perform an OLS in both cases.
Thank you again!
Elisa says
Hi Jim,
I am quite confused about which model to use. I have some items that affects a variable. The items follow a likert scale (from 1 to 5), and the variable is an ordinal variable (there are 4 groups that go from highest customization to lowest customization). First I would like to study independently how each item affects the variable, to see if my hypotheses are correct. Can I do I use fitted line plot? I am not sure about which model I might use. And for analysing the impact of all the items on the ordinal variable? Data does not follow normal distribution.
Jim Frost says
Hi Elisa,
First you need to figure out which variable is your dependent variable and which are your independent variables. It’s not clear from your comment which are which.
If your dependent variable is ordinal data, you’ll need to use ordinal logistic regression. If the DV is continuous, probably OLS linear regression is a good place to start.
It sounds like your independent variables are ordinal data. That can be tricky because ordinal data are a bit like categorical/nominal data and a bit like continuous data. You’ll have to try including them both ways to see which one produces a better fit. All things being equal, you’d prefer to include them as continuous data. When you include them as categorical data, behind the scenes, the analysis is actually fitting many variables that relate to the levels of each categorical variable. In short, your using many more variables going the categorical route, which can cause problems if you treat multiple variables as categorical. However, categorical can sometimes provide a more flexible fit.
You might be able to sum the ordinal variables to create a pseudocontinuous variable. I don’t have experience doing that myself. It seems like ordinal variables are common in some fields, but not others. And, they weren’t in mine. Of course, if you combine them, they might be easier to analyze but you won’t be able to isolate the role of each.
Basically, there’s a lot going on here that goes beyond what I can address in comments. But, that’s the situation in a nutshell! It’ll probably require some research and experimentation on your part to find the best solution out of several possibilities.
Finally, we’re getting to several clear cut answers now! If you include your variables in a regression model, the simple fact that they’re in the model means that analysis controls for them, and you can assess the independent effect of each variable. In other words, when you assess the effect of each variable, the model is holding constant the values of all the other variables that are in the model. That’s makes interpretation easy!
As for how to display the effect, you can’t use a fitted line plot when you have more than one independent variable–you’d need more that twodimension to display it in! However, you can display the effects on a main effects plot. These are specialized plots that graph the fitted values for values of each variable while holding the other variables constant. How to create main effects plots depends on the software you’re using, but they’re a nice feature because you can use them when you have more than one independent variable. It really helps you to graphically isolate the role of each variable.
I show an example of a main effects plot in my post about interaction effects (most of the graphs in that post are interaction plots but there is one example of a main effects plot). I should probably write a post about those some time soon!
Best of luck with your analysis!
Al says
Hi Jim,
Do you have a tutorial on variable selection for logistic regression? I am studying from a textbook that discusses the topic for MLR — the software produces a bestsubsets table with adjusted Rsquared and Mallow’s Cp for all the different variable combinations. For logistic regression the table contains RSS and Mallows Cp, along with something labeled “probability.”
THanks,
Al
Marcelo Ribeiro says
Hallo Jim,
I have a DV, normally distributed, in percentage (the % of women in Governing Board in different companies around the world) and my Independt variables (national gender imbalance indicators) are in %, continuous and ordinals.
Which model should I use? (and do you have any other post of the chose one for me to test the assumptions in Stata to obtain consistent results).
I checked that my DV and IV are significant correlated but there’s no significance when using a GLM model…
thank you,
Marcelo
Mridha says
I appreciate your work. I have learnt a lot from your blog.
Jim Frost says
Thanks–I’m glad it’s been helpful!
Faruque says
Hello Jim,
I am glad that I found your page. Thank you so much for this informative post.
Actually, I have questions regarding difference in difference (DID) method. I want to employ DID method in my paper to find out the prepost treatment effect of adopting a technology. When I discussed with a colleague, he recommend me not to choose the treatment and control groups randomly but better to choose from a project. For instance, if the government decide to impose a new policy for X region in Y year and the whole population do not have any other alternatives rather then to adopt, then we can randomly choose treatment group from region X while control group should come from other provinces.
But what if the govenement says its not mandatory but people who will adopt will get many benefits. In this case, if we want to find out the prepost benefit effects between adopters and nonadopters, can we also employ DID method?
Thank you very much.
Jim Frost says
Hi Faruque,
I haven’t used the difference in difference method myself. I know it is a method that tries create an experiment but using observational data. So, I don’t have first hand insights.
One difficulty I see in your scenario where it is not mandatory but people choose is that there might be (probably is) a difference between those who choose to participate versus those who do not. If the program provides many benefits, there must be some reason why particular people do not join the program. Whatever that difference is, you’re starting out with it as a selection bias for adopters/nonadopters. This preexisting difference could bias the results. It’s entirely possible that those who don’t join do that because they won’t receive as many benefits as the average person. If that’s the case, this bias will artificially inflate the difference in benefits between adopters and nonadopters.
It might be that your colleague is making that suggestion about different provinces because it prevents that decision as a source of bias. Maybe. Of course, you then have to worry about whether the citizens of the various provinces are different in some other systematic manner. Such is the nature of observational data!
I always hesitate to offer suggestions based on limit information. But, these are the types of issues to consider.
Ellie Sharaki says
Dear Jim,
I can’t thank you enough for your advice. I’ll redo the analysis using ordinal regression.
All the best,
Ellie
Ellie Sharaki says
Hi, Jim,
I already sought your advice re the multinomial reg. I ran for my IVs (Teachers’ gender, years of experience, qualifications, grade they teach) and and DV (English Teachers’ agreement to promote learner autonomy with four values (1) agree and practice, (2) agree but not practice, (3) unsure, (4) disagree). Among the IVs, gender produced very odd Exp(b) values (e.g., 3972393.841 !!) So, I was thinking of running ordinal regression though the four values defined for the DV are not naturally ordered. This yielded reasonable results all less than 1.
What do you think? Should I report the results of ordinal regression for gender separately or is something wrong in the multinomial analysis which I should fix?
I deeply appreciate your time and advice.
Best,
Ellie
Jim Frost says
Hi Ellie,
I looked back at your first comment to refresh my memory about. I see that wasn’t totally clear about the nature of your DV. If your DV is discrete and has a natural order, yes, you should use ordinal logistic regression. If the DV is discrete and there is no natural ordinal (i.e., nominal/categorical), you should use nominal logistic regression.
It appears with the new information that there is a natural order, so you should probably use ordinal logistic regression and disregard the other results.
E says
Hi Jim,
This blog is great! If you don’t mind, I was hoping you could help me with something. I’m looking at employing a crosssectional study to determine health needs in a community. With a couple exceptions (like age, and the distance it takes someone to drive to the hospital), the majority of my variables are nominal (the questions ask about perceived needs, opinions on community strengths, etc.). I believe I need to do a regression analysis but I’m not entirely sure which one. I think I need to use nominal logistic regression but am curious on your thoughts. I might be missing something… Thank you so much!
Sincerely,
E
Jim Frost says
Hi, it really depends on the type of dependent variable that you have. I couldn’t tell what variable is your dependent variable, so it’s hard to say. You can include nominal variables as independent variables in quite a few different types of regression analysis, which doesn’t narrow it down. If you can clarify the dependent variable specifically, I can give you a better answer. Thanks!
Levan Mumladze says
Thanks for your post. That is exactly what is needed for nonstatistican researchers. But I think it would be even more helpful if one can found other cases as well. For instance, when dependent variable is a proportion between 01, and independent either continuous or categorical. I found beta regression suggested in the last case. Also, I am pretty sure there is also other varieties of regressions.
best regards
Jim Frost says
Hi Levan,
I’m sure there are many additional types of regression analysis. This post was meant to cover the most common types. Additional research might be required to determine the correct type for more specialized cases.
Timothy Dickson says
Hi Jim !
So basically I have confused myself.
I have an independent variable of age
And the dependent variable is a total of 8yes/no decision tasks. (yes=1 and no=0, so total scores 08)
Can you please clarify my understanding? If I was testing each question individually I would use a binomial logistic regression, but by totalling scores I have created a continuous dependent variable and should proceed with a linear regression ? (and if the model is not adequate potentially a nonlinear)
Jim Frost says
Hi Timothy,
Yes, if you analyzed each question individual, you would use binary logistic regression.
Now, summing those eight items together to create a continuous variable might be a bit debatable. The rule of thumb that I’m aware of is that if you have a discrete variable that has 10 equally spaced values and your distribution covers that range, you can consider it a continuous variable. You’re close to that with 8. And, it appears like you satisfy the equally space aspect. However, I don’t have a good reference for my old rule of thumb! I’m not sure how widely accepted the rule is.
What I would do is try fitting the model using it as a continuous variable. Then, be very sure to check the residuals. That’s always a good practice. If the residuals look good, you’re probably fine fitting the model that way. In your write up, you might spend a bit more time explaining how the model satisfies the assumption despite the somewhat unusual dependent variable to allay any fears about that aspect. Also be aware that if you use the model for prediction, you’ll get decimal values and predictions that are potentially past the ends of the data range.
Best of luck with your analysis!
Tim says
HI Jim!
I am using age as my independent variable (my stats professor said not to code the groups into older and younger) and a total score from a yes/no decision task as my dependent variable (yes = 1 no=0) over 8 related questions. (so total score can be anywhere between 08). Does this become a continuous dependent variable? and if my understanding is correct, I should use a linear regression over a binomial logistic (which I would use if I was testing each question individually)?
Dave says
Jim, I’m a novice trying to recall old business school stats. How do I interpret a high p value for the intercept (0.069) but a significant p (0.01) for the dependent variable?
Jim Frost says
Hi Dave,
A nonsignificant pvalue for the constant technically means that you fail to reject the null hypothesis that the constant equals zero. In other words, you have insufficient evidence to conclude that the constant is different from zero. However, you typically should not interpret the constant and it’s statistical significance for a variety of reasons. I talk about these reasons in my post about the regression constant.
However, the significant pvalue for the independent variable (IV) (and I’m assuming you do mean independent rather than dependent because there are no pvalues for the DV) is much more important. First off, there’s no need to attempt to draw a connection or explanation between the lack of significance for the constant and the significance for the IV. Those are independent things. The significant IV indicates that you have sufficient evidence to conclude that there is a relationship between the IV and the DV. For more information about this interpretation, read my post about regression coefficients and pvalues.
I hope this helps!
immaculate says
halo am greatful for your detailed work, am a student and am doing my first research, it is really hard for me now, my topic is ” determinants of male participation in family planning decision making” and my dependent variable has three categories ( mainly respondent, mainly husband, joint decision) then the independent variable has demographic and socioeconomic determinants. i was requesting help out on which type of regression analysis am to use under SPSS plz
Ellie says
Hi Jim,
I need to see if there is any association between 4 IVs (teachers’ gender, age (four groups coded nominally), grade they teach, years of experience (three groups coded nominally)) and one DV (belief in learner autonomy at four levels).
I am using multinomial logistic regression, and of course I haven’t observed any significant association between my predictors and dependent variable. Have I chosen the correct stat?
Thanks.
Jim Frost says
Hi Ellie,
For your dependent variable, do the four groups have a natural order? It sounds like they might.
If they do, that order contains some information that you’re missing out on. Instead, you should use ordinal logistic regression.
However, if the groups do not have any type of natural order, then it sounds like you’re using the correct analysis.
I hope this helps!
Ellie says
Agree with you. Thanks, Jim.
Nikola says
HI Jim. Please, can you tell me what is statistical method i should use for analyse impact of independent variable (numerical variable which has same value in period 20122016) on dependent variables (numerical variables, which has different values in each year in period 20122016). Thank you.
ab says
Hi Jim, I have both independent and dependent variable in likert type (strongly agree, agree, somewhat agree, disagree and strongly disagree). What kind of regression method is helpful in order to find the effect of predictor variable on response variable. Thank you
Rajesh Kavediya says
Dear Jim. Really a good post explaining the type regressions to be used in various situation. I need your help. I am working on analysing the determinants of inflation expectations. My dependent variable is categorical, i.e between 01, 13, 35, 510 ,1015 and above 15 (inflation expectations) and independent variables are either categorical (like age group, income and education level) or macroeconomical variables like inflation, unemployment and growth. I will be grateful if you could suggest the appropriate regression framework/model.
vivi says
Hi Jim,
I wanna ask about type of data for multiple regression
I use google to get a rating from a place, for example the ranking for the Eiffel Tower is 4.6. I know, on Google itself, this value is the result of processing between the ratings given by the review of the place being assessed and other factors.
What I want to ask, is a value of 4.6 called ordinal data or numerical data (scale or interval)? so that it can be used as a variable from multiple regression. Because other variables are types of intervals or ratios.
Even if the ranking value is ordinal, should it be changed to numeric first so that it can be used in multiple regression models?
Jim Frost says
Hi Vivian,
Sorry about the delay in replying to your question. I’ve been away traveling.
Ratings are usually ordinal data. For example, if diners can rate a restaurant from 1 to 5 stars, it’s an ordinal scale. You can average the number of stars and obtain an average of 4.6 or other value. But the data points are ordinal.
However, I don’t know how Google determines the rankings. If it’s more complex than users simply entering an ordinal value, Google’s rating might not be ordinal. I don’t enough about it to really say for sure. But ratings are generally ordinal data.
If the rating is ordinal, you can’t just change it to numeric data. You can represent ordinal data using numbers, but it’s still ordinal data. Image that we use the numbers 1, 2, and 3 to represent first, second, and third in a race. Even though we are using numbers, they are not numeric or continuous data. For example, numerically, the number 2 is twice the value of 1. However, that does not mean that the second place finisher took twice as long as the first place finisher. And, the third place finisher isn’t necessarily three times as long. Etc. You can change how ordinal data are represented, but it doesn’t change the underlying fact that they are ordinal data.
If you have ordinal data and it’s the dependent variable, use ordinal logistic regression. If it’s an independent variable, it can be tricky. You can try using it as an independent variable, but pay extra attention to the residual plots. They may or may not provide a good fit for reasons that I describe in the race example!
Alex says
Hi Jim
Thanks for this constructive post. I have a question.
I run a multiple regression model in which my dependent variable is the vote for Social Democratic parties and my independent variables are associated with a range of factors.
Do you think that I should keep this single multiple regression model or could I divide this model into more multiple regression models? The advantage of dividing the model into more multiple regression models is that I acquire a better R squared value.
However, my argument in favour of keeping the single multple regression model is that the vote is a complicated phenomenon, which is affected by many factors and by controlling for more factors, we can explain more of the variation in y.
Thank you in advance.
Jim Frost says
Hi Alex,
I’m not quite sure how you would be dividing up the different models if you’re using the same dependent variable? Maybe by election, region, or year?
To make this determination, you’ll really have to use your subject area knowledge. If you think the relationships between the independent variables and the dependent variable are likely to be constant across the entire large model, that’s a good reason to use just one model. However, if those relationships change based on however you are dividing the models, that’s an argument for either dividing the model or modeling those changes themselves in the large model–possibly by including interaction effects.
Best of luck with your analysis!
Linch dan says
First of all, I want to thank you to maintain an excellent blog and it is very helpful for everyone.
I am one of a student who is struggling to find the best regression type for my study. In brief, Animals were fed with a supplement with different doses namely (0, .5 %, 1% ,2% and 3%). Each treatment group has 9 replicates ( ex .5% group has 9 animals). After feeding trail, Different blood parameters (Ex: Immune cells) are measured along with supplement concentration in blood. Now I want to correlate blood parameters(Ex: Immune cells counts ) with supplement concentration in blood.
For this experiment, What is the Correct Type of Regression Analysis?
What is Regression type that I need to work on is it linear regression or nonlinear regression? I am still in the learning curve and your help is highly appreciated.
Jim Frost says
Hi Linch,
It could be either linear or nonlinear regression. It depends on the nature of the relationships between the variables. There’s no way I could possibly guess about that. Definitely start with linear regression and determine whether you can obtain an adequate fit by checking the residual plots. If you can get a good fitting using linear regression, you can avoid nonlinear regression, which is often more complicated.
Best of luck with your analysis!
Geo says
Hi jim,
Thanks for the post.I have a set of categorical & continous variables that need to predict the success of a event ( fail/pass).The categorical values in the data set take multiple values ( q1 to q400 ) which may be related to region code or some other parameter which may impact the final output.what may be the best model to fit in here.
Jim Frost says
Hi Geo, because of the binary dependent variable, you’d need to use binary logistic regression. Using this type of regression allows you to determine which variables are correlated with changes in the probabilities of the success of an event. You can use both continuous and categorical independent variables with this type of analysis.
richard sadaka says
Hi, Jim
i am trying to find the regression line of 1 dependent variable and 35 independent variables (all categorical), but i faced a problem related to the significance of 33 out of 35 coef, all of them is insignificant
I could really appreciate any help
Jim Frost says
Hi Richard,
That’s a very difficult question to answer. The explanation can range from no relationship existing between those variables to an effect that is too small given your sample size to be detectable (i.e., your statistical power is too low).
Additionally, for 35 categorical independent variables, you probably need at least 700 observations. Possibly higher depending on the number of levels per categorical variable and the distribution of observations across those levels.
Can you narrow those variables down to a few that theory strongly suggests?
richard sadaka says
actually i can’t since they are the components of the consumption (i.e revenue = C1 + C2 +C3 ……..C35 + S)
Lis Bittencourt says
Hi, Jim! Thank you for your post.
My doubt is: if I have a continuous dependent variable and a count independent variable, what is the most suited regression analysis?
I understand that if the count variable was my dependent variable, a poisson regression was OK. But I am not sure if the inverse situation demands a regular simple linear regression.
Thank you,
Lis
Seun Opaleye says
Thank you Jim. I will increase the time period for the model to include period of growth together with the recession, then use structural breaks to identify effects during the period of recession. Right?
Seun Opaleye says
Thanks for your response Jim!
Recall that recession is measured on a quarterly basis and my country experienced recession for a period of 5 quarters which gives us 15 months.
Will it be proper to run an analysis based on data where T=5 (five quarters) & five independent variables.
Will GMM capture this data size? Or what do you suggest?
Jim Frost says
I think you might have a problem there. I’m not the most familiar with Generalized Methods of Moments (GMM). However, my understanding is that this method trades off efficieny in order to obtain more robust estimates using fewer assumptions. In other words, you need a larger sample size using this method compared to OLS. And, you’d have a problem with using OLS for your study. In OLS, you generally need at least 10 observations per term in your model. You’re nowhere near that. I don’t know the guidelines for GMM, but it is less efficient so presumably you’d need more observations per term.
You probably have an additional issue as well. If you have data only from times of recession, it limits the variability in your dependent variable and possibly the independent variables. This situation weakens the ability of the analysis to detect relationships in the data. It’s much better if you have data from strong and weak economic times because that allows the analysis to more easily determine which independent variables covary with the dependent variable. It’s much harder to determine how the variables covary when at least one of them (dependent variable) doesn’t vary that much.
I think you need more data and particularly include a variety of economic conditions.
Best of luck with your analysis!
Seun Obed says
Hi Jim,.
Nice work you’re doing here. I’d like to find out the model to use in running a regression where the time period is in months and we are looking at 15months and we have five independent variables. The research aim at looking into factors that significantly increased unemployment during a period of recession which lasted 15 months in my country.
Jim Frost says
Thanks Seun! Performing regression analysis with time series data can be trick but it’s possible. At some point, I might write a post about that topic! I’d start out with linear regression. You’ll almost certainly need to include time information along with your independent variables. Very possibly include time lagged variables as well. For instance, the state of variable X in the previous time period might affect the current time period. You should always check the residual plots. However, when you’re working with time series data, be sure to check the Residuals versus Time Order plot to ensure that you’re accounting for all the timerelated effects.
Torsha says
Hi Jim,
I found this extremely helpful!
I have a very elementary doubt. Is is practically feasible that both dependent and independent variables (all of them) are all dummies, i.e., are binary in nature?
Jim Frost says
Hi Torsha,
Yes, you can certainly do this using binary logistic regression. That type of regression allows you to use the binary dependent variable, which the other types of regression don’t allow. Then, you can add the binary independent variables, which isn’t unique to binary logistic regression.
I think the type of model you describe is relatively common in the health care field. That’s not my field but I attended a presentation by someone in the field who talked about how they use that type of model to assess the risk of a surgical procedure for different patients. All the independent variables are patient traits (e.g., high blood pressure, etc.) and the dependent variable was survival. The model allowed doctors to enter patient attributes and estimate survival probability. This type of model can also be used in other fields.
Barney says
Thank you Jim!
I’ve been fiddling around with Minitab trying to get it to include CIs for the parameter estimates, but I just can’t find information online to learn to enable it specifically for parameter estimates. Would you know if this is possible on Minitab, or could you please name some software that I might be able to use to do the CIs? It’s for asymptotic regressions.
Jim Frost says
Hi Barney, yes, Minitab can display the CIs for parameter estimates. On the main dialog box, click the Results button. Under Display of Results, choose Expanded. When you rerun the analysis, it will now display those CIs along with various other additional results.
pipi says
Thanks a lot Jim, it does help…..
Jim Frost says
You’re very welcome! Best of luck with your analysis!
pipi says
Thanks alot for your replying Jim,
Actually I kinda confused, because my supervisor said that my response variable which is travel time is discrete data. Because the way i collected, because in every interval time, i only have 1 data for every day..
ex. on Sunday– 6am – 6.59am = 42 minutes, 7am7.59am = 32 minutes, so on
on Monday — 6 am – 6.59am = 40 minutes, 7 am7.59am = 30 minutes, so on
on Tuesday — 6 am 6.59am = 30 minutes, 7am7.59am =20 minutes, so on
is it still continuous data or discrete? Because at first I choose continuous too…
And my next question, if i choose time as my predictor variable, how it would like?
because as my example above, it is in interval, can it be like:
Y (mean of travel time at 6am6.59am) = 37.33,
x (predictor variable) = 6
??
And Jim, can I contact you in private because I really need some suggestions about my research or about regression. Thank you so much
Jim Frost says
Hi Pipi,
Time is usually a continuous variable even if you collect it once per interval per day. The type of data doesn’t change based on how often you collect measurements. I suppose you could make the case that it’s a count of minutes if you only recorded whole minutes. In that case, you could try Poisson regression. But, generally time is considered a continuous variable. Personally, I’d try least squares regression first and see how well you can model the data using it.
The response variable, you’d just use the travel time.
For the predictor variables you’d include time related variables and you could possibly include other variables as well if you have that data. For example, you can trying including Hour of Day, Day of Week. And, if you have the data, you could try weather conditions too.
I wish I could help more in depth, but if I did that for everyone who asks, I woouldn’t have time for my own life! I get a lot of requests for that. As it is answering comments of a more general nature takes a lot of time! I hope you understand. But, I do try to provide general tips and points–like I am here! 🙂
pipi says
Hi Jim,
I am now in reseach about regression model. I was wrong before because I used polynomial regression and Trend analysis (Time series) for predicting my data which the response variable is discrete and the predictor is continuous.
Actually I want to estimate and predict the travel time, so I’ll describe the way I collect the data,…
I collected the travel time datas (response variable)about 1 month, everyday.. And in everyday i collected the datas from 6am10.59pm, where the interval per 1 hour.
Ex. on Sunday– 6am6.59 am = 42 minutes, then 7am7.59am= 40 minutes, and so on
so I want to estimate and predicting by using regression.
Is it right if i used Poison regression for solving my problem?
Can I combine with Trend Analysis especially Quadratic ? if I can combine how it would like?
Really need your suggest,
Thanks Jim
Jim Frost says
Hi Pipi,
Time series analyses require data that are collected at consistent intervals and that do not have any gaps. Your data have gaps (midnight – 5:59AM), so you can’t use time series analysis. Also, you use Poisson regression when your response variable is a count. Your response seems to be a continuous variable.
I would try using linear regression analysis and then including predictors such as time of day, maybe day of week, etc. Include the time components as predictors. You can include polynomials if needed. Regression with time related data can be tricky but it can work.
Trying fitting the models, checking the residuals, and adjusting as necessary. In addition to the regular residual plots, be sure to pay extra attention to the residuals vs order plot because you have time ordered data (assuming you record them in your worksheet in time order).
I have not done much regression with time related data, so I don’t have much more to suggest than trying that approach. Best of luck with your analysis!
Barney says
Hi Jim,
Very informative article.
Pardon me for perhaps a simplistic question, but is it considered regression analysis if the function is known, and I want to test the correlation of experimental data against the function? If so, what should I research to learn more about it?
Jim Frost says
Hi Barney,
Thanks! I’m glad it was helpful!
If you have a theoretical function and want to compare it to the fit you obtain for your data when you specify the same model as the theoretical function, you can use the confidence intervals (CIs) for the parameter estimates. If these CIs do not include the parameter values from the known function, you have sufficient evidence to conclude that the differences between your parameter estimates and the known function are statistically significant. Most statistical software should be able to produce this type of CI, although it might not always be included in the default output.
So, that’s what you should look into: CIs for the parameter estimates (coefficients) in a regression equation.
Kelly Parris Yeldham says
Hi Jim,
Cool article!
How should I proceed when I want to compare a hypothetical nonlinear graph with an experimental graph, when the function and the shape of graph is unknown? I want to quantify the two graph shapes, and compare their equations, as opposed to visually overlapping the two graphs.
I have a continuous independent variable of time, and the dependent variable of velocity calculated from previous iterations of velocity starting from 0. Increasing the order of the polynomial that I use brings me closer and closer to the shape, but I do not believe that my function is polynomial.
Thank you!
Jim Frost says
Hi Kelly,
I’m not 100% sure what you’re asking, but I think might be asking how to specify a model that fits the curve in your data and how to determine whether that model adequately fits the curvature. If so, I’ve got the perfect blog post for you: Curve Fitting Using Linear and Nonlinear Regression. I cover the different methods you can use to fit curves and how to determine which provides the best fit.
For your data, if the polynomials don’t provide a good fit, you might well need to transform your data or use a nonlinear model. Note that nonlinear models are different than linear models that use polynomials to model curvature, which is what you’re doing. I talk about all of that in that post!
I hope that helps!
Kirti says
Hi. Thanks for this post. It is very informative.
I am a student. I have a dataset with 300,00 rows and 77 columns. How do I approach the data?
Also I have to do some predictive analysis. My independent variables are a mix of continuous and nominal categorical variables and my dependent variable is continuous. Which regression model should I use?
Jim Frost says
Hi Kirti, with a few exceptions, the type of regression analysis you should use doesn’t depend on the size of the dataset and number of variables. Usually, it’s the type of variables that you have.
For your case, I’d start with multiple linear regression. See if you can get a good fit to your data using that procedure.
Tony says
Hey Jim! Once I’ve trained a logistic model and know which predictors are important, is there a way that I can define an optimal range for my input variables? For example if I’m trying to adjust three settings on a machine to minimize my probability of introducing a defect, how could I use my coefficients from a logistic model to decide what the mean setting should be for all three to maximize probability of no defects? Thanks!
Jim Frost says