Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data.
I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.
Related posts: Nominal, Ordinal, Interval, and Ratio Scales, Guide to Data Types and How to Graph Them, and Independent and Dependent Variables Explained
Regression Analysis with Continuous Dependent Variables
Regression analysis with a continuous dependent variable is probably the first type that comes to mind. While this is the primary case, you still need to decide which one to use.
Continuous variables are a measurement on a continuous scale, such as weight, time, and length.
Linear regression
Linear regression, also known as ordinary least squares (OLS) and linear least squares, is the real workhorse of the regression world. Use linear regression to understand the mean change in a dependent variable given a one-unit change in each independent variable. You can also use polynomials to model curvature and include interaction effects. Despite the term “linear model,” this type can model curvature.
This analysis estimates parameters by minimizing the sum of the squared errors (SSE). Linear models are the most common and most straightforward to use. If you have a continuous dependent variable, linear regression is probably the first type you should consider.
There are some special options available for linear regression.
-
Fitted line plots: If you have one independent variable and the dependent variable, use a fitted line plot to display the data along with the fitted regression line and essential regression output. These graphs make understanding the model more intuitive.
- Stepwise regression and Best subsets regression: These automated methods can help identify candidate variables early in the model specification process.
Advanced types of linear regression
Linear models are the oldest type of regression. It was designed so that statisticians can do the calculations by hand. However, OLS has several weaknesses, including a sensitivity to both outliers and multicollinearity, and it is prone to overfitting. To address these problems, statisticians have developed several advanced variants:
- Ridge regression allows you to analyze data even when severe multicollinearity is present and helps prevent overfitting. This type of model reduces the large, problematic variance that multicollinearity causes by introducing a slight bias in the estimates. The procedure trades away much of the variance in exchange for a little bias, which produces more useful coefficient estimates when multicollinearity is present.
- Lasso regression (least absolute shrinkage and selection operator) performs variable selection that aims to increase prediction accuracy by identifying a simpler model. It is similar to Ridge regression but with variable selection.
- Partial least squares (PLS) regression is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. PLS decreases the independent variables down to a smaller number of uncorrelated components using Principal Component Analysis. Then, the procedure performs linear regression on these components rather than the original data. PLS emphasizes developing predictive models and is not used for screening variables. Unlike OLS, you can include multiple continuous dependent variables. PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables.
Nonlinear regression
Nonlinear regression also requires a continuous dependent variable, but it provides a greater flexibility to fit curves than linear regression.
Like OLS, nonlinear regression estimates the parameters by minimizing the SSE. However, nonlinear models use an iterative algorithm rather than the linear approach of solving them directly with matrix equations. What this means for you is that you need to worry about which algorithm to use, specifying good starting values, and the possibility of either not converging on a solution or converging on a local minimum rather than a global minimum SSE. And that’s in addition to specifying the correct functional form!
Most nonlinear models have one continuous independent variable, but it is possible to have more than one. When you have one independent variable, you can graph the results using a fitted line plot.
My advice is to fit a model using linear regression first and then determine whether the linear model provides an adequate fit by checking the residual plots. If you can’t obtain a good fit using linear regression, then try a nonlinear model because it can fit a wider variety of curves. I always recommend that you try OLS first because it is easier to perform and interpret.
I’ve written quite a bit about the differences between linear and nonlinear models. Read the following posts to learn the differences between these two types, how to choose which one is best for your data, and how to interpret the results.
- What is the Difference Between Linear and Nonlinear Models?
- How to Choose Between Linear and Nonlinear Regression?
- Curve Fitting with Linear and Nonlinear Regression
Regression Analysis with Categorical Dependent Variables
So far, we’ve looked at models that require a continuous dependent variable. Next, let’s move on to categorical independent variables. A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic. Logistic regression transforms the dependent variable and then uses Maximum Likelihood Estimation, rather than least squares, to estimate the parameters.
Logistic regression describes the relationship between a set of independent variables and a categorical dependent variable. Choose the type of logistic model based on the type of categorical dependent variable you have.
Binary Logistic Regression
Use binary logistic regression to understand how changes in the independent variables are associated with changes in the probability of an event occurring. This type of model requires a binary dependent variable. A binary variable has only two possible values, such as pass and fail.
Example: Political scientists assess the odds of the incumbent U.S. President winning reelection based on stock market performance.
Read my post about a binary logistic model that estimates the probability of House Republicans belonging to the Freedom Caucus.
Ordinal Logistic Regression
Ordinal logistic regression models the relationship between a set of predictors and an ordinal response variable. An ordinal response has at least three groups which have a natural order, such as hot, medium, and cold. Learn more about Ordinal Data.
Example: Market analysts want to determine which variables influence the decision to buy large, medium, or small popcorn at the movie theater.
Nominal Logistic Regression
Nominal logistic regression, also known as multinomial logistic regression, models the relationship between a set of independent variables and a nominal dependent variable. A nominal variable has at least three groups which do not have a natural order, such as scratch, dent, and tear.
Example: A quality analyst studies the variables that affect the odds of the type of product defects: scratches, dents, and tears.
Learn more with my Logistic Regression Overview with Example.
Regression Analysis with Count Dependent Variables
If your dependent variable is a count of items, events, results, or activities, you might need to use a different type of regression model. Counts are nonnegative integers (0, 1, 2, etc.). Count data with higher means tend to be normally distributed and you can often use OLS. However, count data with smaller means can be skewed, and linear regression might have a hard time fitting these data. For these cases, there are several types of models you can use.
Poisson regression
Count data frequently follow the Poisson distribution, which makes Poisson Regression a good possibility. Poisson variables are a count of something over a constant amount of time, area, or another consistent length of observation. With a Poisson variable, you can calculate and assess a rate of occurrence. A classic example of a Poisson dataset is provided by Ladislaus Bortkiewicz, a Russian economist, who analyzed annual deaths caused by horse kicks in the Prussian Army from 1875-1894.
Use Poisson regression to model how changes in the independent variables are associated with changes in the counts. Poisson models are similar to logistic models because they use Maximum Likelihood Estimation and transform the dependent variable using the natural log. Poisson models can be suitable for rate data, where the rate is a count of events divided by a measure of that unit’s exposure (a consistent unit of observation). For example, homicides per month.
Example: An analyst uses Poisson regression to model the number of calls that a call center receives daily.
Learn in depth with my Poisson Regression Analysis Overview with Example.
Alternatives to Poisson regression for count data
Not all count data follow the Poisson distribution because this distribution has some stringent restrictions. Fortunately, there are alternative analyses you can perform when you have count data.
Negative binomial regression: Poisson regression assumes that the variance equals the mean. When the variance is greater than the mean, your model has overdispersion. A negative binomial model, also known as NB2, can be more appropriate when overdispersion is present.
Zero-inflated models: Your count data might have too many zeros to follow the Poisson distribution. In other words, there are more zeros than Poisson regression predicts. Zero-inflated models assume that two separate processes work together to produce the excessive zeros. One process determines whether there are zero events or more than zero events. The other is the Poisson process that determines how many events occur, some of which can be zero. An example makes this clearer!
Suppose park rangers count the number of fish caught by each park visitor as they exit the park. A zero-inflated model might be appropriate for this scenario because there are two processes for catching zero fish:
- Some park visitors catch zero fish because they did not go fishing.
- Other visitors went fishing, and some of these people caught zero fish.
Whew! That’s many different types of regression analysis! If you’re trying to figure out which one to choose, I hope you will use this information to point yourself in the right direction!
If you’re learning regression and like the approach I use in my blog, check out my Intuitive Guide to Regression Analysis book! You can find it on Amazon and other retailers.
Abubakar Sadiq Kasum says
Thank you Jim for this concise peice.
Yen Phuong says
Hi Jim,
Thank you for your post. I’m currently a university student and is conducting a research on the impact of institutional quality on green economic growth from 2014-2023 of 50 Asia countries. Green economic growth (GEG) is our dependent variable and independent variables are as follow: Institutional Quality, Domestic Credit, Trade Openess, Industry Value Added. We are having a hard time determine which type of regression to use as T<<N. I'm wondering if you have any suggestion for us, thank you for your efforts and time
Kasuni says
Hi Jim,
Great post and clearly explained the concepts !
I am struggling with my analysis and it would be great to get some direction from you.
My broad study aim is to analyze what cause pine cone death before they mature. For this, I have the no. of cones aborted from each individual tree from 2011-2021 from the same site. I converted this to the percentage of cones aborted per each tree since different tree had different starting cone numbers. Each tree might have been used in multiple years, but I do not know that information.
I want to know how environmental factors affect cone abortion. For that I have temperature and rainfall as my two main predictors. In temperature, I counted the no. of frost days (less than 2.2 celsius) , no. of heat days (greater than 35 celsius) in the time considered where most cone abortion happens based on literature. I also have high rainfall days and no. of zero rainfall days for this time.
My questions are
1. Is is correct to use percentage of cone abortion as the dependent variable?
2. Are my predictor variables (counts) suitable to answer my question?
3. What would be the best model to analyze this data (I tried simple linear models and got very low (0.02) adjusted R sq)
4. Can I use generalized mixed linear models? and if so what would be my random and fixed variables?
Thank you so much for your time and support.
Ghanimah says
Hi Jim, I’ve been running an ordinal regression (ologit) on Stata as I also have a categorical/ordinal dep vari, but how do I run it as a continuous variable/why would I choose to do that? My understanding is that Stata automatically considers it to be categorical. Also, my ‘scale’ is actually made up of numbers that are actually ordered (0, 1, 25, 50, 75, 100). Running an OLS has also been suggested but I don’t see how that would work.
Thanks!
Moksh says
Hi Jim,
I am trying to understand the number of children a woman decides to have. So, the dependent variable is of course the number of children. Kindly guide me which regression is most suited for my regression.
Regards,
Moksh
Jim Frost says
Hi Moksh,
Your dependent variable is a count, so Poisson regression or Negative binomial regression would be a good place to start.
Sam says
Hi Jim-
We run an 80 store retail chain across the US. We want a regression model for projected Weeks of supply based on history sales and current week of supply for each store location in order to allocate product . We were using a linear regression model with R2 . Although seem in some scenarios to be getting negative projected weeks of supply for stores that are selling high and are on low weeks of supply.
Che Hannigan says
Hi Jim,
Thank you for the article it is very helpful, but I am a bit lost with my regression analysis.
I am completing my undergrad dissertation on targeted poverty alleviation in China, and I have panel data from 30 Chinese provinces, from a time period of 2013-2022. I am attempting to analyse the effects of variable 1 (e.g., local government expenditure on education) against poverty (measured in GPD per capita), holding constant for other means of poverty alleviation (investment in infrastructure, access to clean water, etc.). But I am unsure how I am meant to regress these variables with the time series aspect if that makes sense.
From what I have read, would it be possible to hold time variation as a constant, as the date of policy implementation was 2015, so that I can compare the data in 2015, 2019, and 2022?
Another aspect of the dissertation was comparing the regional disparities of urban and rural provinces, but I’m unsure if I would also implement this into the regression? Apologies in advance if I haven’t explained this very well. I am a bit rusty on data analysis, but a reply would be more than appreciated. If you need me to clarify anything, please let me know.
p.s. All my data is currently in separate .csv files, as this was the only way I could save it from the databank. It is currently in the format of row: year, column: province, and the cells having the corresponding data for the specific variable. Any advice on how to clean and merge this data would also be very helpful
Thanks, Che.
MT says
Hi. I am working on a hospital readmission data set where the outcome is binary but rare (only 4 out of 70). Which regression or analysis test is recommended here, because logistic is not yielding any outputs on STATA. Thanks.
Jim Frost says
Hi MT,
Thank you for your question! Analyzing rare events can be challenging with traditional logistic regression. When the event of interest is rare (as in your case with only 4 out of 70 cases), the logistic regression model can sometimes fail to converge or provide unreliable estimates due to the limited number of positive outcomes.
For such situations, you might consider a few alternative approaches that are available in STATA:
Exact Logistic Regression: This method is specifically designed for small sample sizes or when the event is rare. It doesn’t rely on the large sample approximations of traditional logistic regression, making it a robust choice for datasets like yours. STATA can perform exact logistic regression using the exlogistic command.
Penalized Logistic Regression: Also known as ridge regression or shrinkage method, penalized logistic regression can help stabilize the estimates in models where the number of events is very low. This method adds a penalty to the likelihood function to reduce the variance of the estimates, which can be particularly useful in your case. In STATA, you can use the logistic command with the lasso or ridge option. Note: penalized methods like this are often used for models with many variables to help reduce them but they can also be helpful in cases like yours with a low prevalence of outcomes.
Before switching methods, it’s also a good idea to double-check your data handling and coding in STATA, ensuring there are no data entry errors or misconfigurations affecting the analysis.
I hope these suggestions help you progress with your analysis!
Iryna Zablotska says
Hi Jim,
Thanks for a very informative sight, and for your very helpful and clear advice.
may I ask for some help as well?
I am an epidemiologist currently investigating the prevalence of infertility in Africa. My outcome of interest is obviously binary, and logistic regression would be a natural first choice, nut the outcome event is not rare, and my model produces very inflated odds ratios. For instance, ORs of 17.62, 25.96, 93.42. What would you advise in this case?
Thanks,
Iryna
Jim Frost says
Hi Iryna,
Logistic regression can produce inflated ORs when the outcome is common because ORs tend to exaggerate the effect size in such scenarios.
While logistic regression is a common choice for binary outcomes, it may not be the most suitable model when dealing with common outcomes. The inflated ORs suggest that an alternative modeling approach might be needed. Before doing that, be sure that your logistic regression model fits the data well and check for issues like multicollinearity or overfitting, which could contribute to inflated ORs.
Sometimes using risk ratios rather than odds ratios can help because they’re less prone to inflation for common outcomes.
In that vein, consider using Modified Poisson regression, which is an adaptation of the Poisson regression model that allows for direct estimation of relative risks (risk ratios) for binary outcomes. It is particularly useful when the outcome is not rare, as it avoids the inflation of odds ratios that can occur in such situations. It’s also known as Poisson regression with robust error variance.
Modified Poisson regression involves fitting a Poisson regression model but adjusting the standard errors to account for the binary nature of the outcome. This adjustment is typically done using robust standard errors.
I’ve read several articles in the literature recently indicating the effective of the Modified Poisson model.
ROBERT says
Thanks, Jim, for the wonderful website.
If I have predictors which are binary (for example participation in a skilling program) and an outcome that is continuous (income), what kind of regression do I have to use? Thanks again. Robert
Jim Frost says
Hi Robert,
Because you have a continuous outcome, you’d want to start with least squares regression. It can handle the binary predictors with no problem. You could perform ANOVA because your predictors are all categorical. You’d get equivalent results as the regression approach.
Isabelle Kramer says
Hi!
Thank you for all this helpful information!
I am also trying to find a suitable regression model. I am working with three moderating variables which were measured on a five point likert scale. I binary independent variable which is a yes or no question (the usage of chatbots) and my dependent variable is customer satisfaction which is also measured on a five point likert scale. What would you suggest me to do?
Thanks already,
Isabelle
Jim Frost says
Hi Isabelle,
You should try ordinal logistic regression because your Likert scale response is an ordinal variable. You’ll need to figure out whether you should enter your Likert scale moderators as continuous or categorical variables. That’ll depend on the type of fit you can get using them as continuous variables even though they’re ordinal vs. the amount of data you have to enter them as categorical (which uses more degrees of freedom) but might give you a better fit.
Nor Tadala says
Hello Jim,
Thank you for the insights on choosing the right regression. However, I am still confused looking at the kind of data I have for my thesis.
I have one time (post-intervention data) cross-sectional data, and am looking to assess the impact of the intervention by comparing those who received (treated) vs the control. The dependent variable in binary and the key explanatory variable is also binary, taking either 1 or zero. Most papers I have read have done either logit or probit regression models however, one thing common about these papers is that they had before (baseline) and after intervention data or used panel data, which in my case I do not have either. They applied propensity score matching and Difference in Difference to estimate the treatments effects.
Later I started thinking of Linear Probability Model, but now seems their is consensus that its a wrong model since probability can be more than 1.
Now I am stuck and not sure which model to use? At least one that will be easier to use and interpret Please assist.
Now
Joe Vondrasek says
Hi Jim,
What is the best statistical test to assess the non-monotonic relationship between 2 continuous variables. I saw in your explanation of spearman’s and tau that these tests can assess only monotonic curvilinear relationships.
The independent variable is age and the dependent variable is central arterial stiffness. I don’t necessarily want to predict with these data. The point of the paper I am writing is to assess the presence of a relationship and compare the data points to available normative data. For example, a conclusion could be that central stiffness does indeed increase with age.
Admittedly before I read your posts, I thought I could just fit a non-linear line and use the R2. Now I’m thinking that won’t be best. When I did fit the curvilinear line the residuals were not normally distributed.
I appreciate your help and I can clarify as needed.
-Joe
Rahul says
Thanks so much JIM for the neat explanation. I am now learning many things.
Model_F_R <- glmer (F_size ~ tmin_M + ppt_M + vpdmin_M + vpdmax_M + years + (1|farms), family = poisson(link = "log"), data = Data_Flam_DF_Z)
This is the formula I have used for fitting a GLMM (Poisson). I have flock size as response variable, and 4 climatic factors + years from 1975-2020 as covariates.
My question is that “Whether the covariates influence the flock size, and whether there is Zoo specific influence of covariates on flock size”
I am confused about the following things.
1. I have 10 farms to analyse. But each farm has varying time-spans for which the data is available. For example, farm 1 : 1975-2021, farm 2: 1983-2021 etc. I have used these data as it is. Is that correct?
2. I am not getting a neat QQ plot. I applied transformations. But it did not help. Outlier plots of dependent variable shows outliers, and for some variable, there is a cluster on the one side, and after a gap, some on the other end. I used Cleveland plot to do it. I tried square root and log transformation. But nothing helped.
Am I doing the right thing?
Thanks
Rahul
Jim Frost says
Hi Rahul,
Some answers to your questions!
Using data from different time spans for each farm, as you have done, is generally acceptable in longitudinal and mixed-model analyses. This approach allows you to utilize all available data despite the different start and end points, which is particularly useful in ecological or agricultural studies where data collection periods may vary by location.
However, it is crucial to ensure that this variation in data availability does not introduce biases related to temporal trends in the covariates or the response variable. For instance, if earlier years had systematically different flock sizes or climatic conditions, this might influence the model outcomes. One way to address this potential bias is by including random slopes for years within farms if the data supports this complexity.
Also, Poisson regression data don’t need to follow the normal distribution. Instead of using QQ plots, use goodness-of-fit tests tailored to count data to determine whether the predicted numbers of events deviate from the observed numbers of events in a way that the Poisson distribution does not predict. The tests I’m familiar with for this purpose are the Deviance Test and the Pearson Test.
If those tests indicate a significant lack of fit, you might need to change any of the following: the model specification, the link function, or possibly use a different type of model. For Poisson models, check if the variance exceeds the mean significantly, which indicates overdispersion. Poisson models can’t handle overdispersion. If it’s present, consider using a negative binomial model, which can handle overdispersion better.
Rahul says
Hi Jim:
I used beta regression model for modeling. When i used the years as grouping variables, the phi coefficient for the intercept was -83.45. Can the phi coefficient be negative ??
Jim Frost says
Hi Rahul,
The phi coefficient is a measure of association between two binary variables. Yes, it can be negative. In fact, they range from +1 to -1. Negative values indicate an inverse relationship between two variables.
However, because they are binary variables, positive or negative really depends on how your data are structured. If you picture a 2X2 table, a positive or negative value depends on the direction of the diagonal with the most values. A diagonal like this \ produces a positive phi coefficient, while a diagonal like this / produces a negative value.
Consequently, if you switched the order of outcomes in the margins of the table, you could potentially switch the direction of association.
Rahul says
Hi Jim:
Thanks so much for the suggestion. It works !!!…
I have one more question. I performed two models: one with farms as random effect and the other with years as random effect. The model with years as random effect has lower AIC, lower BIC, and slightly higher pseudo R2 value, indicating a good fit. But the precision parameter for model with random component years is lower (0.03) compared to that of farms (0.1). Moreover, only two predictors are significant in the model with years as random component, whereas all predictors were significant in the model with farms as random component. Log likelihood of model with years as random effect is slightly lesser (995) than that of model with farms (1003). As a measure of goodness of fit, what strategy I may choose.
Thanks a lot JIM
Jim Frost says
Hi Rahul,
You’re very welcome! It’s been interesting sharing this journey with you! I hope the results you’re obtaining so far are informative.
In terms of fixed and random effects, you don’t make those decisions based on goodness-of-fit measures but rather the nature of the factor and what you want to learn. So a quick refresher on those:
Use random effects when the levels of a factor represent a random sample from a larger population, and you are interested in generalizing findings beyond these specific levels. Random effects are useful for controlling for the variability introduced by these levels without focusing on the effect of each specific level.
Use fixed effects when you chose specific levels of the factor and the levels of a factor are of primary interest and you want to estimate and test the effect of each level individually.
In your case, it sounds like farms might be a good candidate for a random factor. You probably want to control for the variability between farms but the goal of your study isn’t to learn about those farms specifically.
On the other hand, I wouldn’t include years as a random factor. Usually, time related variables are fixed factors. That’s true even though you probably want to control for the year to year variability but you might not be interested in those specific year. The reason is because you didn’t randomly sample the years.
If what I wrote about farms is correct (i.e., you’re not interested in those specific farms), consider fitting a model where farm is a random effect and year is a fixed effect. The reasoning I discuss above should trump any of the other goodness-of-fit issues because the decision is based on the nature of the data.
Rahul says
Thanks Jim.
That’s an interesting solution. I will use the beta regression, as you suggested.
Rahul
Rahul says
Thanks Jim:
I have 0 for some farms and hence beta regression fails. May I know what can I do in this case ?
Rahul
Jim Frost says
Ah, right. You can’t have exactly 0 or 1. Just make those values slightly different, say 0.000001 or something like that instead of exactly zero.
Also, I updated my previous answer after replying to include another possibility, GLLM.
Rahul says
Hi Jim, Thanks for the reply. Can I use linear mixed effects model for Male proportion ?
Rahul
Jim Frost says
Hi Rahul,
Unfortunately, you can’t due to the fact that the dependent variable is a proportion. Beta Regression is your best bet.
Or you might be able to use a generalized linear mixed model (GLMM) with a logistic (or other appropriate) link function and a binomial error distribution. This type of model can handle the bounded nature of proportions and also account for random effects due to data clustering (e.g., different farms).
Rahul says
Hi Jim:
I have a question. I have two dependent variables (flock size and male proportion), and seven climatic variables for various farms, located at different parts of the country. May I know which is the right regression model for this type of data ?
Thanks Rahul
Jim Frost says
Hi Rahul,
For flock size, you could try either Poisson regression or linear least squares. That really depends on whether the flock sizes are large enough to approximate a continuous normal distribution, in which case regular multiple regression is fine. However, Poisson regression specifically models counts and so is probably a good place to start. There are other types that also model counts, such as a negative binomial model. You’d need to see which type best satisfies that assumptions.
The proportion of males is a more complicated problem. First, I’ll assume that the dependent variable itself records the proportions and that you’ve recorded one value for each flock. In that case, you’d probably want to use Beta Regression because it models proportions that are constrained between 0 and 1. It’s not common in many statistical software packages but is available in R (betareg package) and Python (statsmodels library). I haven’t used this type of regression analysis myself so, unfortunately, I can’t offer any practical advice but it is worth checking out.
ellie says
Hi Jim,
I am writing my thesis at the minute and i am really confused about which statistical test to use.
I am looking at how your social motives and your socioeconomic status (my IV’s) affect your achievement (DV) .
BUT, all of my variables have been tested on a Likert scale so I don’t have any nominal data. I thought i needed a simple multiple regression but Ive read over your website and I think Im wrong. At my university we havent got too in depth with regressions (only multiple regression, linear regression and briefly hierarchical regression). Please could you give me some advice? I am going round in circles, thank you
Adam says
Hello Jim! I have a question about conducting a moderation analysis using a categorical independent variable, a continuous moderator, and a categorical dependent variable. What regression/moderation analysis on SPSS would you recommend for this type of analysis and why? Thank you kindly for your time.
Jim Frost says
Hi Adam,
The type of analysis depends on the type of categorical DV. You’d use one of the following types of regression analysis and include interaction terms to model the moderation effect.
Binary logistic regression: If your DV has two levels.
Multinomial logistic regression: If the DV has at least 3 categories that don’t have a natural order.
Ordinal logistic regression: If the DV categories have a natural ordering (e.g., low, medium, and high).
Yiwen Tu says
Hello Jim,
I have a problem of statistics on game analysis.
For example we have a multiplayer fps game, and we need minimum 40 people to start a game, each game will last 30min at most. Assume that probability for each player to start a game is 20%. Now we want to know how many people shall be online at same time minimum to ensure that matchmaking time is less than 20 seconds?
Which statistical model or algorithm can we use to analyse this problem?
Thank you.
Matthias Klumm says
Dear Mr Frost,
I am a postdoc in English linguistics and I am looking for the best statistical test to use for a quantitative analysis of data that I have recently collected.
I have collected texts from three different genres (i.e. news reports, commentaries and student stories), and I want to investigate in how many sentences within the texts of each genre a particular word (i.e. ‘however’) occurs. And finally, I want to test whether there is a statistically significant difference in the use of ‘however’ across the three genres.
So my independent variable is categorical (i.e. (i) number of sentences in news reports, (ii) number of sentences in commentaries and (iii) number of sentences in student stories), and my dependent variable is binary (i.e. presence or absence of ‘however’).
Could you please let me know which statistical test is the best to analyze the data described above? I have tried a chi-square test and came up with a very low p-value, but maybe there is a better test for my data (ANOVA, binary logistic regression…)?
Thank you very much for your help!
Best regards,
Matthias
Maja Wójcik says
Hello Jim!
Your material is really helpful, but I couldn’t find answer for my type of problem.
I have 2 dependent variables which are continuous (scores from questionnaires) and I want to do regression for each of them separately or both at once if possible.
BUT
most of my predictors (independent variables) are either nominal or ordinal.
What should I do then?
Thanks in advance for your help 🙂
Caleb says
Hello Mr.Frost,
I want to predict salary based on 3 Categorical Variable as predictors :
– Experience Level (4 level)
– Expertise Level (4 level)
– Company Size (3 Size)
I am not sure which regression model i can use here
Jim Frost says
Hi Caleb,
The type of regression model that you should use depends more heavily on the type of response/dependent variable. Because you’re talking about salary, that would be a continuous variable. Your predictors are actually ordinal variables. Ordinal data is combination of categorical (groups) and continuous (order).
I’d recommend starting with least squares regression. You’ll need to decide whether to include your set of ordinal variables as categorical, continuous, or a mix. That’ll depend on what provides a better fit and your research goals. Entering variables as categorical variables also uses up more degrees of freedom, so your sample size might become an issue. Pay extra attention to the residual plots because with ordinal predictors it can be hard to get a good fit, particularly when entering them in the model as continuous variables.
I hope that helps! Best of luck with your analysis!
Federico says
Hello Mr. Frost,
Thank you for the blog article. Just a quick clarification:
We are accustomed to think that linear regression is the suitable model if the output response variable is numerical and continuous. However, I believe that the output response variable can also be discrete. In reality, cost, height, etc. are only theoretically continuous and practically numerical and discrete…Cost is discrete (the cent is the smallest amount. the mm is the smallest division on a meter stick)
“Count” is a discrete variable and Poisson regression is the suitable model. But Poisson regression is not the suitable model for every discrete output response variable, correct?
Thank you!
Federico
Jim Frost says
Hi Federico,
I’d generally agree with you. There’s a rule of thumb that if you have at least 10 discrete values that are equally spaced, you can count it as a continuous variable. Because of measurement accuracy limitations in every device and the resulting significant digits, all continuous variables are effectively discrete at that level. But generally they easily meet and far exceed the rule of thumb criteria, so we treat them as continuous. For convenience, we still refer to those theoretically continuous variables as continuous variables rather than discrete.
Technically, the Poisson distribution is for non-negative integers where the mean equals the variance, which are typically count data. Theoretically, you could use Poisson regression for discrete data that fit those criteria even if they are not counts. I’m not aware of it actually being use that way, but I wouldn’t want to say that it couldn’t.
However, you are correct in that it is generally not suitable for most discrete variables other than count data. Indeed, it’s not even appropriate for all count data. If the variance is greater than the mean, known as overdispersion, you might use the negative binomial regression instead. Or a zero inflated model if your data have too many zeros.
To summarize, Poisson regression is applicable for some count data and generally nothing else.
I hope that helps!
Amar says
Hi Jim,
Thank you so much! I will try to read more about what you said and try out your suggestions. This information will be so helpful in my future research studies as well. Thank you once again.
Have a nice day! 🙂
Engelbert Buxbaum says
And to think that many years ago I chose to study biology over chemistry because the maths-requirements were easier! Thanks for your input
Engelbert
Jim Frost says
You’re very welcome! Even as a statistician, there’s all these very specialized analyzes that are often field specific that are hard to keep up with.
Engelbert Buxbaum says
Hi,
my data set is a continuous outcome variable against time of day (i.e., the independent variable is circular). The data are measured when practical and are therefore irregularly spaced, so FFT cannot be used. There is also a lot of scatter. So far, I have sorted the data by time of day (irrespective of date) and then used a sliding average to filter out most of the noise. This actually works surprisingly well (especially as circular data have no ends). There clearly are two maxima (early morning and afternoon), so I cannot use regression to a c-association (Fisher 1993). Is there a similar method for data with multiple peaks? And is my filtering method acceptable? I have tried Vaníček-Lomb-Scargle periodograms (using the R-package Lomb) of the original data, but that method often does not identify any significant periodicity where I can clearly see it in the filtered data, so I am a little worried…
Thanks in advance
Engelbert
Jim Frost says
Hi Engelbert,
Thank you for reaching out with your detailed question about your time-series data analysis. I’m not specialized in circular data nor have I used the following analyses for this purpose, so I don’t have a lot of personal insight to offer. But based on my research, I’d look into the following approaches.
Your use of a sliding average is a rational starting point for trying to understand the underlying patterns in your data. However, there are intrinsic limitations to this method, especially when dealing with irregularly spaced observations, as it can potentially distort the true nature of the data patterns due to its assumption of uniformly spaced data points.
Given the circular and irregular nature of your data, I would recommend exploring circular kernel smoothing as a more suitable alternative. Kernel smoothing, in general, can be especially beneficial as it assigns weights to observations based on their proximity in time, making it more suited for irregularly spaced data. It also adapts well to the irregular intervals between observations, providing a more accurate representation of underlying trends and patterns.
Using circular kernel smoothing will accommodate the circular nature of your independent variable and weights the contribution of nearby points based on their circular proximity, respecting the cyclic boundaries of the data. This is particularly advantageous for data like yours, allowing for more accurate representation of underlying trends and patterns and preserving localized features better than a sliding average would.
Additionally, consider the following methods for multimodal data.
Gaussian Mixture Models (GMM) ideally employ circular distributions as components to account for the cyclicity and avoid misinterpretations inherent with linear Gaussian distributions in a circular setting.
Non-Parametric Density Estimation, such as Kernel Density Estimation (KDE), can utilize a circular kernel and account for the circular nature during density estimation to accurately identify multiple peaks in circular datasets.
Clustering Algorithms can incorporate distance metrics suitable for circular data, such as circular distance, to ensure correct cluster identification and separation associated with different peaks.
Wavelet Analysis for Circular Data: The application of Periodic Wavelet Transform can help analyze your circular data by treating it as a period of a periodic signal. This approach maintains continuity at the boundaries and is crucial for effective analysis of circular data.
I hope that at least provides you with some potentially useful options to consider. Best of luck with your analyses!
Amar says
Hi Jim,
Thank you for replying!
The scale for measuring attitudes is a categorical scale (3 sub scales having 5 point likert type items) and the outcome variable is measured on a continuous scale (11 point likert type items).
Posting this comment again since I’m not sure if the previous one got successfully posted or not 😅
Jim Frost says
Hi Amar,
I did see and reply to your previous comment/question. I had asked for additional information about your DV before answering in detail. Thanks for supplying that!
If you can truly count your outcome DV as a continuous variable, a good place to start would be Least Squares Regression. However, while having an 11 point scale does help in terms of counting the DV as a continuous variable rather than a trickier ordinal variable, there are some cautions. The 11 points must all be equally spaced in terms of differences in the underlying construct. For instance, is the difference between 9 and 10 the same as the difference between 2 and 3? When you’re dealing with a more subjective attribution, which often occurs when using a Likert scale, that requirement is less likely to be true. So, just something to consider.
For more more information on this issue, read a comment I’ve recently written to another reader about expanding a 5-point Likert scale to a 20-point scale. Some of the issues apply to you in terms of whether you can treat your DV as continuous or ordinal.
That all said, I’d try least squares regression and treat the DV as continuous. See if you can get a good fit. That may or may not work given the issues I discussed but it’s worth a try.
If you can’t get a good fit, you’ll need to try a different method, such as ordinal logistic regression. That treats your DV as an ordinal variable, reflecting its true nature. However, 11 outcomes for an ordinal variable might be too many. Give it a try but you might need to collapse some of the categories.
You’ll also need to determine how you’ll handle all those ordinal IVs. That’s another issue in addition to how to treat the DV as discussed above. Either include the IVs in your model as categorical or continuous variables. There’s pros and cons for both methods. And it’ll depend the nature and quantity of your data, the number of IVs, and the goals of your research.
Definitely see what other researchers have done with similar studies. See how they’ve handled this situation. There might be standard procedures in your field of which I’m unfamiliar. But those are my thoughts given what you’ve provided.
Best of luck!
Amar says
Hi Jim,
I am trying to analyse my data and I’m not good with statistics, so could you please help me out.
I want to examine how different dimensions of the same scale impact the dependent variable, I’m trying to study the effect of different types of attitudes about something on well-being. Could you help me with the correct type of regression for this analysis?
Thank you so much!
Jim Frost says
Hi Amar,
I’d really need to know more about your outcome variable(s) to know which type might work best for you. Are the outcomes measured in using a continuous scale? Ordinal? Categorical?
Joe says
Hi Jim
i am truly thankful for your answer.
I am really trying to find which factors could be significant für using alternative medicine (for example: if the person has more than 2000 dollars in income, they are more likely to use alternativ medicine)
using cross-tables i already have shown that alternativ medicine doesnt help significantly for the disease. So its really about the using and not the healing.
i thank you again for your outstanding effort in this article. If more people could be like you we’d have much less problems in life.
Jim Frost says
Hi Joe,
Thanks so much for your kind words!
Definitely sounds like Alternative Medicine is your Outcome/Dependent variable. Binary logistic regression will model the probability of someone using alternative medicine given the independent/predictor variables in your model. Those predictors can be continuous and/or categorical.
joe says
after looking for so long on the internet i am only getting more confused.
i stumple on this article and i think i almot got salvation
my data is mostly yes/no based
mz dependent variable is alternative medicine and i want to run univariate regression analysis the a multivariate
would i get correct rasaults if i use the following:
logistic binominal regression
dependent variable: Alternativ medicine (yes/no)
covariate: none
factors: each variable individually (infections, improvment, use of acupuncture etc) (yes/no)
my program (jamovi) always does the math without considering if the variate are nominal, continuous or ordinal. so i cant know if what im doing is making sence.
finding the answer will help me not get fired 🙂
Jim Frost says
Hi Joe,
Well, I certainly don’t want you to get fired!
If your dependent variables is binary (yes/no), then binary logistic regression is a good place to start.
However, I have a question for you. You say that “alternative medicine” is yes/no and that it is a dependent variable (DV). Is that truly the outcome you’re modeling/predicting? Based on the rest of your message, it sounds like it is one of the independent variables (factors).
If you’re truly using the factors you list to predict whether someone is receiving alternative medicine, then it’s the binary DV in your model and binary logistic regression is a good choice.
However, if you want to use alternative medicine as a yes/no factor to predict a different outcome, then I’d need more information about the other outcome.
Katerina Giullana says
Thank you so much for your appreciated help. I used binomial logistic regression and I have my results. Which do you believe is the best graph/plot to use to visualize my results for my thesis?
Jim Frost says
Hi Katerina,
For the IVs, you should report the odds ratios (OR), the OR confidence intervals, and the p-values. ORs are essentially the effect size for the IVs. The key is that you want the CIs to exclude the value of 1, which represents no effect in this context. When the OR CIs exclude 1, you’ll also have significant p-values. Read more about Odds Ratios in my article about them. I discuss the regression context towards the end but you’ll learn more about them throughout the article if you’re not already familiar.
In terms of graphs, I’ve used contour plots to illustrate the results effectively. Although, that works better with continuous IVs and you have categorical IVs. But take a look at this example of where I use binary logistic regression and illustrate the results with graphs.
But the most important results I’d say are the ORs and their CIs. Followed by the p-values.
Waqar Ul Hassan says
Hello Jim,
Nice piece of reading. But I’m still confused can you help me.
I am conducting a research for my master dissertation, and for that I have to identify the “determinants of antenatal care (ANC) and skilled birth attendants (SBA) childbearing women in Pakistan” for that i have to develop the two different models;
1) ANC = f(age of women, education of women, income, partner living with, distance to health facility, parity, no. of pregnancies, region………………)
construct three categories of dependent variable;
0-3 ANC Visits and coded as 0
4-7 ANC Visits and coded as 1
8-and above ANC Visits and coded as 2
2) SBA = f(age, edu, income, partner living, parity, no. of pregnancies, ANC visits………………)
construct three categories of dependent variable;
No one assisting at the time of delivery and coded as 0
traditional birth attendant assisting at the time of delivery and coded as 1
skilled birth attendant assisting at the time of delivery and coded as 2
I’m confused what regression model will be suitable for it? Ordinal or multinomial logistic regression or else?
kindly guide, it will be appreciated.
Katerina Giullana says
Hello, thank you for spending time on posting and answering our questions.
Personally, after reading your articles I am still confused on what test to choose for my data analysis. I would provide some info and I would be happy to receive help.
I want to test with regression if there is an effect on the survival rate of my earthworms, which is my dependent variable. The survival rate can have the following values: 0, 0.33, 0.67, and 1. My two independent variables are the earthworm species (L. rubellus, A. caliginosa) and the flooding conditions of my experiment (Yes, No). My observations are 80 in total, based on the number of earthworms I found in each one of my 80 pots at the end of my experiment. I started with three earthworms in each pot.
At the beggining I tried to analyze my data with the chi-squared test, but after looking online I believe that regression is better.
Jim Frost says
Hi Katerina,
First, I agree that regression is probably the better choice. More on that later.
Initially, I was confused by one aspect of your description. How can the survival rate be set to those four values? Survival should an outcome that you observe. It’s not something you set. Then I realized you’re assessing the survival rate in each pot with three earthworms per pot. So, your sample consists of 80 pots in your description.
There’s two potential ways to model this setup using regression. You need to decide whether you will model the count of survivors per pot (n = 80) or the survival rate of all the worms (n = 240). Use Poisson Regression (or maybe negative binomial regression) to model the count of survivors for each pot given the IVs. Or, use binary logistic regression to model the probability of survival of worms given the IVs.
If you many worms have the same combination of IV values, the worms count as replicates and are great for experiments. Replicates are different experimental units (worms in your case) that undergo the same conditions and allow the model to assess random error. In that case, I might lean towards using binary logistic regression. However, if most/all worms have different combinations of IV values (no replicates), you might consider using Poisson regression (or negative binomial) because each pot is its own unique thing in your experiment. It’s even possible that both approaches are legitimate and might give the same overall results.
If both models are usable, I’d recommend binary logistic regression because the results are more interpretable for readers. The difference would be binary logistic regression modeling survival probabilities given your model versus Poisson regression modeling the count of survivors when you start with very specifically with 3 worms in a pot given your model.
However, there are probably other subject-area considerations of which I’m unaware. It might be worthwhile consulting with experts in your field or reviewing the literature for similar studies. One concern you should consider is whether the various pots have experienced different environmental conditions that you couldn’t control or measure. They would be potential confounding variables. Those would be concerns with either type of regression model.
A chi-square test is another possible analysis but it provides less information. It’ll tell you whether there are statistically significant relationships between categorical variables. However, chi-square can’t estimate the strength of the relationship at all much less between each IV and the DV, chi-square can’t model the probabilities, and it can’t determine the significant of individual variables like binary logistic regression. So, I agree with trying binary logistic regression.
I hope that helps! I’d be very interested in knowing how you proceed and the results, if you’d care to share later on in the process!
Mara says
Hi Jim,
This is extremely helpful. I have just discovered your blog and it already became my favorite resource for understanding statistics. Like many others here, I got stuck too while working on my PhD. I’m not quite sure how to determine the type of my dependent variable, so I can apply an accurate regression method in my analysis. The scale of answers include: no, yes, yes – more than once (have you ever experienced any of the following…). Would that be an ordinal or cathegorical variable? I’ve seen some articles using logistic regression for similar kind of analysis I’m attempting to perform, but their dependent variables are binary with yes and no answers.
Thank you for this post and taking your time to review my question,
Mara
Jim Frost says
Hi Mara,
Typically, the choice of regression analysis depends on the dependent variable (DV). If your DV has only Yes or No answers, it is a binary variable. A binary variable is a special case of a categorical variable. With a binary DV, consider using binary logistic regression.
Shaad says
Hi Jim,
I am very lucky that I found your blogs which helped me as a beginner to understand the concepts of regression very clearly. your blog is literally a life saver for me in my research.
I have some doubts in my work. I am trying to identify various factors affecting the snowfall. For that snowfall rate is taken as the DV and other topographical and climate parameters are taken as IVs. I checked for individual correlation between each of the IVs with the DV but getting very low values.. it’s observed that multicorrelation present between IVs. Can you help me to select the right method to understand the influence of each parameters. Can I use PCA? I have 35 observations representing 35 locations, 2 DVs and 8 IVs of which one IV is nominal.
Thanking you
Shad
Jim Frost says
Hi Shaad,
It sounds like an interesting study but I have several concerns about it.
For starters, you have only one observation per location. Imagine fitting a separate model for each location. You’d have n=1. You’re not capturing any of the variability within each location for the various conditions.
Also, 8 IVs for 35 observations is probably too many. You might be overfitting your data.
I don’t mean to sound like a downer, but those are real problems to consider.
As for the multicollinearity (correlated IVs), that can be a problem. However, try using LASSO or Ridge Regression which can handle multicollinearity. However, you’d still be overfitting the data. You might try reducing the number of variables. With 35 observations, you’d safely be able to include 3 or 4 continuous predictors. The nominal IV complicates things because it uses more degrees of freedom.
PCA probably isn’t the best approach for several reasons. Yes, it can reduce the number of independent variables (fixing the overfitting and multicollinearity). And you can use PCA with regression analysis in procedure called Partial Least Squares Regression that uses the components as predictors. You’ll at least find out if the information in your sample provides predictive abilities. However, it’s not good for understanding the role of each IV because it “mushes” multiple variables together in the components. But, if you can only access that particular dataset it might be the best option because it fixes both the problems I mention.
I hope that helps!
Khalid says
Hi Jim
Given that I’m attempting to understand the relationship between various independent variables such as the number of faculty with PhDs, the number of publications, and the number of patents, and the dependent variable of university ranking,
what would be the most suitable regression analysis method to use in this context?
Karly says
HI, Jim!
I am also doing a thesis and got stuck:). I have a dichotomous categorical variable as IV (gender) and an assessment score as DV. I was going to do a simple linear regression. However, when I check for linearity I get smith weird: two vertical lines for gender 1 and gender 2:). What would that mean? My dataset is not too small. I know that many other similar studies do regression in these cases. If you share your opinion, I would greatly appreciate it !
Jim Frost says
Hi Karly
It sounds like you should be performing a 2=sample t-test to compare the two means. Regression isn’t the best analysis for this case, although it can be done. A 2-sample t-test will tell if the different between male and female mean scores is statistically significant.
Nikita says
Dear Jim,
In my thesis, I have made the following hypotheses…
IV–> Perceived Risk of COVID-19 (Continuous variable) DV –> Anxiety (Continuous variable)
Possible moderators–> age, education level, income (categorical variables)
I would like to run three separate regression analysis to test whether the three possible moderators really significantly moderates the relationship between Perceived Risk of COVID-19 and Anxiety.
I know how to run hierarchical linear regression with continuous variable or dichotomous variable as moderator. However, I have difficulty in conducting the regression analysis with polytomous categorical variable as moderator…..
Would you please help?
Thanks a million.
Nikita
Jim Frost says
Hi Nikita,
I’m entirely sure which part of the process you’re having difficulties with? It sounds like you’re comfortable with fitting models that have interaction effects and using dichotomous variables. The process is essentially the same when your moderator variable has more than two levels. Are you having difficulties coding your categorical variables? Or interpreting the results? I don’t want get into a long involved reply without knowing more about where the problem lies.
Dorwa says
Hi Jim, I am working on a project related to perception (traffic safety) and the perception is rated on 1 to 10 points (visual analogue scale) for 8 videos by the same group of participants (n=300). Each video has different streetscape elements. I am confused as to which regression method to adopt. Reading through your blog, I think linear regression seems appropriate. What do you suggest?
Phumelela Peace Mwelase (@peacegates) says
Thank you, Jim Frost, for such a very useful platform and your unmatched dedication to answer questions and provide help freely.
I am tackling a financial accounting research with panel data. Research on the effect of corporate governance on environmental reporting disclosures. I have formulated a conceptual model which has four main hypotheses (Hypotheses on Corporate Governance Regimes; Audit Committee Attributes, Board Characteristics and CEO Attributes).
Corporate Governance Regimes (CGRs): Based on the main hypothesis on CGRs, a total of 2 sub-hypotheses assumed of the dimensions of CGRs effect (2 factors) on the dependent variable “Environmental Reporting Disclosures, E-Score).
Audit Commit Attributes (ACAs): Based on the main hypothesis on ACAs, a total of 4 sub-hypotheses assumed of the dimensions of ACAs effect (4 factors) on the dependent variable “Environmental Reporting Disclosures, E-Score)
Board Committee Attributes (BCAs): Based on the main hypothesis on BCAs, a total of 4 sub-hypotheses assumed of the dimensions of BACs effect (4 factors) on the dependent variable “Environmental Reporting Disclosures, E-Score)
CEO Attributes (CEO-As): Based on the main hypothesis on CEO-As, a total of 3 sub-hypotheses assumed of the dimensions of CEO-As effect (3 factors) on the dependent variable “Environmental Reporting Disclosures, E-Score)
An article such as this one: (dx.doi.org/10.1108/CG-08-2019-0250) only describes a model on board characteristics.
(a) Would Panel Data Analysis be suitable for this scenario? (b) I have written four (4) main models for each main hypothesis. How do I go about providing the model for each sub-hypothesis, if this makes sense (c) What would be the best trajectory to test these sub-hypotheses?
Claude says
Hi Jim,
I want to understand the relationship between 9 phytochemical variables (binay, presence = 1 and absence = 0) and feeding rate of insects (values expressed in %). If I want to explain the feeding rate from the 9 phytochemicals (independent variables are 9 phytochemical and dependent variable is feeding rate). Which regression model should be used for my data. I am thinking of multiple linear regressions because my dependent variable is quantitative, is it okay? I need your advise on this as soon as possible! Thanks
Jim Frost says
Hi Claude,
Multiple linear regression is a possibility. It’s a good choice if the percentages for your dependent variable don’t tend to be close to either 0 or 100%–the upper and lower limits. Multiple linear regression doesn’t “know” that those are limits. Hence, if many values are near a limit, the model might make predictions for values below 0% and above 100%.
pavlos zournatzidis says
Hi Jim,
Thanks again
So, my n = 65.
Thanks for elaborating on your reply.
Please let me make another question … when considering a binomial logistic regression how can I decide if my sample size suffices? this is a power analysis or is different for binomial logistic regression (and perhaps all regression models belonging to the generalized linear models family? )
Jim Frost says
Hi Pavlos,
With 65 observations, you’re good to go with binary logistic regression with at least several predictors.
As for your new question, in regression, you need to consider the number of observations per predictor variable. With linear regression, it’s recommended that you have at least 10-15 observations per predictor. If you don’t, you run the risk of overfitting your model. Click the link to learn more about that. I’m not sure offhand what the recommendations are for logistic regression. However, click the overfitting link and I have reference at the bottom of the post that should get you your answer. Unfortunately, I don’t have easy access to it right now.
But with n=65, you can definitely fit some binary logistic models!
pavlos zournatzidis says
Hi Jim,
Thank you for all the posts – they are very helpful.
What if I have a binary outcome variable, but I have a small sample size for binary logistic regression. Any alternatives?
Thank you in advance,
Pavlos
Jim Frost says
Hi Pavlos,
You don’t mention how small your sample is. I’ll assume it’s very small. But feel free to provide additional details.
When you have a very small sample, you’ll have difficulties using any method. It’s just harder to find relationships with smaller samples. And if you happen to see apparent relationships in your small sample, they are less likely to be statistically significant. And you’re also more likely to get flukey results with small samples. Those are some of the fundamental hurdles you’ll face with a small sample size.
With a small sample, consider sticking with exploratory data analysis. You have a binary outcome and presumably some data for potential explanatory variables. You could see whether the events and non-events of your binary outcome are associated with different means for the potential explanatory variables. For example, are events associated with a higher mean for variable X than non-events? You could create individual value plots showing where the data points occur for events vs non-events. That sort of thing. Essentially, you’re looking for candidates for further study.
zelalem T. says
hi,i have three response variable. but those three variable has association and my first and second response have binary categories and the else is ordinary categories. what type of model can i use?
Alejandra says
Hi Jim,
Hope you are ok! Thank u so much for your information.
Im wondering, which tests are available of t for relationships between a continuous predictor and a categorical response variable? Can you give me a example?
Jim Frost says
Hi Alejandra,
In this post, look for the subsection titled, “Regression Analysis with Categorical Dependent Variables.” A dependent variable and response variable are synonyms. I provide examples and the types of regression analyzes to answer your questions there.
John Kitt says
Hi Jim
Can you help?
I have temperature and humidity data for several months.
I have three different types of plants in four different soil types.
I want to see and compare if life survival of plants is dependent on soil type and weather data.
How can I get all this into a single test? Can I do this in an hierarchical anova?
Jim Frost says
Hi John,
This sounds like a case for when you’d use binary logistic regression. The outcome variable is binary–survival yes or no. You can include the type of plant as a categorical predictors and temperature and humidity as continuous predictors. That gets everything into one model.
You could split it into three models, one for each plant type. Consider doing that if the relationship between the predictors (temperature & humidity) and plant survival changes depending on plant type. If that is true, you’ll need to include a number of interaction terms, making the model more complex and less intuitive to interpret.
But if you can get a good fit with a single model, that’s fine too! A single model allows you to determine whether type of plant affects survivability while controlling for temperature and humidity.
Komal Sarbuland says
Hello Sir,
I got an assignment, where we need to select statistical analysis by ourselves. The research have following variables.
1 : Six year graduation (0 = No, 1 = Yes)
2 : URM (0 = No, 1 = Yes) Underrepresented Minority
3 : Pell Eligible (0 = No, 1 = Yes)
4 : Gender 0 = Female, 1 = Male
5 : Participation in LLC (0 = No, 1 = Yes)
6 : Course unit earned (Continuous)
7: SAT scores (Continuous)
8 : Pell Eligible (0 = No, 1 = Yes)
Research Question:
When controlling for gender, URM and Pell, what is the best predictor for reducing equity gaps between URM and non-URM students graduating from State University in six years? Is it participating in a living learning community, the number of units earned during the first year, , or the SAT score?
Ebenezer Esaah says
Hi Jim
Please I am predicting consumer purchase behaviour in microinsurance (either health or life policies) with python. The indpendent variables are Age, Income level (medium or low), Region, Type of City (urban/rural)
Looking at the mixed nature of my independent variables, which type of regression would be appropriate
1991 says
Hi Jim,
I am making cheddar. I have one dependent variable: the grade it is given (1-5) and then independent variables such as acidity, time from rennet to press etc…
Is this a simple example of count regression. If so can I use OLS?
Thanks in advance,
Sebastian
Jim Frost says
Hi, that’s an example of ordinal logistic regression because your dependent variable seems to be an ordinal variable.
Faten says
Hello Jim
I hope you are doing great!
My questions is:
I have one independent variable(Teacher autonomy support) and one dependent variable with 4 subscales(student’s engagement which has four types). My data is ordinal which means i used likert scale (always, sometimes…..never ).
– Which type of regression can help me to know the impact of teacher autonomy support on each type of student engagement?
Thank you in advance
Pradip Kumar Nanda says
Dear Jim Frost
Hope you are doing well
I wish to forecast employees performance as DV on Likert scale (1-5) ,ordinal variable, having 8 dimensions(items).
There are 4 IVs (construct) as ordinal variable measured in a Likert Scale(1-5) having each 8 items(questions) each .
The sample size is 516
Plz offer suggestions on
1. What statistics to be chosen Mean/Median/Mode.
2. whether sample mean (for both IVs and DV) can be derived to treat it as continuous data and apply parametric regression such as multivariate regression.
3. If not what type of ordinal regression can be applied ,suggest in brief.
4. Whether it is possible to enhance statistical power of regression on ordinal variable by implementing the advantages of parametric tests having nor mal distribution.
5. My data set is skewed as the variable are ordinal, and how to transform the distribution to normal distribution.
Thanks
Pradip Kumar Nanda
Nor hanieza says
Hello. I have a question. if the objective of my study is “factors influencing food security among adults”. the independent variable is more than two. so what is the appropriate analysis should I use? Is it multiple regression or logistical regression? in my questionnaire what form of question should I use? should I use a “likert scale” or “yes” or “no”?
Jim Frost says
Hi Nor,
The type of regression model depends more on your dependent variable. What type of variable is that? How do you measure food security?
You can use Likert scale and yes/no questions. Likert scale items can be a bit difficult to model because they are ordinal variables, but you’ll need to include them either as a continuous or a categorical variable. There are reasons to go either way. I cover that in my regression book. Yes/No questions are not a problem
Filipa Carvalho says
Hello Jim!
So I have one continuous dependent variable (engagement) and 3 categorical independent variables (all with 2 levels each: information (yes/no), entertainment (yes/no) and pltaform (facebook/instagram). First, do I consider it a 2x2x2 factorial design?
Second, can I do a multiple regression?
Third, assuming I can make the regression, I would also like to see if entertainment together with information and entertainment together with no information have any effect on the DV. How would I do it?
Thank you so much.
Best,
Fil
Jim Frost says
Hi Filipa,
Using standard notation, you’d say that you have a 23 factorial design. You have three factors, each with two levels.
Yes, you can definitely use regression for that or ANOVA.
Typically, you’ll look the p-values to identify which factors have a statistically significant relationship with your dependent variable. You can also include interaction effects to see if the value of one factor effects the relationship between a different factor and the DV.
Pradip says
Dear Jim Frost
Hope you are doing well
I wish to forecast employees performance as DV on Likert scale (1-5) ,ordinal variable, having 8 dimensions(items).
There are 4 IVs (construct) as ordinal variable measured in a Likert Scale(1-5) having each 8 items(questions) each .
The sample size is 516
Plz offer suggestions on
1. What statistics to be chosen Mean/Median/Mode.
2. whether sample mean (for both IVs and DV) can be derived to treat it as continuous data and apply parametric regression such as multivariate regression.
3. If not what type of ordinal regression can be applied ,suggest in brief.
4. Whether it is possible to enhance statistical power of regression on ordinal variable by implementing the advantages of parametric tests having nor mal distribution.
5. My data set is skewed as the variable are ordinal, and how to transform the distribution to normal distribution.
Thanks
Pradip Kumar Nanda
Alice Jackson says
Hi Jim ,
I have a question for you about how to interpret the intercept of a lm() model with two categorical predictors (IVs).
m1= lm (IV1 + IV2, data)
m2= lm (IV1 * IV2, data)
I noticed that the intercept in m2 (which includes interaction effect) gives the mean of the baseline categories of IV1 and IV2, however, the intercept is different in m1 without interaction effect. Could you explain to me how to interpret the intercept of models like m1 which include the main effects of the two categorical IVs but not their interaction effect? What exactly does its intercept tell us?
thank you very much.
Best wishes,
Alice
Jim Frost says
Hi Alice,
For starters, m1 and m2 are completely different models. So, it’s no surprise that the intercepts are different. You should check the residual plots for both models to see if either of them produce good residual plots.
As for interpreting the intercept, I’ve written a post about that. How to Interpret the Constant (Y Intercept) in Regression Analysis. That should answer your questions. In general, the intercept does not provide useful information and it’s usually not worthwhile interpreting. Click the link to read why.
ed says
what kind of model can I use when the dependent variables are categorical and continuous?
Michael says
Dear Jim,
I was wondering what you would do in the following situation.
The data is panel data, with country-years as the unit of analysis. There are 20 countries over 20 years.
The dependent variable is a quasi-count. It has 64% of zero values and 36% of positive values. These positive values range from 1 to 10383 (70 unique values). It so happens that they are all non-negative integers (although theoretically, they could also be non-negative NON-integers, i.e. they could theoretically have a decimal point).
The main problem has been with the issue of “too many zeros”.
We have discarded zero-inflated models for 3 reasons: a) we know there are no “true” zeros in our database, because all cases could be positive non-zeros, b) we do not have theoretical or empirical reasons to suspect a two-step process as implied by zero-inflated models where a first step distinguishes zeros from non-zeros and a second step models variation among non-zeros, c) we know the difference between zero and low values (e.g. 1 or 2 or 3) are not really significant theoretically or empirically. For the same reasons, we did not go into Heckman selection models such as a Tobit-2 model.
Considering the overdispersed nature of our dependent variable, that it is a quasi-count ranging from 0 to 10383 with 64% of zero values, we ended up with a negative binomial model. Because our data is country-year panel data, we first ran a random effects negative binomial model, and then a fixed effects negative binomial model. Because some people might question whether our dependent variable is a true “count variable” or not, we also run a GEE model (i.e. a population-average panel-data model using generalized estimating equations) where we specified a gamma distribution with a log link function.
What do you think?
We are in two minds. On the one hand, we do not have a typical count variable (often, count variables do not go from 0 to 10383, so it is maybe not your typical count and the negative binomial might struggle with such high values) so some might take issue that our non-negative integer variable should not be modelled through a negative binomial, whilst gee models with a gamma family and a log link function are not so common in my field (and additionally, they often fail to converge for models with too many variables). On the other hand, we are not sure what type of model could cope with 64% zero values, especially considering that zero inflated models (such as zero inflated poisson or zero inflated negative binomial) and heckman selection models do not seem empirically or theoretically appropriate. We substantively know that we do not have a two-step process going on.
Do you have any feedback or advice?
Or do you think that what we did sounds ok?
Sorry for the long post.
Jovelyn says
Hello Jim,
May I ask what to use if my independent variables are ranked data, the data is in percentile rank, can I still use logistic regression?
Jim Frost says
Hi Jovelyn,
Unfortunately, ranked variables are difficult to incorporate as IVs. Or, I should say, there are more complexities involved. You’ll need to decide to include them either as continuous variables or categorical variables. Ranked data are ordinal variables, which share properties of both continuous and categorical variables. The decision to include them either as continuous or categorical variables depends on both the goals of your study and the nature of your data (e.g., sample size, which choice provides a better fit, and number of ranks per variable). I write about this in more detail in my regression analysis book.
Samuel Frimpong says
hello Jim,
I would appreciate your help with the kind of regression model suitable for my experimental design, I am somewhat conflicted.
Okay, so my experimental design is a 3x2x2 factorial design. These three factors comprises one continuous independent variables and two categorical independent variable. your input as to the most suitable regression model that could be used for the data analysis will greatly be appreciated
NB: The responses variables are all continuous.
Thank you very much and hoping for your kind and swift response.
Samuel
Jim Frost says
Hi Samuel, I’d start with a linear model. They can handle a mix of continuous and categorical IVs with a continuous DV.
Nathan says
Hello Jim,
I’ve read your book on Linear Regression: An Intuitive Guide and really enjoyed it. It is clear and well written. One question I have that I did not see in your book nor have found in your website (or others( is with respect to Deming regression. Specifically – do the Gauss-Markov assumptions apply to this type of regression? Put another way, is a Deming regression the BLUE for cases where there are errors in both x and y measurements?
Thanks for your time and keep up the good work!
– Nathan
Fatimah says
Hi Jim! I hope you are doing well. thanks for this awesome post.
I need a suggestion on selecting the statistical test for my data… as it got too much complicated for me.
Hypothesis 1: there is a positive correlation between self-concept and ideal self
Hypothesis 2: self-concept and ideal self (negatively) impact the preference towards social norms
I want to see if “self-concept and Ideal self impacts the social norm perception”, and if it does, what kind of relationship can be seen. More specifically, I want to see does “A and B (self concept) and C and D (Ideal self) impact [E1, E2], and [F1, F2, F3, F4, F5, and F6] (social norm perceptions)”, and if it does what kind of relationship can be seen.
Details:
The self-concept is measured by 2 items, item A and B. Item A has 4 sub-items and item B has 5 sub-items, which is measured as 0 (no) and 1 (yes). The scale is calculated for each item (ex., the scale for item A = 0 to 4). Then the scale for whole self-concept is calculated as 0 to 9. [a separate variable is also built by applying min-max normalization for each item and variable, which transformed data to 0to1 scale… (it becomes continuous var?)]. It is used as Independent Variable(s).
The Ideal self is measured by 2 items, item C and D. Each item has 2 sub-items (= respondents were asked to select one options for each 1st priority and 2nd priority among several options), which is measured as 0 (brave) and 1 (calm) for item C and 0 (home) and 1 (profession) for item D. The scale is calculated for each item (ex., the scale for item C = 0 to 2). Then the scale for whole ideal self is calculated as 0 to 4. [a separate variable is also built by applying min-max normalization for each item and variable, which transformed data to 0to1 scale]. It is used as Independent Variable(s).
The norm perception is measured across 3 domains: domain E, domain F, and domain G. Domain. E consists of 2 variables “E1” and “E2” and has 5 items for E1 and 3 items of E2. Domain F has 5 variables “F1”, “F2”, “F3”, “F4”, and “F5”. F1=4 items, F2=2, F3=2, F4=4, F5=2, and G1=2 only. Each item is measured as 0 (not inclined) and 1 (inclined). The scale is calculated for each variable (ex., the scale for variable E1 = 0 to 5). [a separate variable is also built by applying min-max normalization for each variable, which transformed data to 0to1 scale]. They are used as (3) Dependent Variables.
Issues 1:
The Cronbach Alpha internal validity for self-concept and ideal self is low: 0.377, and 0.035 respectively. This is because children were not considerate of ideal self much (I think).
Issue2:
Data for ALL variables are skewed and highly skewed for some. non-normal distribution of data (children tend to be answering similarly)
My thoughts (But need your advice for better selection of test):
For Hyp1: Bivariate correlation between self-concept and ideal self (although it seems un-associated or even negatively associate.)
For Hyp2: Hierarchical linear regression. First adding self concept the ideal self OR Moderator analysis: to check if ideal self moderate self concept and preference in norms and vise versa. OR Multivariate analysis to run every thing to run every thing together. [note: I have no skills in any of mentioned statistical tests]
I hope to get some valuable suggestion and guidance from you.
Best regards,
Jim Frost says
Hi Fatimah,
If all of these measurements are individual values for something like a Likert scale, you’d need to use ordinal logistic regression because the dependent variable is ordinal. However, if the scale has 10 possible values, there is some justification for treating it as a continuous scale. There is some debate over that.
It’s trickier including ordinal variables as independent variables. You’ll need to include them as either categorical or continuous. There are several pros and cons for both ways. That’s too long for an explanation in the comments sections but I write about it in my regression book.
It seems like you need to use regression analysis, but all the ordinal variables will make it trickier. I’d suggest consulting with a statistician at your institution because it is a tricky scenario and there isn’t always agreement on how you can handle Likert items.
Cecilia says
HI Jim,
Thank you for this. I am trying to run different Machine learning models to predict Renewable Energy Generation(continuous), with total salary, price of energy, government incentives, price of transportation (all continuous variables) as predictors. I did a linear regression, I don’t seem to be able to do a logistic regression (maybe because its for categorical variables as you just explained) and then I will try regression trees, random forest, lqa, lda, and knn. Is this correct or is this not possible?
ivan says
Hi Jim, I would like to ask which should i use to asses the influence of 1 independent variable (with 3 categories) to a dependent variable.
my problem is this
what is the impact of basic psychological need(autonomy, competence and relatedness with psychological well being.
thanks in advance.
Jim Frost says
Hi Ivan,
If your DV is continuous and the IV has three groups, you can use least squares regression. However, if you have only that one categorical IV, you could just use one-way ANOVA. To see an example of that procedure, read my post about using Excel to perform one-way ANOVA.
Varun@23 says
Hello Jim,
Thank you for your insights.
My dependent variable is Fire incidence rates (Fire incidences divided by the population in the corresponding area), which is a ratio, hence it is a continuous variable (am I right?). I have 16 independent variables, most of them are count variables. Some examples of my independent variables are: Number of Households in an area, Number of people who smoke in an area, etc. Can I use multiple linear regression for this scenario, despite my independent variables not being continuous?
Your help regarding this will be massive, thank you!!
prunelle says
thank you for the informations, i would like to know if i can use composite variable to make the ordinal logistic regression. I have 1 dependent variable which is a likert scale with four questions, i make the mean by combining the 4 questions into one question for my ordinal dependent variable item.
Jim Frost says
Hi Prunelle,
Yes, researchers will often combine Likert items by either summing or averaging them and create a continuous variable. Amongst statisticians, there is some debate over whether that’s valid. If you believe that the differences between your Likert values represent a constant difference (i.e., the magnitude of the difference between a Likert score 5 and 4 is the same as the difference between 4 and 3, and so on), you’re on more solid ground.
If you use that approach and your DV is the composite variable (the mean of multiple Likert items), you’d use linear regression or other regression that can use a continuous DV. Given the nature of the composite variable, you’ll need to be extra careful about assessing the residuals for lack of fit!
Hilman says
Hi Jim,
I have 1 DV and 33 IV (26 dichotomous, 6 continuous and 1 ordinal).
Have done the correlation using spearman coefficient and the linear regression for the model.
But it seems that it’s not really adequate fit from the residual check of linear regression.
Which one should I try for the next step, a multinomial or polynomial?
Any advice will be greatly appreciated.
Thank you.
Hilman
Jim Frost says
Hi Hilman,
That’s a lot of variables. Be sure that you have a large enough sample size to avoid overfitting your model, which can produce results you can’t trust.
As for next steps, if you’re seeing patterns in your residuals, you definitely need to make changes. You can try graphing residuals by specific variables to determine if there’s curvature you should be fitting. That includes adding polynomial terms to your model to fit the curvature. You should also check and see if your DV is highly skewed. That can make it hard to get good residuals–you might need a transformation in that case. Although, I always recommend transformations as a last resort.
For more advice and tips about fitting models, read my post about fitting the correct regression model.
Pritam says
Hello Jim,
I work on one linear regression model( very simple dataset having only two features). I got Rsquared value nearly 77%. I have to improve that value. Could you please tell me which transformations are there to improve my Rsquared value( or accuracy of linear regression model)?
Jim Frost says
Hi Pritam,
R-squared isn’t something that you can just crank up as needed. Trying to go beyond the natural limit of the unexplained variability in your data will cause problems–namely results that you can’t trust. If at all possible, you should conduct some research and determine what other studies in the same area have obtained for their R-squared values. If they have gotten higher values, then you have reason to see if you can increase yours.
There are some things you can legitimately try. First, check the residual plots. If you see patterns in your data, then you know there are changes you can make to get a better fit. Maybe include polynomials for curvature? Or an interaction effect. If your DV is highly skewed, transforming it might be in order. Try a Box-Cox transformation. However, I always recommend data transformations as a last result because they complicate interpretation and the model less intuitive.
For more information, read my post about fitting the correct regression model. Lots of tips and practical advice in it!
Fikadu Tola Seyoum says
my dependent variable is continuous,but im categorizing int four groups.
serum bilirubin as 0.21-051,0.52-0.71,0.72-0.99 and 0.99-1.00 so which model of regression i use to analyse data
Jim Frost says
Hi Fikadu,
If you recode your DV in that manner, you’ll be converting continuous data to ordinal data. Hence, you’d need to use ordinal logistic regression. However, I’d strongly recommend against recoding your data like that, if at all possible, because you lose so much valuable information.
Gamachu Diriba says
Hi Dear Jim
I have dichotomous dependent and five points likart Independents. Can I Analysis with binary logistic regression ?
My objective is to understand contributing factors to dependent.
Please 🙏
Jim Frost says
Hi Gamachu,
You’re dichotomous DV is right for binary logistic regression. However, Likert scale IVs pose a problem for all forms of regression, assuming you’re talking about individual Likert scores for each observation rather than an average. Likert, and ordinal data in general, have characteristics of both continuous and categorical data. And you’ll need to choose one of those ways to enter them in the model. That’ll depend on the nature of your data and the goals of your analysis. I cover that in my book about regression analysis.
Hannah says
Hi Jim! May I ask what can you best recommend if I were to correlate multiple discrete ID and one discrete D? In my case, I want to correlate informal settlers, sq km of impervious pavements, and amount of waste disposal (ID) to number of flood occurrences per year (D)
Thank you! and I wish you well!
Jim Frost says
Hi Hannah,
I’ll assume that by ID and D, you mean independent variables (IVs) and the dependent variable (DV), respectively.
I don’t know what informal settlers are. However, it appears like your DV is a count variable, flood occurrences per year. In that case, you might want to try Poisson Regression or the related negative binomial regression. They can model counts of occurrences. If there are enough floods per year, say 30 or more, the Poisson distribution approximates the normal distribution, and you might be able to use linear least squares regression. However, smaller counts will be skewed, and the other types would tend to provide a better fit because they’re designed to fit skewed counts.
I hope that helps!
Laurine says
Hi Jim,
I would like to investigate an mediation effect en will do a regression analysis for it. I have two categorical independent variables; both having 2 categories. My dependent variable has 3 (or maybe 4; I still have to do my research) categories. Is the nominal logistic regression the right type of analysis.
Hope you can help me!
Jim Frost says
Hi Laurine,
That sounds like the right choice based on what you wrote as long as the DV categories don’t have a natural order. If they have an order, such as high, medium, and low, then you’d need ordinal logistic regression. But, if not, then nominal logistic is correct.
Judian says
Hi Mr. Frost
I have a continuous dependent variable and 4 continuous independent variable and 3 categorical independent variable.
Will Multiple Linear Regression statisfy all 7 the predictors?
Jim Frost says
Hi Judian,
Multiple linear regression can handle that combination. Of course, whether you it can provide an adequate fit depends on the nature of the relationship between the variables. But it’s a good place to start.
I’ll add that with so many predictors you’ll need a good sample size to avoid overfitting–at least 70 observations but potentially many more depending on how many levels each categorical variable has.
Muddasir Khan says
Hi jim i hope you’re doing well . i have one independent and 5 dependent variables like to find impact of fdi on Company performance here fdi is independent and for performance i took five dependent variable like return on asset, return on equity, net profit margin, total assets, and total equity, please suggest me best model for this analysis thank you so much
Tewodros says
I want to conduct an anlysis of dependent variable community support for ecotourism development and independent variables like
Personal economic benefit
Social and Environmental benefit
Community attachment
Local benefit
Which regression analysis should i supposed to employ. thank you in advance for your help
Jim Frost says
Hi, if your DV is a continuous variable, I’d recommend starting with linear least squares. If your DV is a different type of variable, look for that type in this post and see which type of regression will work with it.
Justine says
Hello Mr. Frost! I am a student and currently working on a research with one dependent variable, 5 independent variables and 1 moderating variable. A survey will be conducted using the 5 likert scale on agreeableness category. The research objectives includes: to determine the dominant factors that affects the dependent variable and assess the significance of the relationship of the variables. Is PLS SEM analysis applicable for this? I am also confused on whether my data shall be classified as ordinal or interval. Thank you in advance!
Henrik Herrebrøden says
Dear Jim,
Thank you so much for this wonderful website! Great to have someone able to break down stats into plain language.
I have a question about adding categorical covariates into a regression model. Here is a sports study I am working on, investigating relationships between games played for various national teams in soccer:
Outcome variable: Number of games played for senior national team
Predictor variable 1: Number of games played for under-21 national team
Predictor variable 2: Number of games played for under-19 national team
Predictor variable 3: Number of games played for under-17 national team
However, I have a dataset with players from six different countries. I want to investigate whether nationality makes a difference to the relationships here, i.e. if the predictive value (of my predictors) changes depending on which country the player represents.
So my question is: Does it make sense to run a multiple regression model with nationality as covariate? If so, do you know how to run this in SPSS (that is, where to enter the nationality variable)?
Your help would be highly appreciated!
Best wishes,
Henrik Herrebrøden,
PhD Student
Rajiv says
I have one categorical independent variable (Work configuration) and two dependent variables one is stress (Metric) and other is stress coping (categorical). What is right approach to analyze
Jim Frost says
Depending on the number of levels in the categorical IV, you can use either a 2-sample t-test or one-way ANOVA for the continuous DV.
For the categorical IV and categorical DV, you can use either the chi-square test of independence or nominal logistic regression.
Egla says
Hello Jim,
Thank you for your explanation.
I have 3 independent variables (1 of them us demographics info so can be a mixture of continuous snd categorical and do the other 2 IV are continuous as the both measure time spent on phone).
My independent variable parenting behaviour measured on a likely scale. What’s the best regression analysis to use?
Your insight would be greatly appreciated.
Egla
Dugasa Tesfaye says
Is it possible to run regression analysis for dependent variable data categorized by interval? I made interval to make the encoding easier but I stuck whether it will take me to an interpret-able result?
John Santoyo says
Hi Jim, I’m excited to begin reading Regression Analysis! I hope it’ll equip me to answer questions such as the following for regression:
In regression analysis, is there such a thing as an “unrelated” variable? For example, can I use a marketing campaign total spend amount (Pinterest spend) to predict revenue in a separate campaign channel (Facebook revenue) that also has a spend dependent variable (Facebook spend). So in this example, would Pinterest spend be “unrelated” to Facebook revenue, and thus not useful as a dependent variable for revenue?
I can clarify, if needed. Thanks again for this resource! It’s really helping calm my nerves as I start my career as a Data Analyst…
Jim Frost says
Hi John!
I’m so glad you’re reading my website!
Yes, there is definitely such a thing as an unrelated variable. If an independent variable (IV) has a coefficient near zero and a non-significant p-value (usually defined as greater than 0.05) then you have indicates that the IV might be unrelated. Although, it’s possible that is related but your sample is too small, data too noisy, fluky sample, etc. to detect the effect.
I’m not exactly sure what you’re asking. I can’t tell you whether the Pinterest spend predicts Facebook spend or not. What do the data say?
Shamine Macwan says
Hi Jim,
Since the dataset has 8000+ columns is it feasible to use least square regression.
Jim Frost says
Hi Shamine,
Yes, that is no problem at all for least squares regression.
Shamine Macwan says
Hi Jim,
Your post is very informative. I want to predict salary so my predictors are Industry, Role, Technical Skills and average experience. Industry, Role, Technical Skills are categorical variables so after converting them and creating dummy variables these predictors will have binary values(0/1). So I am not sure which regression model can be used here.
Jim Frost says
Hi Shamine,
Because you have a continuous dependent variable (salary), you can use least squares regression. It can handle the continuous and categorical variables in your model. I actually use a very similar example in my regression book to illustrate how to interpret the different types of variables.
Abhinandan Chakraborty says
Hello, my IV is Income which is continuous variable but in my DV my respondents can write 3 bicycle brand names which appears at the top of their mind. So now I have majorly 6 brand names and for each I have 0 or 1. I wish to check whether Income plays any role in Brand recall. Since I have now 6 data columns I am unsure about which regression I can run and how ?
Jim Frost says
Hi Abhinandan,
Because you have a categorical DV, you should try using nominal logistic regression. It’s also known as multinomial logistic regression. In your case, you can use it determine whether income predicts the probability of an observation falling into the different brands.
Veronica says
Hi Jim,
Thanks for the great info! I have a question, I am trying to determine which combination of methods gives me the best results for an analysis. I have a vector of enrichment values for my dependent variable (the output of
my analysis) , and I have a binary matrix for my independent variable (combinations of methods: 1 if something is used and 0 if not, the matrix is 36 rows by 10 columns). What regression model would you recommend and how should I apply it on these variables?
Allan says
Hello,
May i ask if it’s possible to conduct a moderation analysis with an independent, dependent and moderating variables that are all ordinal? Which regression would work best in that situation.
Thank you.
Jim Frost says
Hi Allan,
Yes, you can. However, there are some additional considerations. The ordinal dependent variable requires that you use ordinal logistic regression, which can handle moderation (interaction) effects. However, entering ordinal variables as independent variables will require you to make some choices. You’ll need to enter them either as continuous or categorical variables. The decision depends on the nature and amount of your data along with goals of your analysis. Unfortunately, I don’t have a blog post to point you towards about all those details, but I do write about it in my regression analysis book.
I hope that helps!
Bab says
Hi Jim,
Many thanks for your informative blog.
I have a large data set (more than 25000 data) investigating the effect of one categorised variable( in this case is when the window is open and when it is closed) on the building temperature (which is my dependent variable). There are other independent variables which have effect on the building temperature including solar radiation, air temperature and illuminance (these are three numerical independent variables). Now I want to find out the effect of my categorised variable (window open and close) on the building temperature (while other three variables are effecting on that as well. It will be much appreciated if you can provid eme with your advice in this matter.
Jim Frost says
Hi Bab,
I’d start with least squares regression because temperature is a continuous variable. It can also handle your mix of categorical and continuous independent variables.
Yeshambel Azmeraw says
dear JIM my DV is saving and my independent variable are age,sex,family size,employment level,education level martial status and incom.SO what type of recommendations are gives to me from you?
Jim Frost says
Hi, because savings is a continuous variable, I’d start with linear regression and see if you can get a good fit with that. You have a mix of continuous and categorical IVs, which is just fine for linear regression.
james mcloughlin says
Good morning Jim You have a remarkable talent as an educator and as a welcomer. I’m reminded of the multitude of scenarios ….statistical scenarios …that have emerged from the Titanic tragedy. PERHAPS YOU CAN REFRESH OUR MEMORY WITH RESPECT TO ONE DICHOTOMOUS INDEPENDENT AND ONE DICHOTOMOUS DEPENDENT VARIABLE USING GENDER AFFECTING SURVIVAL …ON THAT FATEFUL VOYAGE. WHEN YOU HAVE HUNDREDS OF MALE/FEMALE OPTIONS FOLLOWED BY HUNDREDS OF SURVIVAL AND NON-SURVIVAL. ARE ANY OF THE REGRESSION APPLICATIONS APPROPRIATE? THANKS, JIM
Jim Frost says
Hi James,
Thanks so much!! I appreciate that!
There are several ways to handle that type of analysis. One would be to use a chi-square test of independence using gender and survival. That analysis will tell you whether an association exists between the two variables. However, that’s not regression.
In terms of regression, I’d use binary logistic regression. I’d set up the analysis so that survival (yes/no) is the dependent variable. You can then include various independent variables, include gender, class, age, etc. to determine which variables played a role in survival and the nature of that role. Here’s a link to a post I wrote that uses binary logistic regression for a different scenario, in case you’d like an idea of how it works.
I hope that helps!
Deepanwita purohit says
Hi Jim,
Very useful post. I am studying the effect of genetic diversity (independent variable) on lifetime breeding success (LBS, dependent variable) in a wild pig species. The dependant variables are all count data ranging from 6 to 45, whereas the independent variables are continuous data. Should I use poisson regression or negative binomial regression (assuming over-dispersion in the dependant variables) to test the correlations?
Thanks
Deepanwita
Troy says
Jim,
I am trying to predict interval y with ordinal level x1, x2, and x3 as well as interval level x4, x5, etc. I suspect (with good reason) that the ordinal level IVs are not very reliable, yet they seem to be more predictive (higher Adj B squared) then the presumably reliable interval level IVs. Could the ordinal variables be overestimated in the model? If so, how can I correct the problem?
Troy
waid kaled says
Hi Jim,
I am conducting research using two psychological instruments to assess the quality of life of 150 patients with chronic illness. the instruments are self-assessment Likert Scale with final result dependent variable calculated in scale from 0 to 100 measuring 4 different domains. the independent variables; gender, age, education level, marital status, income, residency, number of family members, and number of complications due to illness. what is the model type that is useful to compare the independent variables as predictors of outcome dependent instrument scale? what statistical test should be used to compare between these two instruments knowing that they are using completely different scaling technique ?
appreciate your kind help
best regards
Lara says
Hi Jim!
I’m helping conduct research using a categorical binary independent variable and a discrete numerical dependent variable. The categorical binary independent variable is originally a nominal variable, but I can choose to transform the data into binary form as the research allows for it.
I was wondering what the best method of analysing would be. How can I go about understanding the correlation in this case in terms of statistical tests?
Keturah says
Hi Jim!
I am currently doing my Psychology Honours thesis and am investigating the relationship between parental separation and conduct disorder (cross-sectional in children).
My main IV in binary (parental separation/non-parental separation) and my DV is continuous (conduct disorder rating scale). I am also hoping to control for other variables / other IVs (economic status, peer-related problems, gender, ethnicity and social support).
I have 3 aims so far. Firstly, I want to explore the inter-correlations between these variables on the DV and examine their one by one relationship. Secondly, I think I want to do a multiple regression with conduct disorder as the outcome for each of these IVs. And lastly, I think I want to use a moderation analysis to examine which of these IVs dampen or strengthen the relationship between parental separation and conduct disorder.
Would you have any advice or feedback regarding if I am on the right track?
Aswell, I would really appreciate any advice on what statistical analyses I might be able to use to measure these aims or how I might go about re-wording these aims so it sounds better.
Thank you for your help!
Keturah
Jim Frost says
Hi Keturah,
Most rating scales are ordinal. If that’s the case, you’ll need to use ordinal logistic regression rather than linear regression. So, be sure you understand the nature of your DV. If it’s truly continuous, you can use multiple linear regression.
It sounds like regression, whether ordinal logistic or linear multiple regression, would be appropriate. And, with either type you can model moderation effects. I write a post about including and interpreting interaction effects in regression models. Interaction effects and moderation are different names for the same thing.
It sounds like you’re on the right track!
RABIA NOUSHEEN says
Hi Jim
I would like to know that if my response variable is development time, measured in days, then what family of GLM is suitable?
Thanks a lot for your help
Jim Frost says
Hi Rabia,
Generally speaking, time is considered a continuous variable and you can treat it as such. OLS would be a good place to start!
Louisa Balster says
Hi Jim,
great post – thank you for that! But I coudn´t really find the right regression type for me (maybe you could add some examples). My DV is a variable with a value between 0 and 6 (a company gets a value of 6 if they have an environmental strength in each of the six areas). Do you have a recommendation on what regression type would be suitable? That would be highly appreciated 🙂
Cheers,
Louisa
Jim Frost says
Hi Louisa,
Is your DV and ordinal variable? It sounds like it but the info is sparse. If it is an ordinal variable, use ordinal logistic regression. Look at that section in this post for an example.
Mohammad says
Hi Jim,
First, thank you for all your helps..This page provides a quite comprehensive framework for regression analyis..
Need a little bit help on my current research about the behaviour of households.
– I have a dependent variable (:attitude change) with 3 subvariables (let’s call them a,b and c) (all of them are categorial, 5-likert-scale).
– There are also various independent variables, which are mostly categorial (5-likert-scale), and a few nominal (gender, etc.).
– In addition, there are 3 independent groups of observations (households) (each group > 100 samples).
Ideally, I would like to explore the influence of the independent variables on each of the dependent subvariables (a,b,c) for each group, separately. But not sure if it is feasible.
There are two concerns here:
1- Is it better to merge the dependent subvariables together and investigate only the main DV, or it is better to model the subvariables separately? Both are possible due to the nature of the topic, but not sure if it is recommended for Regression analysis.
2- As there are 3 different groups, will there be 3 different models for each group separately, or all of them come in only one model?
I guess the right model to use is the ordinal logistic regression analysis. The initial idea was to use mann whitney u test & kruskal Wallis test, but I am lost on choosing the right method.
Thank you very much in advance..
Chaitra Deshpande says
Hello Jim, thank you for the very helpful and crisp, lucid article. I am working on a data where the dependent variable has 10 categories and the independent variables are all categorical. Could you help me understand whether multinomial logistic would be a good option? If not, what other kind of regression can I use?
Thank you
Jim Frost says
Hi Chaitra,
If the level for the DV have no natural order (i.e, they’re categorical aka nominal), then use nominal logistic regression. However, if they have a natural order, your DV is an ordinal variable and you should use ordinal logistic regression.
Tizita says
Hello Jim I need to know which regration is suitable for my title” the effect of HRM practice on employees’ retention? And also I have demographics and HRM variable so can I test by logit in one block or measure independently. Thank you for your response and with regards.
Miduna says
Well i am comparing the population with the co2 emission worldwide the population is in the x axis and the co2 emission is the y axis so may i know what regression to use?
Jim Frost says
Hi Miduna,
Least squares regression would be a good place to start because you have a continuous dependent variable (CO2 emission). However, you might have some added difficulties. I’m assuming you’re measuring both population and CO2 over time. If so, as the CO2 tends to increase, you might experience heteroscedasticity. Additionally, if you’re tracking over time, you’ll need to assess autocorrelation in the residuals and likely need to include time related independent variables to model it.
Jag Smith says
Hello sir, what if the type of dependent variable is interval? What regression model do I need to use?
Jim Frost says
Hi,
Interval scale variables are a form of continuous variables. Consequently, a good place to start would be least squares regression.
hari says
Hi jim , I am dealing with a regression problem predicting rejection rate of a product after ordered.
due to 95% of products were not returned my data is highly inflated towards 0 . Can you suggest on any approach or algorithm to handle it more efficiently than building a normal ols regression.
Jim Frost says
Hi Hari, if your dependent variable is binary, such as rejected/not rejected, you can use binary logistic regression.
Lukenge M. says
Which data is most suitable for linear regression?
Jim Frost says
Hi, you’ll find that answer in this post. Look in the early sections.
Alina says
Dear Jim,
Thank you for your great and valuable insight. I am still a bit confused on what model to use for my regression. I want to look at the relationship between financial ratios and company bankruptcy for the year 2020. I am thinking a binary logistic regression model would be the right approach but would like to confirm with you whether I am on the right track or if I should switch the model.
Thank you!
Jim Frost says
Hi Alina,
If your dependent variable is binary (bankruptcy, no bankruptcy) then binary logistic regression is a good choice!
Mirwaise Khan says
Hi,
Sir Jim, I have One DV such as GDP and two IV such as External Debt and Debt Servicing. Would you please guide me, Which type of Regression analysis should I use?
Hanadi says
Hi Jim,
I have 5 independent variables (continuous), 1 dependent variable (continuous), sample size is 63, i did normality test and i found some variables not normally distributed based on q-q plots.
1- Initially, I compute the correlation coefficient to see the correlation between variables
2- because i have a small sample size i entered (only) the correlated variables in the linear regression and i excluded the non correlated variables and the variables with a high correlation coefficient above or equal (.8)
3- I used enter method: the model is significant but each variable was not
4- then, i used a stepwise method: i got one varible significant.
My question is the previous process true? and which method could i choose in linear regression (enter or stepwise)?
jasper says
Hi Jim,
I have 5 variables 1 DV, 1 IV and three presumed mediators, 4 out of the 5 variables are not normally distrubuted. Which regression is most fitted to test this relation?
Jim Frost says
Hi Jasper,
Least squares regression is a good place to start! The IVs do not need to be normally distributed. Be sure to check the residuals for normality. If they aren’t normally distributed, there are various corrective measures you can try.
FARZANEH CHANGIZI says
Hi Jim
I need your help
I have a dependent variable which is continuous (between zero to twenty). it is about negative feelings and emotions.
I used multinomial regression analysis and generalized linear mixed effect models but I didn’t get a good result.
What is your recommendation for me? to use other types of regression modelling?
By the way, I have several dependent variables which are categorical or continuous.
Jim Frost says
Hi Farzaneh,
Unless your DV values are nicely in the center of that 0 to 20 range, you’ll probably need to use a two limit Tobit model. The problem is if your data are two close to one limit or the other that they’ll be highly skewed, making it hard to get normal residual. Also, the model will predict outside the possible range, giving nonsensical results. Those aren’t necessarily problems for least squares regression when your data are more in the center of the range. Check those residual plots.
However, it is a problem, I hear that a two limit Tobit model is a good option. Unfortunately, I don’t have any practical experience with that type of model to share with you. But it might point you in the right direction.
nur says
Hi Jim,
I have 4 independent numerical (integer) data and 1 dependent continuous (float) data . Which regression algorithm should I use for classification?
Jim Frost says
Hi Nur,
Because you have a continuous DV, a good place to start is least squares regression!
Dewidine says
Thank, Jim!
Dewidine says
Good day Jim, this was very useful, but I do not see multinomial logistic regression. I think that is the analysis I need to use. I want to know whether the two categorical variables influence what group a taxa falls into. This is my data, I have a dependent variable – Group (three categories) and then my independent variables is again categorical data – range (Single vs Multiple sites) and Life form (three categories). I have been running the model separately for the two an also doing a full model with both variables. I want to know whether this type of test is correct for my data? And are there any important things to be aware of when using this type of test, like assumption that need to be met. I really don’t understand the result that I’m getting.
Jim Frost says
Hi Dewidine,
I have that type of regression listed as nominal logistic regression. It’s an alternate name for multinomial logistic regression. And, yes, that is the correct type for your data!
As for assumptions, you want to check the residual plots as normal. Look for things such as homoscedasticity (constant variance) in the residuals. Generally speaking, it’s better to include both IVs in the model rather than fitting separate model. That way each variable is estimated while holding the other variable constant. It sounds like you’re on the right track!
Arnþór Freyr Sigþórsson says
Thanks a lot for the reply Jim, my DV ranges from VERY close to 0 and around 8%, mostly its extremly low.
Jim Frost says
Hi, in that case, I’d guess that least squares regression won’t work and you’ll need to use one of the other methods. You can always try OLS. Just be sure to check those residual plots! But, that’s awfully close to the lower bound, which is likely a problem.
Jane says
Hi Jim,
If I were to use a continuous independent variable (e.g. emotion regulation) with two levels (e.g. cognitive reappraisal, expressive suppression), with one continuous dependent variable (e.g. relationship satisfaction), would it be possible to use simple linear regression? Thanks so much!
Jim Frost says
Hi Jane,
It sounds like your independent variable is actually a categorical variable with two levels. Continuous variables don’t have levels like that. You can perform simple regression with one categorical independent variable. However, for simplicity, you can perform a 2-sample t-test and assess the mean difference between your two groups.
Arnthor F. Sigthors says
Hi Jim. I am working with a continuous dependent variable (engagement rate of posts) and 2 categorical independent variables(type – 4 categories and type2 – 4 categories) which represent what kind of post it is. Is logistic regression the way to go for me?
Jim Frost says
Hi Arnthor,
When it comes to selecting the type of regression analysis, the the nature of the DV is front and center. In your case, you actually have a proportion or percentage, which isn’t quite a continuous variable. If it’s not too close to 0 or 100%, you might be able to model it using least squares regression. I’d try that first. Check the results and residual plots. If it works out for your data, that’s the best route to go.
However, if the percentages tend to be near 0 or 100, the model will predict outside possible range for values that don’t make sense. In that case, you might not be able to use least squares regression but instead fit a two limit Tobin model to account for the range of 0-100%.
Logistic regression is not a good choice for you because that requires categorical, ordinal, or binary DVs. You don’t have that. You do have categorical IVs, but that doesn’t mean you should use logistic regression. Again, it’s more about the type of DV.
Best of luck with your study!
Uzair says
Hi Jim,
Thanks for all your work explaining different models and answering questions. I am currently working on statistical analysis with a categorical dependent variable and two categorical independent variables (based on ethnicity and religion). The dependent variable has three categories (“Yes”, “No”, “I don’t know”). I have tried the multinomial logistic regression model with these variables but I am getting extremely high p-values to my surprise. I have considered converting the dependent variable into a binary variable (removing “I don’t know” category) to see if that can improve the results. Any advice on what I should do here? Thanks
Jim Frost says
Hi Uzair,
You can certainly try that. It is a bit of cherry picking the analysis because the first one didn’t give you results that you expected. Ideally, you’d state up front that you were going to process and analyze the data that way (just use two categories). In the write up, be upfront about that process (the why and how).
You don’t want to do excessive cherry picking because if you do that, you’ll always find something significant even if it’s just a chance correlation. However, if that’s the only post hoc change you’ve made to your plan, it doesn’t sound too bad.
jayden says
Hi Jim!
Would you know why a logit regression model with interval variables would be favored over a logit regression model with categorical variables and vice versa?
Jim Frost says
Hi Jayden, logit regression models are designed to model the probability of an event for binary or categorical variables (more than two levels). Events are categorical outcomes such as pass or fail, or scratch, dent, and tear.
james mcloughlin says
Hi jim…such a gifted teacher! Thank you.
Can you do a regression with one categorical independent variable predicting one categorical dependent variable?
EG. Gender predicting admission to MIT..
Thanks, Jim
Jim Frost says
Hi James,
Yes you can! In your case, you’re talking about a binary dependent variable because it has only two levels (presumably), admitted and not admitted. In that case, you’d use binary logistic regression and it’s fine to use a binary (or categorical) independent variable.
If your dependent variable had more than two levels, you’d use nominal logistic regression.
Nimra Azhar says
Hi. Thank you so much for this. I have a question. Is there a regression analysis that can be used for overlapping categories of dependent variable?
Example: I am using a sample of rural agricultural households where individuals are engaged in multiple paid (business or wage work) and unpaid activities
So my y variable is broadly classified into 3 categories: business, wage, unpaid
Can Maximum Likelihood models be used in this case?
ntombi skosana says
Hi Jim
Thank you so much for your response. It has helped a lot.
I really appreciate your assistance.
Jim Frost says
You’re very welcome, Ntombi!
Denton says
Hi Jim,
Thanks for the reply. I thought about using ANOVA but the data isn’t normally distributed and the variances are unequal so normal ANOVA and Kruskal-Wallis don’t work. I asked my supervisor about using Welch’s ANOVA and was told to use regression instead as it is easier to interpret and relatively robust to non normal data.
The dependent variable is the number of people who have a disease. So across 10,000 doctors, each doctor uses one OS, and records how many patients they have with e.g. diabetes. I’m trying to see if the choice of OS affects the recording of diabetes.
The data isn’t normally distributed, but with log(), it does normalise the data but I would also need to +0.1 (arbitrarily chosen constant, does it matter much the constant is?) to every data point as there would be log(0)s that prevent further analysis eg with qqnorm. Should I normalise the data and all do analysis (e.g. I guess at my supervisors suggestion, regression) on log()+0.1 data?
Thank you.
Jim Frost says
Hi Denton,
F-test ANOVA and regression are two sides of the same coin because they use the same math “under the hood.” It’s true that ANOVA can be robust against deviations from normality, but it depends on how many samples you have per group. Read my post about parametric vs. nonparametric analyses. That has pros and cons for both. But look particularly for a table with samples for different analyses. It’ll show you how large they need to be so that you generally don’t need to worry about the normality assumption.
However, regular ANOVA and regression are susceptible to violations of the assumption of equal variances. I talk about that in this post about Welch’s ANOVA. You can’t get out of that assumption by having a larger sample size. If your variability is substantially different, you really should be using Welch’s ANOVA and NOT the F-test ANOVA, regression, or nonparametric analyses.
You might not need to transform the data with large enough sample sizes unless your data are very skewed. If your smallest group is 50, you have a good enough sample to not worry about nonnormality unless it’s very skewed. I suppose the transformation could also solve your non-constant variance problem too. I could see going that way as you suggest. However, you can still consider Welch’s ANOVA and potentially analyze your data in the raw data units, which I like to do whenever possible. You’re probably fine either way.
ntombi skosana says
Hi Jim
My DV is the the % exports for sales of volumes. Is this a continuous?
There are a lot of zeros for those who did not export for my time period of interest. I do not want to drop these as they are true zeros – as they represent firms that did not export anything for whatever reason. Can I just go ahead and run my OLS with this DV ?
What can I do with the ” I do not know” responses?
Jim Frost says
Hi Ntombi,
Percentages as a DV can be a bit tricky. If the percentages are close to zero or 100 percent, the distribution tends to be skewed. If so, you might need to transform it. I’d graph the percentages to see what they look like in a histogram. Are they very skewed? If not, OLS might be fine. Try fitting a model and see how the residuals look. If they look good, that’s a great sign you can use OLS. If not, try transforming the DV to get a more normal distribution.
I hope that helps!
Denton says
Hi Jim,
I’m not sure how to go about analysing my data statistically.
I have 4 computer Operating Systems that doctors can choose from, that all record 5 different disease conditions.
These 4 OS’ have different amounts of data – one OS is only used by 50 doctors whereas another is used by 3000 doctors(unsure if this makes a difference).
I want to identify variation in the prevalence rates (how common the condition is) of the 5 different conditions and if this is due to the different OS’. i.e. Does the OS used cause a difference in whether or not a disease is recorded. The descriptive statistics are all pretty different.
Thanks!
Jim Frost says
Hi Denton,
It sounds like you need to use one-way ANOVA. You have one factor (OS) with four levels (the different OS). It sounds like you have a continuous dependent variable, although I wasn’t totally clear about that. If you want to compare the means between the groups and determine whether they are statistically significant, one-way ANOVA is what you should use. I write about using one-way ANOVA in Excel. Even if you’re not using Excel, it should give you an idea of what it can do.
I hope that helps!
Pranab Chatterjee says
Hi Jim! Thanks for this great post! Just wondering why you did not include the log-binomial model in this post. Could be a good fit right after the binary logistic regression point, perhaps?
Aishwarya Kanitkar says
Hello Jin,
Thank you very much for this article and the article on multicollinearity, they really helped me.
I am struggling with my Master Thesis. I am analyzing the impact of food product attributes on organic food buying tendency.
On one hand, (for independent variable) I have a constant sum scale for 5 food attributes and for the dependent variable I have the number of organic food people buy out of 10 grocery products (so 1 to 10 – higher the number of organic products higher the buying tendency)
I am confused as to which type of regression to use, given that my independent variables would be correlated.
Any help is appreciated!!
Thank you,
Aishwarya
Jim Frost says
Hi Aishwarya,
I’ve never used a constant sum scale before. However, I’d imagine you’d need to pick one or several but not all of the attributes and use them for the IV(s). I don’t really know if there is a recommended practice in terms of how many attributes to include. Maybe there are one or two that are particularly correlated with the DV? You can’t use all of the attributes because that produces perfect multicollinearity for a constant sum scale.
Your DV is a count variable if I understand it correctly. I’d try Poisson regression or possibly negative binomial. Look in the count DV section for more details about that type. I’d try those types first. Being a count variable, it’s likely to be skewed but those regression types can handle it. If the DV isn’t too skewed, you could conceivably treat the DV as continuous and use linear least squares regression. But, that depends on the properties of the data.
I hope that helps!
Claudia says
Yes! It really does 🙂 thank you so much!
Mark says
Thanks for the excellent post. It seems that regression works for both continuous and binary target label. However, if in a situation, the target label is continuous but is between 0 and 1. What will happen if we transform the target to binary values (0 and 1) with an indicator and fit the regression model? How will the model parameter and standard error change compared to the fitted regression with the original continuous label?
claudia says
Hi Jim,
Super useful page- thank you! I am struggling to find examples that describe what to do with a binary categorical dependent AND independent variable. I want to know if the mating status of a father (mated/unmated) predicts the mating status of his son (mated/unmated). Would this still be a binary logistic regression?
Any help greatly appreciated!
Thank you 🙂
Jim Frost says
Hi Claudia,
You can certainly use binary logistic regression with both a binary dependent variable and a binary independent variable.
However, if you have only those two variables, you can also/instead perform a 2-sample proportions test. That’s easier to interpret (for you and your readers). That will answer the question of does the proportion of mating sons differ by the mating status of the father?
If you want to see a 2-sample proportions test in action, read my post about Flu Shot Effectiveness, where I use a 2-sample proportions test on real data to determine whether the proportion of flu infections differ by vaccination status.
If you have additional IVs that you didn’t mention, then you’d need to use binary logistic regression.
I hope that helps!
Santhosh Srinivas says
Greetings Jim
Great post , I have a query.
I am working on a study where dependent variable is num of records(values start from 0 to a million) , around 6500 rows of data. which is the best mode to analyse this? please help.
Bewketu says
Hi, I have question on my research preparation. I have two dependent variable which is categorical(yes/No). I used binary logistic regression. but my question is .how i can do the analysis? Is that separately or mixed? How? I thanks who can helping me!
rukky says
Hi Jim
Very happy to have found you.
What is the best regression to determine asset allocation for a group of firms
Dependent variable- equity asset allocation
independent variables – leverage ratio, market cap, stdev of operating cashflow, book value of equity etc
Jim Frost says
Hi Rukky,
Typically, you don’t choose regression models based on the subject area. Instead, start by looking at the type of DV and using that to choose a type of regression model as I describe in this post. If equity asset allocation is a continuous variable, I’d focus on that section in this post.
Adila says
Hi Jim.
I am currently doing my study on technology investment decision.
The DV is a binary (1= firm invest in technology; 0=firm do not invest in technology). There are about 6 IV with different types of data (i.e dummy, ordinal, continuous)
The time horizon is longitudinal (i.e panel data firm=n and year=i)
I am not sure which method is suitable. Is it panel data logistic regression? I also read some articles that use Generalized Method of Moment analysis.
Please help me. I need your opinion.
Thank you
Martin says
Hi Jim,
thanks a lot for the great post!
I am struggling with finding the appropriate statistical method for my master thesis. The purpose of my study is to identify reasons why farmers invest in or refrain from investing in new agriculture technologies on farms in Portugal. I want to test different hypotheses to understand what drives the adoption of technologies among farmers in Portugal. This means that my dependent variable is binary (adoption yes/no).
Now, I have different types of independent variables:
– Binary variables (e.g. did you participated in any training program related to new technologies? yes/no)
– Ordinal variables (e.g. I find new technologies easy to use. likert-scale)
I will probably need two statistical models. For independent binary variables, I could use the logistic regression. For independent ordinal variables, I am not sure which regression method to use? Or should I use a correlation method like chi square test for independence or Spearman’s?
I would be grateful for your input!
Looking forward to hear from you soon.
Best regards,
Martin
Jim Frost says
Hi Martin,
As I mention in this post, you often choose the type of regression model based on the type of dependent variable. The types of independent variables don’t typically influence the choice in regression model.
Because you have a binary dependent variable, you’ll need to use binary logistic regression regardless of the types of independent variables. You’ll be able to predict the probability that a farmer will adopt technologies based on the values of the independent variables.
However, ordinal IVs present a challenge in regression analysis. You’ll need to enter each ordinal variable into your model as either as a categorical (nominal) variable or a continuous (numeric) variable. The decision depends on the nature and amount of your data combined with your research goals. I write about that in my book about regression analysis. In fact, because your thesis depends on regression analysis, I’d recommend it for that reason too.
By the way, I don’t think a chi-squared test or correlation is the way to go. Those analyses won’t give you the answers you need.
I hope that helps!
RABIA NOUSHEEN says
Hi Jim
Thanks for very useful post. I have a question that can we use square root rather than log link in poisson regression? Poisson regression with my data results in p values = 1 for all coeffecients when I use log link.
I want to ask another thing , I am confused how to treat my variables in regression analysis, either numeric or factor? my variables are Concentration(500,1000,2000,10000 particles), Exposuretime(1,2,3,24 hours), Size of particles (3 microns,10 microns) and particle type (aged, unaged, with additives, without additives). I feel that all should be entered as factors but not sure. When I treat variables having number as numeric, software (R) by default considers them continuous variable (got to know it when did post hoc test and observed that software is choosing any value between set levels (like 7 hours between 3 and 24 hours) for comparison rather than the ones tested).
Please guide me as I have wasted much time in getting answer of these questions. I did not get any of my question unanswered on this forum and that gives me a great hope. I shall be thankful for your time and valuable suggestions.
Packartt Tuga says
Hey there Jim,
I need your help with the regression analysis portion of my paper, which deals with effectiveness of debt relief programs in the sub-saharan african region. My dependent variables will be macroeconomic factors like-GNI, Inflation etc and the independent variable is the amount of relief received.
With the help of this blog, I think Lasso regression will suit my research best.
Can you suggest other regression methods or statistical tools that can be used?
Most times I’m not able to comprehend the statistical language/ formulae that most guides used, do you know of a step-by-step guide that provides a basic framework for regression/quant analysis.
Thanks a lot.
Degarege Asefa says
Hi JIm, greetings
my research title is: “Determinant factors for effective donor utilization by NGOs working in Ethiopia: the case of ORDA”
my Dependent variable is ” effective donor fund utilization”
and independent variables are: managerial factors, technical factors, operational environment, financial management and donor behavior. so, could you tell me the metric and non metric of both variables (dep. & Indpen.) ?
what type of regression to be used? multiple regression or continuous linear regression?
thank you for your fast and cheerful professional advice!
Jim Frost says
Hi Degarege,
It takes subject area expertise and research to determine the best metrics for your study. Unfortunately, I’m not an expert in donor fund utilization. So, I don’t know which metrics are best. You’ll probably need to do a bit of research to identify them. For more information about this process, please read my post about 5 Steps for Conducting Studies with Statistics. The earlier stages of the process explain how you need to operationalize your variables.
The type of regression largely depends on the type of metric you use for the dependent variables, which is unclear at this point. If it’s a continuous variable, then multiple regression is probably a good place to start. I believe what you describe as “continuous linear regression” is synonymous with multiple regression.
Thanks for writing!
Maria Martin says
Really clear and useful post, Jim. Thank you!
May I ask, if the dependent variable is slightly right skewed, is it still appropriate to use a linear regression? would I necessarily have to transform the data by taking natural log of the variable? If I do this, I lose around n=100 (total n=1004) participants.
I ask because I received a comment to create a histogram or illustration of the dependent variable distribution to demonstrate that it is acceptable to model it with linear regression. Are there other factors that can make the dependent variable more or less appropriate?
I think I should plot the independent and dependent variable with a scatterplot – but this is not so informative because my independent variable is a categorical item with five levels (gender identity).
Many thanks in advance for your help,
Jim Frost says
Hi Maria,
The real question to ask is: are the residuals are normally distributed? Regression assumptions center on the properties of the residuals. So, the question about the distribution of the dependent variable is largely irrelevant. I say “largely” because if you have a highly skewed DV, then it can be hard to obtain normally distributed residuals. For more information on this aspect, please read my post about OLS assumptions.
In that post, I indicate you should use a normal probability plot of the residuals to determine whether they follow a normal distribution. For more information about that type of plot, please read my post about histograms vs. normal probability plots for assess normality. In that post, I’m not writing about residuals specifically, but the information applies. One thing to note, with 1000 observations you have a fairly large dataset. You could theoretically use a normality test to assess normality. However, with so many data points, the test will be very powerful and could potentially detect inconsequential deviations from the normal distribution. More reason to use a normal probability plot!
Before trying anything else, I’d check the residuals. Determine whether they’re normally distributed. You might not even have a problem. If they’re not normally distributed, or you see other patterns, there are other steps you can take that depend on the nature of the problem.
Because your research involves regression analysis, you might be interested in my regression analysis book. I write about all of these issues in that book along with various ways to resolve problems.
If your IV with five levels is an ordinal scale, such as Likert scale, you might have some difficulties with that because ordinal data have a mix of properties of continuous and categorical variables. You’ll have to choose one or the other ways to enter that variable into the model. I don’t have a post for that but I do write about in my book.
I hope this helps!
Chandra Shekhar Tiwari says
Just to clarify further the above question:
Apologies for a long message again. I have been advised to do like this by my project supervisor:
I am going to do Multiple regression after taking mean (or average of my ordinal variables).
1. I have total 10 independent variables out of 3 constructs (for independent variable)
and 3 dependent variables (CA1, CA2, CA3) out of 1 construct (for dependent variable- for competitive advantage) and all are ordinal variables. So basically, I have total 13 variables, therefore will I have 13 columns of average if I need to calculate?
2. Then I need to take average of all 10 independent variables (because they are on ordinal scale) and take average of 1 dependent variable (e.g. CA1) at a time, and then run the multiple regression analysis. Because dependent variable can be only one at a time.
Then repeat the step for CA2 and CA3.
I will have 3 equations Y1 = m1X1+m2X2+m3X3+…+m10X10 +C, this is for CA1
Y2= m1X1+m2X2+m3X3+…+m10X10 +C, this is for CA2
Y3 = m1X1+m2X2+m3X3+…+m10X10 +C, this is for CA3
to interpret from output of SPSS. Output is in different tables.
And all these three equations should satisfy separately.
Am I doing this alright? Or shall I just keep only CA and delete other two?
Or shall I just run ordinal regression 3 times for 3 dependent variables as SPSS allows ordinal regression so no need to take average (Interpretation will be a bit difficult)?
I am going to try on paper the linear regression after taking average 3 times and then see if there is a relationship as mentioned in 3 equations.
Looking forward to hearing from you soon.
Regards,
Chandra
Jim Frost says
Hi Chandra,
Ah, in my reply to your other comment, I didn’t realize you were boiling down all your IVs to one average. When you can use a good number of Likert values to produce an average, it is good.
Are your Y variables going to be averages or the actual Likert values (ordinal)? If you use the ordinal values, you’ll need to use ordinal logistic regression. If the Y variables are an average of multiple Likert items, you might be able to treat it as continuous and use linear least squares.
So, I’m not totally sure what your final model will be? Will have the 10 separate IVs (Likert items) or are you averaging them down to one? And, will the DV be ordinal or continuous?
But, in general, yes, I’d fit a model for each DV as you describe. As I mentioned in my other reply, check those residual plots to be sure that you’re not missing curvature or otherwise misspecifying your model! That’s always important but it’ll be extra important in your case because you’re working with averaging Likert items, which can work but not always linearly.
Chandra Shekhar Tiwari says
Hello Jim,
Thanks for your great posts!!
Need a little bit of help if you could oblige for my MBA project. I went through your various posts already.
Here is the scenario of my project:
I have total 10 Independent variables all in Likert scale from 1-7 (ordinal), these variables are coming out of my 3 constructs.
And I have 3 dependent variables (part of one main dependent variable) again on Likert scale of 1-7(ordinal).
My questions:
1.Which regression method do I use?
2. Can I take all independent variables and only one dependent variable at a time? And then analyze?
3. Can I take average of all my independent variables(as they are on Likert) and average of my dependent variables (again on Likert)? and run the analysis? But I am not sure how do I do it.
4. How do I interpret the result?
5. My preference is to use Linear regression.
Just to briefly describe about my project, I am analyzing Impact of innovation on competitive advantage. so 10 IVs from Impact of innovations and 3 DVs(Competitive advantage).
I would be grateful for your inputs. This is a bit of struggle at the moment.
Looking forward to hear from you soon.
Regards,
Chandra
Jim Frost says
Hi Chandra,
You’ll need to use ordinal logistic regression because of the ordinal dependent variable. Yes, it’s fine to use separate models as you indicate.
Many analysts will average (or sum) your Likert items together and treat them as continuous variables. That can potentially avoid the problems I discuss below. Just be sure that averaging or summing is appropriate for your variables. You’ll need to be extra sure to check your residual plots to be sure your model has an adequate fit.
Ordinal independent variables can be problematic. I write more about ordinal IVs in my book about regression analysis. They have properties of both categorical and continuous variables. You’ll need to enter them one way or the other. The correct decision depends on a combination of the characteristics of your data and dataset, and the goals of your study. But, if you can sum or average scores, you might be able to avoid these difficulties.
You can’t use linear regression when you have an ordinal DV. Although, if you can average/sum the DV to produce a continuous variable, you might be able to use least squares linear regression. Again, be extra sure to check those residual plots!
Because your study depends on regression analysis, you might consider getting my regression book. You can get it as an ebook from my website (best price) or in print from Amazon and other sellers.
Patabedige P.M. says
Hellow sir,
I have a problem with my research work..My dependent variable is ordinal type (5 point likert scale) and my all the independent variables are also only ordinal type (likert scale).what is the most suitable regression type for me? According to my knowledge i think that my all the independent variables are ordinal variables are ordinal regression can not be used.because it do not have any continues or nominal variable..according to the assumptions ..
Jim Frost says
Hi, because of your ordinal dependent variable, you’ll need to use ordinal logistic regression. However, as you note, the ordinal independent variables are potentially problematic. I write more about ordinal IVs in my book about regression analysis. They have properties of both categorical and continuous variables. You’ll need to enter them one way or the other. The correct decision depends on a combination of the characteristics of your data and dataset, and the goals of your study.
If it’s appropriate, you can average (or sum) your Likert items together and treat them as continuous variables.
Arya Devi K.S. says
Hi Sir,
I have more than one categorical dependent variable and one continues independent variable. Which regression I can use for it?
Jenny says
Hi Jim,
thank you very much for the article. I was wondering if you could help me find the fitting method for my thesis. I am doing a survey. My independent variable as well as my dependent variable are both binary (“yes” and “no”, which I coded in 1 and 0). Furthermore I have a lot of control variables which are either binary as well like for example sex or continuous (e.g. age) or categorical (1-4 or 1-5). Can I use a normal binomial logistic regression, even though not only my dependent but also my independent variable is binary/ a dummy variable? Or what else should I use? And how could I then interpret the results, just like with normal logistic regression?
I also want to test the mediation effect of two variables which are categorical (measured on a 4 or 5 point Likert Scale). How can I do that?
I would really appreciate if you could help me! Thank you so much in advance!
Jim Frost says
Hi Jenny,
Yes, that’s absolutely fine to use binary logistic regression in that context of also having binary variables as IVs along with the DV. What might actually cause some difficulty are the ordinal (Likert) IVs. You’ll have to model them as either continuous or categorical. The best method depends on which method gives you a good fit, the number of observations versus categorical levels (if you use that approach), and your research goals. I write about that in my book about regression analysis, although not in the binary logistic context, but the issues are the same.
I hope that helps!
Michael Dean Green says
Hi Jim, Thanks for helping us statistically challenged folk! I manually corrected my dependent variable by adjusting for time and plugged the data into some software that does Backward Stepwise OLS and got a final answer, but it failed Normality (P<0.001) and Constant Variance (P<0.001) test. When I took the square root of my dependent variable and adjusted for time, the result were entirely different, but they passed Normality (P=0.980) and Constant Variance (P=0.144) tests. I read one can normalize data simply by doing a log transformation or taking the square root. Is this practice legit? I feel like I had manipulated the data.
Jim Frost says
Hi Michael,
It is true that that you can transform your data as you describe to solve both those problems. I mention this in my post about heteroscedasticity (non-constant variance). It can also fix the normality problems associated with a very skewed dependent variable. Just be sure that the non-normality is not caused by a misspecified model, such as incorrectly modeling curvature.
While it is an acceptable solution to these problems, it should be your solution of last resort. Try solving the problems other ways first. I mention some in my post about heteroscedasticity. And, you can read about other reasons for non-normal residuals in my post about OLS assumptions. However, if you can’t fix those problems using other methods, then you can use the data transformation. Just be aware that you’re then describing the relationship between your IVs and the transformed DV, which makes the interpretation less intuitive than using the raw data. The goodness of fit statistics, such as R-squared, apply to the transformed DV as well. I write more about that in my book about regression analysis.
So, yes, it can be a good solution but try resolving the problems other ways first!
Shannon says
Hi Jim,
Thanks for the great overview! I’m wondering if you could explain a little about how orthogonal linear regression fits into the mix?
Thanks!
Jim Frost says
Hi Shannon,
Orthogonal regression is useful when you have two different measurement systems and you want to determine the relationship between the two. For example, you might have an old and new way to measure weight, blood pressure, etc. and you want to understand the relationship between the two systems. Most regression analysis assumes that measurement error exists only in the dependent variable. In orthogonal regression, the analysis can handle error in both the IV and DV. However, you do need to know the ratio of measurement error for the two systems to be able to use orthogonal regression.
I hope that helps!
Caroline Öhman says
Hi Jim! I’m struggling with the statistical method for my thesis in bioscience. I have only taken a basic course in statistics (we use R) and need to use another analysis than what I’ve done so far, for this project.
I have a number of sites where bat activity has been recorded. The data is in the form of “activity” or “no activity”, Now I want to compare the activity with qualities of the nature at the site. This will be measured in type of nature (alternatives like decidous forest, grassland, planted forest, edge zone), close to water (yes or no), slope (facing south, west, east,north or no slope) and maybe more factors. Someone suggested regression trees, but the information I have found is at a too difficult level for my knoledge, and I’m not that good at mathematical English… And I’m not sure how many sites I would need. to get relevant results from my partly nominal data. I have about 15-35 sites with “activity” and the total number of sites is 35, if I count “no activity”.
Do you have any idea of a good analysis or how to use my data? Please ask if I my explanation is unclear or you need more information. Thank you for doing this website!
Jim Frost says
Hi Caroline,
It sounds like you can use binary logistic regression, which I write about in this post. Your response variable is binary, activity or no activity. And, you can include all the variables you mention. Using the model, you can learn how each factor relates to the binary outcome. You can even predict the probability of an event (one of the outcomes) occurring for a set of predictor values.
Michael Dean Green says
Hi Jim,
Thanks very much for your explanations of the types of regressions. I was wondering if you could advise me on the type of regression that is suitable to the data I have collected.
I would like to know which of the supplements I have been taking has any influence on my Parkinson’s Disease symptoms.
The severity of my symptoms vary from day to day. I can measure the severity by doing a simple tennis ball rolling exercise to get “rotations per minute” or RPM. I record the time of day (T) I do the exercise and record the combination of supplements I took the night before.
So my dependent variable is RPM and my independent variables are T and the supplements I took the night before (1=yes, 0=no).
What statistical method would you recommend?
Thanks,
Mike
Jim Frost says
Hi Michael,
If your outcome data are continuous (it sounds like they are), then linear least squares (OLS) regression would be a great place to start. Because your model involves time (daily measurements), you’ll need to be extra careful looking for autocorrelation in the residuals. Read my post about OLS assumptions for more information. Look for the autocorrelation assumption.
John Grenci says
Hi Jim, thank you for this site. well written, and i am sure it takes up much of your time. Your passion shines thru. I am hung up a bit on one of the assumptions for binary logistic regression. in particular, where Independent variables are linearly related to the log odds of the dependent variable. how would we go about checking that? thanks John
Laura Kukkonen says
Hi Jim!
Thank you a lot for your posts, they are very useful and well-written.
I’m trying to find a way to analyse multiple hypotheses based on questionnaire data. Each hypothesis requires the analysis of multiple dependent and independent variables. The independent variables are of continuous, categorical and binary types, while the dependent variables are in this case all categorical (Likert scale). If possible, I would like to find a single model to analyse each hypothesis, but I’m having a hard time figuring out if there is any statistics available to fit my needs. I have been considering a multinomial logistic regression, but as I understand it only allows for one dependent variable. Is there an alternative where multiple dependent variables are available? If all of my different types of independent variables cannot be included in a single test, which types of separate tests would you recommend? I am thinking of a MANOVA for the categorical independent variables, but I don’t know of any suitable tests for the continuous and binary variables where I can include more than one dependent variable. Thank you a lot in advance!
Kinds regards,
Laura
Jim Frost says
Hi Laura,
Because of the ordinal DVs, I think you’ll need to use ordinal logistic regression and analyze each DV in a separate model. Those models can each contain all of your continuous and binary IVs. The regular MANOVA isn’t suitable for ordinal DVs. I’m not aware of a similar analysis that can handle multiple ordinal DVs.
Girma Gedamu says
hi Hi Jim, I’m interested to find out the prediction of drought of categorical independent variables on a continuous dependent variable.
for example my is data contain five variables one dependent and four other independent but the dependent variable again has 5 categories so which regression model i use please help me
Jim Frost says
Hi Girma,
If the DV variable’s categories have no natural ordering, then you’d use nominal logistic regression. However, if they do have a natural ordering (e.g., a five point Likert scale), you’d need to use ordinal logistic regression.
Tim says
Hello Jim,
Thanks for your post.
I have been tasked to perform an OLS regression on wage of a certain population. We are asked to have a close look at the definition of the variable education (1:low and 5:high). What kind of conceptual problem might there be in using it in an OLS?
Thanks in advance!
Jim Frost says
Hi Tim,
One potential issue is that you have an ordinal independent variable. You’ll have to determine whether to include it as a categorical variable or a continuous variable. That depends on the goals of your analysis, the nature and amount of data, and the adequacy of the fit using either method. If you have my book about regression, I write more about the decision and challenges of both approaches.
Also, regarding the definition of the variable, that sounds fairly subjective/vague. You should have a clear definition of how you’ll measure education level. Ideally, come up with an operational definition. Then test it on sample data to be sure that it provides consistent, accurate results in representing education level. I talk about creating operational definitions of your variables in my post about including statistical analyses in scientific studies.
İrem says
Hi Jim,
Hello I’m trying to build a regression model. I have 122 independent variable and I am trying to predict
legal proceeding amount .I also have many zeros in my dependent variable. These zeros show that he is not under legal prosecution.Most of these independent variables include the frequency of loan usage by years. I also have a variable in how much credit they use in total. But there are too many zeros in these variables. Because it was entered as zero for not using credit in any year. Therefore, I have different variables such as gender and age. Which regression model would be appropriate for me in this context ?
Sanjay Mali says
I have a table showing readings of a variable that depends on 4 different ‘Yes-No’ type of variables which are independent of each other. Besides, in the descriptive part, I am told that these “+ – ” type variables are themselves can vary in the sense that suppose there is a circular button. This button is on or off is given in the table. Besides if on, its range of rotation is also given that influences on part of question. What type of regression model should I use?
Abeer says
Hi ,,
thank you for the valuable information ..
I have a question about ordinal logistic regression.
I have a likert-scale questionnaire with several subscales. I have calculated the composite score for each subscale. First , I did compressions between males and females in my sample using mean scores of each . further I focused on one of the subscales and wish to do ordinal logistic regression for each item under that subscale separately to test its strength of association with other IV (gender, GPA, etc) .
would that be a correct approach ?OR is it contraindicated to treat items separately using ordinal logistic regression if they have already been combined as a composite score in the same study?
Hugo says
HI, I was wondering if my DV is numerical (ratio), what is de best regression analysis? The IVs are both ratio (numerical) and continuous.
Jim Frost says
Hi Hugo, the best place to start would be linear least squares!
Ian Berryman says
Hi, great informative webpage. I was curious. If dependent variable is natural logged and key ind. var is not (semi-log form) can your controls be natural logged? my dep var is percent change in poverty, my key ind. var. is # of new churches(so must be measured by units) and my controls are all measured in percents (i.e. percent of the population thats black, percent of the population that earned a HS diploma, etc.) so do I need to have these controls natural logged to interpret my regression as “a 1% increase in a control variable leads to a x% change in poverty” or, since they are already measure in my data set as percentages, is that already how the regression interprets?
Thanks!
Jim Frost says
Hi Ian,
Yes, you can “mix and match” as needed with using natural logs or not! It really depends on the the nature of the data and theories about what is appropriate for it.
Sajeeka Gunasekara says
Thank you for replying me.! If I want to investigate the impact of happiness on the GPA, what would be suitable? Happiness would be collected as a average score it would be in between 0 to 6. What regression would be suitable?
Jim Frost says
Hi Sajeeka,
Is happiness an average of a set of scores that are something like Likert scale items? If so, taking the average is a method to use it as a continuous independent variable in your model. The coefficient for this variable indicates the average increase in GPA for a one-unit increase in the happiness score. That’s still least squares regression.
Sajeeka Gunasekara says
Hi Jim!
Thank you for your posts they are really useful. I have a problem with my scenario. My DV is grade point averages (GPA). It is a continuous value between 0 and 4. My IVs are happiness score, gender and academic level (it has 4 levels as 1000,2000,3000 and 4000 level). what type of regression is suitable here? What type of regression would you recommend for this??
Thank you so much!
Jim Frost says
Hi Sajeeka,
I’d start with linear least squares regression (OLS). You have a continuous DV. Gender and academic level are categorical independent variables, which OLS can handle. For academic level, the regression model will determine whether the mean GPA for each academical level is significantly different from one the baseline academic level (which is something you have to pick).
I hope that helps!
Helen says
Hi Jim,
My dependent variable is a score 1 – 5, with 5 being the best rating, and 1 being the worst. I want to see if gender has an association with a higher rating. I have been looking into doing an ordinal logistic regression analysis, but not sure if this is correct. I know I could create a binary outcome of (1-3) vs (4,5) but not sure if I want to lose that data? What type of regression would you recommend for this analysis?
Thanks!
Jim Frost says
Hi Helen,
Your dependent variable is an ordinal variable. Consequently, you’ll need to use ordinal logistic regression. You won’t need collapse any values using that method. That method will tell you if gender relates to those ordinal values.
Bidisha Chakraborty says
Hi Jim! Your posts are really helpful. My DV is an happiness index lying between 0 and 1(continuous) that has been constructed from various responses in categorical form. Hence DVis ordered. Should ordered logit or ordered probit would be ideal regression model to fit? Earlier, I have fitted OLS to the happiness scores lying between 10 to 45. R square is coming out to be very low. In ordered logit, chi square is highly significant, but, the pseudo R square is very low. Waiting for your valuable comments.
Thanks and regards,
Bidisha
Jim Frost says
Hi Bidisha,
It sounds like you have a continuous DV that is restricted in range between 0 and 1. Because it is continuous, you cannot use binary or ordinal logistic regression. However, you can use a logistic sigmoid function to model it. This process forces the model to respect the limits of the DV. I’ve never performed this type of analysis, so I can’t offer much help. But that should point you in the right direction!
Louis G. Daily says
Jim, I have one continuous predictor (IV) variable in my multiple regression and one ordinal predictor (IV). Can I do this? Can I incorporate this ordinal variable which comes from a Likert scale into the multiple regression?
Jim Frost says
Hi Louis,
I write about this in more detail in my regression book. Go to my webstore for more details about it.
Ordinal variables have a mix of attributes of categorical and continuous variables, which makes including them in a model a bit more complex. You’ll need to enter your ordinal variable as either a continuous variable or a categorical variable. The decision depends on the nature of your data, the goals of your research, and the quality of fit each approach provides.
Abel |Chipeta says
Hey Jim
Thanks for the clarifications on the models…they really help.
Am conduction a study on Gender decomposition on the effect of education on household savings….but under this Savings is my dependent Variable while education, gender, sex, age, income, marital status and occupation are the independent variables…Help me on the best model to use since saving is categorical variable while in the explanatory variables their is mix of both categorical and continuous variables…
Looking forward from your Help
best regard
Jim Frost says
Hi Abel,
I would start with linear least squares regression (aka ordinary least squares). Your dependent variable is continuous (savings), which this type of regression can handle. Linear least squares can also handle a mix of continuous and categorical independent variables, such as you have.
Meriem says
Hi Jim,
Could you recommend any references?
Thank you in advance
joel says
Hi Jim,
I’m interested to find out the impacts of categorical independent variables on a continuous dependent variable.
For example, what are the effects of the type of news (financials, dividend payout) on the volatility of stocks.
Jim Frost says
Hi Joel, that’s possible in linear least squares regression along with other types of regression. Assessing the role of categorical IVs is a fairly common and basic usage for many types of regression models. If you have only categorical variables in your model, that’s often called ANOVA (analysis of variances). However, ANOVA uses the same mathematics “under the hood” as regression.
Gemechu says
I have two dependent variables(aboveground biomass and carbon content) and three indepedent variables(wood density, diameter and height). Which model is best for anaysing my data? Waiting you under reply. Thank you
Meriem says
Thank you for your answer. What about the fixed and random effects is ok to use the mix of binary and continuous varibales?
Jim Frost says
Yes, just be sure to use a type of model where you can specify both fix and random effects, such as the MIXED procedure in SPSS. That aspect does affect the type of procedures you can use.
Meriem says
Hi Jim,
Your content has helped me a lot in my work, Thank you!
I’m conducting a regression analysis using panel data on a sample of 74 individuals on an 8 year period of time. I’ve been wondering if I can use a mix of binary and continuous independent variables to explain a continuous dependent variable. Would that be ok? how would it affect the type of regression I use?
Thanks
Meriem
Jim Frost says
Hi Meriem,
Yes, it’s entirely ok to use a mix of binary and continuous independent variables for ordinary least squares. You still need to satisfy the same set of OLS assumptions, but there are no additional requirements. Binary independent variables are also known as indicator variables and analyst frequently use them in linear regression. Typically, the 1s and 0s of an indicator variable represent the presence or absence of a characteristic. You just need to interpret their coefficients differently. The coefficient represents the mean difference between observations with and without the characteristic. The p-value indicates whether that difference is statistically significant.
I hope this helps!
Erick Loetz says
Hello Jim.
When using logistic regression how critical is to have balanced treatment groups. I have a binary response variable evaluated from two treatments at n=40 and n=14? How does the need for balance (approximately equal n’s for each group) compares to OLS analysis? Is there any published information on the subject? I appreciate your comments and expertise. Many thanks.
Jim Frost says
Hi Erick,
I don’t have a reference specific to binary logistic regression. However, in general, it’s OK to have unbalanced groups like that. Having balanced groups helps you maximize power for any given number of subjects. But, it’s ok to have an unequal number. I’m not sure for logistic regression, but for most analysis I’d say that having 14 in the smallest group is fine.
Perry Gonen says
When doing multiple regression analysis, as apposed to a simple OLS, where we have a number of independent variables, do you recommend to plot each independent variable against the dependent variable, one at a time to see how the plot of each variable on its own (without the other variables) against the dependent variable looks like. After analyzing each plot on its own go forward with the statistical analysis
Sarah says
Hi Jim,
I am hoping to do a regression analysis on social posts. The question I am usually trying to answer is does a certain variable (for example a photo vs no photo) play a role in engagement rate.
I’ve been going about this as coding the post: “Does this post have a photo or not?” Yes = 1, No=0 , but then I’m unsure what type of analysis would make the most sense?
(I would replicate this analysis on a bunch of different categories as well, like is the post light vs. dark, morning or night etc )
Any help/ thoughts would be much appreciated!!
Sarah
Nick Dekkers says
Dear Jim,
I would like to describe the correlation between 1 categorial factor (3 levels) and 2 continuous responses which are non-linear over time. I have a continuous factor (concentration) which I would like to add to the model. But I’m not sure if I should use the MANCOVA or a non-linear regression model. What would you suggest?
Thanks in advance!
Daphne says
Dear Jim,
I’ve been reading some of your blogposts and they are very helpful! However, I do have some remaining doubts about the regression I should run for my thesis. The doubt is mainly related to the setup of my dependent variable. I am constructing a measure of corporate social responsibility performance by summing strengths and concerns of several dimensions for each firm per year. There are about 50 dimensions in total and a firm is either given a 1 or a 0 in case the firm is either known to perform well/poorly(1) or 0 otherwise. Hence the dependent variable is constructed by summing these binary data observations. The minimum value of both the total strengths and the total concerns variable is 0. The maximum value of the strengths is equal to 44 and for the concerns this is equal to 36. I have thus only non-negative integer values. The mean for both strengths and concerns is around 3.0, however the standard deviation of the strengths is equal to 3.8 and for the concerns it equals 4.9. I’m hoping to be able to use OLS regression, however I was wondering whether there is enough variance in the DV and whether it can be considered a continuous variable. After reading this blogpost I also started wondering about the count data dependent variable and whether something like a poisson regression isn’t more suited. I was also considering the ordinal logistic regression. For a bit more clearance: I am trying to find the effect of several ownertypes on CSR performance. I hope you can help me out! In case you have some questions about my dependent variable, please let me know, then I will elaborate further!
Many many thanks in advance!
Jayson says
Hello Sir Jim,
What if there are 12 variables 2 Variants ( Variable 1 to 6 measures well being and Variable 7 to 12 positive thinking?
Should I run simple linear regression or multiple? How can I know if the variables are affecting each other.thank you
Jessica says
Hello Mr Jim Frost,
Thank you so much for the information you have provided on your website, as well as the answers to the questions above they are very helpful!
I am currently designing a research proposal that will assess medication adherence for rural vs urban groups in Canada. I plan on using an adherence scale that will give the participants a score (from 1-8) and based on the score, their adherence will be categorised. So >7 =high adherence. 5-8 =moderate adherence. <5 is low adherence.
I am planning the statistical analysis for the adherence.
Based on the responses I have read, would it be correct to conduct an ordinal logistic regression for the two groups? We’re also planning on conducting an independent samples t-test between the two groups and a paired samples t-test to compare the same group at the start and end of the study. Would these t-tests be possible if we transformed the data to continuous variables?
We also wanted to conduct a Chi-square test; is this possible if we don’t have estimated adherence scales.
How would you go about the statistical analysis for this study?
Thanks!
Jessica
Jim Frost says
Hi Jessica,
Yes, your plan sounds like a sound one. Ordinal logistic regression is a good choice. You can use t-tests for Likert scale data. Your data aren’t exactly Likert scale. It’s an 8 point ordinal scale. My guess (though I don’t a a reference to cite) is that you could not use a t-test based on the recategorized scale but probably on the full eight point scale. If you really want to use the recategorized scale, you might need to use a Mann-Whitney test instead. Transformation is probably not necessary. As an ordinal scale has more points, it become more like continuous data. Not perfectly so, but studies have shown that at least 5 point scales are close enough to use t-tests. Read my post about using t-tests for Likert data for more information.
About the chi-square test, I’m not sure what you mean by not having estimated adherence scales?
I think any of those approaches are valid. Ordinal logistic regression has the benefit of being designed for ordinal data. While studies have shown that t-tests and Mann Whitney tests can both work, it avoids a potential debate about the results if you just use a test designed for that data type! Consequently, I’d probably lean in that direction myself.
Geska says
Hii Jim,
If one respondent answering same set of question for different brand, when i run the regression analysis, should I compared both answer and use dummy variable, or I should run the regression separetly? because one of my aim is looking how customer trust affect customer loyalty
Jim Frost says
Hi Geska,
Let me make sure I’m understanding your scenario correctly. You’ll have the respondents answer the same questions about two or more brands and you’ll some dependent/outcome variable that you’re measuring? And, you want to determine how the responses related to the brands?
In that case, you’ll need to use a mixed ANOVA model. You’ll need to include a subject ID as a random factor, survey responses as fixed factors or covariates depending on the type of response, and the brand itself as a fixed factor.
I hope that helps! Unfortunately, I haven’t written a post about mixed models yet. But, that’s the direction it sounds like you need to go!
Dave says
Hi Jim
I am wondering whether regression analysis can be used to predict the strength of relationships between multiple income variables and bottom line expenditure. I have a conundrum in how I compare Business performance pre Covid ( Feb) to future performance in say ( October) and how i might model and estimate future income and expenditure. Can you enlighten?
Fei says
Hi Jim, thank you for sharing this information. It is very helpful! I have a question regarding gender. There are only a few males in my sample, the female – male ratio is 14:1. Will this affect my analysis if I run a multiple linear regression? How about a simple mediation or moderation analysis (IV, DV, mediator/moderator). I plan to use gender as a control variable for these three types of analyses.
Thank you! I look forward to your reply.
Jim Frost says
Hi Fei, because there are so few males, it’ll be hard for the model estimate the effect of gender and the statistical test will have low statistical power. However, if the effect of gender is large, it might still be a valuable addition and you could possibly get significant results. Honestly, I would just try it and see how it works. Don’t expect too much from gender, but if the effect is large it might still be important to include it.
Michaela says
Hi Jim,
I hope you can help me, I am riding the clock a bit.
I hope you are well and safe with all the madness of COVID-19 in the world. I was hoping you could please help me. I am a student currently in need of a lot of help. Statistical analysis is not my forte but when needs must, we grin and bare it and look to the professionals for help.
I have just received my data back from my survey and I have found that my independent variables are all 1-5 Likert scales. I am looking at consumer attitude and intention, however, I believe if I am not mistaken that my dependent variable is my ‘Intention’ as that is what I am looking to find out overall. From my research, I think ordinal logistic regression is most appropriate to use. I ended up looking at youtube on how to run the test and what goes where but a lot of it is contradictory to the next.
My independent variables are;
Relative advantage/ Compatibility/ Environmental Impact / Psychological ownership.
Each of these are subcategories with between 4 to 8 questions under each – All 5 point Likert
I need to determine these expectancy values against attitude and intention but I have no idea how to.
So my questions are:
Firstly, am I using the right regression
Do the IV go into the ‘factors’ or the ‘Covariate’ box
Is it Test of Parallel lines where I am receiving my null hypothesis data?
Is there another test I should be running with it?
Also when analysing the output, if I run it through ‘factor’ box, in the parameter estimates, under the location information, is the significance data suggesting that my data is proving that – say for instance – Relative advantage significance is linked to consumer intention or does it mean that there is no link between RA and intention?
Thank you very much for your time. I hope you can help.
Jim Frost says
Hi Michaela,
Choosing the correct type of regression depends on the dependent variable, and I’m not sure what your DV is. If your DV is also Likert scale, then, yes, ordinal logistic regression is the correct type.
For the IVs, Likert scale items can be tricky if you’re using the individual item scores for your values. They’re not continuous but they’re not categorical. However, you’ll need to model them one of those ways. That gets to your question about whether they are “factors” (categorical) or covariates (continuous). There are pros and cons for both ways. And, the goals of your study also play a role. I don’t have an article to refer you to about that, but I do write about it in my regression ebook.
If you’re summing or averaging Likert items together and using those values in your model, you might be able to treat them as continuous variables (covariates).
The test of parallel lines is for ordinal (or ordered) logistic regression tests whether the coefficients are the same for all levels of your DV, which I’m assuming are the Likert scale values. This test really is about determining whether you have a good model or not. If this test is significant, it’s not good. It means that there is something wrong with your model, but it doesn’t tell you what exactly. You could just have a poor fitting model or you might be using the incorrect link function. You want this p-value to be greater than your significance level.
You should also look at the chi-squared goodness-of-fit tables. Again, you want p-values that are higher than your significance levels. Low p-values are problematic here because they indicate your model doesn’t fit the data well.
You should also look at the pseudo R-squared values. Like the regular R-squared in linear regression, they give you an idea of how well the model fits the data. Low values don’t necessarily mean the model is incorrect but they do mean that it won’t be good for prediction.
For the location parameter estimates, low p-values that are statistically significant indicate that the IV is statistically significant. Suggests that the effect of IV does not equal zero (i.e., there is an effect). It’s a good thing!
I hope that helps! Best of luck with your study!
Robbie says
Hi Jim
I have read your books and a lot of your blog and have found all incredibly useful. One area I am repeatedly confused by (perhaps as it is not in your books) is the interpretation of logistic regressions.
I am looking at various negative outcomes for children (child labour, begging, etc.) and comparing them with a range of households characteristics (child headed, large households, elderly headed households etc)
The results show some interesting findings. For instance, regression the binary dependent variable of ‘begging’ or not give me the odds ratio of 2.5 for child headed households. For the same regression, te margins commands shows probabilities of 0.57 for child headed households and 0.35 for adult headed households.
I am at a loss as the best way to interpret these findings in a manner which makes sense for the data…
Is it more intuitive to say child headed households are 2.5 times more likely to beg than adult headed households (or is this not correct interpretation), or is the better interpretation that child headed households are 1.6 times more likely to beg than adult headed households (0.57 v 0.35).
Can you assist?
Robbie says
Hi Jim
Thanks so much for your useful blog and posts. Its really helped me understand regressions and do some really interesting analysis. And inevitably led me to by your books on linear regressions and statistics… I only wish there was one on logistic regressions too 🙂
I have run a simple logistic regression using a child labour survey in an East African country.
I ran the number of households that reported begging against those households that were child headed.
The logistic regression showed that child headed households had an odds ratio of 2.51. The constant was 0.54.I interpret this as meaning that child headed households are 2.5 times more likely to beg that non-child headed households (although I am not sure if i need to subtract the constant?)
However the margins command shows the predicted probability of begging in a child-headed household is 0.57 while in a non-child headed household it is 0.35. The odds ratio and margins do not speak to each other as I understand. How can the odds of begging in a child headed household be 2.5 times greater using the odds ratio but the probability of begging in a child-headed household be far less at 1.6 times greater?
Apologies for the no doubt obvious question, but I am struggling to find any answers.
. logistic begging i.child_head
Logistic regression Number of obs = 1,288
LR chi2(1) = 3.95
Prob > chi2 = 0.0470
Log likelihood = -836.89941 Pseudo R2 = 0.0024
——————————————————————————
begging | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
————-+—————————————————————-
1.child_head | 2.51981 1.180173 1.97 0.048 1.006238 6.310081
_cons | .545676 .032052 -10.31 0.000 .4863365 .6122557
——————————————————————————
Note: _cons estimates baseline odds.
.
end of do-file
. margin child_head
Adjusted predictions Number of obs = 1,288
Model VCE : OIM
Expression : Pr(begging), predict()
——————————————————————————
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
————-+—————————————————————-
child_head |
0 | .3530339 .0134158 26.31 0.000 .3267393 .3793285
1 | .5789474 .113269 5.11 0.000 .3569443 .8009505
——————————————————————————
Varma says
Jim,
Thanks for the valuable advice!
jamesey10 (@jamesey10) says
Hi Jim,
Maybe you can tell me if I’m on the right track or getting derailed.
I think an Ordinal Logistic Regression or Poisson Regression is right for me. I’m still obtaining my counts and learning R-Studio, but I’ve been reading your site and others to figure out what I’ll do with the data I collect.
I have over 1200 observations from content analysis (text mining)
My dependent variable is a count of words relating to one of 5 levels (a,b,c,d,e.)
My independent variable is a count of words relating to one of 4 typologies (y,z,w,x-order does not matter.)
I want to determine if any of the 4 typologies correlate to any of the 5 levels.
Dan says
Hi Jim,
Thanks for you article, it is very helpful!
I have a question about my research: I want to examine if a correlation exists between clinical characteristics (amount of pain, degree of physical disability, etc.) and radiological scoring of an X-ray. Most clinical parameters are continuous variables, but one is binary (success/failure) and another is Likert scale with 7 categories (ordinal of course). The X-ray scoring is ordinal with 4 categories (from no degeneration to extensive degeneration).
I figure I could use Spearman’s rho for continuous variables; Kendall’s tau, Somer’s d or Goodman and Kruskal’s gamma for Likert; and rank-biserial correlation coefficient for binary variable (success/failure). This way I can correlate these variables to the ordinal X-ray scoring.
The next step is to do regression analysis to determine which independent variables account most for the X-ray scoring. Since my dependent variable is ordinal I was thinking of ordinal logistic regression. But is it possible to incorporate continuous, ordinal and binary independent variables into this model?
Kind regards,
Dan
Jim Frost says
Hi Dan,
Yes, ordinal logistic regression is the way to go for your data. It can handle continuous and binary data with no problem. It can handle categorical data with recoding as indicator variables, which most software should do automatically these days. Your ordinal independent variable is a bit problematic. You’ll need to include it either as a continuous variable or a categorical variable. There are pros and cons for each way. The correct answer depends upon a combination of the goals of your analysis and the nature of your data. I don’t have article to point you towards but I do write about it in my regression ebook. I write about in in the context of least squares regression, but the same principles apply to ordinal logistic regression.
Best of luck with your analysis!
Varma says
Great post, Jim.
I am dealing with a challenge. I am not sure what regression analysis method to use to analyze a data set where the independent variables are in Likert scale (1 through 5 – completely disagree through completely agree) and the dependent variables are Likert items (1 through 5 – completely disagree through completely agree).
Your advice would be greatly appreciate.
Thanks in advance
Jim Frost says
Hi Varma, because your dependent variable is ordinal (Likert scale), you’ll need to use ordinal logistic regression. However, including ordinal independent variables is problematic. You’ll need to include them as either categorical or continuous variables. Each approach has strengths and weaknesses. The correct choice depends on the goals of your analysis and the characteristics of your data. I write about it in my ebook about regression analysis.
Best of luck with your analysis!
YUSUF JAMAL says
Hello Jim,
I want some help in determining a threshold poplation density.I have two data sets.One is the poulation density in many counties of US and another is number of cholera disease cases in those counties.I wat to figure out that population density at which there are 50% chances of getting infected with cholera.Can regression help me here?
Thank You so much
Jim Frost says
Hi Yusuf,
It sounds like you need to use binary logistic regression because that’ll allow you to calculate the probability of an event occurring (cholera infection). You’d need the number of events (infections) and opportunities (county population) along with the population density values. Although, I’m not sure if any US county ever got up to 50% cholera infections.
ronnie says
hello jim, so my doctor in college gave me the task to make a regression analysis on the impact of smoking, exercise level. and i dont know how to approach it.
EJ says
Hi there! I am trying to find the right correlation test for my data. I have a continuous independent variable, and a categorical dependent variable (with 4 possible categories). Do you have any suggestions for this type of data?
Jim Frost says
Hi EJ,
I’m not sure that finding a correlation is what you’d want to do with that type of data. Instead, use a one-way ANOVA to determine whether the mean of your continuous variable is different between the four groups. Click the link to see an example of that analysis.
Alexis says
Hello Jim , i need your help with my masters project, i have a set of audio recordings from which i extracted the MFCC matrix and the corresponding image sequence from which the mouth landmarks were detected for each MFCC frame, my goal ultimately is to give an audio to a system that will predict the mouth landmarks for each MFCC vector but i literally have no idea what that system should be or how to proceed , can you give me any advice concerning that?
Yusuf Jamal says
Hello Sir,
Is Hosmer-Lemeshow statistic really important in binary logistic regression? Does a p-value of less tan 0.05 of Hosmer-Lemeshow sufficient enough to discard the otherwise good binary logistic regression model?
Thank You
Jim Frost says
Hi Yusuf,
In binary logistic regression, the Hosmer-Lemeshow goodness-of-fit test compares the observed and expected frequencies of events to determine whether the model adequately fits the data. It evaluates whether the differences between the expected probabilities and the observed probabilities are statistically significant. Consequently, your low p-value indicates that your model does not fit the data well because these differences are significant.
In this case, try different link functions and/or change your model.
haneen says
Another question, sorry but I am a beginner in statistics. Now I have to check the residual plot ( same as scatterplot right?) for positive correlation right? If there is a positive correlation does it mean the dependent variable is normally distributed? Or shall I do another test to check for it’s normal distribution? One more question, I have to check the residual and the normal distribution for the dependent variable “for the means?” or for the original responses?
Thanks alot for your response
Jim Frost says
Hi, I have an article about residual plots. I put a link in my other reply. That’ll answer your questions.
haneen says
Dear Jim,
Yes there is a positive relationship, you mean by the residual plot the scatter plot right? Can I send a screenshot for the scatter plot as a message on your fb page just to make sure that all is ok?
Jim Frost says
Hi Haneen,
You use residual plots to check the OLS assumptions. Read those posts for more information.
Haneen says
Dear Jim,
I have a sample of 140 instructors and 21 tool in Moodle website and I want to show statistically that the the awareness of Moodle tools (in general) influences its usage.
I conducted a 5-items likert scale survey. Two questions for each tool, one asks about awareness of the tool and the other about usage of the tool. Then I calculated the overall usage of each tool (by finding the mean of the responses for each tool) and the overall awareness of the tool (by finding the mean of the responses for each tool). then I used a simple linear regression between the overall awareness and the overall usage (i.e. between the means) (resulting in 21 plot in the scatter plot)) is this right???? or shall I make the linear regression between between all responses for all tools?? (21*140 dot in the scatter plot)
Can you kindly reply a.s.a.p?
regards,
Jim Frost says
Hi Haneen,
I’m not quite clear about your model. For Tool A, are you looking at the mean awareness of tool A by all subjects and the mean usage of tool A for all subjects? And, you have 21 tools, so for your model, your 21 observations are basically 21 means for the DV and the IV? And each mean is calculated using 140 values. Is that right?
You should be able to fit the model that way. Typically, each observation is an individual but in your case each observation is a mean. But, it should tell you whether higher awareness leads to higher usage. You should find a positive relationship. Just be sure to check the residual plots. Using 140 values to calculate each average should allow you to treat the underlying ordinal variables as continuous data. However, because they are based on ordinal data with a constricted range and the difference between values might not be constant, you might have curved relationships and nonnormal distributions. Pay particular attention to the residual plots. If you satisfy the assumptions, you should be good. If the distribution of the DV is highly skewed, you might need to use a transformation. Or possibly fit curvature.
Rose says
Hi Jim,
I really liked your article, I only have one question. My dependent variable is a categorical variable with three levels, and my independent variable is a discrete variable with 5 levels. Is ordinal logistic regression then the right way to analyse my data?
Thanks in advance!
Rose
Vanessa says
Hi Jim,
Thank you so much for all your valuable posts! i was going over them in order to prepare for my own analysis.
I decided to both run an ordered logic regression (ordered categorical dependent variable) and a multinomial logic regression (unordered categorical dependent variable). The question i right now have is what is the best way to interpret my coefficients received from the logit regressions?
shall i use the odds ratio? the RRR? or simply use conditional or average marginal effects? is there anything i need to take into account?
Many thanks already very much for your reply!! I am so looking forward to it!
Jamaldeen Nurudeen says
please, which type of regression is appropriate to quantify the influence of funding on academic success and why?
Jim Frost says
Hi Jamaldeen, you don’t pick the type of regression based on the subject area but by the types of variables you have, particularly the type of dependent variable. So, determine what your dependent variable is and then look through this post for the types of regression that can analyze that type of dependent variable.
David Lee says
Do you discuss use Information -Theortic Approach in your blog? In particular do you discuss use of AICc in comparing models? I recently purchased your Regression Analysis book but don’t see this discussed.
Gaia Fiordispini says
Thank you very much Jim! I really appreaciate your responsiveness.
I know it would be better to consider it as a nested dataset, and this makes the analysis more complex and difficult to understand for me. Since I am a marketing student and not a statistician, and my research question wether there is a relationship between the maturity levels (fixed values for each brand, that I want to use as independent continous variables) and the consumer mindset metrics (different for each respondents, who answered for one brand only, and that I want to consider as my dependent variable)… wouldn’t it be sufficient to run a correlation? It would come with limitations but I have to answer my research question with my competencies.
Thank you and all the best!
Jim Frost says
There are several considerations. If your DV is Likert, you can’t use the regular Pearson’s correlation, you’d need to use Spearman’s correlation.
However, there is a larger issue with the correlation approach. When you include multiple independent variables in your model, it automatically controls for all the variables in the model. Consequently, the effect of each variable is calculated while controlling for (i.e., holding constant) the other variables in the model. You lose that with correlation because you’re assessing IVs one at a time. You could end up with biased estimates–basically omitted variable bias, which might or might not be a problem with your data. That’s a risk to weigh and involves subject-area knowledge that I don’t have. Read here to see an example of this problem in action.
I’m rethinking the nested design as I think I’m understanding your design better. Nesting occurs on the IV side of things. And you don’t have that there. Respondents are answering for one brand, but their responses are on the DV side. So, disregard that. I’d perform ordinal logistic regression with no nesting.
You could perform Spearman’s correlation but with the caveat I mentioned earlier.
Gaia Fiordispini says
Thank you! My supervisor suggested me to think about CLUSTERED DATA and perform REAPEATED MEASURE MULTIVARIATE ANOVA. This should imply to transform my dataset? And, also, I do not understand if it is the proper way to analyze my data, or if a specific type of correlation (e.g. I tried to perform Spearman) can be enough!
Jim Frost says
Hi Gaia,
Unless I misunderstood your design, which is possible (see my other comment), it doesn’t sound like you’re using repeated measures. It does sound like you have a nested design because each respondent is answering about one brand.
My other comment has thoughts about the type of regression and other details about your model. It’s a rather complex mix of things, at least if I’m understanding correctly.
Gaia Fiordispini says
Hi Jim! Could you please reply to my question? It is important for my Master Thesis!
Jim Frost says
Hi Gaia,
So sorry about the delay. I’ll get to it soon. It unintentionally fell through the cracks!
Lauren says
Hi Jim,
Great article! I am still a bit confused. Would a type of regression still be best if the data included 1 categorical independent variable (two groups), 2 continuous dependent variables, and 1 categorical dependent variable?
Jim Frost says
Hi Lauren,
Typically, you have just one dependent variable per regression model. If you have multiple dependent variables, you’ll need to fit separate models.
You can use least squares models to fit models for your continuous dependent variables with the categorical independent variables. Although, if you have a continuous DV and just two groups, you can use a 2-sample t-test for that instead of regression.
For the categorical dependent variable, you’d need to use a binary logistic model if it has two levels or a nominal logistic model if it has more than two levels and assuming there is no natural order to the levels.
I hope that helps!
Diego says
Hi Jim,
Great article! Thanks for the info.
I have a question regarding a doubt I have on which regression to use for the dataset I’m currently working in.
I’m trying to analyze the impact of HIV on MSK health among women aging with HIV. So, I assessed body composition measures (fat mass, lean mass, etc) a in a cohort of women HIV-positive and women HIV-negative to see if there are differences between them. Also, I did some physical test (grip, chair standing among others). So, I want to see if body composition measurements affects in the scoring of those physical tests, and how HIV influence that.
In this case, as body composition measurments are continues variables and the physical test scores are also continues variables should I use a linear regression model? But I’m not sure how to plot the HIV impact part on that model. Thanks!!
Best,
Diego
Gabriel Villas says
Hi Rahma!
If you want to investigate the level of attendance, you dependent variable will be Categorical, because it has a natural order (Low, Medium and High) to obey.
Considering this, the regression model I suggest you is the Ordinal Logistic Regression.
Kind regards,
Gabriel Villas.
Jim Frost says
Hi Gabriel,
A categorical variable with a natural order is an ordinal variable, and I agree with the recommended analysis. Even better would be to record the number of days attended and use either least squares regression or Poisson (or related) regression depending on how well the count approximates the normal distribution. You’d probably get more information that way.
Gaia Fiordispini says
Hello I have a question regarding a dataset I must analyze. I did two research steps:
1. Based on a managerial maturity model, I ranked 8 brands according to their level of usage of social media tools. For example how many different content formats they use form -3 to +5. They are a total of 7 components that, summed all together, give the total maturity score of each brand.
2. I made a survey among consumers to measure their level of brand awareness, consideration and purchase intent. Each respondent was randomly assigned to a brand for which he had to give his rankings, so I have 8 conditions in the dataset.
3. Now my hypothesis is that there is a relationship between the level of maturity and the consumer mindset metrics (namely higher levels of maturity should score higher levels of mindset metrics).
So the questions are:
– I am analyzing the full sample because each condition has only 50 respondents. Should I split the dataset per condition for some analysis?
– The variables I want to compare are: interval Likert scales 1-5 for the mindset metrics (dependent) and ratio (probably I will rescale them from 0-100% for example) that are fixed for each condition (independent). How to do correlation and regression analysis? Specifically, which regression is more appropriate?
Sorry for the long request but I tried to be as precise as possible.
Jim Frost says
Hi Gaia,
This is much deeper into the real deep details of study than I can usually go in blog comments. As you’ll see, your design is a complex mix of elements. From your description, I’m not sure that I have a completely clear understanding of your study, which makes it difficult to comment. However, I think I’ve got and here are some thoughts.
If your dependent variable consists of ordinal data (Likert scale), you’ll probably need to use something like ordinal logistic regression–unless the DV is a sum of multiple Likert items, which you can sometimes treat as continuous. If the DV is a sum of multiple Likert items, you can try using least squares regression but you’ll need to pay particular attention to the residual plots to ensure you’re getting an unbiased fit.
It sounds like each respondent is nested within one condition (brand). If that’s the case, you’ll need to use a nested design. You’ll probably need to include respondent as a random factor (as opposed to a fixed factor) in your design.
That’s a complex model to run and you might need to consult with a statistician at your institution to help you out.
I hope that helps! Best of luck with your analysis!
Naduth Must says
Hi
I have a dataset with 3 continuous variables in which the dependent variable includes negative values too. How can i decide wat type of regression to apply(other than linear)? by looking at the scatterplot or something?
Rahma says
Hi!! I have a question!
I have data on the number of classes that 86 student attended.
I have grouped those students into LOW attendance, MEDIUM attendance and HIGH attendance, based on the number of classes they attend.
And I have many Independent Variables (age, gender, BMI, smoking status, employment status, martial status), both categorical and continuous.
1) Is my Dependent Variable categorical or continuous?
2) Which regression do I use?
Waiting for your answer!!! Thank you
Maggie says
Hi, Jim.
I really need your help on the appropriate test for my undergraduate thesis. I controlled 3 categorical variables (A,B,C). Each have 3 levels. This is a within subject experiment where each subject was tested for 12 conditions (e.g. condition 1: A1,B1,C2; condition 2: A1,B2, C2…) Could you please suggest a test to analyse the contribution of each variable? I was thinking of the binary logistic test since the response is binary.
Thank you so much for your attention!!
Jim Frost says
Hi Maggie,
In general, that sounds like the right approach. One potential hiccup is that given the with-subjects nature of the study, you’ll need to include subject as a random factor. I discuss this a bit in my post about repeated measures designs. I’ve never seen a similar type analysis with a binary dependent variable and a random independent variable. So, I’m not sure if there’s software out there that can do that analysis. I’d consult with a statistician that can look into the special needs of your study and find the best solution. If it weren’t for the random factor, I’d say that binary logistic regression would meet your needs.
Katuta says
Hi Jim
I wanted to analyze visual acuity outcome, which is primarily continuous variable but I have made the range and categorised it into three groups as good outcome, borderline and poor outcome, can I use ordinal, multinomial or still I required to use linear regression?
My independent variables are, age, sex, type of cataract, commobidities, and level of education.
Thanks.
Jim Frost says
Hi Katuta,
You can use ordinal logistic regression. However, if you have the continuous data for acuity outcome, you might consider analyzing it using linear regression on the continuous data rather than converting it to an ordinal variable. But, if you do transform it as you describe, use ordinal logistic regression.
Priya Mohan says
Hi Jim,
Your post is very helpful and too clear, thank you. In my observtional study, the outcome variable is categorical, dichotomous – yes/no and the independent or explanatory variables are also categorical example – age groups, gender, education, smoking etc. I understood that we need to use binary logistic regressiON. Is this right? and how would we interpret beta coefficient?
Fads says
Hi Jim, I have a question on the type of regression analysis that would be the most appropriate for me to use . I’m investigating the impact factors such as ethnicity, geographical region and gender have on educational attainment at GCSE level
I have 3 categorical variables; ethnicity, geographical region and gender which im measuring against a continous dependent variable.
Thank you!
ariel says
Hi JIm,
Thank you again for the prompt response.
My regression lacks the offset variable.
So, in the Parameter Estimates table (SPSS) of this regression there is an Exp b column and I wonder how to refer to its values as IRR or RR?
thanks,
ariel
ariel says
Hey Jim,
You are right my dependent variable is not a count variable but it is not a pure rating variable either. It is difference/delta between the student rating pre and post a course and it has a poisson distribution. Is it wrong to use a poisson regression for this type of variable?
Thank you for the prompt answer,
Ariel