Regression analysis mathematically describes the relationship between independent variables and the dependent variable. It also allows you to predict the mean value of the dependent variable when you specify values for the independent variables. In this regression tutorial, I gather together a wide range of posts that Iโve written about regression analysis. My tutorial helps you go through the regression content in a systematic and logical order.
This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions. I close the post with examples of different types of regression analyses.
If youโre learning regression analysis, you might want to bookmark this tutorial!
When to Use Regression and the Signs of a High-Quality Analysis
Before we get to the regression tutorials, Iโll cover several overarching issues.
Why use regression at all? What are common problems that trip up analysts? And, how do you differentiate a high-quality regression analysis from a less rigorous study? Read these posts to find out:
- When Should I Use Regression Analysis?: Learn what regression can do for you and when you should use it.
- Five Regression Tips for a Better Analysis: These tips help ensure that you perform a top-quality regression analysis.
Tutorial: Choosing the Right Type of Regression Analysis
There are many different types of regression analysis. Choosing the right procedure depends on your data and the nature of the relationships, as these posts explain.
- Choosing the Correct Type of Regression Analysis: Reviews different regression methods by focusing on data types.
- How to Choose Between Linear and Nonlinear Regression: Determining which one to use by assessing the statistical output.
- The Difference between Linear and Nonlinear Models: Both kinds of models can fit curves, so whatโs the difference?
Tutorial: Specifying the Regression Model
Selecting the right type of regression analysis is just the start of the process. Next, you need to specify the model. Model specification is the process of determining which independent variables belong in the model and whether modeling curvature and interaction effects are appropriate.
Model specification is an iterative process. The interpretation and assumption confirmation sections of this tutorial explain how to assess your model and how to change the model based on the statistical output and graphs.
- Model Specification: Choosing the Correct Regression Model: I review standard statistical approaches, difficulties you may face, and offer some real-world advice.
- Using Data Mining to Select Your Regression Model Can Create Problems: This approach to choosing a model can produce misleading results. Learn how to detect and avoid this problem.
- Guide to Stepwise Regression and Best Subsets Regression: Two common tools for identifying candidate variables during the investigative stages of model building.
- Overfitting Regression Models: Overly complicated models can produce misleading R-squared values, regression coefficients, and p-values. Learn how to detect and avoid this problem.
- Curve Fitting Using Linear and Nonlinear Regression: When your data donโt follow a straight line, the model must fit the curvature. This post covers various methods for fitting curves.
- Understanding Interaction Effects: When the effect of one variable depends on the value of another variable, you need to include an interaction effect in your model otherwise the results will be misleading.
- When Do You Need to Standardize the Variables?: In specific situations, standardizing the independent variables can uncover statistically significant results.
- Confounding Variables and Omitted Variable Bias: The variables that you leave out of the model can bias the variables that you include.
- Proxy Variables: The Good Twin of Confounding Variables: Find ways to incorporate valuable information in your models and avoid confounders.
Tutorial: Interpreting Regression Results
After choosing the type of regression and specifying the model, you need to interpret the results. The next set of posts explain how to interpret the results for various regression analysis statistics:
- Coefficients and p-values
- Constant (Y-intercept)
- Comparing regression slopes and constants with hypothesis tests
- R-squared and the goodness-of-fit
- How high does R-squared need be?
- Interpreting a model with a low R-squared
- Adjusted R-squared and Predicted R-squared
- Standard error of the regression (S) vs. R-squared
- Five Reasons Your R-squared can be Too High: A high R-squared can occasionally signify a problem with your model.
- F-test of overall significance
- Identifying the Most Important Independent Variables: After settling on a model, analysts frequently ask, โWhich variable is most important?โ
Tutorial: Using Regression to Make Predictions
Analysts often use regression analysis to make predictions. In this section of the regression tutorial, learn how to make predictions and assess their precision.
- Making Predictions with Regression Analysis: This guide uses BMI to predict body fat percentage.
- Predicted R-squared: This statistic evaluates how well a model predicts the dependent variable for new observations.
- Understand Prediction Precision to Avoid Costly Mistakes: Research shows that presentation affects the number of interpretation mistakes. Covers prediction intervals.
- Prediction intervals versus other intervals: Prediction intervals indicate the precision of the predictions. I compare prediction intervals to different types of intervals.
Tutorial: Checking Regression Assumptions and Fixing Problems
Like other statistical procedures, regression analysis has assumptions that you need to meet, or the results can be unreliable. In regression, you primarily verify the assumptions by assessing the residual plots. The posts below explain how to do this and present some methods for fixing problems.
- The Seven Classical Assumptions of OLS Linear Regression
- Residual plots: Shows what the graphs should look like and why they might not!
- Heteroscedasticity: The residuals should have a constant scatter (homoscedasticity). Shows how to detect this problem and various methods of fixing it.
- Multicollinearity: Highly correlated independent variables can be problematic, but not always! Explains how to identify this problem and several ways of resolving it.
Examples of Different Types of Regression Analyses
The last part of the regression tutorial contains regression analysis examples. Some of the examples are included in previous tutorial sections. Most of these regression examples include the datasets so you can try it yourself! Also, try using Excel to perform regression analysis with a step-by-step example!
- Linear regression with a double-log transformation: Models the relationship between mammal mass and metabolic rate using a fitted line plot.
- Understanding Historiansโ Rankings of U.S. Presidents using Regression Models: Models rankings of U.S. Presidents to various predictors.
- Modeling the relationship between BMI and Body Fat Percentage with linear regression.
- Curve fitting with linear and nonlinear regression.
If you’re learning regression and like the approach I use in my blog, check out my eBook!
Din says
Hi Jim
I am looking for some scientific papers to do the validation among the goodness of fit parameters in terms of atmospheric conditions using polynomial model. Even the benchmark would be helpful as well to determine the range of parameter’s threshold values.
Thanks
Mary K. Reinders says
Hello Jim, Finding your site super-helpful. Wondering if you can provide some examples to simply illustrate multiple regression output (from spss). I would like to illustrate the overall effects of the independent variables on the dependent without creating a histogram. Ideally something that shows the strength, direction and significance in a box plot, line graph, bubble chart or other smart graphic. So appreciate your guidance.
Anthony Orzechowski says
Hi Jim, I just bought all 3 of your books via your Website Store that took me to Amazon – $66.72 plus tax. I don’t see how to get the PDF versions though without an additional cost to buy them additionally. Can you help me with how to get access to the PDF’s? Also, I am reviewing these to see if I want to add them to the courses I teach in Data Analytics. Do you have academic pricing available for students? Both hardcopy and e-copy?
Jim Frost says
Hi Anthony,
Look for an email from me.
Edited to add: I just sent an email to the address you used for the comment, but it bounced back saying it was “Blocked.” Please provide a method for me to contact you. You can use the contact form on my website and provide a different email address. Thanks!
gebretsadikshibre says
Dear Dr, how are you?
Thank you very much for your contribution.
I have one question, which might not be related to this post
in My cross tabulation between two categorical variables, the cell of one of the variables have just 8 observations for the ” no” , and 33 observations for the “yes” of the of the second variable. Can I continue with is for the descriptive statistics or should collapse the categories to increase the sample size? Do I use the new variable with a fewer categories in my regression analysis?
your help is much appreciated
David says
Thanks for teaching us about Stats intuitively.
Is your book Regression Analysis available in PDF format?
I’m a student learning Stats and would like it only in PDF format (no Kindle)
Thanks.
Jim Frost says
Hi David,
Yes, if you buy it through My Website Store, you’ll get it in PDF format.
RABIA NOUSHEEN says
Hi Jim
Thnak you for your valuable advice. Change is in the way I am putting data into the software. When I put average data, glm output shows ingestion has no significant effect on mortality. When I input data with replications, glm out shows significant effect of ingestion on mortality. My standard deviations are large but data shows homogscedacity and normal distribution.
Your comments wil really be helpful in this regard.
Jim Frost says
Hi Rabia,
If you have replications, I’d enter that data to maintain the separate data points and NOT use the average. That provides the model with more information!
Rabia says
Hi Jim
I have a question about Generaliลพed linear model. I am getting different outputs of same response variable when I apply glm using 1) data with replications 2) using average data. Mortality is My response variable and no. Of particles ingested is My predictor variable, other two predictors are categorical.
Looking forward to your advice.
Jim Frost says
Hi Rabia,
I’m not sure what you’re changing in your analysis to get the different outputs?
Replications are good because they help the model estimate pure error.
Average data is okay too but just be aware of incorporating that into the interpretations.
Reza M. Tamanani says
Hi Jim,
I know that we can use linear or nonlinear models to fit a line to a dataset with curvature. My question is that when we have too many independent variable, how could we understand if there is a curvature?
Do you think we should start with simple linear regression, then model polynomial, Reciprocal, log, and nonlinear regression and compare the result for all of them to find which model works the best?
Thanks a lot for your very good and easy to understand book.
Reza
Jen says
Jim,
I wonder if you have any recommended articles as to how to interpret the actual p-values and confidence interval for multiple regressions? I am struggling to find examples/templates of reporting these results.
I truly appreciate your help.
Jen
Jim Frost says
Hi Jen,
I’ve written a post about interpreting p-value for regression coefficients, which I think would be helpful.
For confidence intervals of regression coefficients, think about sample means and CIs for means as a starting point. You can use the mean as the sample estimate of the population mean. However, because we’re working with a sample, we know there is a margin of error around that estimate. The CI captures that margin of error. If you have a CI for a sample mean, you know that the true population parameter is likely to be in that range.
In the regression context, the coefficient is also a mean. It’s a mean effect or the mean change in the dependent variable given a one-unit change in the independent variable. However, because we’re working with a sample, we know there is a margin of error around that mean effect. Consequently, with a CI for a regression coefficient, we know that the true mean effect of that coefficient is likely to fall within that coefficient CI.
I hope that helps!
nasr says
Hi Jim
first of all, thanks for all your great work.
I’m setting linear regression analysis, in which the standard coefficient is considered, but the problem is my dependent variable that is Energy usage intensity so it means the lower value is the better than a higher value. correct me if I’m wrong, I think SPSS evaluates high value as the best and lower one as the worst so in my case, it could lead to effect reversely on the result (standard coefficient beta).
is it right? and what is your suggestion?
raeda says
hello Jim
i want help with an econometric model or equation that can be used if there is one independent variable (dam) and dependent variable(5 livelihood outcomes). here i am confused if i can use binary regression model considering the 5 outcomes as a indicators of the dependent variable which is livelihood outcomes or i have to consider the 5 livelihood outcomes as 5 dependent variables and use multivariate regression.please reply a.a.p
thank you so much
Jim Frost says
Hi Raeda,
It really depends on the nature of the variables. I don’t know what you’re assessing, but here are two possibilities.
You label the independent variable, which I’m assuming is continuous but I don’t know for sure, and the 5 indicator/binary outcomes. This is appropriate if you think the IV affects, or at least predicts, those five indicators. Use this aproach if the goal of your analysis is to use the IV to predict the probability of those binary outcomes. Use binary logistic regression. You’ll need to run five different models. In each model, one of the binary outcomes/indicators is your DV and you’d use the same IV for each model. This type of model allows you to use the value of the IV to predict the probability of the binary outcome.
However, if you instead want to use the binary indicators to predict the continuous variable, you’d need to use multiple regression. The continuous variable is your DV and the five indicators are your IVs. This type of model allows you to use the values of the five indicators to predict the mean value of the continuous variable.
Which approach you take is a mix of theory and want your study needs to learn.
I hope that helps!
Behnaz says
Hi Jim,
Is this normal that the signs in “Regression equation in uncoded units” are sometimes different from the signs in the “Coded coefficients table”? In my regression results, for some terms, while the sign of a coefficient is negative in “Coded coefficients table”, it is positive in the regression equation. I am a little confused here. I thought the signs should be the same.
Thanks,
Behnaz
Jim Frost says
Hi Behnaz,
There is nothing unusual about the coded and uncoded coefficients having different signs. Suppose a coded coefficient has a negative sign but the uncoded coefficient has a positive sign. Your software using one of several processes that translates the raw data (uncoded) into coded values that help the model estimate process. Sometimes that conversion process causes data values to switch signs.
Jonathan Hedvat says
Hello Jim,
I am lookign to do a Rsquared line for a multiple regression series. i’m not so confident that the 3rd,4th,5th number in the correlations will help make a better line.
i’m basically looking at data to predict stock prices (getting a better R2)
so for example Enterprise Value/Sales to growth rate has a high R2 of like .48
but we know for sure that Free cash flow/revenue to percent down from 52 week high is like .299
i have no clue how to get this to work in a 3d chart or to make a formula and find the new r2.
any help would be great.
i dont have excel..not a programmer..just have some google sheets experience.
Jim Frost says
Hi Jonathan,
I’m not 100% sure what you mean by an R-squared line? Or, by the 3rd, 4th, 5th, number in the correlations? Are you fitting several models where each one has just one independent variable?
It sounds to me like you’ll to learn more about multiple regression. Fortunately, I’ve written an ebook about it that will take you from a novice to being able to perform multiple regression effectively. Learn about my intuitive guide to regression ebook.
It also sounds like you’ll need to obtain some statistical software! I’m not sure what statistics if any you can perform in Google Sheets.
Kevin says
Forgive me if these questions have obvious answers but I could not find the answers yet. Still reading and learning. Is 3 the minimum number of samples needed to calculate regression? Why? I’m guessing the equations used require at least 3 sets of X,Y data to calculate a regression but I do not see a good explanation of why. I’m am not wondering about how many sets make the strongest fit. And with only two sets we would get a straight line and no chance of curvature……
I am working on a stability analysis report. For some lots we only have two time points. Zero and Three months. The software will not calculate the regression. Obviously, it needs three time points…..but why? For example: standard error cannot be calculated with only two results and therefore the rest of the equations will not work…or maybe it is related to degrees of freedom? (in the meantime what I will do is run through the equations by hand. the problem is I’m so heavily relying on the software. in other words being lazy. at least I’m questioning though. i’ve been told not to worry about it and just submit the report with “regression requires three data points”.)
Thank you.
Taibet says
Hello, Pls how can I construct a model on carbon pricing? Thanks in anticipation of a timely response.
Jim Frost says
Hi Taibet,
The first step is to do a lot of research to see what others have done. That’ll get you started in the right direction. It’ll help you identify the data you’ll need to collect, variables to include in the model, and the type and form of model that is likely to fit your data. I’ve also written a post about choosing the correct regression model that you should read. That post describes the model fitting process and how to determine which model is the best.
Best of luck with your analysis!
Jane says
Hi Jim: I recently read your book Regression Analysis and found it very helpful. It covered a lot of material but I continue to have some questions about basic workflow when conducting regression analysis in social science research. For someone who wants to create an explanatory multiple regression model(s) as part of an observational study in anthropology, what are the basic chronological steps one should follow to analyze the data (eg: choose model type based on type of data collected; create scatterplots between Y and X’s; calculate correlation coefficients; specify model . . .)? I am looking for the basic steps to follow in the order that they should be completed. Once a researcher has identified a research question and collected and stored data in a dataset, what should the step-by-step work flow look like for a regression / model building analysis? Having a basic chronology of steps will help me better organize (and use) the material in your book. Thanks!
Jim Frost says
Hi Jane,
First, thanks so much for buying me ebook. I’m so happy to hear that it was helpful. You ask a great question. And, it my next book I tackle the actual process of performing statistical studies that use the scientific method. For now, I can point you towards a blog post that covers this topic: Five Steps for Conducting Studies with Statistical Analyses
And, because you’re talking about an observation study, I recommend my post about observational studies. It talks about how they’re helpful, what to watch out for, and some tips. Also be sure to read about confounding variables in regression analysis, which starts on page 158 in the book.
Additionally, starting on p. 150 in the ebook, I talk about how to determine which variables to include in the model.
Throughout all of those posts and the ebook, you’ll notice a common theme. That you need to do a lot of advance research to figure out what you need to measure and how to measure it. Also important to ensure that you don’t accidently not measure a variable and have omitted variable bias affect your results. That’s where all the literature research will be helpful.
Now, in terms of analyzing the data, it’s hard to come up with one general approach. Hopefully, the literature review will tell you what has worked and hasn’t worked for similar studies. For example, maybe you’ll see that similar studies use OLS but need to use a log transformation. It also strongly depends on the nature of your data. The type of dependent variable(s) plays a huge role in what type of model you should use. See page 315 for more about that. It’s really a mix of what type of data you have (particularly the DVs) and what has worked/not worked for similar studies.
Sometimes, even armed with all that advanced knowledge, you’ll go to fit the model with what seems to be the best choice, and it just doesn’t fit your data. Then, you need to go back to the drawing board and try something else. It’s definitely an iterative process. But, looking at what similar studies have done and understanding your data can give you a better chance of starting out with the right type of model. And, then use the tips starting on page 150 to see about the actual process of specifying the model, which is also an iterative process. You might well start out with the correct type of model, but have to go through several iterations to settle on the best form of it.
So:
Best of luck with your study!
svend ulstein says
Thank you! Much appreciated!!
Jim Frost says
You’re very welcome, Svend! Because you’re study uses regression, you might consider buying my ebook about regression. I cover a lot more in it than I do on the blog.
svend ulstein says
Hi Jim! Did you notice my question from 28. May…??
Svend
Jim Frost says
Hi Svend, Sorry about the delay in replying. Sometimes life gets busy! I will reply to your previous comment right now.
Aisha says
Thank you so much for such timely responses! They helped clarify a lot of things for me ๐
Svend says
Thank you for a very informative blog! I have a question regarding “overfitting” of a multivariable regression analysis that I have performed; 368 patients (ACL-reconstructed + concomitant cartilage lesions) with 5-year FU after ACL-reconstruction. The dependent variable was continuous (PROM). I have included 14 independent variables (sex/age/time from surgery etc, all of which formerely shown to be clinically important for the outcome) including two different types of surgery for the concomitant cartilage injury. No surgery to the concomitant lesions was used as reference (n=203), debridement (n=70), and Microfracture (n=95). My main objective was to investigate the effect on PROMs of those 2 treatments. My initial understanding was that it was OK to include that many independent variables as long as there were 368 patients included/PROMs at FU. But I have had comments that as long as the number of patients for some of the independent variables, ex. (debridement and microfracture) is lower than the model as a whole, the number of independent variables should be based on the variable with least observations…?
I guess my question is: does the lowest number of observations for an independent variable dictate the size of the model/how many predictors you can use..? -And also the power..?
Thanks!
Jim Frost says
Hi Svend,
I’m not sure if you’ve read my post about overfitting. If you haven’t, you should read it. It’ll answer some of your questions.
For your specific case, in general, yes, I think you have enough observations. In my blog post, I’m talking mainly about continuous variables. However, if I’m understanding correctly, you’re referring to a categorical variable for reference/debridement? If so, the rules are a bit different but I still think you’re good.
Regression and ANOVA are really the same analysis. So, you can thinking of your analysis as an ANOVA where you’re comparing groups in your data. And, it’s true that groups with smaller numbers will produce less precise estimates than groups with larger numbers. And, you generally require more observations for categorical variables than you do for continuous variables. However, it appears like your smallest group has an n=70 and that’s a very good sample size. In ANOVA, having more than 15-20 observations per group is usually good from a assumptions point of view (might not be produce sufficient statistical power depending on the effect size). So, you’re way over that. If some of your groups had very few observation, you might have need to worry about the estimates for that variable–but that’s not the case.
And, given your number of observations (368) and number of model terms requiring estimates overall (14), I don’t see any obvious reason to worry about overfitting on that basis either. Just be sure that you’re counting interaction terms and polynomials in the number model terms. Additionally, a categorical variable can use more degrees of freedom than a single continuous variable.
In short, I don’t see any reason for concern about overfitting given what you have written. Power depends on the effect size, which I don’t know. However, based on the number of observations/terms in model, I again don’t see an obvious problem.
I hope this helps! Best of luck with your analysis!
Aisha says
Also, another query. I want to run a multiple regression but my demographics and one of my IVs weren’t significant in the initial correlation I ran. What variables should I put in my regression test now? Should I skip all those that weren’t significant? Or just the demographics? I have read that if you have literature backing up the relationship, you can run a regression analysis regardless of how it appeared in your preliminary analysis. How true it that? What would be the best approach in this case?
would mean a lot if you help me out on this one
Jim Frost says
Hi again Aisha,
Two different answers for you. One, be wary of the correlation results. The problem is, again, the potential for confounding variables. Correlation doesn’t factor in other variables. Confounding variables can mess up the correlation results just like it can bias a regression model as I explained in my other comment. You have reason to believe that some of your demographic variables won’t be significant until you add your main IVs. So, you should try that to see what happens. Read the post about confounding variables and keep that in mind as you work through this!
And, yes, if you have strong theory or evidence from other studies for including IVs in the model, it’s ok to include them in your model even if it’s not significant. Just explain that in the write up.
For more about that, and model building in general, read my post about specifying the correct model!
Aisha says
Hi! I can’t believe I didn’t find this blog earlier, would have saved me a lot of trouble for my research ๐
Anyway, I have a question. Is it possible for your demographic variables to become significant predictors in the final model of a Hierarchical regression? I cant seem to understand why it is the case with mine when they came out to be non significant in the first model (even in the correlation test when tested earlier) but became significant when I put them with the rest of my (main) IVs.
Are there practical reasons for that or is it poor statistical skills? :-/
Jim Frost says
Hi Aisha,
Thanks for writing with a fantastic question. It really touches on a number of different issues.
Statistics is a funny field. There’s the field of statistics, but then many scientists/researchers in different fields use statistics within their own fields. And, I’ve observed in different fields that there are different terminology and practices for statistical procedures. Often I’ll hear a term for a statistical procedure and at first I won’t know what it is. But, then the person will describe it to me and I’ll know it by another name.
At one point, hierarchical regression was like this for me. I’ve never used it myself, but it appears to be common in social sciences research. The idea is you add variables to model in several groups, such as the demographic variables in one group, and then some other variables in the next group. There’s usually a logic behind the grouping. The idea is to see how much the model improves with the addition of each group.
I have some issues with this practice, and I think your case illustrates them. The idea behind this method is that each model in the process isn’t as good as the subsequent model, but it’s still a valid comparison. Unfortunately, if you look at a model knowing that you’re leaving out significant predictors, there’s a chance that the model with fewer IVs is biased. This problem occurs more frequently with observational studies, which I believe are more common in the social sciences. It’s the problem of confounding variables. And, what you describe is consistent with there being confounding variables that are not in the model with demographic variables until you add the main IVs. For more details, read my post about how confounding variables that are not in the model can bias your results.
Chances are that some of your main IVs are correlated with one more demographic variables and the DV. That condition will bias coefficients in your demographic IV model because that model excludes the confounding variables.
So, that’s the likely practical reason for what you’re observing. Not poor statistical skills! And, I’m not a fan of hierarchical regression for that reason. Perhaps there’s value to it that I’m not understanding. I’ve never used it in practice. But there doesn’t seem to be much to gain by assessing that first (in your case) demographic IV model when it appears to be excluding confounding variables and is, consequently, biased!
However, I know that methodology is common in some fields, so it’s probably best to roll with it! ๐ But, that’s what I think is happening.
IBRAHIM BUMADIAN says
Hello Jim
I need your help please.
I have this eq: Can you perform a multiple regression with two independent variables but one of them constant ? for example I have this data
Angle (Theta) Length ratio (%) Force (kn)
0 1 52.1
0.174444444 1 52.9
0.261666667 1 53.3
0.348888889 1 55.5
0.436111111 1 58.1
Jim Frost says
Hi Ibrahim,
Thanks for writing with the good question!
The heart of regression analysis is determining how changes in an independent variable correlates with changes in the dependent variable. However, if an independent variable does not change (i.e., it is constant), there is no way for the analysis to determine how changes in it correlate to changes in the DV. It’s just not possible. So, to answer your question, you can’t perform regression with a constant variable.
I hope this helps!
Angie Manfredo-Thomas says
Thank you very much for this awesome site!
Harri says
Hello sir, i need to know about regression and anova could you help me please.
Jim Frost says
Hi Hari,
You’re in the right spot! Read through my blog posts and you’ll learn about these topics. Additionally, within a couple of weeks, I’ll be releasing an ebook that’s all about learning regression!
SM says
Very nice tutorial. I’m reading them all! Are there any articles explaining how the regression model gets trained? Something about gradient descent?
Mani says
Thanks alot for your precious time sir
Jim Frost says
You’re very welcome! ๐
Mani says
Hey sir,hope you will be fine.It is really wonderful platform to learn regression.
Sir i have some problem as I’m using cross sectional data and dependent variable is continuous.Its basically MICS data and I’m using OLS but the problem is that there are some missing observation in some variables.So the sample size is not equal across all the variables.So its make sense in OLS?
Jim Frost says
Hi Mani,
In the normal course of events, yes, when an observation has a missing value in one of the variables, OLS will exclude the entire observation when it fits the model. If observations with missing values are a small portion of your dataset, it’s probably not a problem. You do have to be aware of whether certain types of respondents are more likely to have missing values because that can skew your results. You want the missing values to occur randomly through the observations rather than systematically occurring more frequently in particular types of observations. But, again, if the vast majority of your observations don’t have missing values, OLS can still be a good choice.
Assuming that OLS make sense for your data, one difficulty with missing values is that there really is no alternative analysis that you can use to handle them. If OLS is appropriate for your data, you’re pretty much stuck with it even if you have problematic missing values. However, there are methods of estimating the missing values so you can use those observations. This process is particularly helpful if the missing values don’t occur randomly (as I describe above). I don’t know which software you are using, but SPSS has a particularly good method for imputing missing values. If you think missing values are a problem for your dataset, you should investigate ways to estimate those missing values, and then use OLS.
Best of luck with your analysis!
Antonio Padua says
Hi Jim, I was quite excited to see you post this, but then there was no following article, only related subjects.
Binary logistic regression
By Jim Frost
Binary logistic regression models the relationship between a set of predictors and a binary response variable. A binary response has only two possible values, such as win and lose. Use a binary regression model to understand how changes in the predictor values are associated with changes in the probability of an event occurring.
Is the lesson on binary logistic regression to follow, or what am I missing?
Thank you for your time.
Antonio Padua
Jim Frost says
Hi Antonio,
That’s a glossary term. On my blog, glossary terms have a special link. If you hover the pointer over the link, you’ll see a tooltip that displays the glossary term. Or, if you click the link, you go to the glossary term itself. You can also find all the glossary terms by clicking Glossary in the menu across the top of the screen. It seems like you probably clicked the link to get to the glossary term for binary logistic regression.
I’ve had several requests for articles about this topic. So, I’m putting it on my to-do list! Although, it probably won’t be for a number of months. In the mean time, you can read my post where I show an example of binary logistic regression.
Thanks for writing!
Hanna Kerstin says
Hi Jim,
Thanks so much, your blog is really helpful! I was wondering whether you have some suggestions on published articles that use OLS (nothing fancy, just very plain OLS) and that could be used in class for learning interpreting regression outputs. I’d love to use “real” work and make students see that what they learn is relevant in academia. I mostly find work that is too complicated for someone just starting to learn regression techniques, so any advice would be appreciated!
Thanks,
Hanna
Tran Trong Phong says
Hi Jim. Did you write on Instrumental variable and 2 SLS method? I am interested in them. Thanks so all excellent things you did on this site.
Jim Frost says
Hi,
I haven’t yet, but those might be good topics for the future!
David says
Jim. Thank you so much. Especially for such a prompt response! The slopes are coming from IT segment stock valuations over 150 years. The slopes are derived from valuation troughs and peaks. So it is a graph like you’d see for the S&P. Sorry I was not clear on this.
David says
Jim, could you recommend a model based on the following:
1. I can see a strong visual correlation between the left side trough and peak and the right side. When the left has a steep vector so does the left, for example.
2. This does not need to be the case, the left could provide a much steeper slope compared to right or a much more narrow slope.
3. The parallels intrigue me and I would like to measure if the left slope can be explained by the right to any degree.
4. I am measuring the rise and fall of industry valuations over time. (it is the rise and fall in these valuations over time that create these ~ parallel slopes.
5. My data set since 1886 only provides 6 events, but they are consistent as described.
6. I attempted correlate rising slope against declining.
Jim Frost says
Hi David,
I’m having time figuring out what you’re describing. I’m not sure what slopes you’re referring and I don’t know what you mean by the left versus right slopes?
If you only have 6 data points, you’ll only be able to fit an extremely simple model. You’ll usually need at least 10 data points (absolute minimum but probably more) to even include one independent variable.
If you have two slopes for something and you want to see if one slope explains the other, you could try using linear regression. Use one slope as an independent variable and another as a dependent variable. Slopes would be a continuous variable and so that might work. The underlying data for each slope would have to be independent from data used for other slopes. And, you’ll have to worry about time order effects such as autocorrelation.
Raju says
Thank you Jim.
Raju Pavithran says
Hi Jim,
I have a doubt regarding which regression analysis is to be conducted. The data set consists of categorical independent variables (ordinal) and one dependent variable which is of continuous type. Moreover, most of the data pertaining to an independent variable is concentrated towards first category (70%). My objective is to capture the factors influencing the dependent variable and its significance. In that case should I consider the ind. variables to be continuous or as categorical? Thanks in advance.
Raju.
Jim Frost says
Hi Raju,
I think I already answered your question on this. Although, it looks like you’re now saying that you have an ordinal independent variable rather than a categorical variable. Ordinal data can be difficult. I’d still try using linear regression to fit the data.
You have two options that you can try.
1) You can include the ordinal data as continuous data. Doing this assumes that going from 1 to 2 is the same scale change as going from 2 to 3 and so on. Just like with actual continuous data. Although, you can add polynomials and transformations to improve the fit.
2) However, that doesn’t always work. Sometimes ordinal data don’t behave like continuous data. For example, the 2nd place finisher in a race doesn’t necessarily take twice as long as the 1st place finisher. And the difference between 3rd and 2nd isn’t the same as between 1st and 2nd. Etc. In that case, you can include it as a categorical variable. Using this approach, you estimate the mean differences between the different ordinal levels and you don’t have to assume they’ll be the same.
There’s an important caveat about including them as categorical variables. When you include categorical variables, you’re actually using indicator variables. A 5 point Likert scale (ordinal) actually includes 4 indicator variables. If you have many Likert variables, you’re actually including 4 variables for each one. That can be problematic. If you add enough of these variables, it can lead to overfitting. Depending on your software, you might not even see these indicator variables because they code and include them behind the scenes. It’s something to be aware of. If you have many such variables, it’s preferable to include them as continuous variables if possible.
You’ll have to think about whether your data seems more like continuous or categorical data. And, try both methods if you’re not sure. Check the residuals to make sure the model provides a good fit.
Ordinal data can be tricky because they’re not really continuous data nor categorical data–a bit of both! So, you’ll have to experiment and assess how well the different approaches work.
Good luck with your analysis!
Raju says
Hello Jim,
I have a set of data consisting of dependent variable which is of continuous type and independent variables which are of categorical type. The interesting thing which I found is that majority (more than 70%)of the independent variables belong to the category 1. The category values range from scale 1 to 5. I would like to know the appropriate sampling technique to be used. Is it appropriate to use linear regression or should I use other alternatives? Or any preprocessing of data is required? Please help me with the above.
Thanks in advance
Raju.
Jim Frost says
Hi Raju,
I’d try linear regression first. You can include that categorical variable as the independent variable with no problem. As always, be sure to check the residual plots. You can also use one-way ANOVA, which would be the more usual choice for this type of analysis. But, linear regression and ANOVA are really the same analysis “under the hood.” So, you can go either way.
I hope this helps!
sarkhani says
Hello Jim
Iโd like to
Know what your suggestions are with regards to choice of regression for predicting:
dependent variable is count data but it does not follow a poisson distribution
independent variables include categorical and continuous data
Iโd appreciate your thoughts on it โฆ.
thanks!
Jim Frost says
Hi Sarkhani,
Having count data that don’t follow the Poisson happens fairly often. The top alternatives that I’m aware of are negative binomial regression and zero inflated models. I talk about those options a bit in my post about choosing the correct type of regression analysis. The count data section is near the end. I hope this information points you in the right direction!
mohamadhosein says
Hi jim
i’m really happy to find your blog
Arnab Paul says
Independent variables range from 0 to 1 and corresponding dependent variables range from 1 to 5 . If we apply regression analysis to above and predict the value of y for any value of x that also ranges from 0 to 1, whether the value of y will always lie in the range 1 to 5?
Jim Frost says
In my experience, the predicted values will fall outside the range of the actual dependent variable. Assuming that you are referring to actual limits at 1 and 5, the regression analysis does not “understand” that those are hard limits. The extent that the predicted values fall outside these limits depends on the amount of error in the model.
RAJKUMAR R says
Very Good Explanation about regression ….Thank you sir for such a wonderful post….
Patrik Silva says
Hi Jim, I would like to see you writing something about Cross Validation (Training and test).
Patrik
Lisa says
thank you Jim this is helpful
Jim Frost says
You’re very welcome, Lisa! I’m glad you found it to be helpful!
Yud says
Hello Jim
Iโd like to
Know what your suggestions are with regards to choice of regression for predicting:
the likelihood of participants falling into
One of two categories (low Fear group codes 1 and high Fear 2 … when looking at scores from several variables ( e.g. external
Other locus of control, external social locus of control , internal locus of control and social phobia and sleep quality )
It was suggested that I break the question up to smaller components … Iโd appreciate your thoughts on it …. thanks!
Jim Frost says
Because you have a binary response (dependent variable), you’ll need to use binary logistic regression. I don’t know what types of predictors you have. If they’re continuous, you can just use them in the model and see how it works.
If they’re ordinal data, such as a Likert scale, you can still try using them as predictors in the model. However, ordinal data are less likely to satisfy all the assumptions. Check the residual plots. If including the ordinal data in the model doesn’t work, you can recode them as indicator variables (1s and 0s only based on whether an observation meets a criteria or not. For example, if you have a scale of -2, -1. 0, 1, 2 you could recode it so observations with a positive score get a 1 while all other scores get a 0.
Those are some ideas to try. Of course, what works best for your case depends on the subject area and types of data that you have.
I hope this helps!
Md zishan hussain says
Hello Jim,
I am using Step-wise regression to select significant variables in the model for prediction.how to interpret BIC in variable selection?
regards,
Zishan
Jim Frost says
Hi, when comparing candidate models, you look for models with a lower BIC. A lower BIC indicates that a model is more likely to be the true model. BIC identifies the model that is more likely to have generated the observed data.
Aftab Siddiqui says
yes.the language of the topic is very easy , i would appreciate you sir ,if you let me know that ,If rank
correlation is r =0.8,sum of “D”square=33.how we will calculate /find no. observations (n).
Jim Frost says
I’m not sure what you mean by “D” square, but I believe you’ll need more information for that.
Dina says
Hi, Jim!
I’m really happy to find your blog. It’s really helping, especially that you use basic English so non-native speaker can understand it better than reading most textbooks. Thanks!
Jim Frost says
Hi Dina, you’re welcome! And, thanks so much for your kind words–you made my day!
Nivedan says
Hi Jim!
Can you write on Logistic regression please!
Thank you
Jim Frost says
Hi! You bet! I plan to write about it in the near future!
Farmanullah says
great work by great man,, it is easily accessible source to access the scholars,, sir i am going to analyse data plz send me guidlines for selection of best simple linear/ multiple linear regression model, thanks
Jim Frost says
Hi, thank you so much for your kind words. I really appreciate it! I’ve written a blog post that I think is exactly what you need. It’ll help you choose the best regression model.
bwbjlt says
such a splendid compilation, Thanks Jim
Jim Frost says
Thank you!
Tobden says
would you also throw some ideas on Instrumental variable and 2 SLS method please?
Jim Frost says
Those are great ideas! I’ll write about them in future posts.