Regression Tutorial with Analysis Examples

By Jim Frost 84 Comments

Regression analysis mathematically describes the relationship between independent variables and the dependent variable. It also allows you to predict the mean value of the dependent variable when you specify values for the independent variables. In this regression tutorial, I gather together a wide range of posts that I’ve written about regression analysis. My tutorial helps you go through the regression content in a systematic and logical order.

This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions. I close the post with examples of different types of regression analyses.

If you’re learning regression analysis, you might want to bookmark this tutorial!

When to Use Regression and the Signs of a High-Quality Analysis

Before we get to the regression tutorials, I’ll cover several overarching issues.

Why use regression at all? What are common problems that trip up analysts? And, how do you differentiate a high-quality regression analysis from a less rigorous study? Read these posts to find out:

When Should I Use Regression Analysis?: Learn what regression can do for you and when you should use it.
Five Regression Tips for a Better Analysis: These tips help ensure that you perform a top-quality regression analysis.

Tutorial: Choosing the Right Type of Regression Analysis

There are many different types of regression analysis. Choosing the right procedure depends on your data and the nature of the relationships, as these posts explain.

Choosing the Correct Type of Regression Analysis: Reviews different regression methods by focusing on data types.
How to Choose Between Linear and Nonlinear Regression: Determining which one to use by assessing the statistical output.
The Difference between Linear and Nonlinear Models: Both kinds of models can fit curves, so what’s the difference?

Tutorial: Specifying the Regression Model

Selecting the right type of regression analysis is just the start of the process. Next, you need to specify the model. Model specification is the process of determining which independent variables belong in the model and whether modeling curvature and interaction effects are appropriate.

Model specification is an iterative process. The interpretation and assumption confirmation sections of this tutorial explain how to assess your model and how to change the model based on the statistical output and graphs.

Model Specification: Choosing the Correct Regression Model: I review standard statistical approaches, difficulties you may face, and offer some real-world advice.
Using Data Mining to Select Your Regression Model Can Create Problems: This approach to choosing a model can produce misleading results. Learn how to detect and avoid this problem.
Guide to Stepwise Regression and Best Subsets Regression: Two common tools for identifying candidate variables during the investigative stages of model building.
Overfitting Regression Models: Overly complicated models can produce misleading R-squared values, regression coefficients, and p-values. Learn how to detect and avoid this problem.
Curve Fitting Using Linear and Nonlinear Regression: When your data don’t follow a straight line, the model must fit the curvature. This post covers various methods for fitting curves.
Understanding Interaction Effects: When the effect of one variable depends on the value of another variable, you need to include an interaction effect in your model otherwise the results will be misleading.
When Do You Need to Standardize the Variables?: In specific situations, standardizing the independent variables can uncover statistically significant results.
Confounding Variables and Omitted Variable Bias: The variables that you leave out of the model can bias the variables that you include.
Proxy Variables: The Good Twin of Confounding Variables: Find ways to incorporate valuable information in your models and avoid confounders.

Tutorial: Interpreting Regression Results

After choosing the type of regression and specifying the model, you need to interpret the results. The next set of posts explain how to interpret the results for various regression analysis statistics:

Coefficients and p-values
Constant (Y-intercept)
Comparing regression slopes and constants with hypothesis tests
R-squared and the goodness-of-fit
- How high does R-squared need be?
- Interpreting a model with a low R-squared
- Adjusted R-squared and Predicted R-squared
- Standard error of the regression (S) vs. R-squared
- Five Reasons Your R-squared can be Too High: A high R-squared can occasionally signify a problem with your model.
F-test of overall significance
Identifying the Most Important Independent Variables: After settling on a model, analysts frequently ask, “Which variable is most important?”

Tutorial: Using Regression to Make Predictions

Analysts often use regression analysis to make predictions. In this section of the regression tutorial, learn how to make predictions and assess their precision.

Making Predictions with Regression Analysis: This guide uses BMI to predict body fat percentage.
Predicted R-squared: This statistic evaluates how well a model predicts the dependent variable for new observations.
Understand Prediction Precision to Avoid Costly Mistakes: Research shows that presentation affects the number of interpretation mistakes. Covers prediction intervals.
Prediction intervals versus other intervals: Prediction intervals indicate the precision of the predictions. I compare prediction intervals to different types of intervals.

Tutorial: Checking Regression Assumptions and Fixing Problems

Like other statistical procedures, regression analysis has assumptions that you need to meet, or the results can be unreliable. In regression, you primarily verify the assumptions by assessing the residual plots. The posts below explain how to do this and present some methods for fixing problems.

The Seven Classical Assumptions of OLS Linear Regression
Residual plots: Shows what the graphs should look like and why they might not!
Heteroscedasticity: The residuals should have a constant scatter (homoscedasticity). Shows how to detect this problem and various methods of fixing it.
Multicollinearity: Highly correlated independent variables can be problematic, but not always! Explains how to identify this problem and several ways of resolving it.

Examples of Different Types of Regression Analyses

The last part of the regression tutorial contains regression analysis examples. Some of the examples are included in previous tutorial sections. Most of these regression examples include the datasets so you can try it yourself! Also, try using Excel to perform regression analysis with a step-by-step example!

Linear regression with a double-log transformation: Models the relationship between mammal mass and metabolic rate using a fitted line plot.
Understanding Historians’ Rankings of U.S. Presidents using Regression Models: Models rankings of U.S. Presidents to various predictors.
Modeling the relationship between BMI and Body Fat Percentage with linear regression.
Curve fitting with linear and nonlinear regression.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Comments

Mary K. Reinders says

April 10, 2024 at 5:29 pm

Hello Jim, Finding your site super-helpful. Wondering if you can provide some examples to simply illustrate multiple regression output (from spss). I would like to illustrate the overall effects of the independent variables on the dependent without creating a histogram. Ideally something that shows the strength, direction and significance in a box plot, line graph, bubble chart or other smart graphic. So appreciate your guidance.

Reply
Anthony Orzechowski says

January 2, 2023 at 10:02 am

Hi Jim, I just bought all 3 of your books via your Website Store that took me to Amazon – $66.72 plus tax. I don’t see how to get the PDF versions though without an additional cost to buy them additionally. Can you help me with how to get access to the PDF’s? Also, I am reviewing these to see if I want to add them to the courses I teach in Data Analytics. Do you have academic pricing available for students? Both hardcopy and e-copy?

Reply
- Jim Frost says
  
  January 3, 2023 at 12:17 am
  
  Hi Anthony,
  
  Look for an email from me.
  
  Edited to add: I just sent an email to the address you used for the comment, but it bounced back saying it was “Blocked.” Please provide a method for me to contact you. You can use the contact form on my website and provide a different email address. Thanks!
  
  Reply
gebretsadikshibre says

July 18, 2022 at 7:15 am

Dear Dr, how are you?
Thank you very much for your contribution.
I have one question, which might not be related to this post
in My cross tabulation between two categorical variables, the cell of one of the variables have just 8 observations for the ” no” , and 33 observations for the “yes” of the of the second variable. Can I continue with is for the descriptive statistics or should collapse the categories to increase the sample size? Do I use the new variable with a fewer categories in my regression analysis?
your help is much appreciated

Reply
David says

May 5, 2021 at 1:46 pm

Thanks for teaching us about Stats intuitively.

Is your book Regression Analysis available in PDF format?
I’m a student learning Stats and would like it only in PDF format (no Kindle)

Thanks.

Reply
- Jim Frost says
  
  May 5, 2021 at 1:49 pm
  
  Hi David,
  
  Yes, if you buy it through My Website Store, you’ll get it in PDF format.
  
  Reply
RABIA NOUSHEEN says

February 22, 2021 at 4:08 pm

Hi Jim

Thnak you for your valuable advice. Change is in the way I am putting data into the software. When I put average data, glm output shows ingestion has no significant effect on mortality. When I input data with replications, glm out shows significant effect of ingestion on mortality. My standard deviations are large but data shows homogscedacity and normal distribution.

Your comments wil really be helpful in this regard.

Reply
- Jim Frost says
  
  February 22, 2021 at 4:11 pm
  
  Hi Rabia,
  
  If you have replications, I’d enter that data to maintain the separate data points and NOT use the average. That provides the model with more information!
  
  Reply
Rabia says

February 22, 2021 at 6:39 am

Hi Jim

I have a question about Generaližed linear model. I am getting different outputs of same response variable when I apply glm using 1) data with replications 2) using average data. Mortality is My response variable and no. Of particles ingested is My predictor variable, other two predictors are categorical.

Looking forward to your advice.

Reply
- Jim Frost says
  
  February 22, 2021 at 3:44 pm
  
  Hi Rabia,
  
  I’m not sure what you’re changing in your analysis to get the different outputs?
  
  Replications are good because they help the model estimate pure error.
  
  Average data is okay too but just be aware of incorporating that into the interpretations.
  
  Reply
Reza M. Tamanani says

September 7, 2020 at 1:02 pm

Hi Jim,

I know that we can use linear or nonlinear models to fit a line to a dataset with curvature. My question is that when we have too many independent variable, how could we understand if there is a curvature?

Do you think we should start with simple linear regression, then model polynomial, Reciprocal, log, and nonlinear regression and compare the result for all of them to find which model works the best?

Thanks a lot for your very good and easy to understand book.

Reza

Reply
Jen says

August 6, 2020 at 8:49 pm

Jim,

I wonder if you have any recommended articles as to how to interpret the actual p-values and confidence interval for multiple regressions? I am struggling to find examples/templates of reporting these results.

I truly appreciate your help.

Jen

Reply
- Jim Frost says
  
  August 6, 2020 at 10:29 pm
  
  Hi Jen,
  
  I’ve written a post about interpreting p-value for regression coefficients, which I think would be helpful.
  
  For confidence intervals of regression coefficients, think about sample means and CIs for means as a starting point. You can use the mean as the sample estimate of the population mean. However, because we’re working with a sample, we know there is a margin of error around that estimate. The CI captures that margin of error. If you have a CI for a sample mean, you know that the true population parameter is likely to be in that range.
  
  In the regression context, the coefficient is also a mean. It’s a mean effect or the mean change in the dependent variable given a one-unit change in the independent variable. However, because we’re working with a sample, we know there is a margin of error around that mean effect. Consequently, with a CI for a regression coefficient, we know that the true mean effect of that coefficient is likely to fall within that coefficient CI.
  
  I hope that helps!
  
  Reply
nasr says

June 28, 2020 at 2:57 am

Hi Jim
first of all, thanks for all your great work.
I’m setting linear regression analysis, in which the standard coefficient is considered, but the problem is my dependent variable that is Energy usage intensity so it means the lower value is the better than a higher value. correct me if I’m wrong, I think SPSS evaluates high value as the best and lower one as the worst so in my case, it could lead to effect reversely on the result (standard coefficient beta).
is it right? and what is your suggestion?

Reply
raeda says

June 23, 2020 at 3:51 am

hello Jim
i want help with an econometric model or equation that can be used if there is one independent variable (dam) and dependent variable(5 livelihood outcomes). here i am confused if i can use binary regression model considering the 5 outcomes as a indicators of the dependent variable which is livelihood outcomes or i have to consider the 5 livelihood outcomes as 5 dependent variables and use multivariate regression.please reply a.a.p
thank you so much

Reply
- Jim Frost says
  
  June 28, 2020 at 12:31 am
  
  Hi Raeda,
  
  It really depends on the nature of the variables. I don’t know what you’re assessing, but here are two possibilities.
  
  You label the independent variable, which I’m assuming is continuous but I don’t know for sure, and the 5 indicator/binary outcomes. This is appropriate if you think the IV affects, or at least predicts, those five indicators. Use this aproach if the goal of your analysis is to use the IV to predict the probability of those binary outcomes. Use binary logistic regression. You’ll need to run five different models. In each model, one of the binary outcomes/indicators is your DV and you’d use the same IV for each model. This type of model allows you to use the value of the IV to predict the probability of the binary outcome.
  
  However, if you instead want to use the binary indicators to predict the continuous variable, you’d need to use multiple regression. The continuous variable is your DV and the five indicators are your IVs. This type of model allows you to use the values of the five indicators to predict the mean value of the continuous variable.
  
  Which approach you take is a mix of theory and want your study needs to learn.
  
  I hope that helps!
  
  Reply
Behnaz says

June 3, 2020 at 5:17 pm

Hi Jim,

Is this normal that the signs in “Regression equation in uncoded units” are sometimes different from the signs in the “Coded coefficients table”? In my regression results, for some terms, while the sign of a coefficient is negative in “Coded coefficients table”, it is positive in the regression equation. I am a little confused here. I thought the signs should be the same.

Thanks,
Behnaz

Reply
- Jim Frost says
  
  June 3, 2020 at 8:07 pm
  
  Hi Behnaz,
  
  There is nothing unusual about the coded and uncoded coefficients having different signs. Suppose a coded coefficient has a negative sign but the uncoded coefficient has a positive sign. Your software using one of several processes that translates the raw data (uncoded) into coded values that help the model estimate process. Sometimes that conversion process causes data values to switch signs.
  
  Reply
Jonathan Hedvat says

October 17, 2019 at 5:31 am

Hello Jim,
I am lookign to do a Rsquared line for a multiple regression series. i’m not so confident that the 3rd,4th,5th number in the correlations will help make a better line.
i’m basically looking at data to predict stock prices (getting a better R2)
so for example Enterprise Value/Sales to growth rate has a high R2 of like .48
but we know for sure that Free cash flow/revenue to percent down from 52 week high is like .299

i have no clue how to get this to work in a 3d chart or to make a formula and find the new r2.
any help would be great.

i dont have excel..not a programmer..just have some google sheets experience.

Reply
- Jim Frost says
  
  October 17, 2019 at 3:42 pm
  
  Hi Jonathan,
  
  I’m not 100% sure what you mean by an R-squared line? Or, by the 3rd, 4th, 5th, number in the correlations? Are you fitting several models where each one has just one independent variable?
  
  It sounds to me like you’ll to learn more about multiple regression. Fortunately, I’ve written an ebook about it that will take you from a novice to being able to perform multiple regression effectively. Learn about my intuitive guide to regression ebook.
  
  It also sounds like you’ll need to obtain some statistical software! I’m not sure what statistics if any you can perform in Google Sheets.
  
  Reply
Kevin says

July 18, 2019 at 9:44 am

Forgive me if these questions have obvious answers but I could not find the answers yet. Still reading and learning. Is 3 the minimum number of samples needed to calculate regression? Why? I’m guessing the equations used require at least 3 sets of X,Y data to calculate a regression but I do not see a good explanation of why. I’m am not wondering about how many sets make the strongest fit. And with only two sets we would get a straight line and no chance of curvature……

I am working on a stability analysis report. For some lots we only have two time points. Zero and Three months. The software will not calculate the regression. Obviously, it needs three time points…..but why? For example: standard error cannot be calculated with only two results and therefore the rest of the equations will not work…or maybe it is related to degrees of freedom? (in the meantime what I will do is run through the equations by hand. the problem is I’m so heavily relying on the software. in other words being lazy. at least I’m questioning though. i’ve been told not to worry about it and just submit the report with “regression requires three data points”.)

Thank you.

Reply
Taibet says

July 12, 2019 at 5:34 pm

Hello, Pls how can I construct a model on carbon pricing? Thanks in anticipation of a timely response.

Reply
- Jim Frost says
  
  July 15, 2019 at 11:12 am
  
  Hi Taibet,
  
  The first step is to do a lot of research to see what others have done. That’ll get you started in the right direction. It’ll help you identify the data you’ll need to collect, variables to include in the model, and the type and form of model that is likely to fit your data. I’ve also written a post about choosing the correct regression model that you should read. That post describes the model fitting process and how to determine which model is the best.
  
  Best of luck with your analysis!
  
  Reply
Jane says

June 25, 2019 at 11:47 am

Hi Jim: I recently read your book Regression Analysis and found it very helpful. It covered a lot of material but I continue to have some questions about basic workflow when conducting regression analysis in social science research. For someone who wants to create an explanatory multiple regression model(s) as part of an observational study in anthropology, what are the basic chronological steps one should follow to analyze the data (eg: choose model type based on type of data collected; create scatterplots between Y and X’s; calculate correlation coefficients; specify model . . .)? I am looking for the basic steps to follow in the order that they should be completed. Once a researcher has identified a research question and collected and stored data in a dataset, what should the step-by-step work flow look like for a regression / model building analysis? Having a basic chronology of steps will help me better organize (and use) the material in your book. Thanks!

Reply
- Jim Frost says
  
  June 25, 2019 at 10:03 pm
  Hi Jane,
  
  First, thanks so much for buying me ebook. I’m so happy to hear that it was helpful. You ask a great question. And, it my next book I tackle the actual process of performing statistical studies that use the scientific method. For now, I can point you towards a blog post that covers this topic: Five Steps for Conducting Studies with Statistical Analyses
  
  And, because you’re talking about an observation study, I recommend my post about observational studies. It talks about how they’re helpful, what to watch out for, and some tips. Also be sure to read about confounding variables in regression analysis, which starts on page 158 in the book.
  
  Additionally, starting on p. 150 in the ebook, I talk about how to determine which variables to include in the model.
  
  Throughout all of those posts and the ebook, you’ll notice a common theme. That you need to do a lot of advance research to figure out what you need to measure and how to measure it. Also important to ensure that you don’t accidently not measure a variable and have omitted variable bias affect your results. That’s where all the literature research will be helpful.
  
  Now, in terms of analyzing the data, it’s hard to come up with one general approach. Hopefully, the literature review will tell you what has worked and hasn’t worked for similar studies. For example, maybe you’ll see that similar studies use OLS but need to use a log transformation. It also strongly depends on the nature of your data. The type of dependent variable(s) plays a huge role in what type of model you should use. See page 315 for more about that. It’s really a mix of what type of data you have (particularly the DVs) and what has worked/not worked for similar studies.
  
  Sometimes, even armed with all that advanced knowledge, you’ll go to fit the model with what seems to be the best choice, and it just doesn’t fit your data. Then, you need to go back to the drawing board and try something else. It’s definitely an iterative process. But, looking at what similar studies have done and understanding your data can give you a better chance of starting out with the right type of model. And, then use the tips starting on page 150 to see about the actual process of specifying the model, which is also an iterative process. You might well start out with the correct type of model, but have to go through several iterations to settle on the best form of it.
  
  So:
  1. See what other studies have done
  2. Understand your own data.
  3. Use information from step 1 and 2 to settle on a good type of model to start with and what variables to include in it.
  4. Try to obtain a good fit using that type of model. This step is an iterative process of fitting models, assessing the fit and significance, and possibly making adjustments.
  5. If you can obtain a good fit in step 4, you’re done after settling on the best form.
  6. If you cannot obtain a good fit in step 4, do more research to find another type of model you can try and go back to step 3.
  Best of luck with your study!
  Reply
svend ulstein says

June 5, 2019 at 11:50 am

Thank you! Much appreciated!!

Reply
- Jim Frost says
  
  June 5, 2019 at 12:24 pm
  
  You’re very welcome, Svend! Because you’re study uses regression, you might consider buying my ebook about regression. I cover a lot more in it than I do on the blog.
  
  Reply
svend ulstein says

June 4, 2019 at 6:17 am

Hi Jim! Did you notice my question from 28. May…??
Svend

Reply
- Jim Frost says
  
  June 4, 2019 at 11:06 am
  
  Hi Svend, Sorry about the delay in replying. Sometimes life gets busy! I will reply to your previous comment right now.
  
  Reply
Aisha says

June 2, 2019 at 5:12 am

Thank you so much for such timely responses! They helped clarify a lot of things for me 🙂

Reply
Svend says

May 28, 2019 at 3:53 am

Thank you for a very informative blog! I have a question regarding “overfitting” of a multivariable regression analysis that I have performed; 368 patients (ACL-reconstructed + concomitant cartilage lesions) with 5-year FU after ACL-reconstruction. The dependent variable was continuous (PROM). I have included 14 independent variables (sex/age/time from surgery etc, all of which formerely shown to be clinically important for the outcome) including two different types of surgery for the concomitant cartilage injury. No surgery to the concomitant lesions was used as reference (n=203), debridement (n=70), and Microfracture (n=95). My main objective was to investigate the effect on PROMs of those 2 treatments. My initial understanding was that it was OK to include that many independent variables as long as there were 368 patients included/PROMs at FU. But I have had comments that as long as the number of patients for some of the independent variables, ex. (debridement and microfracture) is lower than the model as a whole, the number of independent variables should be based on the variable with least observations…?
I guess my question is: does the lowest number of observations for an independent variable dictate the size of the model/how many predictors you can use..? -And also the power..?
Thanks!

Reply
- Jim Frost says
  
  June 4, 2019 at 11:23 am
  
  Hi Svend,
  
  I’m not sure if you’ve read my post about overfitting. If you haven’t, you should read it. It’ll answer some of your questions.
  
  For your specific case, in general, yes, I think you have enough observations. In my blog post, I’m talking mainly about continuous variables. However, if I’m understanding correctly, you’re referring to a categorical variable for reference/debridement? If so, the rules are a bit different but I still think you’re good.
  
  Regression and ANOVA are really the same analysis. So, you can thinking of your analysis as an ANOVA where you’re comparing groups in your data. And, it’s true that groups with smaller numbers will produce less precise estimates than groups with larger numbers. And, you generally require more observations for categorical variables than you do for continuous variables. However, it appears like your smallest group has an n=70 and that’s a very good sample size. In ANOVA, having more than 15-20 observations per group is usually good from a assumptions point of view (might not be produce sufficient statistical power depending on the effect size). So, you’re way over that. If some of your groups had very few observation, you might have need to worry about the estimates for that variable–but that’s not the case.
  
  And, given your number of observations (368) and number of model terms requiring estimates overall (14), I don’t see any obvious reason to worry about overfitting on that basis either. Just be sure that you’re counting interaction terms and polynomials in the number model terms. Additionally, a categorical variable can use more degrees of freedom than a single continuous variable.
  
  In short, I don’t see any reason for concern about overfitting given what you have written. Power depends on the effect size, which I don’t know. However, based on the number of observations/terms in model, I again don’t see an obvious problem.
  
  I hope this helps! Best of luck with your analysis!
  
  Reply
Aisha says

May 26, 2019 at 5:25 am

Also, another query. I want to run a multiple regression but my demographics and one of my IVs weren’t significant in the initial correlation I ran. What variables should I put in my regression test now? Should I skip all those that weren’t significant? Or just the demographics? I have read that if you have literature backing up the relationship, you can run a regression analysis regardless of how it appeared in your preliminary analysis. How true it that? What would be the best approach in this case?
would mean a lot if you help me out on this one

Reply
- Jim Frost says
  
  May 27, 2019 at 10:25 pm
  
  Hi again Aisha,
  
  Two different answers for you. One, be wary of the correlation results. The problem is, again, the potential for confounding variables. Correlation doesn’t factor in other variables. Confounding variables can mess up the correlation results just like it can bias a regression model as I explained in my other comment. You have reason to believe that some of your demographic variables won’t be significant until you add your main IVs. So, you should try that to see what happens. Read the post about confounding variables and keep that in mind as you work through this!
  
  And, yes, if you have strong theory or evidence from other studies for including IVs in the model, it’s ok to include them in your model even if it’s not significant. Just explain that in the write up.
  
  For more about that, and model building in general, read my post about specifying the correct model!
  
  Reply
Aisha says

May 25, 2019 at 8:13 am

Hi! I can’t believe I didn’t find this blog earlier, would have saved me a lot of trouble for my research 😀
Anyway, I have a question. Is it possible for your demographic variables to become significant predictors in the final model of a Hierarchical regression? I cant seem to understand why it is the case with mine when they came out to be non significant in the first model (even in the correlation test when tested earlier) but became significant when I put them with the rest of my (main) IVs.
Are there practical reasons for that or is it poor statistical skills? :-/

Reply
- Jim Frost says
  
  May 27, 2019 at 10:19 pm
  
  Hi Aisha,
  
  Thanks for writing with a fantastic question. It really touches on a number of different issues.
  
  Statistics is a funny field. There’s the field of statistics, but then many scientists/researchers in different fields use statistics within their own fields. And, I’ve observed in different fields that there are different terminology and practices for statistical procedures. Often I’ll hear a term for a statistical procedure and at first I won’t know what it is. But, then the person will describe it to me and I’ll know it by another name.
  
  At one point, hierarchical regression was like this for me. I’ve never used it myself, but it appears to be common in social sciences research. The idea is you add variables to model in several groups, such as the demographic variables in one group, and then some other variables in the next group. There’s usually a logic behind the grouping. The idea is to see how much the model improves with the addition of each group.
  
  I have some issues with this practice, and I think your case illustrates them. The idea behind this method is that each model in the process isn’t as good as the subsequent model, but it’s still a valid comparison. Unfortunately, if you look at a model knowing that you’re leaving out significant predictors, there’s a chance that the model with fewer IVs is biased. This problem occurs more frequently with observational studies, which I believe are more common in the social sciences. It’s the problem of confounding variables. And, what you describe is consistent with there being confounding variables that are not in the model with demographic variables until you add the main IVs. For more details, read my post about how confounding variables that are not in the model can bias your results.
  
  Chances are that some of your main IVs are correlated with one more demographic variables and the DV. That condition will bias coefficients in your demographic IV model because that model excludes the confounding variables.
  
  So, that’s the likely practical reason for what you’re observing. Not poor statistical skills! And, I’m not a fan of hierarchical regression for that reason. Perhaps there’s value to it that I’m not understanding. I’ve never used it in practice. But there doesn’t seem to be much to gain by assessing that first (in your case) demographic IV model when it appears to be excluding confounding variables and is, consequently, biased!
  
  However, I know that methodology is common in some fields, so it’s probably best to roll with it! 🙂 But, that’s what I think is happening.
  
  Reply
IBRAHIM BUMADIAN says

May 19, 2019 at 6:30 am

Hello Jim
I need your help please.
I have this eq: Can you perform a multiple regression with two independent variables but one of them constant ? for example I have this data

Angle (Theta) Length ratio (%) Force (kn)
0 1 52.1
0.174444444 1 52.9
0.261666667 1 53.3
0.348888889 1 55.5
0.436111111 1 58.1

Reply
- Jim Frost says
  
  May 20, 2019 at 2:42 pm
  
  Hi Ibrahim,
  
  Thanks for writing with the good question!
  
  The heart of regression analysis is determining how changes in an independent variable correlates with changes in the dependent variable. However, if an independent variable does not change (i.e., it is constant), there is no way for the analysis to determine how changes in it correlate to changes in the DV. It’s just not possible. So, to answer your question, you can’t perform regression with a constant variable.
  
  I hope this helps!
  
  Reply
Angie Manfredo-Thomas says

February 27, 2019 at 6:13 pm

Thank you very much for this awesome site!

Reply
Harri says

February 27, 2019 at 11:01 am

Hello sir, i need to know about regression and anova could you help me please.

Reply
- Jim Frost says
  
  February 27, 2019 at 11:46 am
  
  Hi Hari,
  
  You’re in the right spot! Read through my blog posts and you’ll learn about these topics. Additionally, within a couple of weeks, I’ll be releasing an ebook that’s all about learning regression!
  
  Reply
SM says

February 20, 2019 at 12:09 pm

Very nice tutorial. I’m reading them all! Are there any articles explaining how the regression model gets trained? Something about gradient descent?

Reply
Mani says

February 11, 2019 at 11:55 am

Thanks alot for your precious time sir

Reply
- Jim Frost says
  
  February 11, 2019 at 11:58 am
  
  You’re very welcome! 🙂
  
  Reply
Mani says

February 10, 2019 at 5:05 am

Hey sir,hope you will be fine.It is really wonderful platform to learn regression.
Sir i have some problem as I’m using cross sectional data and dependent variable is continuous.Its basically MICS data and I’m using OLS but the problem is that there are some missing observation in some variables.So the sample size is not equal across all the variables.So its make sense in OLS?

Reply
- Jim Frost says
  
  February 11, 2019 at 11:40 am
  
  Hi Mani,
  
  In the normal course of events, yes, when an observation has a missing value in one of the variables, OLS will exclude the entire observation when it fits the model. If observations with missing values are a small portion of your dataset, it’s probably not a problem. You do have to be aware of whether certain types of respondents are more likely to have missing values because that can skew your results. You want the missing values to occur randomly through the observations rather than systematically occurring more frequently in particular types of observations. But, again, if the vast majority of your observations don’t have missing values, OLS can still be a good choice.
  
  Assuming that OLS make sense for your data, one difficulty with missing values is that there really is no alternative analysis that you can use to handle them. If OLS is appropriate for your data, you’re pretty much stuck with it even if you have problematic missing values. However, there are methods of estimating the missing values so you can use those observations. This process is particularly helpful if the missing values don’t occur randomly (as I describe above). I don’t know which software you are using, but SPSS has a particularly good method for imputing missing values. If you think missing values are a problem for your dataset, you should investigate ways to estimate those missing values, and then use OLS.
  
  Best of luck with your analysis!
  
  Reply
Antonio Padua says

January 20, 2019 at 10:33 am

Hi Jim, I was quite excited to see you post this, but then there was no following article, only related subjects.

Binary logistic regression

By Jim Frost

Binary logistic regression models the relationship between a set of predictors and a binary response variable. A binary response has only two possible values, such as win and lose. Use a binary regression model to understand how changes in the predictor values are associated with changes in the probability of an event occurring.

Is the lesson on binary logistic regression to follow, or what am I missing?

Thank you for your time.

Antonio Padua

Reply
- Jim Frost says
  
  January 20, 2019 at 1:20 pm
  
  Hi Antonio,
  
  That’s a glossary term. On my blog, glossary terms have a special link. If you hover the pointer over the link, you’ll see a tooltip that displays the glossary term. Or, if you click the link, you go to the glossary term itself. You can also find all the glossary terms by clicking Glossary in the menu across the top of the screen. It seems like you probably clicked the link to get to the glossary term for binary logistic regression.
  
  I’ve had several requests for articles about this topic. So, I’m putting it on my to-do list! Although, it probably won’t be for a number of months. In the mean time, you can read my post where I show an example of binary logistic regression.
  
  Thanks for writing!
  
  Reply
Hanna Kerstin says

November 2, 2018 at 1:24 pm

Hi Jim,

Thanks so much, your blog is really helpful! I was wondering whether you have some suggestions on published articles that use OLS (nothing fancy, just very plain OLS) and that could be used in class for learning interpreting regression outputs. I’d love to use “real” work and make students see that what they learn is relevant in academia. I mostly find work that is too complicated for someone just starting to learn regression techniques, so any advice would be appreciated!

Thanks,
Hanna

Reply
Tran Trong Phong says

October 25, 2018 at 7:52 pm

Hi Jim. Did you write on Instrumental variable and 2 SLS method? I am interested in them. Thanks so all excellent things you did on this site.

Reply
- Jim Frost says
  
  October 25, 2018 at 10:29 pm
  
  Hi,
  
  I haven’t yet, but those might be good topics for the future!
  
  Reply
David says

October 23, 2018 at 2:33 pm

Jim. Thank you so much. Especially for such a prompt response! The slopes are coming from IT segment stock valuations over 150 years. The slopes are derived from valuation troughs and peaks. So it is a graph like you’d see for the S&P. Sorry I was not clear on this.

Reply
David says

October 23, 2018 at 12:14 pm

Jim, could you recommend a model based on the following:

1. I can see a strong visual correlation between the left side trough and peak and the right side. When the left has a steep vector so does the left, for example.

2. This does not need to be the case, the left could provide a much steeper slope compared to right or a much more narrow slope.

3. The parallels intrigue me and I would like to measure if the left slope can be explained by the right to any degree.

4. I am measuring the rise and fall of industry valuations over time. (it is the rise and fall in these valuations over time that create these ~ parallel slopes.

5. My data set since 1886 only provides 6 events, but they are consistent as described.

6. I attempted correlate rising slope against declining.

Reply
- Jim Frost says
  
  October 23, 2018 at 2:04 pm
  
  Hi David,
  
  I’m having time figuring out what you’re describing. I’m not sure what slopes you’re referring and I don’t know what you mean by the left versus right slopes?
  
  If you only have 6 data points, you’ll only be able to fit an extremely simple model. You’ll usually need at least 10 data points (absolute minimum but probably more) to even include one independent variable.
  
  If you have two slopes for something and you want to see if one slope explains the other, you could try using linear regression. Use one slope as an independent variable and another as a dependent variable. Slopes would be a continuous variable and so that might work. The underlying data for each slope would have to be independent from data used for other slopes. And, you’ll have to worry about time order effects such as autocorrelation.
  
  Reply
Raju says

October 2, 2018 at 1:37 am

Thank you Jim.

Reply
Raju Pavithran says

October 2, 2018 at 1:31 am

Hi Jim,
I have a doubt regarding which regression analysis is to be conducted. The data set consists of categorical independent variables (ordinal) and one dependent variable which is of continuous type. Moreover, most of the data pertaining to an independent variable is concentrated towards first category (70%). My objective is to capture the factors influencing the dependent variable and its significance. In that case should I consider the ind. variables to be continuous or as categorical? Thanks in advance.

Raju.

Reply
- Jim Frost says
  
  October 2, 2018 at 2:26 am
  
  Hi Raju,
  
  I think I already answered your question on this. Although, it looks like you’re now saying that you have an ordinal independent variable rather than a categorical variable. Ordinal data can be difficult. I’d still try using linear regression to fit the data.
  
  You have two options that you can try.
  
  1) You can include the ordinal data as continuous data. Doing this assumes that going from 1 to 2 is the same scale change as going from 2 to 3 and so on. Just like with actual continuous data. Although, you can add polynomials and transformations to improve the fit.
  
  2) However, that doesn’t always work. Sometimes ordinal data don’t behave like continuous data. For example, the 2nd place finisher in a race doesn’t necessarily take twice as long as the 1st place finisher. And the difference between 3rd and 2nd isn’t the same as between 1st and 2nd. Etc. In that case, you can include it as a categorical variable. Using this approach, you estimate the mean differences between the different ordinal levels and you don’t have to assume they’ll be the same.
  
  There’s an important caveat about including them as categorical variables. When you include categorical variables, you’re actually using indicator variables. A 5 point Likert scale (ordinal) actually includes 4 indicator variables. If you have many Likert variables, you’re actually including 4 variables for each one. That can be problematic. If you add enough of these variables, it can lead to overfitting. Depending on your software, you might not even see these indicator variables because they code and include them behind the scenes. It’s something to be aware of. If you have many such variables, it’s preferable to include them as continuous variables if possible.
  
  You’ll have to think about whether your data seems more like continuous or categorical data. And, try both methods if you’re not sure. Check the residuals to make sure the model provides a good fit.
  
  Ordinal data can be tricky because they’re not really continuous data nor categorical data–a bit of both! So, you’ll have to experiment and assess how well the different approaches work.
  
  Good luck with your analysis!
  
  Reply
Raju says

October 1, 2018 at 2:32 am

Hello Jim,
I have a set of data consisting of dependent variable which is of continuous type and independent variables which are of categorical type. The interesting thing which I found is that majority (more than 70%)of the independent variables belong to the category 1. The category values range from scale 1 to 5. I would like to know the appropriate sampling technique to be used. Is it appropriate to use linear regression or should I use other alternatives? Or any preprocessing of data is required? Please help me with the above.

Thanks in advance
Raju.

Reply
- Jim Frost says
  
  October 1, 2018 at 9:40 pm
  
  Hi Raju,
  
  I’d try linear regression first. You can include that categorical variable as the independent variable with no problem. As always, be sure to check the residual plots. You can also use one-way ANOVA, which would be the more usual choice for this type of analysis. But, linear regression and ANOVA are really the same analysis “under the hood.” So, you can go either way.
  
  I hope this helps!
  
  Reply
sarkhani says

September 23, 2018 at 4:28 am

Hello Jim
I’d like to
Know what your suggestions are with regards to choice of regression for predicting:
dependent variable is count data but it does not follow a poisson distribution
independent variables include categorical and continuous data
I’d appreciate your thoughts on it ….
thanks!

Reply
- Jim Frost says
  
  September 24, 2018 at 11:08 pm
  
  Hi Sarkhani,
  
  Having count data that don’t follow the Poisson happens fairly often. The top alternatives that I’m aware of are negative binomial regression and zero inflated models. I talk about those options a bit in my post about choosing the correct type of regression analysis. The count data section is near the end. I hope this information points you in the right direction!
  
  Reply
mohamadhosein says

August 29, 2018 at 9:38 am

Hi jim
i’m really happy to find your blog

Reply
Arnab Paul says

August 11, 2018 at 1:42 pm

Independent variables range from 0 to 1 and corresponding dependent variables range from 1 to 5 . If we apply regression analysis to above and predict the value of y for any value of x that also ranges from 0 to 1, whether the value of y will always lie in the range 1 to 5?

Reply
- Jim Frost says
  
  August 11, 2018 at 4:18 pm
  
  In my experience, the predicted values will fall outside the range of the actual dependent variable. Assuming that you are referring to actual limits at 1 and 5, the regression analysis does not “understand” that those are hard limits. The extent that the predicted values fall outside these limits depends on the amount of error in the model.
  
  Reply
RAJKUMAR R says

August 8, 2018 at 4:18 am

Very Good Explanation about regression ….Thank you sir for such a wonderful post….

Reply
Patrik Silva says

March 29, 2018 at 11:43 am

Hi Jim, I would like to see you writing something about Cross Validation (Training and test).

Patrik

Reply
Lisa says

February 20, 2018 at 8:30 am

thank you Jim this is helpful

Reply
- Jim Frost says
  
  February 21, 2018 at 4:08 pm
  
  You’re very welcome, Lisa! I’m glad you found it to be helpful!
  
  Reply
Yud says

January 21, 2018 at 10:39 am

Hello Jim
I’d like to
Know what your suggestions are with regards to choice of regression for predicting:
the likelihood of participants falling into
One of two categories (low Fear group codes 1 and high Fear 2 … when looking at scores from several variables ( e.g. external
Other locus of control, external social locus of control , internal locus of control and social phobia and sleep quality )
It was suggested that I break the question up to smaller components … I’d appreciate your thoughts on it …. thanks!

Reply
- Jim Frost says
  
  January 22, 2018 at 2:30 pm
  
  Because you have a binary response (dependent variable), you’ll need to use binary logistic regression. I don’t know what types of predictors you have. If they’re continuous, you can just use them in the model and see how it works.
  
  If they’re ordinal data, such as a Likert scale, you can still try using them as predictors in the model. However, ordinal data are less likely to satisfy all the assumptions. Check the residual plots. If including the ordinal data in the model doesn’t work, you can recode them as indicator variables (1s and 0s only based on whether an observation meets a criteria or not. For example, if you have a scale of -2, -1. 0, 1, 2 you could recode it so observations with a positive score get a 1 while all other scores get a 0.
  
  Those are some ideas to try. Of course, what works best for your case depends on the subject area and types of data that you have.
  
  I hope this helps!
  
  Reply
Md zishan hussain says

January 21, 2018 at 5:04 am

Hello Jim,

I am using Step-wise regression to select significant variables in the model for prediction.how to interpret BIC in variable selection?

regards,
Zishan

Reply
- Jim Frost says
  
  January 22, 2018 at 5:36 pm
  
  Hi, when comparing candidate models, you look for models with a lower BIC. A lower BIC indicates that a model is more likely to be the true model. BIC identifies the model that is more likely to have generated the observed data.
  
  Reply
Aftab Siddiqui says

January 18, 2018 at 2:44 pm

yes.the language of the topic is very easy , i would appreciate you sir ,if you let me know that ,If rank

correlation is r =0.8,sum of “D”square=33.how we will calculate /find no. observations (n).

Reply
- Jim Frost says
  
  January 18, 2018 at 3:00 pm
  
  I’m not sure what you mean by “D” square, but I believe you’ll need more information for that.
  
  Reply
Dina says

January 6, 2018 at 11:08 pm

Hi, Jim!
I’m really happy to find your blog. It’s really helping, especially that you use basic English so non-native speaker can understand it better than reading most textbooks. Thanks!

Reply
- Jim Frost says
  
  January 7, 2018 at 12:49 am
  
  Hi Dina, you’re welcome! And, thanks so much for your kind words–you made my day!
  
  Reply
Nivedan says

December 21, 2017 at 12:30 am

Hi Jim!

Can you write on Logistic regression please!

Thank you

Reply
- Jim Frost says
  
  December 21, 2017 at 12:45 am
  
  Hi! You bet! I plan to write about it in the near future!
  
  Reply
Farmanullah says

December 16, 2017 at 2:33 am

great work by great man,, it is easily accessible source to access the scholars,, sir i am going to analyse data plz send me guidlines for selection of best simple linear/ multiple linear regression model, thanks

Reply
- Jim Frost says
  
  December 17, 2017 at 12:21 am
  
  Hi, thank you so much for your kind words. I really appreciate it! I’ve written a blog post that I think is exactly what you need. It’ll help you choose the best regression model.
  
  Reply
bwbjlt says

December 8, 2017 at 8:47 am

such a splendid compilation, Thanks Jim

Reply
- Jim Frost says
  
  December 8, 2017 at 11:09 am
  
  Thank you!
  
  Reply
Tobden says

December 3, 2017 at 10:00 pm

would you also throw some ideas on Instrumental variable and 2 SLS method please?

Reply
- Jim Frost says
  
  December 3, 2017 at 10:40 pm
  
  Those are great ideas! I’ll write about them in future posts.
  
  Reply