Regression analysis mathematically describes the relationship between independent variables and the dependent variable. It also allows you to predict the mean value of the dependent variable when you specify values for the independent variables. In this regression tutorial, I gather together a wide range of posts that I’ve written about regression analysis. My tutorial helps you go through the regression content in a systematic and logical order.

This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions. I close the post with examples of different types of regression analyses.

If you’re learning regression analysis, you might want to bookmark this tutorial!

## When to Use Regression and the Signs of a High-Quality Analysis

Before we get to the regression tutorials, I’ll cover several overarching issues.

Why use regression at all? What are common problems that trip up analysts? And, how do you differentiate a high-quality regression analysis from a less rigorous study? Read these posts to find out:

- When Should I Use Regression Analysis?: Learn what regression can do for you and when you should use it.
- Five Regression Tips for a Better Analysis: These tips help ensure that you perform a top-quality regression analysis.

## Tutorial: Choosing the Right Type of Regression Analysis

There are many different types of regression analysis. Choosing the right procedure depends on your data and the nature of the relationships, as these posts explain.

- Choosing the Correct Type of Regression Analysis: Reviews different regression methods by focusing on data types.
- How to Choose Between Linear and Nonlinear Regression: Determining which one to use by assessing the statistical output.
- The Difference between Linear and Nonlinear Models: Both kinds of models can fit curves, so what’s the difference?

## Tutorial: Specifying the Regression Model

Selecting the right type of regression analysis is just the start of the process. Next, you need to specify the model. Model specification is the process of determining which independent variables belong in the model and whether modeling curvature and interaction effects are appropriate.

Model specification is an iterative process. The interpretation and assumption confirmation sections of this tutorial explain how to assess your model and how to change the model based on the statistical output and graphs.

- Model Specification: Choosing the Correct Regression Model: I review standard statistical approaches, difficulties you may face, and offer some real-world advice.
- Using Data Mining to Select Your Regression Model Can Create Problems: This approach to choosing a model can produce misleading results. Learn how to detect and avoid this problem.
- Guide to Stepwise Regression and Best Subsets Regression: Two common tools for identifying candidate variables during the investigative stages of model building.
- Overfitting Regression Models: Overly complicated models can produce misleading R-squared values, regression coefficients, and p-values. Learn how to detect and avoid this problem.
- Curve Fitting Using Linear and Nonlinear Regression: When your data don’t follow a straight line, the model must fit the curvature. This post covers various methods for fitting curves.
- Understanding Interaction Effects: When the effect of one variable depends on the value of another variable, you need to include an interaction effect in your model otherwise the results will be misleading.
- When Do You Need to Standardize the Variables?: In specific situations, standardizing the independent variables can uncover statistically significant results.
- Confounding Variables and Omitted Variable Bias: The variables that you leave out of the model can bias the variables that you include.

## Tutorial: Interpreting Regression Results

After choosing the type of regression and specifying the model, you need to interpret the results. The next set of posts explain how to interpret the results for various regression analysis statistics:

- Coefficients and p-values
- Constant (Y-intercept)
- Comparing regression slopes and constants with hypothesis tests
- R-squared and the goodness-of-fit
- How high does R-squared need be?
- Interpreting a model with a low R-squared
- Adjusted R-squared and Predicted R-squared
- Standard error of the regression (S) vs. R-squared
- Five Reasons Your R-squared can be Too High: A high R-squared can occasionally signify a problem with your model.

- F-test of overall significance
- Identifying the Most Important Independent Variables: After settling on a model, analysts frequently ask, “Which variable is most important?”

## Tutorial: Using Regression to Make Predictions

Analysts often use regression analysis to make predictions. In this section of the regression tutorial, learn how to make predictions and assess their precision.

- Making Predictions with Regression Analysis: This guide uses BMI to predict body fat percentage.
- Predicted R-squared: This statistic evaluates how well a model predicts the dependent variable for new observations.
- Understand Prediction Precision to Avoid Costly Mistakes: Research shows that presentation affects the number of interpretation mistakes. Covers prediction intervals.
- Prediction intervals versus other intervals: Prediction intervals indicate the precision of the predictions. I compare prediction intervals to different types of intervals.

## Tutorial: Checking Regression Assumptions and Fixing Problems

Like other statistical procedures, regression analysis has assumptions that you need to meet, or the results can be unreliable. In regression, you primarily verify the assumptions by assessing the residual plots. The posts below explain how to do this and present some methods for fixing problems.

- The Seven Classical Assumptions of OLS Linear Regression
- Residual plots: Shows what the graphs should look like and why they might not!
- Heteroscedasticity: The residuals should have a constant scatter (homoscedasticity). Shows how to detect this problem and various methods of fixing it.
- Multicollinearity: Highly correlated independent variables can be problematic, but not always! Explains how to identify this problem and several ways of resolving it.

## Examples of Different Types of Regression Analyses

The last part of the regression tutorial contains regression analysis examples. I’ll be adding more. Some of the examples are included in previous tutorial sections. Most of these regression examples include the datasets so you can try it yourself!

- Linear regression with a double-log transformation: Models the relationship between mammal mass and metabolic rate using a fitted line plot.
- Modeling the relationship between BMI and Body Fat Percentage with linear regression.
- Curve fitting with linear and nonlinear regression.

Mani says

Thanks alot for your precious time sir

Jim Frost says

You’re very welcome! 🙂

Mani says

Hey sir,hope you will be fine.It is really wonderful platform to learn regression.

Sir i have some problem as I’m using cross sectional data and dependent variable is continuous.Its basically MICS data and I’m using OLS but the problem is that there are some missing observation in some variables.So the sample size is not equal across all the variables.So its make sense in OLS?

Jim Frost says

Hi Mani,

In the normal course of events, yes, when an observation has a missing value in one of the variables, OLS will exclude the entire observation when it fits the model. If observations with missing values are a small portion of your dataset, it’s probably not a problem. You do have to be aware of whether certain types of respondents are more likely to have missing values because that can skew your results. You want the missing values to occur randomly through the observations rather than systematically occurring more frequently in particular types of observations. But, again, if the vast majority of your observations don’t have missing values, OLS can still be a good choice.

Assuming that OLS make sense for your data, one difficulty with missing values is that there really is no alternative analysis that you can use to handle them. If OLS is appropriate for your data, you’re pretty much stuck with it even if you have problematic missing values. However, there are methods of estimating the missing values so you can use those observations. This process is particularly helpful if the missing values don’t occur randomly (as I describe above). I don’t know which software you are using, but SPSS has a particularly good method for imputing missing values. If you think missing values are a problem for your dataset, you should investigate ways to estimate those missing values, and then use OLS.

Best of luck with your analysis!

Antonio Padua says

Hi Jim, I was quite excited to see you post this, but then there was no following article, only related subjects.

Binary logistic regression

By Jim Frost

Binary logistic regression models the relationship between a set of predictors and a binary response variable. A binary response has only two possible values, such as win and lose. Use a binary regression model to understand how changes in the predictor values are associated with changes in the probability of an event occurring.

Is the lesson on binary logistic regression to follow, or what am I missing?

Thank you for your time.

Antonio Padua

Jim Frost says

Hi Antonio,

That’s a glossary term. On my blog, glossary terms have a special link. If you hover the pointer over the link, you’ll see a tooltip that displays the glossary term. Or, if you click the link, you go to the glossary term itself. You can also find all the glossary terms by clicking Glossary in the menu across the top of the screen. It seems like you probably clicked the link to get to the glossary term for binary logistic regression.

I’ve had several requests for articles about this topic. So, I’m putting it on my to-do list! Although, it probably won’t be for a number of months. In the mean time, you can read my post where I show an example of binary logistic regression.

Thanks for writing!

Hanna Kerstin says

Hi Jim,

Thanks so much, your blog is really helpful! I was wondering whether you have some suggestions on published articles that use OLS (nothing fancy, just very plain OLS) and that could be used in class for learning interpreting regression outputs. I’d love to use “real” work and make students see that what they learn is relevant in academia. I mostly find work that is too complicated for someone just starting to learn regression techniques, so any advice would be appreciated!

Thanks,

Hanna

Tran Trong Phong says

Hi Jim. Did you write on Instrumental variable and 2 SLS method? I am interested in them. Thanks so all excellent things you did on this site.

Jim Frost says

Hi,

I haven’t yet, but those might be good topics for the future!

David says

Jim. Thank you so much. Especially for such a prompt response! The slopes are coming from IT segment stock valuations over 150 years. The slopes are derived from valuation troughs and peaks. So it is a graph like you’d see for the S&P. Sorry I was not clear on this.

David says

Jim, could you recommend a model based on the following:

1. I can see a strong visual correlation between the left side trough and peak and the right side. When the left has a steep vector so does the left, for example.

2. This does not need to be the case, the left could provide a much steeper slope compared to right or a much more narrow slope.

3. The parallels intrigue me and I would like to measure if the left slope can be explained by the right to any degree.

4. I am measuring the rise and fall of industry valuations over time. (it is the rise and fall in these valuations over time that create these ~ parallel slopes.

5. My data set since 1886 only provides 6 events, but they are consistent as described.

6. I attempted correlate rising slope against declining.

Jim Frost says

Hi David,

I’m having time figuring out what you’re describing. I’m not sure what slopes you’re referring and I don’t know what you mean by the left versus right slopes?

If you only have 6 data points, you’ll only be able to fit an extremely simple model. You’ll usually need at least 10 data points (absolute minimum but probably more) to even include one independent variable.

If you have two slopes for something and you want to see if one slope explains the other, you could try using linear regression. Use one slope as an independent variable and another as a dependent variable. Slopes would be a continuous variable and so that might work. The underlying data for each slope would have to be independent from data used for other slopes. And, you’ll have to worry about time order effects such as autocorrelation.

Raju says

Thank you Jim.

Raju Pavithran says

Hi Jim,

I have a doubt regarding which regression analysis is to be conducted. The data set consists of categorical independent variables (ordinal) and one dependent variable which is of continuous type. Moreover, most of the data pertaining to an independent variable is concentrated towards first category (70%). My objective is to capture the factors influencing the dependent variable and its significance. In that case should I consider the ind. variables to be continuous or as categorical? Thanks in advance.

Raju.

Jim Frost says

Hi Raju,

I think I already answered your question on this. Although, it looks like you’re now saying that you have an ordinal independent variable rather than a categorical variable. Ordinal data can be difficult. I’d still try using linear regression to fit the data.

You have two options that you can try.

1) You can include the ordinal data as continuous data. Doing this assumes that going from 1 to 2 is the same scale change as going from 2 to 3 and so on. Just like with actual continuous data. Although, you can add polynomials and transformations to improve the fit.

2) However, that doesn’t always work. Sometimes ordinal data don’t behave like continuous data. For example, the 2nd place finisher in a race doesn’t necessarily take twice as long as the 1st place finisher. And the difference between 3rd and 2nd isn’t the same as between 1st and 2nd. Etc. In that case, you can include it as a categorical variable. Using this approach, you estimate the mean differences between the different ordinal levels and you don’t have to assume they’ll be the same.

There’s an important caveat about including them as categorical variables. When you include categorical variables, you’re actually using indicator variables. A 5 point Likert scale (ordinal) actually includes 4 indicator variables. If you have many Likert variables, you’re actually including 4 variables for each one. That can be problematic. If you add enough of these variables, it can lead to overfitting. Depending on your software, you might not even see these indicator variables because they code and include them behind the scenes. It’s something to be aware of. If you have many such variables, it’s preferable to include them as continuous variables if possible.

You’ll have to think about whether your data seems more like continuous or categorical data. And, try both methods if you’re not sure. Check the residuals to make sure the model provides a good fit.

Ordinal data can be tricky because they’re not really continuous data nor categorical data–a bit of both! So, you’ll have to experiment and assess how well the different approaches work.

Good luck with your analysis!

Raju says

Hello Jim,

I have a set of data consisting of dependent variable which is of continuous type and independent variables which are of categorical type. The interesting thing which I found is that majority (more than 70%)of the independent variables belong to the category 1. The category values range from scale 1 to 5. I would like to know the appropriate sampling technique to be used. Is it appropriate to use linear regression or should I use other alternatives? Or any preprocessing of data is required? Please help me with the above.

Thanks in advance

Raju.

Jim Frost says

Hi Raju,

I’d try linear regression first. You can include that categorical variable as the independent variable with no problem. As always, be sure to check the residual plots. You can also use one-way ANOVA, which would be the more usual choice for this type of analysis. But, linear regression and ANOVA are really the same analysis “under the hood.” So, you can go either way.

I hope this helps!

sarkhani says

Hello Jim

I’d like to

Know what your suggestions are with regards to choice of regression for predicting:

dependent variable is count data but it does not follow a poisson distribution

independent variables include categorical and continuous data

I’d appreciate your thoughts on it ….

thanks!

Jim Frost says

Hi Sarkhani,

Having count data that don’t follow the Poisson happens fairly often. The top alternatives that I’m aware of are negative binomial regression and zero inflated models. I talk about those options a bit in my post about choosing the correct type of regression analysis. The count data section is near the end. I hope this information points you in the right direction!

mohamadhosein says

Hi jim

i’m really happy to find your blog

Arnab Paul says

Independent variables range from 0 to 1 and corresponding dependent variables range from 1 to 5 . If we apply regression analysis to above and predict the value of y for any value of x that also ranges from 0 to 1, whether the value of y will always lie in the range 1 to 5?

Jim Frost says

In my experience, the predicted values will fall outside the range of the actual dependent variable. Assuming that you are referring to actual limits at 1 and 5, the regression analysis does not “understand” that those are hard limits. The extent that the predicted values fall outside these limits depends on the amount of error in the model.

RAJKUMAR R says

Very Good Explanation about regression ….Thank you sir for such a wonderful post….

Patrik Silva says

Hi Jim, I would like to see you writing something about Cross Validation (Training and test).

Patrik

Lisa says

thank you Jim this is helpful

Jim Frost says

You’re very welcome, Lisa! I’m glad you found it to be helpful!

Yud says

Hello Jim

I’d like to

Know what your suggestions are with regards to choice of regression for predicting:

the likelihood of participants falling into

One of two categories (low Fear group codes 1 and high Fear 2 … when looking at scores from several variables ( e.g. external

Other locus of control, external social locus of control , internal locus of control and social phobia and sleep quality )

It was suggested that I break the question up to smaller components … I’d appreciate your thoughts on it …. thanks!

Jim Frost says

Because you have a binary response (dependent variable), you’ll need to use binary logistic regression. I don’t know what types of predictors you have. If they’re continuous, you can just use them in the model and see how it works.

If they’re ordinal data, such as a Likert scale, you can still try using them as predictors in the model. However, ordinal data are less likely to satisfy all the assumptions. Check the residual plots. If including the ordinal data in the model doesn’t work, you can recode them as indicator variables (1s and 0s only based on whether an observation meets a criteria or not. For example, if you have a scale of -2, -1. 0, 1, 2 you could recode it so observations with a positive score get a 1 while all other scores get a 0.

Those are some ideas to try. Of course, what works best for your case depends on the subject area and types of data that you have.

I hope this helps!

Md zishan hussain says

Hello Jim,

I am using Step-wise regression to select significant variables in the model for prediction.how to interpret BIC in variable selection?

regards,

Zishan

Jim Frost says

Hi, when comparing candidate models, you look for models with a lower BIC. A lower BIC indicates that a model is more likely to be the true model. BIC identifies the model that is more likely to have generated the observed data.

Aftab Siddiqui says

yes.the language of the topic is very easy , i would appreciate you sir ,if you let me know that ,If rank

correlation is r =0.8,sum of “D”square=33.how we will calculate /find no. observations (n).

Jim Frost says

I’m not sure what you mean by “D” square, but I believe you’ll need more information for that.

Dina says

Hi, Jim!

I’m really happy to find your blog. It’s really helping, especially that you use basic English so non-native speaker can understand it better than reading most textbooks. Thanks!

Jim Frost says

Hi Dina, you’re welcome! And, thanks so much for your kind words–you made my day!

Nivedan says

Hi Jim!

Can you write on Logistic regression please!

Thank you

Jim Frost says

Hi! You bet! I plan to write about it in the near future!

Farmanullah says

great work by great man,, it is easily accessible source to access the scholars,, sir i am going to analyse data plz send me guidlines for selection of best simple linear/ multiple linear regression model, thanks

Jim Frost says

Hi, thank you so much for your kind words. I really appreciate it! I’ve written a blog post that I think is exactly what you need. It’ll help you choose the best regression model.

bwbjlt says

such a splendid compilation, Thanks Jim

Jim Frost says

Thank you!

Tobden says

would you also throw some ideas on Instrumental variable and 2 SLS method please?

Jim Frost says

Those are great ideas! I’ll write about them in future posts.