Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data.

I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.

**Related post**: Guide to Data Types and How to Graph Them

## Regression Analysis with Continuous Dependent Variables

Regression analysis with a continuous dependent variable is probably the first type that comes to mind. While this is the primary case, you still need to decide which one to use.

Continuous variables are a measurement on a continuous scale, such as weight, time, and length.

### Linear regression

Linear regression, also known as ordinary least squares (OLS) and linear least squares, is the real workhorse of the regression world. Use linear regression to understand the mean change in a dependent variable given a one-unit change in each independent variable. You can also use polynomials to model curvature and include interaction effects. Despite the term “linear model,” this type can model curvature.

This analysis estimates parameters by minimizing the sum of the squared errors (SSE). Linear models are the most common and most straightforward to use. If you have a continuous dependent variable, linear regression is probably the first type you should consider.

There are some special options available for linear regression.

**Fitted line plots**: If you have one independent variable and the dependent variable, use a fitted line plot to display the data along with the fitted regression line and essential regression output. These graphs make understanding the model more intuitive.**Stepwise regression and Best subsets regression**: These automated methods can help identify candidate variables early in the model specification process.

### Advanced types of linear regression

Linear models are the oldest type of regression. It was designed so that statisticians can do the calculations by hand. However, OLS has several weaknesses, including a sensitivity to both outliers and multicollinearity, and it is prone to overfitting. To address these problems, statisticians have developed several advanced variants:

**Ridge regression**allows you to analyze data even when severe multicollinearity is present and helps prevent overfitting. This type of model reduces the large, problematic variance that multicollinearity causes by introducing a slight bias in the estimates. The procedure trades away much of the variance in exchange for a little bias, which produces more useful coefficient estimates when multicollinearity is present.**Lasso regression**(least absolute shrinkage and selection operator) performs variable selection that aims to increase prediction accuracy by identifying a simpler model. It is similar to Ridge regression but with variable selection.**Partial least squares (PLS) regression**is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. PLS decreases the independent variables down to a smaller number of uncorrelated components, similar to Principal Components Analysis. Then, the procedure performs linear regression on these components rather the original data. PLS emphasizes developing predictive models and is not used for screening variables. Unlike OLS, you can include multiple continuous*dependent*variables. PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables.

### Nonlinear regression

Nonlinear regression also requires a continuous dependent variable, but it provides a greater flexibility to fit curves than linear regression.

Like OLS, nonlinear regression estimates the parameters by minimizing the SSE. However, nonlinear models use an iterative algorithm rather than the linear approach of solving them directly with matrix equations. What this means for you is that you need to worry about which algorithm to use, specifying good starting values, and the possibility of either not converging on a solution or converging on a local minimum rather than a global minimum SSE. And, that’s in addition to specifying the correct functional form!

Most nonlinear models have one continuous independent variable, but it is possible to have more than one. When you have one independent variable, you can graph the results using a fitted line plot.

My advice is to fit a model using linear regression first and then determine whether the linear model provides an adequate fit by checking the residual plots. If you can’t obtain a good fit using linear regression, then try a nonlinear model because it can fit a wider variety of curves. I always recommend that you try OLS first because it is easier to perform and interpret.

I’ve written quite a bit about the differences between linear and nonlinear models. Read the following posts to learn the differences between these two types, how to choose which one is best for your data, and how to interpret the results.

- What is the Difference Between Linear and Nonlinear Models?
- How to Choose Between Linear and Nonlinear Regression?
- Curve Fitting with Linear and Nonlinear Regression

## Regression Analysis with Categorical Dependent Variables

So far, we’ve looked at models that require a continuous dependent variable. Next, let’s move on to categorical independent variables. A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic. Logistic regression transforms the dependent variable and then uses Maximum Likelihood Estimation, rather than least squares, to estimate the parameters.

Logistic regression describes the relationship between a set of independent variables and a categorical dependent variable. Choose the type of logistic model based on the type of categorical dependent variable you have.

### Binary Logistic Regression

Use binary logistic regression to understand how changes in the independent variables are associated with changes in the probability of an event occurring. This type of model requires a binary dependent variable. A binary variable has only two possible values, such as pass and fail.

**Example:** Political scientists assess the odds of the incumbent U.S. President winning reelection based on stock market performance.

Read my post about a binary logistic model that estimates the probability of House Republicans belonging to the Freedom Caucus.

### Ordinal Logistic Regression

Ordinal logistic regression models the relationship between a set of predictors and an ordinal response variable. An ordinal response has at least three groups which have a natural order, such as hot, medium, and cold.

**Example:** Market analysts want to determine which variables influence the decision to buy large, medium, or small popcorn at the movie theater.

### Nominal Logistic Regression

Nominal logistic regression models the relationship between a set of independent variables and a nominal dependent variable. A nominal variable has at least three groups which do not have a natural order, such as scratch, dent, and tear.

**Example**: A quality analyst studies the variables that affect the odds of the type of product defects: scratches, dents, and tears.

## Regression Analysis with Count Dependent Variables

If your dependent variable is a count of items, events, results, or activities, you might need to use a different type of regression model. Counts are nonnegative integers (0, 1, 2, etc.). Count data with higher means tend to be normally distributed and you can often use OLS. However, count data with smaller means can be skewed, and linear regression might have a hard time fitting these data. For these cases, there are several types of models you can use.

### Poisson regression

Count data frequently follow the Poisson distribution, which makes Poisson Regression a good possibility. Poisson variables are a count of something over a constant amount of time, area, or another consistent length of observation. With a Poisson variable, you can calculate and assess a rate of occurrence. A classic example of a Poisson dataset is provided by Ladislaus Bortkiewicz, a Russian economist, who analyzed annual deaths caused by horse kicks in the Prussian Army from 1875-1984.

Use Poisson regression to model how changes in the independent variables are associated with changes in the counts. Poisson models are similar to logistic models because they use Maximum Likelihood Estimation and transform the dependent variable using the natural log. Poisson models can be suitable for rate data, where the rate is a count of events divided by a measure of that unit’s *exposure* (a consistent unit of observation). For example, homicides per month.

**Example**: An analyst uses Poisson regression to model the number of calls that a call center receives daily.

### Alternatives to Poisson regression for count data

Not all count data follow the Poisson distribution because this distribution has some stringent restrictions. Fortunately, there are alternative analyses you can perform when you have count data.

**Negative binomial regression**: Poisson regression assumes that the variance equals the mean. When the variance is greater than the mean, your model has overdispersion. A negative binomial model, also known as NB2, can be more appropriate when overdispersion is present.

**Zero-inflated models**: Your count data might have too many zeros to follow the Poisson distribution. In other words, there are more zeros than the Poisson regression predicts. Zero-inflated models assume that two separate processes work together to produce the excessive zeros. One process determines whether there are zero events or more than zero events. The other is the Poisson process that determines how many events occur, some of which some can be zero. An example makes this clearer!

Suppose park rangers count the number of fish caught by each park visitor as they exit the park. A zero-inflated model might be appropriate for this scenario because there are two processes for catching zero fish:

- Some park visitors catch zero fish because they did not go fishing.
- Other visitors went fishing, and some of these people caught zero fish.

Whew! That’s many different types of regression analysis! If you’re trying to figure out which one to choose, I hope you will use this information to point yourself in the right direction!

If you’re learning regression, check out my Regression Tutorial!

roy hampton says

Great post Jim. I really like the way you explain the different types of regression.

Jim Frost says

Thank you, Roy! I’m glad that you found it helpful!

Nicol says

Technically, regression examines a relationship between predictor and response variables. I wish people will stop using IV and DV incorrectly. There’s nothing the researchers are manipulating in your examples either.

Jim Frost says

Hi Nicol,

Predictor and response variables are synonyms for independent and dependent variables, respectively. You can use them interchangeably. Also, you’re correct that none of the examples have researchers setting the values for the independent (predictor) variables. However, that’s just fine in regression analysis. These examples are observational studies where you measure data and observe the relationships.

You can also use regression analysis in designed experiments where you use random assignment and the researchers set the values of the experimental variables. The designed experiment approach is particularly good when you want to establish causality (rather than just correlation) and it helps rule out confounding variables. However, this type of experiment isn’t always feasible, and it’s OK to use observational studies as long as you are aware of the limitations and potential problems.

Thanks for writing!

Jim

Mukesh Bishnoi says

Very knowledgeable points

Jim Frost says

Thank you, Mukesh!

Abhishek Singh (@abhi121289) says

Very intitutive. Loved the way you explained it. Thanks Jim.

Jim Frost says

Thank you, Abhishek! I really appreciate the kind words and I’m glad you found it to be helpful!

John Petroda says

For the count example (number of calls an analyst receives daily), curious about using Log transformation of the the dependent count variable and using random forest on that?

Would that work?

Than you…

Jim Frost says

Hi John, unfortunately I’m not overly familiar with random forest models. That’s something I should learn more about!

Renee Sartin says

Hi. I am a student, and I am having grave difficulty in determining what types of variables I have for my study. (still in the learning phase). This is my problem statement. It is not known if and to what extent a positive correlation exists between organizational commitment of supervisors and practicum success among students, and whether student intrinsic motivation moderates the relationship.

Please, if you were me what analysis would you use and why. And to your best knowledge what types of variables are these? I look so foraward to receiving yuour respose.

Jim Frost says

Hi Renee, most likely you are working with either continuous or ordinal variables. To determine which type of variable, check out my glossary definitions for both:

Continuous variables

Ordinal variables

For pairs of continuous variables, you can use the Pearson correlation. Be sure to create a scatterplot and determine whether the relationship is linear.

For pairs of ordinal variables, you can use Spearman’s correlation.

Best of luck!

nasim says

hi, i am a student and i have a problem, i want to predict bankruptcy in IRAN . and i want to use LASSO regression to choose more effective independent variables, i select dependent variable y(0 , 1), and i have 50 independent variables that are financial ratios , and i do analysis on Spss, but i have many problem with result, so i have a main question, can i use lasso to predictive with 0 and 1 dependent variable, can i use Spss to do it?

thank you Jim.

Jim Frost says

Hi Nasim, I haven’t done this myself but apparently it is possible. I recommend that you read this about using Lasso with logistic regression. This example uses R, but I’m not sure about SPSS.

Shiji says

Hai Jim,

It is very informative. I found it very useful for the researcher. I have a doubt in my study, i wish test the relationship between domestic tourists and foreign tourists. when we look at the total number the same trend is observed by the two . so I wish to know which method can be used to prove that the pattern of change of domestic and total are the same or the movement of total tourist is same as the domestic.

thanking you

Shiji

Jim Frost says

Hi Shiji, I’m not 100% sure I understand what you are studying. However, it sounds like you might need to include one or more interaction term in your model to determine whether the relationships between your independent variables and dependent variables depend on whether a tourist is a domestic or foreign tourist. I write about comparing regression lines in an article. Read that article and, in the graphs where I show the regression lines for two different groups, imagine that one group represents domestic tourists and the other represents foreign tourists. That might be what you’re looking for. I hope this helps!

CMO says

Hi Jim,

Thanks for posting this.

I would appreciate your thoughts on my analyses. I have an independent variable that is a count variable (number of days at work). My dependent variables are all continuous variables. Can I use a simple linear regression model to test a moderated mediation relationship with the the IV as a count variable?

Thanks!

Jim Frost says

Hi, I’d give the model a try but check the residual plots to be sure that the model satisfies the assumptions. If you’re fitting just the one independent variable, you can use a fitted line plot and really just see at glance if it provides a good fit. I show an example early in this post.

Raof says

Thanks Jim for this informative Blog

I want to examine the influence of predictor variables such as Physical activity (low, moderate,high), sedentary time and dietary habits ( fruits, vegetables, junk food etc.) on a dependent variable BMI, collapsed into lower level ordinal categories like underweight, normal, overweight and obese. If I have to see the odds of being overweight/obese for a person based on these behavioural practices. What would be the appropriate regression analysis. Or am i required to dichotomize (1.underweight/normal and 2.overweight/ obese) my dependent variable and use binary logistic regression. Your views will be much appreciated.

Jim Frost says

Hi Raof,

It sounds like you need to use Ordinal Logistic Regression. Your dependent variable is an ordinal variable. Unfortunately, I don’t have a good example of this type of regression to share with you, but it can do what you describe.

The one problem I see is that you also have an ordinal predictor (physical activity–high, medium, and low). That can be problematic. You can try to fit the model and check the residuals to see if you satisfy the assumptions. If it doesn’t work, you can try converting those three levels to two indicator variables. Indicator variables are binary variables where you have one for each level–however you need to leave one out of your model (e.g. High, Moderate). You need to leave one level out for the analysis to run so I intentionally didn’t include Low–but you can leave any level out.

But, for your ordinal response variable, use ordinal logistic regression.

I hope this helps!

Pankaj Kumar says

Hello Mr. Jim

I hope you are doing very well.

I am in confusion in the testing of regression analysis. Well, as we read in basic Statistics that F test is a two tailed test whereas when we use F test in testing of regression analysis then we always treat it as a one tailed test. Why so?

Thanks

Pankaj

Jim Frost says

Hi Pankaj,

That’s a great question. As it turns out, for regression and ANOVA, the F-test is always a one-tailed test. The F-test tests the ratio of two variances (technically mean squares rather than variances). In regression and ANOVA, it’s a one-tailed test because of the nature of what you’re testing. In One-Way ANOVA, you’re determining whether the between group variance is greater than the within group variance. In regression, you want to determine whether the model with all of your predictors is better than the model with no predictors (only the constant). Those are one-tailed tests by the definition of how the hypotheses are specified–you are determining whether one variance is significantly larger than the other variance.

To see how the F-test works in detail I suggest you read my post about the F-test and One-Way ANOVA. Regression analyses uses the F-test in a similar way but changes the variances in the ratio. You’re testing the model with all of your predictors compared to the model with no predictors (just the constant). You can also read my post about the F-test of overall significance.

You do use two-tailed F-tests for Variance Tests. In this case, you require the ability to determine whether the variance in the numerator is larger than or less than the variance in the denominator. You’re testing both directions (larger and smaller), hence it’s two-tailed.

I hope this helps!

Hassan Elkatawneh says

That is very helpful, but did not answer my own need. If you can advice my, I have 2 IV and one DV, in addition I have one moderator variable. What is the best test, all variables are ratio scale? thanks in advance for your help

Jim Frost says

Moderator variables are commonly used in psychology–which isn’t my field. However, from my understanding, they are essentially interaction effects. That is, the effect between an independent variable and a dependent variable depends on the value of another variable. To fit this type of model, you can use OLS multiple regression. You just need to include the appropriate interaction term in the model. For more information, read my post about understanding interaction effects.

Ahmad says

Hi Jim

Im student, have problem with how can choose which regression model i need to use in my case.

i have many variables with one response like mix design variables and the response is compresive strength of concrete

Jim Frost says

Hi Ahmad, choosing the best regression model is a very important task. In statistics, we call that process model specification. I’ve written an entire blog post about it that will help you. Model Specification: Choosing the Correct Regression Model

Best of luck with your analysis!

Sebastian says

Hello Mr. Frost,

first of all great website. Wish I knew the existence back when I was in my bachelors studies. My question is concerned with log-linear models and binary variables. I developed a model for a thesis that looks like this:

log y_t – log y_t-1 = beta_0 + beta_1 A + beta_2 B + u. The dependent variable is the percentage change of the Treasury yield and A and B are binary events like FOMC meetings. Is this example considered a log-linear regression model? Thanks in advanvce.

CMO says

Thank you, Jim. This is helpful

Jim Frost says

Hi, I’m so glad that is helpful!

Antoine says

Hey Jim,

Thank you for this post, really like the way you explain things.

I am working on a project where I am assessing the relation between discriminative attitude and healthcare provision in health care workers:

– Discriminative attitude: is the independent variable and will be measured using a series of 10 scaled question (scaled from 1 to 5). In that way any respondent will have a score somewhere between 1 and 5, hence i am assuming it is a continuous ordered variable right?

– Healthcare provision: is the dependent variable and will also be measured using a series of 10 scaled question (scaled from 1 to 4) – similarly to the independent variable, i am assuming this is a continuous ordered variable.

In your opinion, what analytical model would be most suitable for that purpose?

Thanks!

Kai says

This is fantastic! I’m a 3rd year statistics major at university, and it is so refreshing having this overview of regression set out in such a clear way. Major kudos for all your work!

Jim Frost says

Thank you, Kai!

Sandeep says

Sir which model is best for stock market short time prediction

Jim Frost says

Hi Sandeep, ah, I get asked that question many times. And, if I knew the answer to that one, I’d be so rich that I’d have more money than I’d know what to do with! The fact is that the stock market is fairly unpredictable. If you could predict it in the short term, then everyone would know exactly where to put there money at any given point. And, then the advantage is gone. So, it doesn’t work that way.

Prashant Dey says

Hey Jim,

Thanks a lot.

This is very good explanation of regression techniques.

This post will gain another boost if a flow chart or map for choosing the right technique is provided.

Just a suggestion.

Thank You again!

This is really helpful.

Lin says

Hi Jim,

I am currently doing a research on behavior pattern in Peru.

My dependent variable are binary but my independent variables is a mix between binary and continuous variables . I have to use data from previous round to predict the later round . For example dependent variable is smokes at age 15 which is binary and some of the independent variables are mathscore standardize at age 12 (continuous) and drinks alcohol at age 12 (binary). I also think that there is also an endogeneity problem in this setting. Hence, I do not know what regression in STATA is best for this kind problems? Also how do i solve the endogenous problem ?

I thought of using linear probability model, since it seems the easiest but I don’t think this is the best method .

Thanks in advance

Kind regard,

J.zhong

SK says

Hi Jim,

Really nice article

I am stuck in a problem where i have to do regression but I am unable to decide with which regression model i should proceed. To give you a background, I have sales (dependent variable) and lets say a, b, c,d and e are independent variable(The sales is triggered by these independent variable). My objective is to find the importance of each independent variable, so that we can prioritize on that channel. Now the values in independent variable can be either “Open”, “close” or blank. I thought of using Binary regression model but here i have three type of values but Binary takes only two.

Pls suggest

Thanks in advance

SK

Jim Frost says

Hi SK,

Assuming that sales is a continuous dependent variable, you do not want to use binary logistic regression or other specialized type. Binary logistic is for cases where the

DVis binary. You’re talking about independent variables.You have categorical independent variables, which you can include in linear regression. Most statistical software will code those as indicator variables automatically. Does “blank” represent a missing value or is that an actual value for the IVs? If you have “Open”, “Close”, and missing values, you’d just need one indicator variable for each IV. The indicator variable could be something like Open_A, which is a 1 if variable A = Open and 0 if variable A = Close. Repeat for the other channels. But, again, most software will do that automatically.

Finding the relative importance of each IV is a separate matter that I write about: Identifying the Most Important IVs

I hope this helps!

WW says

Hi Jim, Thanks for this post. It clears the regression clouds haunted me for a loooong time! It is really helpful! Statistics had been my nightmare since Uni but I guess no more. Look forward to your next posts!

Jim Frost says

Hi WW, I’m so happy to read your comment! I strive to make these posts as easy and intuitive to understand as possible. It makes my day to read that they’ve helped you!

Stayed tuned, I am writing more posts but taking a short break at the moment.

Anudeep Venkata says

Hi Jim,

I read your blog on regression and it was lucid. But I am confused about the regression testing with the model with below mentioned variables

I have control and test samples which have discrete quantitative variables.

If I have to model this with another independent variables like gender (dummy variable or independent), and one ordinal variable.

Since I have my control and test with other factors like gender and housing. I want to determine the link and analyze the effect of Housing and gender on my control and test samples.

Can you please suggest which regression model would be appropriate. Is that Binary logistic or Multi nominal logistic regression??

Thank you in advance.

Jim Frost says

Hi Anudeep,

It really depends on what type of dependent variable you have. Can you clarify the nature of your dependent variable?

Lokesh Gupta says

Hi Jim,

Is there any way we can mention a categorical dependent variable to be ordinal variable before passing into the logistic regression model as it might provide extra edge to the model output

Jim Frost says

Assuming that I understand your question correctly, if you have an ordinal dependent variable, you should use ordinal logistic regression to analyze your data.

Maro says

Hi Jim – Thank you so much for this clear post! This is very helpful.

I have a question. My dependent variable is categorical (3 categories). when I tried nominal logistic regression using minitab, the model didn’t converge. After some research I found out that my data has a collinearity problem making it difficult for the nominal logistic model to converge.

Instead, I tried both linear regression and partial least squares, using minitab, on the same data and the results seem reasonable. My question is, can I use linear or PLS regression if my response variable is categorical? or do I have to do nominal logistic regression?

Thanks again!

Jim Frost says

Hi Maro,

Thanks for writing with the great question. Unfortunately, when you have a true categorical variable, you cannot treat it as a numeric variable. Suppose you have three groups. You can label each one with numbers: 1, 2, 3. However, that doesn’t mean you can analyze them as numbers. Those numbers might represent: scratch, dent, and tear. Or maybe gold, silver and bronze. The numeric labels don’t measure/represent the actual characteristic that the groups are based on. To illustrate this, the value 2 doesn’t indicate that it is exactly twice the value of 1. The numbers also suggest a logical order to the groups that just doesn’t exist (otherwise you’d be using ordinal logistic regression).

Consequently, you can’t use linear or PLS regression. I don’t know what your model is or your other variables, but if you have only categorical variables, you can try the chi-square test of independence to look for relationships among categorical variables. Otherwise, I think nominal logistic regression is your best bet. To address the collinearity, you might try removing or linearly combining variables and using them in nominal logistic regression. By linearly combining them yourself, you’re incorporating some aspects of PLS into nominal logistic regression.

I hope this helps!

Maro says

Hi Jim

Thank you so much! This is very helpful. Here’s more information on what I’m trying to do.

The problem:

I’m studying the impact of adopting technological capabilities (independent variables) on teams performance (dependent variables) in IT.

The approach:

Identify significant performance clusters between teams, and understand how the adoption of technological capabilities impact teams performance clusters.

My dataset consists of:

1) 17 independent variables, one variable is ordinal (1 to 7 scale), 7 variables are categorical (true/false), 9 variables are continuous/numbers.

2) 3 dependent variables, these are continuous/numbers.

3) Dataset size is 36 subjects.

My analysis is two steps:

1) Run the 3 dependent variables (performance measures) through clustering algorithm and find out if there are significant clusters. This test was successful and I found 3 significant clusters (high, medium, low).

2) Now I want to test the influence of the 17 independent variables (technological capabilities) on the 3 clusters. I planned to use multinomial regression but it didn’t work due to the issue mentioned in the earlier post.

My questions:

1) Given the number of independent variable (17), is there a recommended data size for the multinomial logistic regression to work successfully? How many subject can be good enough?

2) Since my clusters consist of the 3 dependent variable, I’m thinking of testing the impact of the 17 independent variables on the 3 dependent variables that make the clusters instead of the clusters themselves using PLS or multi linear regressions? This is still a workaround but I may consider it just in case my logistic regression model continuous to fail.

3) Any other recommendations?

Alternatives:

I checked the linear discriminant function and it seems promising. I think the problem is I don’t know how to interpret its results to find out how the independent variables influence the 3 clusters. I’m not planning to build a prediction model with either logistic regression or discriminant function, I only need the “inferential” piece not the “prediction” piece since I just want to understand the influence of the independent variables on the cluster not interested in building a prediction model.

Your help is much appreciated! Thanks again!

Tony says

Hey Jim! Once I’ve trained a logistic model and know which predictors are important, is there a way that I can define an optimal range for my input variables? For example if I’m trying to adjust three settings on a machine to minimize my probability of introducing a defect, how could I use my coefficients from a logistic model to decide what the mean setting should be for all three to maximize probability of no defects? Thanks!

Jim Frost says

Hi Tony,

There are ways to do optimize your inputs. The process for doing this depends on the statistical software package that you are using. So, it’s hard for me to give any practical advice about it. Essentially the process takes the model that you settled on and then uses an optimization routine to determine which input values optimize the output. So, it’s a separate process from the model specification and fitting process–although your software might tie them together. Typically, you can specify whether you want to obtain a specific target response value, minimize the response value, or maximize the response value. In your case, you probably want to minimize the probability of defects.

Another approach is to perform a Monte Carlo simulation where you generate random data for the input values in your regression equation. The data for each input follow a distribution that you specify. You input these randomly generated data in your regression equation, which produces a distribution of outputs that you can then study. Additionally, you can change the distributions of the inputs to determine how that affects the distribution of the outputs. That allows you to answer “what if” types of questions about changing the inputs.

I hope this helps!

Kirti says

Hi. Thanks for this post. It is very informative.

I am a student. I have a dataset with 300,00 rows and 77 columns. How do I approach the data?

Also I have to do some predictive analysis. My independent variables are a mix of continuous and nominal categorical variables and my dependent variable is continuous. Which regression model should I use?

Jim Frost says

Hi Kirti, with a few exceptions, the type of regression analysis you should use doesn’t depend on the size of the dataset and number of variables. Usually, it’s the type of variables that you have.

For your case, I’d start with multiple linear regression. See if you can get a good fit to your data using that procedure.