What is Logistic Regression?
Logistic regression statistically models the probabilities of categorical outcomes, which can be binary (two possible values) or have more than two categories. These models use a linear combination of independent variables to help you understand how they correlate with the likelihood of the outcomes and predict them based on specific conditions you enter into the model.
Analysts often use a logistic regression model when the goal is to assess event outcomes, such as the following examples:
- Whether a customer will purchase a product (Yes/No).
- Whether a patient has a particular disease (Positive/Negative).
- Whether a loan applicant will default (Default/No Default).
Logistic regression determines which independent variables have statistically significant relationships with the categorical outcome. For example, in the loan default model, logistic regression can assess the likelihood of default based on factors such as income, credit score, and loan amount, helping predict future defaults.
Unlike linear regression, logistic regression focuses on predicting probabilities rather than direct values. It models how changes in independent variables affect the odds of an event occurring.
Later in this post, we’ll perform a logistic regression and interpret the results, highlighting what you can learn!
Learn more about Independent vs. Dependent Variables: Differences & Examples.
Types of Logistic Regression Models
Logistic regression models come in three main types based on the number and nature of the categories in the dependent variable:
Binary: Binary logistic regression models dependent variables with only two possible outcomes, such as Win or Lose, Healthy or Sick, Yes or No.
Multinomial: Multinomial logistic regression handles dependent variables with three or more unordered outcomes. For example, the outcomes could be college majors like “political science,” “psychology,” and “statistics.”
Ordinal: Ordinal logistic regression assesses dependent variables with three or more ordered outcomes, such as “Beginner,” “Intermediate,” and “Advanced.”
Why Use Logistic Regression?
Why would you use logistic regression rather than ordinary least squares (OLS) regression? The answer lies in the nature of the dependent variable.
Logistic regression models are designed for categorical dependent variables and uses a logit function to model the probability of the outcome. On the other hand, OLS regression is inappropriate for categorical outcomes because it can predict probabilities outside the valid 0 – 1 range and does not account for the nonlinear relationship between the independent variables and the outcome probabilities.
Logistic regression is part of the generalized linear model family, which allows it to handle various types of dependent variables by using a link function. This function mathematically connects the combination of input variables and their coefficients (known as the linear predictor) to the expected value, transforming the data so that the analysis can model the relationship linearly.
In logistic regression, the link function varies depending on the model type. For binomial and ordinal logistic regression, the standard link function is the logit, which applies the natural logarithm to the odds of an event occurring. In multinomial logistic regression, the generalized logit function models the log odds of each category relative to a reference category.
The logit function transforms the nonlinear relationship between the independent variables and the log-odds into a linear relationship for analysis. This transformation is critical because these relationships typically follow an S-shaped curve rather than a straight line. The logistic function then converts the log odds back into probabilities, compressing the values into the 0 – 1 range. This process ensures the predicted probabilities are valid and provides a better fit for the typical S-shaped curve associated with binary outcomes.
The graph below displays the characteristic sigmoid shape in a binary logistic regression model for the relationship between antibiotic dosage and the probability of observing no bacteria. As dosage increases, the probability of no bacteria also increases but nonlinearly.
Logistic Regression Model Assumptions
Consider using logistic regression when your data meet the following assumptions.
Categorical Dependent Variable
As discussed in the previous sections, the outcome you’re modeling should be categorical. Logistic regression models the probability of one of these outcomes occurring, and the model assumes this probability fits the logistic distribution.
Independence
Each observation in your dataset should be independent. The outcome of one observation should not affect or depend on the outcomes of others.
No Perfect Multicollinearity
Logistic regression assumes that the independent variables are not perfectly correlated. Perfect multicollinearity prevents the model from estimating the coefficients accurately.
Linearity in the Logit
Logistic models assume that the independent variables have a linear relationship with the log odds of the outcome. If this linearity does not exist, the model’s predictions could be off.
Other Models for Binary Outcomes
While logistic regression using the logit link function is the most common model for binary outcomes, alternative link functions exist. The logit function is preferred when the model fits well because its output—odds ratios—are easy to interpret. However, these alternative models can provide more suitable interpretations depending on the characteristics of your data.
Probit Regression (Probit Link Function)
Probit regression uses the probit link function to model the cumulative probability based on a standard normal distribution. Unlike logistic regression, which models log odds, the probit function estimates the standard deviations from the mean of a normal distribution. This model is often preferred when your data follow a normal distribution and you want to interpret probabilities in that context.
Complementary Log-Log (Cloglog Link Function)
The complementary log-log link is helpful for skewed binary data, particularly when the event is either rare or almost certain—most observations fall into one category. This link function focuses on the extremes, and analysts frequently use it for survival analysis or when analyzing time-to-event data. The cloglog model helps when the probability of an event increases rapidly and then levels off, offering a different interpretation from the symmetrical logit model.
Learn more about Choosing the Correct Type of Regression Analysis.
Log-Log Link Function
The log-log link function is another option for modeling binary outcomes, especially when the probability of an event decreases as the independent variable increases. It’s similar to the cloglog link function but focuses on situations where lower independent variable values correspond with higher event probabilities.
These different link functions offer alternatives to the logit link function, providing flexibility when binary outcomes don’t fit the assumptions of traditional logistic regression. Each approach allows you to model probabilities in a way that better suits your data’s distribution and characteristics.
Logistic Regression Example
Let’s perform an example logistic regression analysis!
In this example, we’re assessing the effectiveness of cereal ads. Does viewing the ads increase the probability of buying the cereal? We’ll include two categorical independent variables. However, you could include continuous IVs as well.
This table summarizes the variables in the logistic regression model:
Variable | Role | Type |
Bought | Dependent Variable | Binary (1 = Yes, 0 = No) |
Children | Independent Variable | Binary (Yes or No) |
ViewAd | Independent Variable | Binary (Yes or No) |
Download the CSV data file to try it yourself: LogisticRegression.
Below are the results.
Interpreting the Coefficients and P-value
The coefficients for a logistic regression model are difficult to interpret directly because they involve transformed data units (i.e., log odds). Specifically, the coefficient for a categorical IV represents the change in the log-odds of the outcome occurring for that category compared to the reference category while holding all other variables constant. The low p-values indicate the variables have a statistically significant relationship with the log odds of the event. That’s helpful but doesn’t provide intuitive information about the nature of the relationship.
Learn more about Regression Coefficients and P-values.
Odds Ratios
Fortunately, when your model uses the logit link function, you can easily get more helpful information from the coefficients using a calculator! By exponentiating the coefficient, you obtain the odds ratio (OR) for the term. ORs are the multiplicative factor by which the odds change for a one-unit increase in a continuous independent variable. Alternatively, for categorical variables such as those in our model, an OR represents how the odds of the outcome occurring change for that category relative to the reference category. Typically, you’ll interpret the odds ratios rather than the coefficients.
Let’s calculate and interpret the odds ratio for the Children variable, which has a parameter estimate of 1.641 in our Poisson regression model.
This odds ratio indicates that the odds of buying the cereal are 5.61 times greater for families with children than those without. By including this variable in the logistic regression model, we’re controlling for it when we assess the ad’s effectiveness.
This statistical software automatically calculates the odds ratios and confidence intervals for logistic regression. Because the p-values for both coefficients are statistically significant, we know the confidence intervals for both odds ratios exclude the null value (no effect) of 1.
The odds ratio for viewing the ad is 3.0219. This result indicates the odds of buying the cereal is about 3 times greater amongst those who viewed it.
At first glance, that sounds effective! However, notice the extremely wide confidence interval [1.0037 9.0985]. That ranges from barely to highly effective. We can conclude there’s at least a tiny effect. However, the logistic regression model is too imprecise to know what it is.
Goodness-of-Fit
The p-values for the logistic regression goodness-of-fit tests are greater than the significance level of 0.05. These results suggest that the model fits the data. However, the deviance R-squared value is only 12.10%, indicating the model accounts for a small amount of the deviance. Our model contains substantial random error, explaining the wide confidence intervals.
Learn more about Goodness-of-Fit: Definition & Tests.
This logistic regression model example included two categorical predictors. For an example that interprets continuous predictors, read my post, Using Logistic Regression to Assess the Republican Establishment Split.
Logistic Regression Models in Machine Learning and AI
Logistic regression is crucial in machine learning (ML) and artificial intelligence (AI). ML models are software systems that learn from data to perform tasks autonomously without constant human intervention. When built using logistic regression, these models can drive predictive analysis that reduces operational costs, boosts efficiency, and enables businesses to scale more quickly.
In machine learning, logistic regression is one of the most widely used algorithms for supervised learning, particularly for binary classification. While logistic regression models probabilities, it can be the foundation for classification tasks by incorporating a probability cutoff value. This process assigns cases with probabilities above the cutoff to one class and those below it to another. ML applications use this approach for tasks such as predicting whether a financial transaction is risky and whether a customer will buy a product.
By estimating the probability that an observation belongs to a specific category, logistic regression provides a probabilistic framework that supports decision-making, making it an essential tool for AI and ML applications.
Reference
Paul Roback and Julie Legler (2021), Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in R.
Comments and Questions