What is Poisson Regression?
Poisson regression statistically models events that you count within a specified observation space. Frequently, analysts define the observation space using time, but it can also relate to a volume, area, or item. These models allow you to understand the independent variables that affect the counts and predict them given specific conditions you enter into the model.
Frequently, statisticians use Poisson regression to analyze rates over a timeframe (counts/time). For example, you can use it to model the following count outcomes:
- Number of customer arrivals per hour.
- Number of website clicks per day.
- Number of Help Desk calls per day.
Poison regression models determine which variables have statistically significant relationships with the counts. For instance, the help desk example can assess the count outcome (calls) using factors like time of day, day of the week, or seasonal trends to help predict future call volume.
However, Poisson regression doesn’t necessarily relate to time. For example, it can model the number of defects that quality control inspectors find on a car tire. Here, the observation space is physical—one tire. The outcome is a count of defects on each tire, and the model can analyze how factors such as the type of rubber, production line, and manufacturing conditions affect the number of defects per tire.
It’s also an excellent statistical analysis tool for assessing count data in a contingency table. Analysts often refer to it as a log-linear model in this setting.
Later in this post, we’ll perform a Poisson regression and interpret the results, highlighting what you can learn from it!
Learn more about Independent vs. Dependent Variables: Differences & Examples.
Why Use Poisson Regression?
Why would you use Poisson regression rather than least square regression? It comes down to the nature of your dependent variable.
Poisson regression assumes your dependent variable follows a Poisson distribution. These distributions can’t have values less than zero and tend to be right-skewed. Additionally, as the expected value of a Poisson distribution increases, so does its variance. Ordinary least squares regression cannot adequately model these conditions.
Learn more about the Poisson Distribution: Definition & Uses.
Poisson regression is in the family of generalized linear models that use a link function to expand the types of dependent variables that a linear model can analyze.
Poisson regression handles these non-negative, right-skewed data with a changing variance by using a link function. This function mathematically connects the combination of input variables and their coefficients (known as the linear predictor) to the expected value of the response variable using a scale that produces a linear relationship which the analysis can model.
In Poisson regression, the link function is typically the natural logarithm because the expected value of the count will always be non-negative. It also addresses both the right-skew and increasing variance in Poisson-distributed data by compressing larger values more than smaller ones.
In summary, the link function allows the Poisson regression model to appropriately handle the non-negative, skewed, and changing variance data characteristics.
Poisson Regression Assumptions
Consider using Poisson Regression when your data satisfy the following assumptions.
Poisson Dependent Variable
The outcome you are trying to predict should be a count of something (like the number of events happening) within a specific time frame or space. This count should follow a pattern that fits the Poisson distribution.
Independence
Each observation in your data should not affect the others. In other words, what happens with one observation should not influence or depend on what happens with another.
Mean = Variance
In Poisson regression, the model’s mean outcome should approximately equal the variance. If this doesn’t hold, the model might not fit the data well and require adjustments. Sometimes, you can resolve this issue in Poisson regression by adding terms to the model. In other cases, you might need to use a different analysis.
Linearity
The model assumes that the relationship between the independent variables (x values) and the log of the expected count is a straight line. If this linearity doesn’t exist, the model’s predictions can be inaccurate. Changing the link function can help in some cases.
Other Models for Count Data
If your data don’t satisfy these assumptions even after tweaking the Poisson regression model, you’ll need to consider a different analysis. Fortunately, several others can model count data.
If the variance is much larger than the expected value (overdispersion), it violates the Poisson assumption that they are equal. Fortunately, negative binomial regression can adequately model overdispersion.
Some data contain too many zeros to follow a Poisson distribution. Zero-inflated models assume that two separate processes work together to produce the excessive zeros. One process assesses if there are zero events or more than zero events. The other is the Poisson process, which determines how many events occur, some of which can be zero. Here is an example!
Imagine park rangers count the fish that park visitors caught during their visits as they exit. A zero-inflated model might be appropriate because there are two processes for catching zero fish:
- Some park visitors caught zero fish because they did not go fishing.
- Other visitors went fishing, and some of these people caught zero fish.
Learn more about Choosing the Correct Type of Regression Analysis.
Poisson Regression Example
Let’s perform an example Poisson regression analysis!
In this example, the Poisson rate we’re assessing is the count of discoloration defects per hour-long inspection session. The company recorded various manufacturing information along with the number of defects the inspectors found during each session.
This example is cool because it illustrates the flexibility of Poison regression. It can model a combination of continuous and categorical independent variables along with interaction effects. Also, I’ll center the continuous variables to reduce the structural multicollinearity the interaction term creates.
Download the CSV data file to try it yourself: PoissonRegression.
This table summarizes the variables in the model:
Variable | Role | Type |
Discoloration Defects | Dependent Variable | Count |
Hours Since Cleanse | Independent Variable | Continuous |
Temperature | Independent Variable | Continuous |
Size of Screw | Independent Variable | Categorical (Large/Small) |
I’ll also include the Temperature * Size of Screw interaction in the Poisson regression model.
Below are the results!
Interpreting the Coefficients and P-values
In our Poisson regression example, the p-values for all the terms are less than the standard significance level of 0.05, making them statistically significant. Centering the variables kept the VIFs low, otherwise some would’ve been too high due to excessive structural multicollinearity.
Learn more about Regression Coefficients and P-values.
The Poisson regression coefficients are difficult to interpret directly because they involve transformed data units. Specifically, the coefficients represent the logarithm of the change in the mean count given a one-unit increase in the independent variable. Generally, positive coefficients indicate that increasing the IV corresponds with higher expected counts while negative values suggest lower counts.
Fortunately, you can easily get more useful information from them using a calculator! By exponentiating the coefficient, you obtain the relative risk ratio for the term. Relative risk is the multiplicative factor by which the mean count changes for a one-unit increase in the independent variable.
Let’s calculate and interpret the relative risk for the Hours Since Cleanse variable, which has a parameter estimate of 0.01798 in our Poisson regression model.
This relative risk indicates that for every one-hour increase in the time since the last cleaning, the mean defect count increases by 1.8%. Because its p-value is statistically significant, we know the confidence interval for the relative risk excludes the null value (no effect) of 1 for relative risk.
Keep reading and we’ll get to a graphical interpretation for the interaction effect!
Goodness-of-Fit
The p-values for the goodness-of-fit tests are greater than the significance level of 0.05 These results suggest that the model fits the data. Additionally, the deviance R-squared value of 85.99% suggest that model accounts for most of the deviance.
Learn more about Goodness-of-Fit: Definition & Tests.
Poisson Regression Interaction Effect
Let’s investigate the interaction effect because it is statistically significant in our Poisson regression model. The best way to do that is with an interaction plot! The plot below uses natural data units because the software backtransformed the results.
The interaction plot for the Poisson regression indicates that temperature has a more severe relationship with defects for the large screws. As you lower the temperature by moving left on the X-axis, the number of defects increases rapidly with large screws while they barely increase for small screws. Because the p-value for the interaction term is less than 0.05, the pattern on the interaction plot is statistically significant. Switching to small screws would make the process more robust to temperature drops.
Learn more about Understanding Interaction Effects.
After reading this post, I hope you see why Poisson regression should be a part of your statistical toolkit. It can handle the unique nature of count data and allows you to analyze various types of independent variables and effects.
Reference
Paul Roback and Julie Legler (2021), Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in R.
Peninah says
You are a great teacher! Thank you for sharing your wisdom with the world.