Discrete probability distributions are based on discrete variables, which have a finite or countable number of values. In this post, I show you how to perform goodness-of-fit tests to determine how well your data fit various discrete probability distributions.
How do you recognize discrete distributions? For discrete distributions, you can create a table that contains all possible values and a non-zero probability for each value. The sum of all probabilities must equal 1. In contrast, continuous distributions are based on continuous variables and have an infinite number of possible values.
The following are examples of different types of discrete distributions.
- Binary: For each customer that enters a dealership, there are two possible outcomes—sale or no sale. Each outcome has a probability.
- Poisson: The number of cars that a dealership sells in a day can follow the Poisson distribution. You can create a table with the counts (0, 1, 2, 3, etc.) along with the probability for each daily count.
- Categorical: The color of the car is a categorical variable. You can list all possible colors along with their probabilities.
Each type of discrete probability distribution requires a different type of data and allows you to model different characteristics. Before you can use these distributions, you need to determine whether your data follows one of them.
Before proceeding, I need to clarify that there are two different approaches based on the type of discrete probability distribution you are testing:
- For binary data, you need to check the assumptions.
- For other types of discrete variables, you need to perform a goodness-of-fit test.
Check the Assumptions for Discrete Distributions Based On Binary Data
If you want to use a discrete probability distribution based on a binary data to model a process, you only need to determine whether your data satisfy the assumptions. You don’t need to perform a goodness-of-fit test. If you are confident that your binary data meet the assumptions, you’re good to go!
I’ll walk you through the assumptions for the binomial distribution. You use the binomial distribution to model the number of times an event occurs within a constant number of trials.
The binomial distribution has the following four assumptions:
- There are only two possible outcomes per trial. For example, yes or no, pass or fail, etc.
- Each trial is independent. The outcome of one trial does not influence the outcome of another trial. For example, when you flip a coin, the result of one flip doesn’t affect the next flip.
- The probability remains constant over time. For some cases, this assumption is true based on the physical properties, such as flipping a coin. However, if there is a chance the probability can change over time, you can use the P chart (a control chart) to confirm this assumption. For example, it’s possible that the probability that a process produces defective products can change over time.
- The number of trials is fixed. The binomial distribution models the frequency of events over a fixed number of trials. If you need to model a different characteristic, use a different distribution.
Typically, you must have good knowledge about the process, data collection methodology, and your goals to determine whether you should use the binomial distribution. If you can meet all four of these assumptions, you can use the binomial distribution.
Other distributions that use binary data
There are several other discrete distributions that use binary data. I list several of them below along with how they differ from the binomial distribution. Each distribution has assumptions or goals that vary a bit from the binomial distribution.
|Distribution||Main differentiation from the binomial distribution|
|Negative binomial||Models the number of trials to produce a fixed number of events.|
|Geometric||Models the number of trials to produce the first event.|
|Hypergeometric||Assumes that you are drawing samples from a small population with no replacements, which causes the probabilities to change.|
If you are working with binary variables, the choice of binary distribution depends on the population, constancy of the probability, and your goals. When you confirm the assumptions, there typically is no need to perform a goodness-of-fit test.
Example Use of the Binomial Distribution
In a future post, I will show you ways you can use the various discrete probability distributions for binary data. For now, I’ll include an example use of only the binomial distribution to give you an idea. The graph below shows us that if the probability of a defective product is 1.5% and you are modeling a sample size of 30, you’d expect just over 60% of the samples to have zero defective products. Additionally, the binomial distribution predicts that about 7.4% of the samples will have two or more defective products.
Performing a Goodness-of-Fit Test for other Discrete Distributions
If you are working with discrete data that are not binary data, chances are you’ll need to perform a Chi-square goodness-of-fit test to decide if your data fit a particular discrete probability distribution. These tests compare the theoretical frequencies to the frequencies of the observed values. If the difference is statistically significant, you can conclude that your data do not follow that specific discrete distribution.
- H0: The sample data follow the hypothesized distribution.
- H1: The sample data do not follow the hypothesized distribution.
For goodness-of-fit tests, small p-values indicate that you can reject the null hypothesis and conclude that your data were not drawn from a population with the specified distribution. Consequently, goodness-of-fit tests are a rare case where you look for high p-values to identify candidate distributions.
I’ll show you how to test whether your discrete data follow the Poisson distribution and a distribution based on a categorical variable. You can download the CSV file that contains the data for both examples: DiscreteGOF.
Testing the Goodness-of-Fit for a Poisson Distribution
The Poisson distribution is a discrete probability distribution that models the count of events or characteristics over a constant observation space. Values must be integers that are greater than or equal to zero. For example, the number of sales per day in a store can follow the Poisson distribution. If these data follow the Poisson distribution, you can use this distribution to make predictions.
I’ll use an accident count example to show you how to determine whether you data follow the Poisson distribution.
Suppose a safety inspector needs to monitor the number of car accidents per month at a specific intersection. The inspector enters the number of monthly accidents in a worksheet like this:
Each value denotes the count of accidents in one month. The actual dataset has 50 values that cover 50 months.
To determine whether these data follow the Poisson distribution, we need to use the Chi-Squared Goodness-of-Fit Test for the Poisson distribution. The statistical output for this test is below.
This test compares the observed counts to the expected counts based on the Poisson distribution. The p-value is larger than the common significance level of 0.05. Consequently, the test result suggests that these data follow the Poisson distribution. You can use the Poisson distribution to make predictions about the probabilities associated with different counts. You can also use analyses that assume the data follow the Poisson distribution. These analyses include the 1- and 2-sample Poisson rate analyses, and the U Chart.
Categorical Variables and Discrete Distributions
A categorical variable also has a discrete probability distribution of values. Each level of the categorical variable is associated with a probability. To determine whether the distribution of categorical data follows the values that you expect, you can perform the Chi-Square Goodness-of-Fit test. This test is very similar to the Poisson version except that you must specify the test proportions.
I’ll walk you through an example. It is fairly easy to perform this test.
Car color example of a discrete distribution
PPG Industries studied the paint color of new cars bought in 2012 for the entire world. We want to assess whether the distribution of car colors in our local area follows the global distribution. In this example, the PPG data are real but I’m making up our local data. The car color is our categorical variable and the levels are the individual colors.
After gathering a random sample of the color of cars sold in our state, we enter the observed data and global proportions in a worksheet like this:
The OurState column contains the tally for each color that we observed. The PPG Industries data are in the Global Proportions column. We’ll perform the Chi-square goodness-of-fit test to determine whether our local distribution is different than the global distribution. We’ll use the PPG proportions as the test proportions.
The Chi-square goodness-of-fit test results
This goodness-of-fit test compares the observed proportions to the test proportions to see if the differences are statistically significant. The p-value is less than the significance level of 0.05. Therefore, we can conclude that the discrete probability distribution of car colors in our state is different than the global proportions.
The Contribution to Chi-squared column tells us which paint colors contribute the most to the statistical significance. Gray and Red are the top two colors, but we don’t know the nature of how they contribute to the difference.
Let’s look at the observed and expected values chart to see how these values are different.
The chart indicates that the observed number of gray cars is higher than expected. On the other hand, the observed number of red cars is less than expected.
There are several different types of discrete variables than can produce different types of discrete probability distributions. The process by which you test your data to determine whether it follows a specific distribution depends on the type of discrete variable.
- Binary data: Check the assumptions.
- Count data: Use the Poisson Goodness-of-Fit Test.
- Categorical variable: Use the Chi-Square Goodness-of-Fit Test and designate the test proportions.