Discrete probability distributions are based on discrete variables, which have a finite or countable number of values. In this post, I show you how to perform goodness-of-fit tests to determine how well your data fit various discrete probability distributions.
How do you recognize discrete distributions? For discrete distributions, you can create a table that contains all possible values and a non-zero probability for each value. The sum of all probabilities must equal 1. In contrast, continuous distributions are based on continuous variables and have an infinite number of possible values.
The following are examples of different types of discrete distributions.
- Binary: For each customer that enters a dealership, there are two possible outcomes—sale or no sale. Each outcome has a probability.
- Poisson: The number of cars that a dealership sells in a day can follow the Poisson distribution. You can create a table with the counts (0, 1, 2, 3, etc.) along with the probability for each daily count.
- Categorical: The color of the car is a categorical variable. You can list all possible colors along with their probabilities.
Each type of discrete probability distribution requires a different type of data and allows you to model different characteristics. Before you can use these distributions, you need to determine whether your data follows one of them.
Before proceeding, I need to clarify that there are two different approaches based on the type of discrete probability distribution you are testing:
- For binary data, you need to check the assumptions.
- For other types of discrete variables, you need to perform a goodness-of-fit test.
Related posts: Understanding Probability Distributions and Identify the Distribution of Your Continuous Data
Check the Assumptions for Discrete Distributions Based On Binary Data
If you want to use a discrete probability distribution based on a binary data to model a process, you only need to determine whether your data satisfy the assumptions. You don’t need to perform a goodness-of-fit test. If you are confident that your binary data meet the assumptions, you’re good to go!
I’ll walk you through the assumptions for the binomial distribution. You use the binomial distribution to model the number of times an event occurs within a constant number of trials.
The binomial distribution has the following four assumptions:
- There are only two possible outcomes per trial. For example, yes or no, pass or fail, etc.
- Each trial is independent. The outcome of one trial does not influence the outcome of another trial. For example, when you flip a coin, the result of one flip doesn’t affect the next flip.
- The probability remains constant over time. For some cases, this assumption is true based on the physical properties, such as flipping a coin. However, if there is a chance the probability can change over time, you can use the P chart (a control chart) to confirm this assumption. For example, it’s possible that the probability that a process produces defective products can change over time.
- The number of trials is fixed. The binomial distribution models the frequency of events over a fixed number of trials. If you need to model a different characteristic, use a different distribution.
Typically, you must have good knowledge about the process, data collection methodology, and your goals to determine whether you should use the binomial distribution. If you can meet all four of these assumptions, you can use the binomial distribution. Statisticians refer to the 2nd and 3rd assumptions as independent and identically distributed (IID) data.
Other distributions that use binary data
There are several other discrete distributions that use binary data. I list several of them below along with how they differ from the binomial distribution. Each distribution has assumptions or goals that vary a bit from the binomial distribution.
Distribution | Main differentiation from the binomial distribution |
Negative binomial | Models the number of trials to produce a fixed number of events. |
Geometric | Models the number of trials to produce the first event. |
Hypergeometric | Assumes that you are drawing samples from a small population with no replacements, which causes the probabilities to change. |
If you are working with binary variables, the choice of binary distribution depends on the population, constancy of the probability, and your goals. When you confirm the assumptions, there typically is no need to perform a goodness-of-fit test.
Example Use of the Binomial Distribution
In a future post, I will show you ways you can use the various discrete probability distributions for binary data. For now, I’ll include an example use of only the binomial distribution to give you an idea. The graph below shows us that if the probability of a defective product is 1.5% and you are modeling a sample size of 30, you’d expect just over 60% of the samples to have zero defective products. Additionally, the binomial distribution predicts that about 7.4% of the samples will have two or more defective products.
Related post: Learn more about the various discrete probability distributions for binary data
Performing a Goodness-of-Fit Test for other Discrete Distributions
If you are working with discrete data that are not binary data, chances are you’ll need to perform a Chi-square goodness-of-fit test to decide if your data fit a particular discrete probability distribution. These tests compare the theoretical frequencies to the frequencies of the observed values. If the difference is statistically significant, you can conclude that your data do not follow that specific discrete distribution.
Like any statistical hypothesis test, Chi-square goodness-of-fit tests have a null hypothesis and an alternative hypothesis.
- H0: The sample data follow the hypothesized distribution.
- H1: The sample data do not follow the hypothesized distribution.
For goodness-of-fit tests, small p-values indicate that you can reject the null hypothesis and conclude that your data were not drawn from a population with the specified distribution. Consequently, goodness-of-fit tests are a rare case where you look for high p-values to identify candidate distributions.
I’ll show you how to test whether your discrete data follow the Poisson distribution and a distribution based on a categorical variable. You can download the CSV file that contains the data for both examples: DiscreteGOF.
Testing the Goodness-of-Fit for a Poisson Distribution
The Poisson distribution is a discrete probability distribution that models the count of events or characteristics over a constant observation space. Values must be integers that are greater than or equal to zero. For example, the number of sales per day in a store can follow the Poisson distribution. If these data follow the Poisson distribution, you can use this distribution to make predictions.
I’ll use an accident count example to show you how to determine whether your data follow the Poisson distribution.
Suppose a safety inspector needs to monitor the number of car accidents per month at a specific intersection. The inspector enters the number of monthly accidents in a worksheet like this:
Each value denotes the count of accidents in one month. The actual dataset has 50 values that cover 50 months.
To determine whether these data follow the Poisson distribution, we need to use the Chi-Squared Goodness-of-Fit Test for the Poisson distribution. The statistical output for this test is below.
This test compares the observed counts to the expected counts based on the Poisson distribution. The p-value is larger than the common significance level of 0.05. Consequently, the test result suggests that these data follow the Poisson distribution. You can use the Poisson distribution to make predictions about the probabilities associated with different counts. You can also use analyses that assume the data follow the Poisson distribution. These analyses include the 1- and 2-sample Poisson rate analyses, and the U Chart.
Related post: Using the Poisson Distribution
Categorical Variables and Discrete Distributions
A categorical variable also has a discrete probability distribution of values. Each level of the categorical variable is associated with a probability. To determine whether the distribution of categorical data follows the values that you expect, you can perform the Chi-Square Goodness-of-Fit test. This test is very similar to the Poisson version except that you must specify the test proportions.
I’ll walk you through an example. It is fairly easy to perform this test.
Car color example of a discrete distribution
PPG Industries studied the paint color of new cars bought in 2012 for the entire world. We want to assess whether the distribution of car colors in our local area follows the global distribution. In this example, the PPG data are real, but I’m making up our local data. The car color is our categorical variable and the levels are the individual colors.
After gathering a random sample of the color of cars sold in our state, we enter the observed data and global proportions in a worksheet like this:
The OurState column contains the tally for each color that we observed. The PPG Industries data are in the Global Proportions column. We’ll perform the Chi-square goodness-of-fit test to determine whether our local distribution is different than the global distribution. We’ll use the PPG proportions as the test proportions.
This table displays both a frequency distribution and relative frequency distribution. For more information, read my post, Relative Frequencies and Their Distributions.
The Chi-square goodness-of-fit test results
This goodness-of-fit test compares the observed proportions to the test proportions to see if the differences are statistically significant. The p-value is less than the significance level of 0.05. Therefore, we can conclude that the discrete probability distribution of car colors in our state is differs from the global proportions.
The Contribution to Chi-squared column tells us which paint colors contribute the most to the statistical significance. Gray and Red are the top two colors, but we don’t know the nature of how they contribute to the difference.
Let’s look at the observed and expected values chart to see how these values are different.
The chart indicates that the observed number of gray cars is higher than expected. On the other hand, the observed number of red cars is less than expected.
You can use this analysis to test Benford’s law, which is a fascinating discrete probability distribution that describes how often numbers in datasets start with each digit from 1 to 9. Learn more about Benford’s law and its distribution.
Read my more in-depth post about the chi-square goodness of fit test.
Closing Thoughts
There are several different types of discrete variables than can produce different types of discrete probability distributions. The process by which you test your data to determine whether it follows a specific distribution depends on the type of discrete variable.
In summary:
- Binary data: Check the assumptions.
- Count data: Use the Poisson Goodness-of-Fit Test.
- Categorical variable: Use the Chi-Square Goodness-of-Fit Test and designate the test proportions.
Kylie says
Hi Jim, thank you for your posts. As a Statistics Undergraduate, I have learnt all these in school but because it is so technical, I find that I am struggling to apply them in real life projects. Your posts helped me to understand the application and see the beauty in stats, I am so grateful for it!
I have a question though, you mentioned earlier that for binomial distribution, there is a possibility that probability may change over time. For example, the probability that a process produces defective products can change over time, and in this scenario we should use a p-chart to help us determine. I went to look it up and a p-chart is a quality control chart used to monitor the proportion of nonconforming units in different samples of size n which shows how the process changes over time. In this case, how do we use the probability for our Binomial distribution? Do we continuously update the probability when it changes?
I am working on a project which involves warranty claims over time and I am trying to look at the claim of the product as a Binomial distribution X~(n,p) where P(X=1) = claimed and P(X=0) = unclaimed. However, I am unsure if I can use the proportion as the probability since there may be more claims occurring in the future. Also, there are different factors affecting whether a claim will occur eg. different manufacturing period (since there will be changes and improvement for each period), type of customers that will file a complaint etc. I am unsure how or if I can even factor these into determining the probability. I am unsure if I am on the right track, is it possible to use a Binomial distribution for this?
Atul Patil says
Hello Jim
I have a discreet data set of 1000 data points between 0 and 3
Mean = 1.97
Standard Error = 0.02101289
Median = 2
Mode = 2
Standard Deviation = 0.664485923
Sample Variance = 0.441541542
Kurtosis = – 0.207838243
Skewness = – 0.151354514
Range = 3
Minimum = 0
Maximum = 3
Sum = 1970
Count = 1000
I am trying to fit a distribution using chi-square test but not able to get p-value above 0.05
can you help me figure out which can distribution can fit for tis data.
Thank you
Jim Frost says
Hi Atul, unfortunately there’s really no way for me to tell from summary values like that. With just those limited number of discrete values in such a tight range, there might not be a good distribution for it.
Nowshad says
Hi
For the number of deaths due to covid -19 which model best?
Stephen Sams says
Hi Jim, I understand that the number of cars that a dealership sells in a day can follow the Poisson distribution since it’s a discrete distribution. What about the price of each car that a dealership sells in a day? I’m guessing that’s a continuous distribution, or is it also discrete?
Jim Frost says
Hi Stephen,
Yes, that would be a great example for the Poisson distribution. The Poisson distribution consists of non-negative integers (0, 1, 2, 3, etc.) and is often used to model count data, such as the number of cars sold in a day. However, the Poisson distribution doesn’t adequately model all count data. This distribution assumes that the mean equals the variance. If that’s not true for a dataset, then the Poisson distribution isn’t a good fit.
Price is a continuous variable.
VEENA G says
which software have you used to conduct chi square test
Jim Frost says
Hi Veena,
I used Minitab Statistical Software.
Ali Al-Khafaji says
Hi, thank you for this great website! I am using multiple linear regression to predict a certain response. Now, I am trying to conduct statistical significance tests to assess the relative importance of each independent variable on that response. My issue is that I don’t know what tests I should use. I have read a quite few papers in my field and still haven’t figured it out. They all mention ANOVA, but they don’t say how.
What I know is that Regression is used when both the input and output data are continuous. On the other hand, ANOVA is used when the input is discrete and the output is continuous, so how can I use ANOVA to assess the relative importance of a continuous independent variable (the input) and that test is not even made for continuous inputs? Thank you again!!
Jim Frost says
Hi Ali,
I have written a blog post that covers this topic exactly! Identifying the Most Important Predictors.
I think that post will answer your questions. And, because your study involves regression analysis, you might consider getting my ebook about regression because it covers this topic and many in more detail!
Best of luck with your analysis!
phani says
Dear Jim,
You are a life saver. Keep the good work coming. I am definitely gonna buy you books ! You make stat look like fun
:)) Cheers!
Jim Frost says
Thank you so much! I strive to make stats fun, so your comment means a lot to me!
Neha says
While finding the tabulated value of chi square distribution, how to determine the degree of freedom for different distributions in goodness of fit test? ?
Sanket Agrawal says
Hi, while testing for the goodness of fit, if the count involves people (suppose) or any other entity that should be a whole number, should we round off our expected frequencies to the nearest integer.
I have seen many of my classmates doing so and arguing that number of people should be a whole number, however I am rather skeptical about this approach.
What is your suggestion on this?
Jim Frost says
Hi Sanket,
Personally, I would not bother doing so. The expected counts with decimal places for the Poisson distribution represent the theoretical distribution. Rounding these values of can actually increase or bias the error. I don’t see any reason to risk that when you can just leave the values alone!
Sanket Agrawal says
When testing the goodness of fit for discrete or categorical distributions for example Poisson in this case, do we have to round off the expected frequencies to the nearest integers, since Poisson distribution takes values on the range of positive integers only?
I have seen many of my classmates rounding off the expected frequencies for questions that involve people’s count arguing that they have to be whole numbers, however I am myself skeptical about this proposal.
What do you think about this?
Stan Alekman says
Thanks for the prompt reply to my question. Sterile drug manufacturers trend the count data they generate for microbial observations. Most of the time, the counts are zero. Occasionally or when there is a problem, counts are observed, which are evidence of contamination. These data are trended following zero inflated models.
Stan Alekman says
Often discrete data have so many zero values for events, the data do not fit the distributions and are addressed as over-dispersed fits. Can you comment and write about these data sets.
Jim Frost says
Hi Stan, I’ve heard about this issue mainly in the context of count data where there are too many zeros for the Poisson distribution. Zero inflated regression models account for this. I do write a little about this problem and potential solution in a post about Choosing the Correct Type of Regression Analysis. I’m not sure if you’re working in the regression context. I can add this to my list of things to write about in more detail!
Jerry Tuttle, FCAS says
Hi. One of the assumptions in the car accidents problem that the data follows a Poisson distribution is that the mean equals the variance. If the variance is greater than the mean, perhaps another distribution should be used such as Negative Binomial.
Jim Frost says
Hi Jerry, you raise an excellent point, and I agree with it. It’s not uncommon to have count data that don’t follow the Poisson distribution for the reason you state. The Poisson Goodness-of-Fit test should detect this condition. In other words, if you have count data where the variance is greater than the mean (or less than), you should get a statistically significant test result for the Poisson GOF test. This tells you that your data don’t follow the Poisson distribution and you should consider a different distribution.
Jim
Ricardo Garza-Mendiola says
Hello, can you show me (remember) how do you estimated the expected value to calculate chi-square. Greetings.
Jim Frost says
Hi Ricardo,
Thanks for the great question. The expected value depends upon which Chi-squared test you perform. In both cases, the expected values are the hypothesized values for the null hypothesis. You’re testing to see if your actual values are significantly different from the hypothesized values.
For the Poisson goodness-of-fit test, the expected values are based on the Poisson distribution. If your data followed the Poisson distribution exactly, these are the values you’d observed in your data.
For the Chi-squared goodness-of-fit test for the categorical variable, the expected values are based on the values that you specify. In this case, I entered the proportions that PPG found in their study. The software uses these proportions and applies them to the sample size to calculate the expected values.
So, the expected values really depend upon the distribution you are testing.
I hope this helps!
Jim