Binary data occur when you can place an observation into only two categories. It tells you that an event occurred or that an item has a particular characteristic. For instance, an inspection process produces binary pass/fail results. Or, when a customer enters a store, there are two possible outcomes—sale or no sale. In this post, I show you how to use the binomial, geometric, negative binomial, and the hypergeometric probability distributions to glean more information from your binary data.
Probability distributions are functions that describe the likelihood of obtaining the possible values that a random variable can assume. To learn more about them, read my post Understanding Probability Distributions.
At a basic level, binary data allow you to calculate proportions and percentages easily. What is the proportion of items that pass the inspection? What percentage of customers make a purchase? Additionally, you can use hypothesis tests with binary data to assess the statistical significance of the difference between group proportions. For example, I have used the 2 Proportions test to evaluate the effectiveness of flu shots. In that case, the binary outcomes for the human subjects are either “infected” or “not infected” with the flu.
However, you can also use binary data and their probability distributions to model probabilities and the frequency of occurrences. For example:
- How many times is an event likely to occur?
- When is the first instance probable?
- How many opportunities do I need to produce a specific number of events?
Being able to answer these questions can be quite valuable. How do you provide these answers? All you need are several convenient discrete probability distributions that are designed for binary data. In this blog post, I’ll show you the benefits of using the binomial, geometric, negative binomial, and the hypergeometric distributions. Each of these probability distributions allow you to answer different questions about your binary data.
Assumptions for Using Probability Distributions for Binary Data
To use the binomial, geometric, negative binomial, and the hypergeometric distributions, you need to satisfy the following assumptions.
- There are only two possible outcomes per trial. For example, accept or reject, sale or no sale, etc.
- Each trial is independent (except for hypergeometric). The result of one trial does not affect the results of another trial. For instance, when flipping a coin, the outcome of a coin toss doesn’t influence the next coin toss. Learn more about Independent Events.
- The probability remains constant over time (except for hypergeometric). In some cases, this assumption is valid based on the physical properties, such as flipping a coin. However, if there is a chance the probability can change over time, you can use the P chart (a control chart) to confirm this assumption. For example, it’s possible that the probability that a process produces defective products can change over time.
Learn more about the assumption of independent and identically distributed (IID) data, which relate to items #2 and #3.
These distributions are examples of Probability Mass Functions (PMFs) because they calculate likelihoods for discrete random variables.
For most of these binary probability distributions, I’ll use a die rolling example. Assume that we’re playing a game where rolling a 6 is very advantageous. In this scenario, rolling a 6 is binary because an observation can be either a 6 or not a 6. The probability of rolling a 6 is 1/6, or about 0.1667. We’ll use the binomial, geometric, and negative binomial distributions to calculate probabilities for how many 6s we’ll roll, when they’ll first appear, and the likelihood of observing a certain number of 6s.
In the examples, I’ll show you when to use each distribution and how to interpret the results. I’ll cover both how to interpret the probability for a specific outcome and the cumulative probability for a range of outcomes. Consequently, it’s important to notice the difference between the probability for each discrete value (each bar in the graphs) and the cumulative probabilities for shaded regions.
Related post: Probability Fundamentals
Binomial Distribution
Use the binomial probability distribution to calculate probabilities that an event occurs a certain number of times in a set number of trials. Specifically, it calculates the probability of X events happening within N trials. This distribution expands on the Bernoulli distribution that can model only 1 trial.
Suppose you want to determine how likely it is to roll a 6 on a die when you roll the die ten times. Additionally, let’s learn the cumulative probability of rolling a 6 four or more times.
I’ll use statistical software to graph the results using the binomial distribution and enter a probability of 0.1667 and specify ten trials.
The graph displays the probability rolling a 6 each number of times when you roll the die ten times. For example, the highest probability (0.32) occurs with rolling a 6 exactly one time in ten rolls. We have a 16% chance of rolling no 6s. We also want to determine the probability of rolling 6s four or more times. The shaded area sums the probabilities for four events and higher to calculate this cumulative probability. The cumulative probability of rolling at least four 6s is 0.06977.
For more detailed information, read my post, The Binomial Distribution.
Geometric Distribution
Use the geometric probability distribution when you know the probability of an event occurring and want to calculate the probability of the event first occurring during a specific trial. In other words, if you keep drawing random samples, what is the probability of the event/characteristic first appearing on each draw?
With the die example, we’ll use the geometric distribution to determine the probability of rolling the first 6 on different numbers of rolls. Additionally, we want to learn the cumulative probability that the first 6 appears on the 7th roll or later.
Each bar in the graph represents the probability of rolling the first six on a specific trial. For instance, the probability of rolling the first 6 on the third roll specifically is 0.11. Interestingly, you might think you’re virtually guaranteed to get a 6 when you roll the die six times. However, the red shaded region indicates that you have a 33% cumulative chance of rolling the first 6 on the 7th roll or later.
For more detailed information, read my post, The Geometric Distribution: Uses, Calculator & Formula.
Negative Binomial Distribution
Use the negative binomial probability distribution to calculate the number of trials that are required to observe the event a specific number of times. In other words, given a known probability of an event occurring and the number of events that you specify, this distribution calculates the probability for observing that number of events within N trials.
For the die example, suppose we want to determine the probability of rolling 6s five times based on the number of total rolls. Additionally, we want to assess the cumulative probability to determine the number of rolls necessary to have a 50% chance of rolling five 6s.
In the statistical software, I enter the probability and specify 5 events.
In the plot, each bar represents the probability of rolling precisely five 6s in the specified number of rolls. For example, the maximum likelihood (0.04) of rolling exactly five 6s occurs at 24 rolls, which is the peak of the histogram. Additionally, the shaded area indicates that the cumulative probability of obtaining five 6s in the first 27 rolls is nearly 0.5.
For more information, read my post The Negative Binomial Distribution: Uses, Calculator & Formula.
Hypergeometric Distribution
Use the hypergeometric probability distribution when you are drawing from a small population without replacement, and you want to calculate probabilities that an event occurs a certain number of times in a set number of trials. Like the binomial distribution, the hypergeometric distribution calculates the probability of X events given N trials. However, unlike the binomial distribution, it does not assume that the likelihood of an event’s occurrence is constant. In fact, the hypergeometric distribution assumes that the probability changes because you are drawing from a small population without replacement.
To illustrate how to use this distribution, we have to move away from rolling 6s on our die! Instead, we’ll draw candy blindly from a jar. Suppose there are 15 candies of various colors in the jar and our favorite candies are red.
For this scenario, the binary data values are “red” and “not red”. At the start, 5 out of the 15 (33%) candies are red. We’ll use the hypergeometric distribution to calculate the probabilities of drawing red candies when we draw five candies from the jar.
The probabilities in this scenario are not constant because each draw from the jar affects the probabilities for the next draw. For instance, if you draw a red candy, that reduces the total number of red candies remaining in the jar, which reduces the probability of drawing another one the next time. The hypergeometric distribution accounts for these changing probabilities.
Hypergeometric example
In the statistical software, I specify the following:
- The population size is 15.
- There are five red candies.
- We’ll draw 5 candies from the jar randomly.
The graph displays the probability of drawing each possible number of red candies when you draw 5 candies altogether. For example, the highest probability is approximately 0.4 and occurs with obtaining two red candies. There is an 8% chance of getting no red candies. And, you have a cumulative probability of 0.1668 for drawing at least three red candies. More likely than not, you’ll have to be content with 1 or 2 red candies!
Learn more about the Hypergeometric Distribution: Uses, Calculator & Formula.
Get the Most Out of Your Binary Data!
The binomial, geometric, negative binomial, and hypergeometric distributions describe the probabilities associated with the number of events and when they occur. This information can be invaluable for planning purposes. I kept the die and candy examples intentionally simple so they are easy to understand. However, for a science-based, real-world example, read my blog post about the effectiveness of flu shots.
In that post, the binary data values indicate whether the human subjects are “infected” or “not infected” with the flu. I use the 2 Proportions hypothesis test to determine whether the proportion of flu infections is significantly different between the vaccinated and unvaccinated groups. Then, I use the discrete probability distributions in this blog post to compare the expected number of flu cases and when they first occur by vaccination status.
Hi Jim
Definitely enjoyed your flu vaccine post and this follow up
Question – how do you assess the fragility of a study for categorical events when those events are rare?
Take the covid vaccine as an example – in one vaccine trial there were 43,000 participants, 162 positive in placebo group and 8 positive in test group – I understand the risk reduction etc.
But with events this rare (0.4% across groups) how does one assess that the study was sufficiently powered? I was thinking something similar to coefficient of variation but unsure how to apply it to this type of data.
Thanks
Hi Anthony,
Read my post about Moderna’s phase 3 vaccination trials. That’ll answer your questions about how they calculated power and how the data are analyzed. I haven’t read in depth about Pfizer’s, but Moderna’s study was sufficiently powered.
Hi Jim,
Your explanations are really useful and easy to understand. Thank you so much.
You’re very welcome, Shiska!
Hello sir,
I am a pg student. I need to do correlation for 4 continuous and 3 ordinal type independent variable and 4 binary type dependent variable. Which technique is suitable to do this? Thank you
Hi Chitra,
Because of the binary dependent variables, you’ll need to use binary logistic regression. Read about it in my post about using the right kind of regression, which covers that type and more. Also, read an example where I use binary logistic regression.
Thank you sir
Hello jims,
I am intrested in logistic regression analysis because i am ph.d student and my objectives of research work are achievable by logistic model. So i want some detailed study of logistic regression written by you. Thanks sir for cooperation.
Hi Sami, I’ll be writing about that in a future blog post. It’s definitely on my list!