Binary data occur when you can place an observation into only two categories. It tells you that an event occurred or that an item has a particular characteristic. For instance, an inspection process produces binary pass/fail results. Or, when a customer enters a store, there are two possible outcomes—sale or no sale. In this post, I show you how to use the binomial, geometric, negative binomial, and the hypergeometric distributions to glean more information from your binary data.

At a basic level, binary data allow you to calculate proportions and percentages easily. What is the proportion of items that pass the inspection? What percentage of customers make a purchase? Additionally, you can use hypothesis tests with binary data to assess the statistical significance of the difference between group proportions. For example, I have used the 2 Proportions test to evaluate the effectiveness of flu shots. In that case, the binary outcomes for the human subjects are either “infected” or “not infected” with the flu.

However, you can also use binary data to model probabilities and the frequency of occurrences. For example:

- How many times is an event likely to occur?
- When is the first instance probable?
- How many opportunities do I need to produce a specific number of events?

Being able to answer these questions can be quite valuable. How do you provide these answers? All you need are several convenient discrete probability distributions that are designed for binary data. In this blog post, I’ll show you the benefits of using the binomial, geometric, negative binomial, and the hypergeometric distributions. Each of these distributions allow you to answer different questions about your binary data.

## Assumptions for Using Probability Distributions for Binary Data

To use the binomial, geometric, negative binomial, and the hypergeometric distributions, you need to satisfy the following assumptions.

**There are only two possible outcomes per trial**. For example, accept or reject, sale or no sale, etc.**Each trial is independent**. The result of one trial does not affect the results of another trial. For instance, when flipping a coin, the outcome of a coin toss doesn’t influence the next coin toss.**(except for hypergeometric)****The probability remains constant over time (except for hypergeometric)**. In some cases, this assumption is valid based on the physical properties, such as flipping a coin. However, if there is a chance the probability can change over time, you can use the P chart (a control chart) to confirm this assumption. For example, it’s possible that the probability that a process produces defective products can change over time.

Throughout most of these distributions, I’ll use a die rolling example. Assume that we’re playing a game where rolling a 6 is very advantageous. In this scenario, rolling a 6 is binary because an observation can be either a 6 or not a 6. The probability of rolling a 6 is 1/6, or about 0.1667. We’ll use the binomial, geometric, and negative binomial distributions to calculate probabilities for how many 6s we’ll roll, when they’ll first appear, and the likelihood of observing a certain number of 6s.

In the examples, I’ll show you when to use each distribution and how to interpret the results. I’ll cover both how to interpret the probability for a specific outcome and the cumulative probability for a range of outcomes. Consequently, it’s important to notice the difference between the probability for each discrete value (each bar in the graphs) and the cumulative probabilities for shaded regions.

## Binomial Distribution

Use the binomial distribution to calculate probabilities that an event occurs a certain number of times in a set number of trials. Specifically, it calculates the probability of X events happening within N trials.

Suppose you want to determine how likely it is to roll a 6 on a die when you roll the die ten times. Additionally, let’s learn the cumulative probability of rolling a 6 four or more times.

I’ll use statistical software to graph the results using the binomial distribution and enter a probability of 0.1667 and specify ten trials.

The graph displays the probability rolling a 6 each number of times when you roll the die ten times. For example, the highest probability (0.32) occurs with rolling a 6 exactly one time in ten rolls. We have a 16% chance of rolling no 6s. We also want to determine the probability of rolling 6s four or more times. The shaded area sums the probabilities for four events and higher to calculate this cumulative probability. The cumulative probability of rolling at least four 6s is 0.06977.

## Geometric Distribution

Use the geometric distribution when you know the probability of an event occurring and want to calculate the probability of the event first occurring during a specific trial. In other words, if you keep drawing random samples, what is the probability of the event/characteristic first appearing on each draw?

With the die example, we’ll use the geometric distribution to determine the probability of rolling the first 6 on different numbers of rolls. Additionally, we want to learn the cumulative probability that the first 6 appears on the 7th roll or later.

Each bar in the graph represents the probability of rolling the first six on a specific trial. For instance, the probability of rolling the first 6 on the third roll specifically is 0.11. Interestingly, you might think you’re virtually guaranteed to get a 6 when you roll the die six times. However, the red shaded region indicates that you have a 33% cumulative chance of rolling the first 6 on the 7th roll or later.

## Negative Binomial Distribution

Use the negative binomial distribution to calculate the number of trials that are required to observe the event a specific number of times. In other words, given a known probability of an event occurring and the number of events that you specify, this distribution calculates the probability for observing that number of events within N trials.

For the die example, suppose we want to determine the probability of rolling 6s five times based on the number of total rolls. Additionally, we want to assess the cumulative probability to determine the number of rolls necessary to have a 50% chance of rolling five 6s.

In the statistical software, I enter the probability and specify 5 events.

In the plot, each bar represents the probability of rolling precisely five 6s in the specified number of rolls. For example, the maximum likelihood (0.04) of rolling exactly five 6s occurs at 24 rolls, which is the peak of the histogram. Additionally, the shaded area indicates that the cumulative probability of obtaining five 6s in the first 27 rolls is nearly 0.5.

## Hypergeometric Distribution

Use the hypergeometric distribution when you are drawing from a small population without replacement, and you want to calculate probabilities that an event occurs a certain number of times in a set amount of trials. Like the binomial distribution, the hypergeometric distribution calculates the probability of X events given N trials. However, unlike the binomial distribution, it does not assume that the likelihood of an event’s occurrence is constant. In fact, the hypergeometric distribution assumes that the probability changes because you are drawing from a small population without replacement.

To illustrate how to use this distribution, we have to move away from rolling 6s on our die! Instead, we’ll draw candy blindly from a jar. Suppose there are 15 candies of various colors in the jar and our favorite candies are red.

For this scenario, the binary data values are “red” and “not red”. At the start, 5 out of the 15 (33%) candies are red. We’ll use the hypergeometric distribution to calculate the probabilities of drawing red candies when we draw five candies from the jar.

The probabilities in this scenario are not constant because each draw from the jar affects the probabilities for the next draw. For instance, if you draw a red candy, that reduces the total number of red candies remaining in the jar, which reduces the probability of drawing another one the next time. The hypergeometric distribution accounts for these changing probabilities.

### Hypergeometric example

In the statistical software, I specify the following:

- The population size is 15.
- There are five red candies.
- We’ll draw 5 candies from the jar randomly.

The graph displays the probability of drawing each possible number of red candies when you draw 5 candies altogether. For example, the highest probability is approximately 0.4 and occurs with obtaining two red candies. There is an 8% chance of getting no red candies. And, you have a cumulative probability of 0.1668 for drawing at least three red candies. More likely than not, you’ll have to be content with 1 or 2 red candies!

## Get the Most Out of Your Binary Data!

The binomial, geometric, negative binomial, and hypergeometric distributions describe the probabilities associated with the number of events and when they occur. This information can be invaluable for planning purposes. I kept the die and candy examples intentionally simple so they are easy to understand. However, for a science-based, real world example, read my blog post about the effectiveness of flu shots.

In that post, the binary data values indicate whether the human subjects are “infected” or “not infected” with the flu. I use the 2 Proportions hypothesis test to determine whether the proportion of flu infections is significantly different between the vaccinated and unvaccinated groups. Then, I use the discrete probability distributions in this blog post to compare the expected number of flu cases and when they first occur by vaccination status.

Anthkny says

Hi Jim

Definitely enjoyed your flu vaccine post and this follow up

Question – how do you assess the fragility of a study for categorical events when those events are rare?

Take the covid vaccine as an example – in one vaccine trial there were 43,000 participants, 162 positive in placebo group and 8 positive in test group – I understand the risk reduction etc.

But with events this rare (0.4% across groups) how does one assess that the study was sufficiently powered? I was thinking something similar to coefficient of variation but unsure how to apply it to this type of data.

Thanks

Jim Frost says

Hi Anthony,

Read my post about Moderna’s phase 3 vaccination trials. That’ll answer your questions about how they calculated power and how the data are analyzed. I haven’t read in depth about Pfizer’s, but Moderna’s study was sufficiently powered.

Shiska Raut says

Hi Jim,

Your explanations are really useful and easy to understand. Thank you so much.

Jim Frost says

You’re very welcome, Shiska!

Chitra kannan says

Hello sir,

I am a pg student. I need to do correlation for 4 continuous and 3 ordinal type independent variable and 4 binary type dependent variable. Which technique is suitable to do this? Thank you

Jim Frost says

Hi Chitra,

Because of the binary dependent variables, you’ll need to use binary logistic regression. Read about it in my post about using the right kind of regression, which covers that type and more. Also, read an example where I use binary logistic regression.

Sami econ says

Thank you sir

Sami econ says

Hello jims,

I am intrested in logistic regression analysis because i am ph.d student and my objectives of research work are achievable by logistic model. So i want some detailed study of logistic regression written by you. Thanks sir for cooperation.

Jim Frost says

Hi Sami, I’ll be writing about that in a future blog post. It’s definitely on my list!