What is a Hypergeometric Distribution?
The hypergeometric distribution is a discrete probability distribution that calculates the likelihood an event happens k times in n trials when you are sampling from a small population without replacement.
This distribution is like the binomial distribution except for the sampling without replacement aspect. When you sample without replacement, the probabilities change with each subsequent trial. Conversely, the binomial distribution assumes the chances remain constant over the trials.
For instance, when you draw an ace from a deck of cards, the probability decreases for drawing another ace on the next draw because the deck has fewer aces.
The hypergeometric distribution can answer the following questions. What is the probability of getting:
- Two red candies when we draw five candies from a jar containing five red candies and 10 white candies.
- Drawing five cards of the same suit from a regular deck of cards.
- 8 women on a jury of 13 people when drawing randomly from a jury pool of 50 people evenly split between men and women?
As the population size increases, the hypergeometric distribution more closely approximates the binomial distribution. This distribution is an example of a Probability Mass Function (PMF) because it calculates likelihoods for discrete random variables.
In this post, learn how to use the hypergeometric distribution and its cumulative form, when you can use it, its formula, and how to calculate probabilities by hand. I also include a hypergeometric distribution calculator that you can use with what you learn. I’ll walk you through the formulas for calculating hypergeometric distribution probabilities.
For more information about other ways to use binary data, read my posts, Maximize the Value of Your Binary Data, the Bernoulli, Binomial, Negative Binomial, and the Geometric Distribution.
Hypergeometric Probabilities
The hypergeometric distribution models the probabilities for exactly k events occurring in n trials when you know the composition of a small population. Let’s look at an example to bring it to life!
I’ll start by using statistical software to calculate the hypergeometric probabilities and create distribution plots. This process will help you understand what you can learn from it. Then we’ll move on to the hypergeometric distribution formula.
Suppose we’re interested in the possible outcomes for a jury selection. We want to know the probability of drawing 8 women for a jury of 13 when there are 25 female and 25 male candidates. For this example, assume the jurors are randomly selected from the pool of candidates.
We’ll need the following information to solve this problem:
- Total population size is 50 candidates (N).
- Number of events (Female) in the population (all candidates) is 25 (K)
- The jury size is 13 (n).
- Outcome of interest is selecting 8 women (k).
The hypergeometric distribution accounts for how the probabilities change with each selection. As we select men and women from the candidate pool, it affects the makeup of the remaining population in the pool because there are no replacements. Each woman we choose reduces the number of women in the candidate pool, thus lowering the likelihood that the following selection will be a woman. Conversely, selecting a man increases the chances that the next juror will be a woman.
Example Results
My statistical software tells me that the likelihood is:
The hypergeometric probability distribution calculates a likelihood of 0.161934 for selecting eight women in 13 draws.
That’s interesting but perhaps not so helpful by itself. We’re also interested in the chances of selecting other numbers of female jurors. Seeing the distribution of probabilities for different numbers is much more helpful.
Related post: Understanding Probability Distributions
Hypergeometric Distribution Graph
The hypergeometric distribution graph is helpful because it displays the probability of differing numbers of successes (k) out of the total number of trials (n). In the chart below, the distribution plot finds the likelihood of selecting exactly no women, 1 woman, 2 women, 3 women, . . ., and up to 13 women in the 13 selections. With this approach, the hypergeometric distribution graph covers the complete range of possible successes up to the total number of trials.
I like these graphs because they emphasize how we’re working with a distribution, and it’s easy to see which values happen more frequently. The graph below does not show the chances for fewer than 2 or more than 11 because those likelihoods are too low to display on the chart.
In the chart, each bar represents the probability of selecting a specific number of women during the 13 selections. The bar for 8 corresponds with the probability (0.161934) shown in the output above. At a glance, we can see that selecting 6 or 7 female jurors are the most likely outcomes with both having equal probabilities of approximately 0.24.
Hypergeometric Cumulative Distribution Function
The hypergeometric distribution is excellent for understanding the likelihood of obtaining an exact number of events (k) within a certain number of trials (n) for a small population without replacement. However, you’re often not interested in just one specific number of outcomes.
For example, in the jury selection example above, you might want to learn the probability of selecting at least eight women.
Let me introduce you to the hypergeometric cumulative distribution function.
Technically, the hypergeometric cumulative probability calculates the likelihood of obtaining less than or equal to k events in n trials. Use the inverse cumulative distribution when you need to get a ≥ probability. These days, most statistical software will let you indicate the direction of the cumulative function for the hypergeometric distribution. I’ll use the hypergeometric distribution graph again to show you how it works.
For example, we want to know the chances of selecting ≥ 8 women in 13 attempts. Below, the shaded region shows the inverse cumulative probability of choosing at least eight women in 13 draws.
The likelihood of randomly choosing eight or more women in 13 selections is 0.2601, approximately 1 in 4.
Learn more about Cumulative Distribution Functions: Uses, Graphs & vs PDF.
Hypergeometric Distribution Calculator
Use this hypergeometric distribution calculator to calculate probabilities and cumulative probabilities. Note that it uses “successes” to indicate the number of events (K and k).
Let’s use this calculator to recreate the preceding jury selection examples. In the calculator, enter Population size (N) = 50, Number of success states in population (K) = 25, Sample size (n) = 13, and Number of success states in sample (k) = 8. The calculator displays a hypergeometric probability of 0.16193, matching our results above for eight women.
Next, in What to compute, change P(X = k) to P(X ≥ k). The calculator displays 0.2601 for selecting at least eight women. This result matches our graphical example with the hypergeometric inverse cumulative distribution.
Now, try one yourself. Imagine you’re drawing from a jar that contains 5 red candies and 10 non-red candies. What is the probability that you’ll draw two red candies with five random draws from the jar?
See the correct answer at the end of this post. Now, onto the formula for those who want to calculate the probabilities manually.
Hypergeometric Distribution Formula
Typically, you’ll use statistical software or online calculators to calculate the probabilities for the hypergeometric distribution. However, I’ll explain the hypergeometric distribution formula so you can calculate them manually and I’ll walk you through a worked example.
The hypergeometric distribution formula is the following:
Where:
- N is the number in the population.
- n is the sample size.
- K is the number of events or successes in the population.
- k is the number of events in the sample.
Use this hypergeometric formula to calculate the probability of k successes occurring in n trials for a small population without replacement. Notice how the formula incorporates the population’s characteristics (N and K) and the properties of the sample we’re assessing (n and k).
The formulation uses combinations. For example, using standard notation, CKk is the number of ways you can start with K successes in the population and end up with k successes in your sample where the order of successes does not matter. For more information, read my post about Finding Combinations.
Let’s cover what each term in the equation means.
Given a population size of N that contains K successes, the hypergeometric distribution formula takes the number of ways to have k successes in a sample and multiplies that by the number of ways for n-k failures in the sample. Then it divides that product by the total number of outcomes for sample size n.
Let’s work through an example calculation to bring the formula to life!
Worked Example of Finding a Hypergeometric Probability
We’ll use the hypergeometric distribution formula to calculate the likelihood of choosing red candies from a jar. The jar contains 5 red candies and 10 non-red candies for a total of 15 candies. We’ll randomly draw five candies from the jar.
Let’s calculate our chances of getting two red candies in our five draws!
We’ll enter the following values in the hypergeometric distribution formula:
- N = 15 total candies in the jar.
- n = 5 draws for the sample.
- K = 5 red candies in the jar.
- k = 2 red candies in the sample.
For these calculations, I’ll use a combinations calculator to obtain the number of combinations for each term in the equation. To learn how to calculate the number of combinations by hand, click the link above about finding combinations.
The probability of drawing precisely two red candies in our five random draws from the jar is 0.3996. This result also answers the hypergeometric calculator problem I gave you earlier!
Comments and Questions