What is Simpsons Paradox?
Simpsons Paradox is a statistical phenomenon that occurs when you combine subgroups into one group. The process of aggregating data can cause the apparent direction and strength of the relationship between two variables to change.
For example, in 1973, data seemed to show that men applying to all departments of the graduate school at UC Berkeley were more likely to be admitted than women. In other words, there’s a positive association between males and admission. However, when analysts assessed admission rates by department, they found that women had slightly better rates.
Read on to see why that happens and learn which answer is correct!
Understanding Simpson’s Paradox is crucial because it can completely flip your results. If you don’t watch out for it, you might accidentally report completely inaccurate outcomes!
Why Does Simpson’s Paradox Occur?
Simpson’s Paradox occurs because a third variable can affect the relationship between a pair of variables. Statisticians refer to this type of third variable as a confounder or confounding variable. To understand the correct relationship between two variables, you must factor in the influence of confounders.
Simpson’s Paradox occurs when the process of aggregating data excludes confounding variables. Usually, this happens unintentionally. The researchers might not realize the consequences of their actions.
Simpson’s Paradox is essentially the same concept as omitted variable bias in regression analysis, except that it is specific to cases where you combine data and ignore subgroup information. Learn more about Confounding Variables and Omitted Variable Bias.
Graphical Example
Let’s look at this graphically before returning to the admissions example. The data below show one group that seems to have a negative correlation between the X and Y variable. As X increases, Y tends to decrease.
Now we’ll factor in the subgroups. Below, it’s easy to see how there is actually a positive relationship between X and Y after including the subgroups. Combining the data and ignoring the subgroups obscured that relationship.
In the context of Simpson’s Paradox, the subgroups capture the confounding variable. By aggregating the data, you are effectively removing the confounder from the analysis, and it distorts the results.
Explaining the Admissions Example
If you want to compare the admissions rates for men and women at UC Berkeley, it seems logical that you can just look at the overall rates. I show the actual acceptance rates below:
Men | Women |
45% | 30% |
It sure appears that Berkeley prefers men and disadvantages women. However, there is more to the story thanks to Simpson’s Paradox!
Unfortunately, aggregating the data from all departments removes departmental differences from the analysis. It turns out that some departments have much lower acceptance rates than others. They’re more selective. The following two factors create the misleading, unbalanced acceptance rates in the previous table:
- Women tended to apply for the harder departments, lowering their overall acceptance rate.
- Men were inclined to apply for the easier departments, boosting their rates.
To determine whether the selection process favors men, we need to assess the data at the departmental level and compare acceptance rates within each department. This method holds each department’s acceptance rate constant, allowing for valid comparisons.
Let’s look at the data! There are 85 departments. The table shows the largest six.
Comparing the rates within departments paints a different picture. Women have a slight advantage over men in most departments.
The subgroup analysis accounts for the confounding variable of the varying admission rates.
Simpson’s Paradox Example
Simpson’s paradox occurs in numerous contexts. More recently, analysts observed it in media reports of COVID deaths among the vaccinated than the unvaccinated. In September 2022, 12,593 COVID deaths occurred in the United States. Of those, 39% were unvaccinated, while 61% were vaccinated. What?!
It turns out that the relationship between being vaccinated and having a higher percentage of deaths is a fiction created by aggregating data and tossing out relevant information—Simpson’s Paradox.
In the United States, the COVID vaccinated population tends to be older and has more risk factors. This group naturally tends to have worse COVID outcomes. However, when you adjust for age and other risk factors, the CDC finds that COVID vaccinated and boosted individuals have an 18.6 times lower risk of dying from COVID. The vaccines are working!
To wrap up, Simpson’s Paradox occurs when you fail to account for relevant information when analyzing data. This paradox occurs when you aggregate data and lose essential details in the process. With the enrollment example, you get opposite results when you look at the overall acceptance rates by gender but don’t consider the varying departmental acceptance rates. For the COVID example, you get confusing results when you assess the overall death COVID percentages by vaccination status without accounting for underlying risk factors.
It shouldn’t be surprising that discounting relevant factors will distort your results. But it is surprising how easily it can happen if you don’t watch for it!
To avoid this type of confusion, researchers must carefully consider the level of data aggregation and carefully examine the data for potential confounding variables that could influence the results. By doing this, they can ensure that their study results accurately reflect the underlying trends and patterns in the data.
References
Sex Bias in Graduate Admissions: Data from Berkeley
Why Do Vaccinated People Represent Most COVID-19 Deaths Right Now?
CDC COVID Data Tracker: Rates of COVID-19 Cases and Deaths by Vaccination Status
Comments and Questions