• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

Simpsons Paradox Explained

By Jim Frost Leave a Comment

What is Simpsons Paradox?

Simpsons Paradox is a statistical phenomenon that occurs when you combine subgroups into one group. The process of aggregating data can cause the apparent direction and strength of the relationship between two variables to change.

For example, in 1973, data seemed to show that men applying to all departments of the graduate school at UC Berkeley were more likely to be admitted than women. In other words, there’s a positive association between males and admission. However, when analysts assessed admission rates by department, they found that women had slightly better rates.

Read on to see why that happens and learn which answer is correct!

Understanding Simpson’s Paradox is crucial because it can completely flip your results. If you don’t watch out for it, you might accidentally report completely inaccurate outcomes!

Why Does Simpson’s Paradox Occur?

Simpson’s Paradox occurs because a third variable can affect the relationship between a pair of variables. Statisticians refer to this type of third variable as a confounder or confounding variable. To understand the correct relationship between two variables, you must factor in the influence of confounders.

Simpson’s Paradox occurs when the process of aggregating data excludes confounding variables. Usually, this happens unintentionally. The researchers might not realize the consequences of their actions.

Simpson’s Paradox is essentially the same concept as omitted variable bias in regression analysis, except that it is specific to cases where you combine data and ignore subgroup information. Learn more about Confounding Variables and Omitted Variable Bias.

Graphical Example

Let’s look at this graphically before returning to the admissions example. The data below show one group that seems to have a negative correlation between the X and Y variable. As X increases, Y tends to decrease.

Simpson's paradox example data that is aggegrated.

Now we’ll factor in the subgroups. Below, it’s easy to see how there is actually a positive relationship between X and Y after including the subgroups. Combining the data and ignoring the subgroups obscured that relationship.

Dataset broken down by subgroups to reveal true relationship.

In the context of Simpson’s Paradox, the subgroups capture the confounding variable. By aggregating the data, you are effectively removing the confounder from the analysis, and it distorts the results.

Explaining the Admissions Example

If you want to compare the admissions rates for men and women at UC Berkeley, it seems logical that you can just look at the overall rates. I show the actual acceptance rates below:

Men Women
45% 30%

It sure appears that Berkeley prefers men and disadvantages women. However, there is more to the story thanks to Simpson’s Paradox!

Unfortunately, aggregating the data from all departments removes departmental differences from the analysis. It turns out that some departments have much lower acceptance rates than others. They’re more selective. The following two factors create the misleading, unbalanced acceptance rates in the previous table:

  • Women tended to apply for the harder departments, lowering their overall acceptance rate.
  • Men were inclined to apply for the easier departments, boosting their rates.

To determine whether the selection process favors men, we need to assess the data at the departmental level and compare acceptance rates within each department. This method holds each department’s acceptance rate constant, allowing for valid comparisons.

Let’s look at the data! There are 85 departments. The table shows the largest six.

Table of UC Berkeley admissions data broken down by department to resolve Simpson's paradox.

Comparing the rates within departments paints a different picture. Women have a slight advantage over men in most departments.

The subgroup analysis accounts for the confounding variable of the varying admission rates.

Simpson’s Paradox Example

Simpson’s paradox occurs in numerous contexts. More recently, analysts observed it in media reports of COVID deaths among the vaccinated than the unvaccinated. In September 2022, 12,593 COVID deaths occurred in the United States. Of those, 39% were unvaccinated, while 61% were vaccinated. What?!

It turns out that the relationship between being vaccinated and having a higher percentage of deaths is a fiction created by aggregating data and tossing out relevant information—Simpson’s Paradox.

In the United States, the COVID vaccinated population tends to be older and has more risk factors. This group naturally tends to have worse COVID outcomes. However, when you adjust for age and other risk factors, the CDC finds that COVID vaccinated and boosted individuals have an 18.6 times lower risk of dying from COVID. The vaccines are working!

To wrap up, Simpson’s Paradox occurs when you fail to account for relevant information when analyzing data. This paradox occurs when you aggregate data and lose essential details in the process. With the enrollment example, you get opposite results when you look at the overall acceptance rates by gender but don’t consider the varying departmental acceptance rates. For the COVID example, you get confusing results when you assess the overall death COVID percentages by vaccination status without accounting for underlying risk factors.

It shouldn’t be surprising that discounting relevant factors will distort your results. But it is surprising how easily it can happen if you don’t watch for it!

To avoid this type of confusion, researchers must carefully consider the level of data aggregation and carefully examine the data for potential confounding variables that could influence the results. By doing this, they can ensure that their study results accurately reflect the underlying trends and patterns in the data.

References

Sex Bias in Graduate Admissions: Data from Berkeley

Why Do Vaccinated People Represent Most COVID-19 Deaths Right Now?

CDC COVID Data Tracker: Rates of COVID-19 Cases and Deaths by Vaccination Status

Share this:

  • Tweet

Related

Filed Under: Basics Tagged With: bias sources, conceptual

Reader Interactions

Comments and Questions Cancel reply

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Follow Me

    • FacebookFacebook
    • RSS FeedRSS Feed
    • TwitterTwitter

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • How To Interpret R-squared in Regression Analysis
    • Mean, Median, and Mode: Measures of Central Tendency
    • Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
    • How to Interpret the F-test of Overall Significance in Regression Analysis
    • Choosing the Correct Type of Regression Analysis
    • How to Find the P value: Process and Calculations
    • Interpreting Correlation Coefficients
    • How to do t-Tests in Excel
    • Z-table

    Recent Posts

    • Fishers Exact Test: Using & Interpreting
    • Percent Change: Formula and Calculation Steps
    • X and Y Axis in Graphs
    • Simpsons Paradox Explained
    • Covariates: Definition & Uses
    • Weighted Average: Formula & Calculation Examples

    Recent Comments

    • Dave on Control Variables: Definition, Uses & Examples
    • Jim Frost on How High Does R-squared Need to Be?
    • Mark Solomons on How High Does R-squared Need to Be?
    • John Grenci on Normal Distribution in Statistics
    • Jim Frost on Normal Distribution in Statistics

    Copyright © 2023 · Jim Frost · Privacy Policy