Correlation vs causation in statistics is a critical distinction. And you’ve undoubtedly heard that correlation doesn’t imply causation. Why is that the case, what are the differences between them, and why do they matter? Those are the topics of this post!
Correlation vs Causation Definitions
Let’s first compare the definitions of correlation vs causation.
What is Correlation?
Correlation means that as one variable changes, another tends to change in a specific direction. In other words, two variables move together. Positive and negative correlations exist.
- Positive correlation: X increases and Y tends to increase.
- Negative correlation: X increases and Y tends to decrease.
For example, as people’s heights grow, their weight tends to increase, creating the positive correlation below. Or, as school absences increase, grades tend to decrease. That’s a negative correlation.
Learn more about Understanding Correlation Coefficients.
What is Causation?
Causation indicates that changes in one variable trigger changes in another variable. For example, increasing the dosage of a medicine causes the severity of the symptoms to decrease.
Relationship between Correlation vs Causation
At first glance, those definitions certainly seem consistent, which is why they’re so frequently misunderstood. What is the relationship between correlation vs causation?
Correlation doesn’t imply causation, but causation suggests that correlation exists. The Venn diagram shows the relationship between the two.
Understanding why causation implies correlation is intuitive. If increasing medicine dosage decreases the symptoms, you’ll find a negative correlation between those variables. The causation creates the correlation.
Unfortunately, it’s less intuitive to understand how you can observe a correlation but not be sure about causation. Let’s dig into that issue.
Why Doesn’t Correlation Imply Causation?
Suppose you find a positive correlation between X and Y. How could it not be causal? After all, when X goes up, Y also goes up. It sounds like cause and effect, but it might not be. Statisticians refer to a non-causal association between variables as a spurious correlation.
They exist for multiple reasons. The fact that they exist all is why you can’t be sure that correlation indicates causation. Only a subset of correlations reflect a causal relationship. Hence the importance of understanding correlation vs causation.
Let’s cover three potential explanations for spurious correlations.
Related post: Understanding Spurious Correlations
Third Variable Problem
A third variable can create a spurious relationship between two variables. It depends on the pattern of correlations between the two variables you’re considering and a third variable.
Did you know a positive correlation exists between ice cream sales and shark attacks?
Now, ice cream sales do not cause an increase in shark attacks. So, what is going on?
It turns out that outside temperature positively correlates with ice cream sales and shark attack opportunities (because more people go to the beach). So, when temperature increases, both sales and attacks increase in unison, creating a spurious correlation between them. In this scenario, we call temperature a lurking variable or a confounder.
Direction of Causation
Sometimes two variables might have a causal relationship, but it’s unclear which variable is the cause and which is the effect.
Researchers find a correlation between the number of hours students spend on social media and their academic grades.
Scenario 1: It could be that spending more time on social media distracts students from their studies, leading to lower grades.
Scenario 2: Conversely, students struggling with their studies might turn to social media for escapism, meaning lower grades lead to increased social media usage.
In this situation, it’s ambiguous whether social media usage causes lower grades or if the reverse is true.
Random chance in sample data can produce an apparent relationship between variables. If you collect enough random samples, randomness will occasionally create the appearance of a correlation where none actually exists. I wrote about this phenomenon in a post about data dredging. I show how entirely randomly generated data that should not have any correlations can produce them when you sift through enough data. See Data Dredging.
The graph below shows what appears to be a negative correlation, but I had computer software randomly generate many variables and then systematically dredge through the data and find correlations.
For these reasons, you might see a correlation in your data when there is no cause and effect. Or at least you might not be sure about it.
Why Establishing Correlation vs Causation is Important
Correlation only indicates that two variables move together, but it doesn’t tell us if one causes changes in the other. Relying solely on correlation can lead to misguided conclusions and ineffective or even harmful actions. Establishing causality ensures that we’re targeting the root cause of an issue rather than just an associated symptom.
For instance, a study published in the journal Language Sciences found a correlation between individuals who use taboo words (swear words) and higher levels of verbal intelligence. However, it’s essential to approach such findings with nuance. The correlation doesn’t suggest that swearing enhances intelligence. Instead, it might indicate that individuals with a richer vocabulary, encompassing both standard and taboo words, have a more extensive linguistic repertoire to express themselves. Misunderstanding this correlation could lead to the mistaken belief that increasing one’s use of swear words would boost intelligence, which is not what the study implies.
Alternatively, suppose you unknowingly find a spurious correlation between vitamins and improved health outcomes. Believing that the vitamins cause those improvements when it’s merely correlation leads to poor decision-making. After all, if the vitamins don’t cause health gains, then consuming more vitamins won’t produce better outcomes despite the correlation.
Scientists have found that people who regularly take vitamins have many pre-existing health habits and conditions that differ from non-vitamin consumers. Those differences are the likely causes for the improved health outcomes rather than the vitamins themselves. Read my posts about Observational Studies to see the long list of differences in the example.
These examples underscore the critical importance of distinguishing between correlation vs causation in decision-making.
How to Identify Causal vs. Correlational Relationships
Establishing correlation vs causation is often misunderstood. While correlation can provide hints about potential relationships between variables, it doesn’t prove that one variable causes another to change. That’s an entirely different matter. They might not be causally linked at all. Unfortunately, spurious correlations occur frequently, and there’s no statistical test for detecting them!
So, how do you distinguish between correlation vs causation?
To truly establish causality, researchers need specially designed experiments—randomized controlled trials (RCTs). RCTs randomly assign participants to either a treatment or control group. This random assignment helps ensure all groups start the same except for the treatment. If the outcomes differ at the end, analysts can attribute them to the treatment with high confidence. Learn more about Randomized Controlled Trials and Random Assignment in Experiments.
Conversely, Correlational Studies are suited for finding relationships quickly and inexpensively in preliminary studies but they are not suitable for establishing causality.
Sir Austin Bradford Hill proposed a set of nine criteria to help determine if a relationship is genuinely causal rather than merely correlational. These criteria are an exercise in critical thought. They prompt you to think about causation by highlighting vital properties to consider and how to apply your subject-area knowledge. The objective is to fulfill as many criteria as possible. While no single criterion is adequate, it’s usually impossible to meet all of them. For a deeper dive, read my post about Causation in Statistics: Hill’s Criteria.
Regardless of the method used, it’s crucial to approach the question of correlation vs causation with caution, ensuring that you base your conclusions on solid evidence, sound methodology, and critical thinking.