What is a Spurious Correlation?
A spurious correlation occurs when two variables are correlated but don’t have a causal relationship. In other words, it appears like values of one variable cause changes in the other variable, but that’s not actually happening.
If you look up the definition of spurious, you’ll see explanations about something being fake or having a deceitful nature. It has the outward appearance of genuineness, but it’s an imitation. With this definition in mind, spurious correlations look like causal relationships in both their statistical measures and in graphs, but it’s not real.
For example, ice cream sales and shark attacks correlate positively at a beach. As ice cream sales increase, there are more shark attacks. However, common sense tells us that ice cream sales do not cause shark attacks. Hence, it’s a spurious correlation.
Spurious Correlations can appear in the form of non-zero correlation coefficients and as patterns in a graph. For instance, in the following example from tylervigen.com, the correlation between U.S. crude oil imports from Norway and drivers killed in a collision with a railway train has a very high correlation coefficient of +0.95, representing a strong, positive relationship. Graphing these data supports the correlation, as shown below. Learn more about Correlation Coefficients.
Of course, there is no causal relationship between the two!
Researchers need to identify genuinely causal relationships, which involves ruling out the possibility of spurious correlations.
In this post, learn how spurious correlations occur, how to identify them, and methods for preventing them.
What Causes a Spurious Correlation?
Spurious correlations occur for several reasons. All the explanations below can create a spurious correlation that produces a non-zero correlation coefficient and a graph that displays a relationship.
Confounding occurs when a third variable causes changes in two other variables, creating a spurious correlation between the other two variables. For example, imagine that the following two positive causal relationships exist.
- A → B
- A → C
As A increases, both B and C will increase together. Hence, it appears that B → C.
For example, higher temperatures cause more people to buy ice cream and swim at the beach, increasing the opportunities for shark attacks. Hence, even though there is no relationship between ice cream sales and shark attacks, they tend to rise and drop together. The confounding variable of temperature causes this spurious correlation.
Related post: Confounding Variables Can Bias Your Results
In other cases, a chain of correlations, or mediating variables, produces a spurious correlation. For example, imagine that both A & B and B & C have causal relationships, as shown below.
A → B → C.
If you have only measurements of A and C, you’ll find a spurious correlation. It appears to be casual. In reality, A causes B, and then B causes C. There is no direct connection between A and C.
Random Sampling Error
Samples don’t always accurately reflect the population due to chance. Random sampling error can produce the appearance of effects in the sample that don’t exist in the population. A correlation is one possible effect.
For studies using samples, the correlations you find might not exist in the population. Hypothesis testing can help sort that out.
When correlations in a sample don’t exist in the population, it’s a phantom that random error produced and, hence, cannot be a causal relationship. Consequently, it’s a spurious correlation. Samples aren’t perfect.
In some cases, it’s just pure chance that two disparate variables follow a similar pattern that looks like a relationship. This condition is slightly different from random sampling error. In this case, values of the two variables correlate in the population. It’s not a mirage caused by a sample. However, there is no causal relationship between the two variables. The patterns of changes match by chance.
By adjusting graph scales, the patterns of changes in the two variables can be exaggerated or diminished so that the two patterns appear to match. But it’s just the result of careful manipulation of the scale. This process creates a spurious correlation.
Detecting and Preventing Spurious Correlations
The best way to detect a spurious correlation is through subject-area knowledge. Establishing causal relationships can be tricky. There is no statistical test that can prove it. Instead, analysts frequently need to rule out other causes and spuriousness. Learn more about Correlation vs. Causation: Understanding the Differences.
Use your subject-area knowledge to assess correlations and ask lots of questions:
- Do they make sense as causal relationships?
- Do they fit established theory?
- Can you find a mechanism for causation?
- Is there a direct link, or are mediator variables involved?
Many criteria can help you evaluate correlations. For more information, read my post about Hill’s Criteria for Causation for some examples.
Various statistical and experimental methods can help reduce spurious correlations. In particular, these methods can prevent confounding variables from creating spurious correlations.
In a randomized study, randomization tends to equalize confounders between experimental groups and, thereby, reduce the risk of a spurious correlation. Additionally, you can use control variables to keep the experimental conditions as consistent as possible. Learn more about Random Assignment in Experiments.
Matching is another technique that can lessen the risk of spurious correlations due to confounders. This process involves selecting study participants with similar characteristics outside the variable of interest for the treatment and control groups. Learn more about matching in my article about observational studies.
Multiple regression analysis can prevent a spurious correlation by using models that account for confounding variables. This approach statistically controls for confounding. Learn more about how regression controls confounding variables.
Conversely, Correlational Studies will find relationships quickly and easily in preliminary research but they are not suitable for establishing causality.