## What is a Spurious Correlation?

A spurious correlation occurs when two variables are correlated but don’t have a causal relationship. In other words, it appears like values of one variable cause changes in the other variable, but that’s not actually happening.

If you look up the definition of spurious, you’ll see explanations about something being fake or having a deceitful nature. It has the outward appearance of genuineness, but it’s an imitation. With this definition in mind, spurious correlations look like causal relationships in both their statistical measures and in graphs, but it’s not real.

For example, ice cream sales and shark attacks correlate positively at a beach. As ice cream sales increase, there are more shark attacks. However, common sense tells us that ice cream sales do not cause shark attacks. Hence, it’s a spurious correlation.

Spurious Correlations can appear in the form of non-zero correlation coefficients and as patterns in a graph. For instance, in the following example from tylervigen.com, the correlation between U.S. crude oil imports from Norway and drivers killed in a collision with a railway train has a very high correlation coefficient of +0.95, representing a strong, positive relationship. Graphing these data supports the correlation, as shown below. Learn more about Correlation Coefficients.

Of course, there is no causal relationship between the two!

Researchers need to identify genuinely causal relationships, which involves ruling out the possibility of spurious correlations.

In this post, learn how spurious correlations occur, how to identify them, and methods for preventing them.

## What Causes a Spurious Correlation?

Spurious correlations occur for several reasons. All the explanations below can create a spurious correlation that produces a non-zero correlation coefficient and a graph that displays a relationship.

### Confounding Variables

Confounding occurs when a third variable causes changes in two other variables, creating a spurious correlation between the other two variables. For example, imagine that the following two positive causal relationships exist.

- A → B
- A → C

As A increases, both B and C will increase together. Hence, it appears that B → C.

For example, higher temperatures cause more people to buy ice cream and swim at the beach, increasing the opportunities for shark attacks. Hence, even though there is no relationship between ice cream sales and shark attacks, they tend to rise and drop together. The confounding variable of temperature causes this spurious correlation.

**Related post**: Confounding Variables Can Bias Your Results

### Mediating Variables

In other cases, a chain of correlations, or mediating variables, produces a spurious correlation. For example, imagine that both A & B and B & C have causal relationships, as shown below.

A → B → C.

If you have only measurements of A and C, you’ll find a spurious correlation. It appears to be casual. In reality, A causes B, and then B causes C. There is no direct connection between A and C.

### Random Sampling Error

Samples don’t always accurately reflect the population due to chance. Random sampling error can produce the appearance of effects in the sample that don’t exist in the population. A correlation is one possible effect.

For studies using samples, the correlations you find might not exist in the population. Hypothesis testing can help sort that out.

When correlations in a sample don’t exist in the population, it’s a phantom that random error produced and, hence, cannot be a causal relationship. Consequently, it’s a spurious correlation. Samples aren’t perfect.

Learn more about Sampling Error and Hypothesis Testing.

### Chance

In some cases, it’s just pure chance that two disparate variables follow a similar pattern that looks like a relationship. This condition is slightly different from random sampling error. In this case, values of the two variables correlate in the population. It’s not a mirage caused by a sample. However, there is no causal relationship between the two variables. The patterns of changes match by chance.

### Graphical Manipulation

By adjusting graph scales, the patterns of changes in the two variables can be exaggerated or diminished so that the two patterns appear to match. But it’s just the result of careful manipulation of the scale. This process creates a spurious correlation.

## Detecting and Preventing Spurious Correlations

The best way to detect a spurious correlation is through subject-area knowledge. Establishing causal relationships can be tricky. There is no statistical test that can prove it. Instead, analysts frequently need to rule out other causes and spuriousness.

Use your subject-area knowledge to assess correlations and ask lots of questions:

- Do they make sense as causal relationships?
- Do they fit established theory?
- Can you find a mechanism for causation?
- Is there a direct link, or are mediator variables involved?

Many criteria can help you evaluate correlations. For more information, read my post about Hill’s Criteria for Causation for some examples.

Various statistical and experimental methods can help reduce spurious correlations. In particular, these methods can prevent confounding variables from creating spurious correlations.

In a randomized study, randomization tends to equalize confounders between experimental groups and, thereby, reduce the risk of a spurious correlation. Additionally, you can use control variables to keep the experimental conditions as consistent as possible. Learn more about Random Assignment in Experiments.

Matching is another technique that can lessen the risk of spurious correlations due to confounders. This process involves selecting study participants with similar characteristics outside the variable of interest for the treatment and control groups. Learn more about matching in my article about observational studies.

Multiple regression analysis can prevent a spurious correlation by using models that account for confounding variables. This approach statistically controls for confounding. Learn more about how regression controls confounding variables.

Ankit says

Amazing topnotch content

Noor says

Hi!

Very helpful. Can multicollinearity in IVs “cause” spurious correlations as well? I have a set of data with 4Ivs, 2 mediators and 2Dvs and even though the theory and literature suggests that the IVs predict the DV and I have significant correlations, there is no predicted effect seen in a multiple regression analysis 😅

Jim Frost says

Hi Noor,

Typically, when we talk about spurious correlations it’s specifically referring to apparent correlations that are not causal. In a regression context, this would mean that you found a significant independent variable but the IV does not have a causal relationship with the DV. In other words, the IV and DV have correlation but not causation (i.e., spurious).

So that doesn’t fit what you’re describing. You’re talking about expecting to see a relationship (a significant IV) but you don’t see it, at least not in the regression model. It sounds like you’re seeing it in the pairwise correlations though. There are several possible reasons for why you might see the significant pairwise correlations but no significance in the in the regression model.

Multicollinearity, as you mention, is one possibility. By inflating the variance of your estimates, it can sap the significance from your p-values in your regression model. To learn more, including how to determine if this is a problem, read my post about multicollinearity.

Also, keep in mind that pairwise correlations and a regression model are different models for the data. In pairwise correlations, you’re not accounting for the other variables. Hence, there could be some omitted variable bias in the results when you’re looking at the correlations. The regression model factors in the other variables and might be finding that when you account for the other variables, there is no relationship.

Another possibility is that you might too few observations for your regression model. For the pairwise correlation, you’re using all your data points for the one relationship whereas for the regression model, you’re using DF to estimate multiple relationships. You might just need a larger sample size for your model.

So, those are some possibilities to consider! Click the links to learn more!

kanchan Singh says

Dear Prof. Jim,

Greetings and regards,

You have been very kind to send me this important learning exercise.

I am impressed by the examples you have cited. They are very useful to comprehend the entire idea. Please keep it up in future as well.

Kanchan Singh

Jim Frost says

Hi Kanchan,

Thanks so much. I’m thrilled my website is helpful!

Thanks for writing!