A correlation between variables indicates that as one variable changes in value, the other variable tends to change in a specific direction. Understanding that relationship is useful because we can use the value of one variable to predict the value of the other variable. For example, height and weight are correlated—as height increases, weight also tends to increase. Consequently, if we observe an individual who is unusually tall, we can predict that his weight is also above the average.

In statistics, a correlation coefficient is a quantitative assessment that measures both the direction and the strength of this tendency to vary together. There are different types of correlation that you can use for different kinds of data. In this post, I cover the most common type of correlation—Pearson’s correlation coefficient.

Before we get into the numbers, let’s graph some data first so we can understand the concept behind what we are measuring.

## Graph Your Data to Find Correlations

Scatterplots are a great way to check quickly for relationships between pairs of continuous data. The scatterplot below displays the height and weight of pre-teenage girls. Each dot on the graph represents an individual girl and her combination of height and weight. These data are actual data that I collected during an experiment.

At a glance, you can see that there is a relationship between height and weight. As height increases, weight also tends to increase. However, it’s not a perfect relationship. If you look at a specific height, say 1.5 meters, you can see that there is a range of weights associated with it. You can also find short people who weigh more than taller people. However, the general tendency that height and weight increase together is unquestionably present.

Pearson’s correlation takes all of the data points on this graph and represents them as a single number. In this case, the statistical output below indicates that the Pearson’s correlation coefficient is 0.694.

What do the correlation and p-value mean? We’ll interpret the output soon. First, let’s look at a range of possible correlation coefficients so we can understand how our height and weight example fits in.

## How to Interpret Pearson’s Correlation Coefficients

Pearson’s correlation coefficient is represented by the Greek letter rho (*ρ*) for the population parameter and r for a sample statistic. This correlation coefficient is a single number that measures both the strength and direction of the linear relationship between two continuous variables. Values can range from -1 to +1.

**Strength:**The greater the absolute value of the correlation coefficient, the stronger the relationship.- The extreme values of -1 and 1 indicate a perfectly linear relationship where a change in one variable is accompanied by a perfectly consistent change in the other. For these relationships, all of the data points fall on a line. In practice, you won’t see either type of perfect relationship.
- A coefficient of zero represents no linear relationship. As one variable increases, there is no tendency in the other variable to either increase or decrease.
- When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r approaches -1 or 1, the strength of the relationship increases and the data points tend to fall closer to a line.

**Direction:**The sign of the correlation coefficient represents the direction of the relationship.- Positive coefficients indicate that when the value of one variable increases, the value of the other variable also tends to increase. Positive relationships produce an upward slope on a scatterplot.
- Negative coefficients represent cases when the value of one variable increases, the value of the other variable tends to decrease. Negative relationships produce a downward slope.

### Examples of Positive and Negative Correlation Coefficients

An example of a positive correlation is the relationship between the speed of a wind turbine and the amount of energy it produces. As the turbine speed increases, electricity production also increases.

An example of a negative correlation is the relationship between outdoor temperature and heating costs. As the temperature increases, heating costs decrease.

## Graphs for Different Correlation Coefficients

Graphs always help bring concepts to life. The scatterplots below represent a spectrum of different correlation coefficients. I’ve held the horizontal and vertical scales of the scatterplots constant to allow for valid comparisons between them.

**Correlation Coefficient = +1**: A perfect positive relationship.

**Correlation Coefficient = 0.8**: A fairly strong positive relationship.

**Correlation Coefficient = 0.6**: A moderate positive relationship.

**Correlation Coefficient = 0**: No relationship. As one value increases, there is no tendency for the other value to change in a specific direction.

**Correlation Coefficient = -1**: A perfect negative relationship.

**Correlation Coefficient = -0.8**: A fairly strong negative relationship.

**Correlation Coefficient = -0.6**: A moderate negative relationship.

## Discussion about the Scatterplots

For the scatterplots above, I created one positive relationship between the variables and one negative relationship between the variables. Then, I varied only the amount of dispersion between the data points and the line that defines the relationship. That process illustrates how correlation measures the strength of the relationship. The stronger the relationship, the closer the data points fall to the line. I didn’t include plots for weaker correlations that are closer to zero than 0.6 and -0.6 because they start to look like blobs of dots and it’s hard to see the relationship.

A common misinterpretation is assuming that negative correlation coefficients indicate that there is no relationship. After all, a negative correlation sounds suspiciously like no relationship. However, the scatterplots for the negative correlations display real relationships. For negative correlation coefficients, high values of one variable are associated with low values of another variable. For example, there is a negative correlation between school absences and grades. As the number of absences increases, the grades decrease.

Earlier I mentioned how crucial it is to graph your data to understand them better. However, a quantitative measurement of the relationship does have an advantage. Graphs are a great way to visualize the data, but the scaling can exaggerate or weaken the appearance of a relationship. Additionally, the automatic scaling in most statistical software tends to make all data look similar.

Fortunately, Pearson’s correlation coefficients are unaffected by scaling issues. Consequently, a statistical assessment is better for determining the precise strength of the relationship.

Graphs and the relevant statistical measures often work better in tandem.

## Pearson’s Correlation Coefficients Measure Linear Relationship

Pearson’s correlation coefficients measure only *linear* relationships. Consequently, if your data contain a curvilinear relationship, the correlation coefficient will not detect it. For example, the correlation for the data in the scatterplot below is zero. However, there is a relationship between the two variables—it’s just not linear.

This example illustrates another reason to graph your data! Just because the coefficient is near zero, it doesn’t necessarily indicate that there is no relationship.

## Hypothesis Test for Correlation Coefficients

Correlation coefficients have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. For Pearson correlations, the two hypotheses are the following:

- Null hypothesis: There is no linear relationship between the two variables.
*ρ*= 0. - Alternative hypothesis: There is a linear relationship between the two variables.
*ρ*≠ 0.

Correlation coefficients that equal zero indicate no linear relationship exists. If your p-value is less than your significance level, the sample contains sufficient evidence to reject the null hypothesis and conclude that the correlation coefficient does not equal zero. In other words, the sample data support the notion that the relationship exists in the population.

**Related post**: Overview of Hypothesis Tests

## Interpreting our Height and Weight Correlation Example

Now that we have seen a range of positive and negative relationships, let’s see how our correlation coefficient of 0.694 fits in. We know that it’s a positive relationship. As height increases, weight tends to increase. Regarding the strength of the relationship, the graph shows that it’s not a very strong relationship where the data points tightly hug a line. However, it’s not an entirely amorphous blob with a very low correlation. It’s somewhere in between. That description matches our moderate correlation coefficient of 0.694.

For the hypothesis test, our p-value equals 0.000. This p-value is less than any reasonable significance level. Consequently, we can reject the null hypothesis and conclude that the relationship is statistically significant. The sample data support the notion that the relationship between height and weight exists in the population of preteen girls.

## Correlation Does Not Imply Causation

I’m sure you’ve heard this expression before, and it is a crucial warning. Correlation between two variables indicates that changes in one variable are associated with changes in the other variable. However, correlation does not mean that the changes in one variable actually *cause* the changes in the other variable.

Sometimes it is clear that there is a causal relationship. For the height and weight data, it makes sense that adding more vertical structure to a body *causes* the total mass to increase. Or, increasing the wattage of lightbulbs *causes* the light output to increase.

However, in other cases, a causal relationship is not possible. For example, ice cream sales and shark attacks are positively correlated. Clearly, selling more ice cream does not cause shark attacks (or vice versa). Instead, a third variable, outdoor temperatures, causes changes in the other two variables. Higher temperatures increase both sales of ice cream and the number of swimmers in the ocean, which creates the apparent relationship between ice cream sales and shark attacks.

In statistics, you typically need to perform a randomized, controlled experiment to determine that a relationship is causal rather than merely correlation.

**Related posts**: Causation versus Correlation and Using Random Assignment in Experiments and Observational Studies

## How Strong of a Correlation is Considered Good?

What is a good correlation? How high should it be? These are commonly asked questions. I have seen several schemes that attempt to classify correlations as strong, medium, and weak.

However, there is only one correct answer. The correlation coefficient should accurately reflect the strength of the relationship. Take a look at the correlation between the height and weight data, 0.694. It’s not a very strong relationship, but it accurately represents our data. An accurate representation is the best-case scenario for using a statistic to describe an entire dataset.

The strength of any relationship naturally depends on the specific pair of variables. Some research questions involve weaker relationships than other subject areas. Case in point, humans are hard to predict. Studies that assess relationships involving human behavior tend to have correlation coefficients weaker than +/- 0.6.

However, if you analyze two variables in a physical process, and have very precise measurements, you might expect correlations near +1 or -1. There is no one-size fits all best answer for how strong a relationship should be. The correct correlation value depends on your study area.

## Taking Correlation to the Next Level with Regression Analysis

Wouldn’t it be nice if instead of just describing the strength of the relationship between height and weight, we could define the relationship itself using an equation? Regression analysis does just that. That analysis finds the line and corresponding equation that provides the best fit to our dataset. We can use that equation to understand how much weight increases with each additional unit of height and to make predictions for specific heights. Read my post where I talk about the regression model for the height and weight data.

Regression analysis allows us to expand on correlation in other ways. If we have more variables that explain changes in weight, we can include them in the model and potentially improve our predictions. And, if the relationship is curved, we can still fit a regression model to the data.

Additionally, a form of the Pearson correlation coefficient shows up in regression analysis. R-squared is a primary measure of how well a regression model fits the data. This statistic represents the percentage of variation in one variable that other variables explain. For a pair of variables, R-squared is simply the square of the Pearson’s correlation coefficient. For example, squaring the height-weight correlation coefficient of 0.694 produces an R-squared of 0.482, or 48.2%. In other words, height explains about half the variability of weight in preteen girls.

If you’re learning about statistics and like the approach I use in my blog, check out my Introduction to Statistics eBook!

Georgi Georgiev says

Hi Jim,

This is an informative article and I agree with most of what is said, but this particular sentence might be misleading to readers: “R-squared is a primary measure of how well a regression model fits the data.”. R-squared is in fact based on the assumption that the regression model fits the data to a reasonable extent therefore it cannot also simultaneously be a measure of the goodness of said fit.

The rest of the claims regarding R-squared I completely agree with.

Cheers,

Georgi

Jim Frost says

Hi Georgi,

Yes, I make that exact point repeatedly throughout multiple blog posts, particularly my post about R-squared.

Additionally, R-squared is a goodness-of-fit measure, so it is not misleading to say that it measures how well the model fits the data. Yes, it is not a 100% informative measure by itself. You’d also need to assess residual plots in conjunction with the R-squared. Again, that’s a point that I make repeatedly.

I don’t mind disagreements, but I do ask that before disagreeing, you read what I write about a topic to understand what I’m saying. In this case, you would’ve found in my various topics about R-squared and residual plots that we’re saying the same thing.

Amy Do says

Thank you very much!

Amy Do says

Hi Jim, I have a question for you – and thank you in advance for responding to it 🙂

Set A has the correlation coefficient of .25 and Set B has the correlation of .9, Which set has the steeper trend line? A or B?

Jim Frost says

Hi Amy,

Set B has a stronger relationship. However, that’s not quite equivalent to saying it has a steeper trend line. It means the data points fall closer to the line.

If you look at the examples in this post, you’ll notice that all the positive correlations have roughly equal slopes despite having different correlations. Instead, you see the points moving closer to the line as the strength of the relationship increases. The only exception is that a correlation of zero has a slope of zero.

The point being that you can’t tell from the correlation alone which trend line is steeper. However, the relationship in Set B is much stronger than the relationship in Set A.

jn says

Thank you 😊. Now I understand.

JN says

hi, I’m a little confused.

What does it indicating, If there is positive correlation, but negative coefficient from multiple regression outcome? in this situation, how to interpret? the relationship is negative or positive?

Jim Frost says

Hi JN,

This is likely a case of omitted variable bias. A pairwise correlation involves just two variables. Multiple regression analysis involves three variables at a minimum (2 IVs and a DV). Correlation doesn’t control for other variables while regression analysis controls for the other variables in the model. That can explain the different relationships. Omitted variable bias occurs under specific conditions. Click the link to read about when it occurs. I include an example where I first look at a pair of variables and then three variables and shows how that changes the results, similar to your example.

Eamonn Ali says

Hi Jim,

I have 4 objective in my research and when I did the correlation between first one and others the result is:

ob1 with ob2 is (0.87) – ob1 with ob3 is (0.84) – ob1 with ob4 is ( 0.83). My question is what is that meaning and can I do Correlation Coefficient with all of them in one time.

Jolette Garcia says

Hi, Mr Jim

Which best describes the correlation coefficient for r=.08?

Jim Frost says

Hi Jolette,

I’d say that is an extremely weak correlation. I’d want to see its p-value. If it’s not significant, then you can’t conclude that the correlation is different from zero (no correlation). Is there something else particular you want to know about it?

Lakshmi Belaguli says

Correlation result between Vul and FCV

t = 3.4535, df = 306, p-value = 0.0006314

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.08373962 0.29897226

sample estimates:

cor

0.1936854

What does this mean?

Jim Frost says

Hi Lakshmi,

It means that your correlation coefficient is ~0.19. That’s the sample estimate. However, because you’re working with a sample, there’s always sample error and so the population correlation is probably not exactly equal to the sample value. The confidence interval indications that you can be 95% confident that the true population correlation falls between ~0.08 and 0.30. The p-value is less than any common significance level. Consequently, you can reject the null hypothesis that the population correlation equals zero and conclude that it does not equal zero. In other words, the correlation you see in the sample is likely to exist in the population.

A correlation of 0.19 is a fairly weak relationship. However, even though it is weak, you have enough evidence to conclude that it exists in the population.

I hope that helps!

Yared Tesfaye says

Hi Jim

Thank you for your support.

I have a question that is.

Testing criteria for Validity by Pearson correlation,

r table determine by formula DF=N-2

– If it is Valid the correlation value less that Pearson correlation value. (Pearson correlation > r table )

– if it is Invalid the correlation value greater that Pearson correlation value. (Pearson correlation < r table )

I got the above information on SPSS tutorial Video about Pearson correlation.

but I didn't get on other literature please

can you recommend me some literature that refers about this?

or can you clarify more about how to check Validity by Pearson correlation?

zia khan says

HI JIM i am zia from pakistan i wanna finding correlation of two factoer i have find 144.6 of 66.93 thats is postive relation?

Jim Frost says

Hi Zia, I’m sorry but I’m not clear about what you’re asking. Correlation coefficients range between -1 and +1, so those two values are not correlation coefficients. Are they regression coefficients?

Norshidah Nordin says

Dear Sir,

Warmest greetings.

My name is Norshidah Nordin and I am very grateful if you could provide me some answers to the following questions.

1) Can I used two different set of samples (for e.g. students academic performance (CGPA) as dependent variable and teacher’s self efficacy as dependent variable) to run on a Pearson correlation analysis. If yes, could you elaborate on this aspect.

2) what is the minimum sample size to use in multiple regression analysis.

Thank You

Jim Frost says

Hi Norshidah,

For correlations, you need to have multiple measurements on the same item or person. In your scenario, it sounds like you’re taking different measurements on different people. Pearson’s correlation would not be appropriate.

The minimum sample size for multiple regression depends on the number of terms you need to include in your model. Read my post about overfitting regression models, which occurs when you have too few observations for the number of model terms.

I hope this helps!

MONIQUE STEINER says

Greetings sir, question…. Can you do an accurate regression with a Pearson’s correlation coefficient of 0.10? Why or Why not?

Jim Frost says

Hi Monique,

It is possible. First, you should determine whether that correlation is statistically significant. You’re seeing a correlation in your sample, but you want to be confident that is also exists in the large population you’re studying. There’s a possibility that the correlation only exists in your sample by random chance and does not exist in the population–particularly with such a low coefficient. So, check the p-value for the coefficient. If it’s significant, you have reason to proceed with the regression analysis. Additionally, graph your data. Pearson’s only is for linear relationships. Perhaps your coefficient is low because the relationship is curved?

You can fit the regression model to your data. A correlation of 0.10 equates to an R-squared of only 0.01, which

isvery low. Perhaps adding more independent variables will increase the R-squared. Even if the r-squared stays very low, if your independent variable is significant, you’re still learning something from your regression model. To understand what you can learn in this situation, read my post about regression models with significant variables and a low R-squared values.So, it is possible to do a valid regression and learn useful information even when the correlation is so low. But, you need to check for significance along the way.

Titania says

Hello Jim, first and foremost thank you for giving us a comprehensive information regarding this! This totally help me. But I have a question; my pearson results showing that there’s a moderate positive relationship between my variables which is Parasocial Interaction and the fans’ purchase intention.

But the thing is, if I look at the answer majority of my participants are mostly answering Neutral regarding purchase intention.

What does this means? could you help me to figure out this T.T thanks you in advance! I’m a student currently doing thesis from Malaysia.

Jim Frost says

Hi Titania,

Have you graphed your data using a scatterplot? I’d highly recommend that because I think it will probably clarify what your data are telling you. Also, are both of your variables continuous variables? I’m wonder if purchase intention is ordinal if one of the values is Neutral. If that’s the case, you’d need to use Spearman’s Rank Correlation rather than Pearson’s.

Aya says

Hello Jim ! I have a question . I calculated a correlation coefficient between the scale variables and got 0.36, which is relatively weak since it gives a 0.12 if quared. What does the interpretation of correlation concern ? The sample taken or the type of data measurement ? or anything else?

I hope you got my question. Thank you for your help!!

Jim Frost says

Hi Aya,

I’m not clear what you’re asking exactly. Please clarify. The correlation measures the strength of the relationship between the two continuous variables, as I explain in this article.

Yes, that it is a weak relationship. If you’re going to include this is a regression analysis, you might want to read my article about interpreting low R-squared values.

I’m not sure what you mean by scale variables. However, if these are Likert scale items, you’ll need to use Spearman’s correlation instead of Pearson’s correlation.

Egbert N Azariah says

Hi Jim

I am very new to statistics and data analysis. I am doing a quantitative study and my sample size is 200 participants. So far I have only obtained 50 complete responses. . Using G*Power a simple linear regression with a medium effect size, an alpha of .05, and a power level of .80 can I do a data analysis with this small sample.

Jim Frost says

Hi Egbert,

Please repost your question in the comments section of the appropriate article. It has nothing to do with correlation coefficients. Use the search bar part way down in the right column and search for power. I have a post about power analysis that is a good fit.

LAI JIA PEI says

Thank you Mr.Jim, it was a great answer for me!😉

Take good care~

LAI JIA PEI says

Hi Mr.Jim,

I am a student from Malaysia.

I have a question to ask Mr.Jim about how to determine the validity (the accurate figure) of the data for analysis purpose base on the table of Pearson’s Correlation Coefficient? Do it has any method?

For example, since the coefficient between one independent variable with the other variable is below 0.7, thus the data valid for analysis purpose.

However, I have read the table there is a figure which more than 0.7. I am not sure about that.

Hope to hearing from Mr.Jim soon. Thank you.

Jim Frost says

Hi, I hope you’re doing well!

There is no single correlation coefficient value that determines whether it is valid to study. It partly depends on your subject area. I low noise physical process might often have a correlation in the very high 0.9s and 0.8 would be considered unacceptable. However, in a study of human behavior, it’s normal and acceptable to have much lower correlations. For example a correlation of 0.5 might be considered very good. Of course, I’m writing the positive values, but the same applies to negative correlations too.

It also depends on what the purpose of your study. If you’re doing something practical, such as describing the relationship between material composition and strength, there might be very specific requirements about how strong that relationship must be for it to be useful. It’s based on real-world practicalities. On the other hand, if you’re just studying something for the sake of science and expanding knowledge, lower correlations might still be interesting.

So, there’s not single answer. It depends on the subject-area you are studying and the purpose of your study.

I hope that helps!

Irene Marticio says

HI Jim, what could be the implication of my result if I obtained a weak relationship between industry experience and instructional effectiveness? thanks in advance

Jim Frost says

Hi Irene,

The best way to think of it is to look at the graphs in this article and compare the higher correlation graphs to the lower correlation graphs. In the higher correlation graphs, if you know the value of one variable, you have a more precise prediction of the value of the other variable. Look along the x-axis and pick a value. In the higher correlation graphs, the range of y-values that correspond to your x-value is narrower. That range is relatively wide for lower correlations.

For your example, I’ll assume there is a positive correlation. As industry experience increases, instructional effectiveness also increases. However, because that relationship is weak, the range of instructional effectiveness for any given value of industry experience is relatively wide.

I hope this helps!

tushar says

if correlation between X and Y is 0.8 .what is the correlation of -X and -Y

Jim Frost says

Hi Tushar,

If you take all the values of X and multiply them by -1 and do the same for Y, your correlation would still be 0.8.

Lorraine M says

This is very helpful, thank you Jim!

Lorraine M says

Hi, My data is continuous – the variables are individual shares volatility and oil prices and they were non-normal. I used Kendall’s Tau and did not rank the data or alter it in any way. Can my results be trusted?

Jim Frost says

Hi Lorraine,

Kendall’s Tau is a correlation coefficient for ranked data. Even though you might not have ranked your data, your statistical software must have created the ranks behind the scenes.

Typically, you’ll use Pearson’s correlation when you have continuous data that have a straight line relationship. If your data are ordinal, ranked, or do not have a straight line relationship, using something other than Pearson’s correlation is necessary.

You mention that your data are nonnormal. Technically, you want to graph your data and look at the shape of the relationship rather than assessing the distribution for each variable. Although, nonnormality can make a linear relationship less likely. So, graph your data on a scatterplot and see what it looks like. If it is close to a straight line, you should probably use Pearson’s correlation. If it’s not a straight line relationship, you might need to use something like Kendall’s Tau or Spearman’s rho coefficient, both of which are based on ranked data. While Spearman’s rho is more commonly used, Kendall’s Tau has preferable statistical properties.

Yohan Park says

Hi, Jim.

If correlations between continuous variables can be measured using Pearson’s, how is correlation between categorical variables measured?

Thank you.

Jim Frost says

Hi Yohan,

There are several possible methods, although unlike with continuous data, there doesn’t seem to be a consensus best approach.

But, first off, if you want to determine whether the relationship between categorical variables is statistically significant, use the chi-square test of independence. This test determines whether the relationship between categorical variables is significant, but it does not tell you the degree of correlation.

For the correlation values themselves, there are different methods, such as Goodman and Kruskal’s lambda, Cramér’s V (or phi) for categorical variables with more than 2 levels, and the Phi coefficient for binary data. There are several others that are available as well. Offhand I don’t know the relative pros and cons of each methodology. Perhaps that would be a good post for the future!

Nima says

Thanks, great explanations.

Curt Miller says

Hi Jim,

In a multi-variable regression model, is there a method for determining where two predictor variables are correlated in their impact on the outcome variable?

If so, then how is this type of scenario determined, and handled?

Thanks,

Curt

Jim Frost says

Hi Curt,

When predictors are correlated, it’s known as multicollinearity. This condition reduces the precision of the coefficient estimates. I’ve written a post about it: Multicollinearity: Detection, Problems, and Solutions. That post should answer all your questions!

Susan Murphy says

Hi Jim: Great explanations. One quick thing, because the probability distribution is asymptotic, there is no p=.000. The probability can never be zero. I see students reporting that or p<.000 all of the time. The actual number may be p <.00000001, so setting a level of p < .001 is usually the best thing to do and seems like journal editors want that when reporting data. Your thoughts?

Jim Frost says

Hi Susan, yes, you’re correct about that. You can’t have a p-value that equals zero. Sometimes software will round down when it’s a very small value. The underlying issue is that no matter how large the difference between your sample value and the null hypothesis value, there is a non-zero probability that you’d obtain the observed results when the null is true.

As for whether it’s a good practice, it probably is one because it makes it explicit. For the p-value that the software displays in this post of 0.000, that actually indicates that it is p < 0.0005. If it was greater than or equal to that value, the software would have rounded it up to 0.001. But, it must've been less than that value. It makes it a bit more clear that there is a tiny probability rather than a zero probability!

Mansoor Ahmad says

Sir you are love. Such a nice share

Kingsley Tembo says

Awesome stuff, really helpful

Patrick says

Hi there,

What do you do when you can’t perform randomized controlled experiments, like in the cases of social science or societal wide health issues? Apropos to gun violence in America, there appears to be correlation between the availability of guns in a society and the number of gun deaths in a society, where as the number of guns in the society goes up the number of gun deaths go up. This is true of individual states in the US where gun availability differs, and also in countries where gun availability differs. But, when/how can you come to a determination that lowering the number of guns available in a society could reasonably be said to lower the number of gun deaths in that society.

Thanks!

Jim Frost says

Hi Patrick,

It is difficult proving causality using observational studies rather than randomized experiments.

In my mind, the following approach can help when you’re trying to use observational studies to show that A causes B.

In observational study, you need to worry about confounding variables because the study is not randomized. These confounding variables can provide alternative explanations for the effect/correlations. If you can include all confounding variables in the analysis, it makes the case stronger because it helps rule out other causes. You must also show that A precedes B. Further, it helps if you can demonstrate the mechanism by which A causes B. That mechanism requires subject-area knowledge beyond just a statistical test.

Those are some ideas that come to my mind after brief reflection. There might well be more and, of course, there will be variations based on the study-area.

I hope this helps!

Patrik Silva says

Dear Jim,

Thank you so much, I am learning a lot of thing from you!

Please, keep doing this great job!

Best regards

PS

Jim Frost says

You bet, Patrik!

Patrik Silva says

Another question is: should I consider transform my variable before using person correlation, if they do not follow normal distribution or if the two variable do not have a clear liner relationship? What is the implication of that transformation? How to interpret the relationship if used transformed variable (let“s say log)?

Jim Frost says

Because the data need to follow the bivariate normal distribution to use the hypothesis test, I’d assume the transformation process would be more complex than transforming each variable individually. However, I’m not sure about this.

However, if you just want to make a straight line for the correlation to assess, I’d be careful about that too. The correlation of the transformed data would not apply to the untransformed data. One solution would be to use Spearman’s rank order correlation. Another would be to use regression analysis. In regression analysis, you can fit curves, use transformations, etc., and the assumption is that the residual follow a normal distribution (along with some other assumptions) is easy to check.

If you’re not sure that your data fit the assumptions for Pearson’s correlation, consider using regression instead. There are more tools there for you to use.

Patrik Silva says

Hi Jim,

I am always here following your posts.

I would like if you could clarify something to me, please!

What is the assumptions for person correlation that must hold true, in order to apply correlation coefficient?

I have read something on the internet, but there is many confusion. Some people are saying that the dependent variable (if have) must be normally distributed, other saying both (dependent and independent) must be following normal distribution. Therefore, I dont know which one I should follow. I would appreciate a lot your kind contribution. This is something that I am using for my paper.

Thank you in advance!

Jim Frost says

Hi Patrik,

I’m so glad to see that you’re hear reading and learning!

This issue turns out to be a bit complicated!

The assumption is actually that the two variables follow a bivariate normal distribution. I won’t go into that here in much detail, but a bivariate normal distribution is more complex than just each variable following a normal distribution. In a nutshell, if you plot data that follow a bivariate normal distribution on a scatterplot, it’ll appear as an elliptical shape.

In terms of the the correlation coefficient, that simply describes the relationship between the data. It is what it is and the data don’t need to follow a bivariate normal distribution as long as you are assessing a linear relationship.

On the other hand, the hypothesis test of Pearson’s correlation coefficient does assume that the data follow a bivariate normal distribution. If you want to test whether the coefficient equals zero, then you need to satisfy this assumption. However, one thing I’m not sure about is whether the test is robust to departures from normality. For example, a 1-sample t-test assumes normality, but with a large enough sample size you don’t need to satisfy this assumption. I’m not sure if a similar sample size requirement applies to this particular test.

I hope this clarifies this issue a bit!

Moritz Geisthoevel says

Hello,

thanks for the good explanation.

Do variables have to be normally distributed to be analyzed in a Pearson’s correlation?

Thanks,

Moritz

Jim Frost says

Hi Moritz,

No, the variables do not need to follow a normal distribution to use Pearson’s correlation. However, you do need to graph the data on a scatterplot to be sure that the relationship between the variables is linear rather than curved. For curved relationships, consider using Spearman’s rank correlation.

Jerry Tuttle says

Pearson’s correlation measures only linear relationships. But regression can be performed with nonlinear functions, and the software will calculate a value of R^2. What is the meaning of an R^2 value when it accompanies a nonlinear regression?

Jim Frost says

Hi Jerry, you raise an important point. R^2 is actually not a valid measure in nonlinear models. To read about why, read my post about R-squared in nonlinear models. In that post, I write about why it’s problematic that many statistical software packages do calculate R-squared values for nonlinear regression. Instead, you should use a different goodness-of-fit measure, such as the standard error of the regression.

Matt says

Hi, fantastic blog, very helpful. I was hoping I could ask a question?

You talk about correlation coefficients but I was wondering if you have a section that talks about the slope of an association? For example, am I right in thinking that the slope is equal to the standardized coefficient from a regression?

I refer to the paper of Cameron et al., (The Aging of Elastic and Muscular

Arteries. Diabetes Care 26:2133–2138, 2003) where in table 3 they report a correlation and a slope. Is the correlation the r value and the slope the beta value?

Many thanks,

Matt

Jim Frost says

Hi Matt,

Thanks and I’m glad you found the blog to be helpful!

Typically, you’d use regression analysis to obtain the slope and correlation to obtain the correlation coefficient. These statistics represent fairly different types of information. The correlation coefficient (r) is more closely related to R^2 in simple regression analysis because both statistics measure how close the data points fall to a line. Not surprisingly if you square r, you obtain R^2.

However, you can use r to calculate the slope coefficient. To do that, you’ll need some other information–the standard deviation of the X variable and the standard deviation of the Y variable.

The formula for the slope in simple regression = r(standard deviation of Y/standard deviation of X).

For more information, read my post about slope coefficients and their p-values in regression analysis. I think that will answer a lot of your questions.

I hope this helps!

Pascal Caillet says

Hi,

Nice post ! About pitfalls regarding correlation’s interpretation, here’s a funny database:

http://www.tylervigen.com/spurious-correlations

And a nice and poetic illustration of the concept of correlation:

https://www.youtube.com/watch?v=VFjaBh12C6s&t=0s&index=4&list=PLCkLQOAPOtT1xqDNK8m6IC1bgYCxGZJb_

Have a nice day

Jim Frost says

Hi Pascal,

Thanks for sharing those links! It always fun finding strange correlations like that.

The link for spurious correlations illustrates an important point. Many of those funny correlations are for time series data where both variables have a long-term trend. If you have two variables that you measure over time and they both have long term trends, those two variables will have a strong correlation even if there is no real connection between them!

Jerome E Tuttle says

Hi.

“In statistics, you typically need to perform a randomized, controlled experiment to determine that a relationship is causal rather than merely correlation.”

Would you please provide an example where you can reasonably conclude that x causes y? And how do you know there isn’t a z that you didn’t control for?

Thanks.

Jim Frost says

Hi Jerome,

That’s a great question. The trick is that when you perform an experiment, you should randomly assign subjects to treatment and control groups. This process randomly distributes any other characteristics that are related to the outcome variable (y). Suppose there is a z that is correlated to the outcome. That z gets randomly distributed between the treatment and control groups. The end result is that z should exist in all groups in roughly equal amounts. This equal distribution should occur even if you don’t know what z is. And, that’s the beautiful thing about random assignment. You don’t need to know everything that can affect the outcome, but random assignment still takes care of it all.

Consequently, if there is a relationship between a treatment and the outcome, you can be pretty certain that the treatment causes the changes in the outcome because all other correlation-only relationships should’ve been randomized away.

I’ll be writing about random assignment in the near future. And, I’ve written about the effectiveness of flu shots, which is based on randomized controlled trials.

I hope this helps!