Pearson’s correlation coefficient formula produces a number ranging from -1 to +1, quantifying the strength and direction of a relationship between two continuous variables. A correlation of -1 means a perfect negative relationship, +1 represents a perfect positive relationship, and 0 indicates no relationship.
In this post, you’ll learn about the correlation coefficient formula and gain insight into how it works. Then we’ll work through an example calculation so you learn how to find the correlation coefficient.
For more information specifically about interpretations, read my post, Interpreting Correlation Coefficients.
Pearson’s Correlation Coefficient Formula
The equation might initially seem daunting, but we’re here to demystify it.
So, let’s take a look at the formula itself. The Greek symbol ρ (rho) represents Pearson’s correlation coefficient.
The correlation coefficient formula is the following fraction:
- Xᵢ and Yᵢ represent the individual values of variables X and Y.
- X̄ and Ȳ denote their respective means.
- N represents the number of observations.
- sx and sy represent the sample standard deviations of X and Y.
By understanding the correlation formula and how it works as a fraction, you can gain insight into how it assesses the data.
You can also use this formula to calculate Spearman’s correlation that uses ranks rather than raw data values.
How the Correlation Coefficient Formula Works
The correlation formula works by comparing each variable’s observed values to their means in the numerator, as shown below.
The product in the correlation coefficient formula’s numerator produces a greater number of positive values to add to the sum when the following conditions tend to occur:
- Above-average X values correspond with above-average Y values.
- Below-average X values correspond with below-average Y values.
A positive sum in the numerator produces a positive correlation.
Conversely, when above-average values for one variable tend to correspond with below-average values of the other, the numerator produces a greater number of negative values to subtract from the total. A negative sum in the numerator produces a negative correlation.
In this manner, the correlation formula assesses the co-variability of two variables around their respective means.
The denominator of the correlation coefficient formula divides the numerator by the product of the degrees of freedom and the two standard deviations. The denominator is always positive because degrees of freedom and standard deviations are always positive values.
The numerator can be positive or negative but its absolute value can never be larger than the denominator, which is how the equation scales correlation coefficients to fit the range of -1 to +1.
Covariance vs Correlation
Before working through the correlation coefficient formula, let’s look at how this equation is similar to the covariance formula and the crucial difference.
You find the covariance if you take the correlation coefficient formula’s numerator and only the (n – 1) in the denominator, as shown below.
Dividing by the extra sXsy bit in the denominator takes you from covariance to correlation. That’s the difference between the two statistical measures. That “extra bit” is the product of the standard deviations of X and Y, and it does two critical things.
First, it takes the -∞ to +∞ covariance range and scales it to the correlation coefficient’s easier-to-interpret -1 to +1 range.
Second, standard deviations use the original data units. Including both SDs in the denominator removes those units from the equation because they’re also in the numerator. Consequently, unlike the covariance, the correlation coefficient formula’s result is unitless and doesn’t change depending on the measurement units.
Suppose you are assessing the relationship between height and weight. If you were to change the height measurements from inches to centimeters, that would affect the covariance but not the correlation. You can even compare correlation coefficients between entirely dissimilar studies.
In summary, the standardized range and unitless nature make correlation far easier to interpret than covariance.
How to Find the Correlation Coefficient Worked Example
Let’s work through an example using the correlation formula to illustrate how to find the coefficient. Suppose we want to evaluate the relationship between the number of hours studied (X) and the test scores (Y) obtained by a group of five students. The data are below.
For simplicity, I’ll split the calculations between the numerator and denominator and then divide them in the final step.
To start, we need to find the mean of both variables to use in the correlation formula.
X̄ = (3 + 5 + 2 + 7 + 4) / 5 = 4.2
Ȳ = (70 + 80 + 60 + 90 + 75) / 5 = 75
Then, follow these steps to calculate the numerator in the correlation coefficient formula:
- Calculate the differences between the observed X and Y values and each variable’s mean.
- Multiply those differences for each X and Y pair.
- Sum those products.
Notice that the product column contains all positive values because above average X-values correspond with above average Y-values. Corresponding below average values similarly produce positive values because the product of two negatives is a positive.
These positive products produce a positive total for the numerator. So, we know that we’ll have a positive correlation coefficient. We’ll use the total in the numerator of the correlation formula to calculate the coefficient’s value.
For the denominator of the correlation coefficient formula, we need to calculate the product of the degrees of freedom, the standard deviation of X, and the standard deviation of Y:
(n – 1) * sx * sy
N is the number of paired observations, usually the number of rows in your dataset without missing values. We have 5 observations, so n – 1 = 4.
I cover how to calculate the standard deviation elsewhere. So, for this example, I’ll have Excel calculate the sample standard deviations for X and Y, which are 1.92 and 11.18, respectively.
We just multiply all these values together for the denominator.
4 * 1.92 * 11.18 = 86.02
Calculating the Correlation
At this point of the correlation coefficient formula, we just divide the numerator by the denominator to find the coefficient!
For these data, the correlation between hours of studying and test scores is 0.99. That’s a strong positive relationship. The more you study, the higher your score. This correlation is unrealistically high, but these are made-up data.