What is Covariance?
Covariance in statistics measures the extent to which two variables vary linearly. The covariance formula reveals whether two variables move in the same or opposite directions.
Covariance is like variance in that it measures variability. While variance focuses on the variability of a single variable around its mean, the covariance formula assesses the co-variability of two variables around their respective means. A high value suggests an association exists between the variables, indicating that they tend to vary together.
In this blog post, learn about the covariance formula and definition, how to interpret it, and how it differs from correlation. We’ll also delve into the formula with a worked example to calculate it.
Interpreting covariance is challenging due to its wide range of possible results. Its formula allows values to range from negative infinity to positive infinity. That’s pretty wide open! The magnitude and scale of the variables strongly affect the covariance value, making it difficult to determine the strength of the relationship.
Furthermore, comparing covariances across different datasets with varying scales can be misleading. A weak value in one dataset may actually represent a strong relationship in another data set. In other words, a value of 3 or 3,000 might represent a strong or weak relationship, depending on the variables. In short, don’t use covariance to assess the strength of a relationship!
Let me illustrate how sensitive covariance is to scale. I have a dataset of 88 heights and weights. It contains the same observations in metric units (meters and kilograms) and imperial units (feet and pounds). They’re the same subjects—I just converted the units. Here’s the Excel file with the dataset if you want to try it: Covariance.
Excel calculated the covariance between height and weight for both measurement units:
- Metric: 0.57
- Imperial: 4.09
The covariance formula produces different values for each measurement scale. Just how strong is that relationship anyway? Hard to say!
In short, use covariance to assess the direction of the relationship but not its strength. This statistic can be categorized into positive and negative types, conveying a distinct association between variables.
When the covariance between two variables is positive, they tend to move in the same direction. Higher values of one variable tend to correspond with higher values of the other variable. For instance, if we observe a positive covariance between the number of hours spent studying and the corresponding grades achieved by students, it implies that increased study hours generally coincide with higher grades. The two positive values for the height and weight dataset indicate that taller people tend to weigh more.
Conversely, a negative covariance signifies that the variables move in opposite directions. Higher values of one variable tend to correspond with lower values of the other variable. For instance, a negative value for rainfall and hours spent outdoors suggests that people tend to spend less time outside when it rains more.
Covariance vs. Correlation
You might be thinking, “Well, correlation also tells us about the relationship between variables.” Although covariance and correlation are related, they are distinct concepts. How are they different?
While both covariance and correlation assess the direction of the linear relationship between variables, correlation also tells us its strength and is comparable across different units and datasets.
Correlation standardizes the results by providing values between -1 and 1 that do not depend on the data’s scale, while the covariance formula does not include standardization. Thanks to these properties, correlation enables us to compare the direction and strength of relationships across different units, making it far more interpretable than covariance.
For comparison, when Excel calculates the correlation between height and weight, it finds the same correlation of 0.71 for both metric and imperial units. From this result, we can determine there is a moderately strong, positive relationship between the two variables.
The covariance formula for two variables, X and Y, is as follows:
- Xᵢ and Yᵢ represent the observed values of X and Y.
- X̄ and Ȳ denote their respective means.
- N is the number of observations.
By understanding the covariance formula, you can gain insight into how it assesses the data. The formula works by comparing each variable’s observed values to their means.
The product in the formula’s numerator produces a greater number of positive values to add to the sum when the following conditions tend to occur:
- Above-average X values correspond with above-average Y values.
- Below-average X values correspond with below-average Y values.
A positive sum in the numerator produces a positive covariance.
Conversely, when above-average values for one variable tend to correspond with below-average values of the other, the numerator produces a greater number of negative values to subtract from the total. A negative sum in the numerator produces a negative covariance.
In this manner, the covariance formula assesses the co-variability of two variables around their respective means.
To learn how to calculate the correlation, read my post, Correlation Coefficient Formula Walkthrough.
How to Calculate Covariance Example
Let’s work through an example using the covariance formula to illustrate how to calculate it. Suppose we want to evaluate the relationship between the number of hours studied (X) and the test scores (Y) obtained by a group of five students. The data are below.
To start, we need to find the mean of both variables to enter into the covariance formula.
X̄ = (3 + 5 + 2 + 7 + 4) / 5 = 4.2
Ȳ = (70 + 80 + 60 + 90 + 75) / 5 = 75
Then, follow these steps to calculate covariance:
- Calculate the differences between the observed X and Y values and each variable’s mean.
- Multiply those differences for each X and Y pair.
- Sum those products.
- Divide the sum by the degrees of freedom.
The positive covariance (21.25) suggests a positive association between the number of hours studied and exam scores. This result implies that as the number of study hours increases, the scores tend to increase.
Note that Excel’s built-in function uses the population covariance formula, while this example uses the sample version. So, the results will differ.
Typically, you’ll report correlations instead of covariances due to the interpretation issues. However, despite those shortcomings, it has specialized applications in various fields, including finance, genetics, meteorology and oceanography. Here are a few examples.
In finance, it is a statistic vital in modern portfolio theory and the capital asset pricing model. Analysts use it among different assets’ returns to determine the relative amounts of different assets that investors should hold to achieve diversification and manage risk effectively.
Genetics and Molecular Biology
Covariance is important in genetics and molecular biology for studying the conservation of DNA sequences among species and analyzing secondary and tertiary structures of proteins and RNA. It enables the identification of necessary sequences for common structural motifs, contributing to our understanding of genetic relationships and heritability estimation of complex traits.
Meteorological and Oceanographic Data Assimilation
In meteorology and oceanography, covariance matrices are essential for estimating initial conditions in weather forecast models through data assimilation. These matrices help account for forecast and observational errors, facilitating accurate predictions by assimilating data into the model.