What is a Covariate?
Covariates are continuous independent variables (or predictors) in a regression or ANOVA model. These variables can explain some of the variability in the dependent variable.
That definition of covariates is simple enough. However, the usage of the term has changed over time. Consequently, analysts can have drastically different contexts in mind when discussing covariates.
Historically, statisticians considered covariates to be a subtype of continuous predictors that appears only in ANOVA models, usually relating to designed experiments (DOE). Originally, they were part of experimental designs where the primary variables of interest are categorical factors that the researchers control.
In these designs, most other potential explanatory variables (confounders) are addressed by controlling the experimental environment and using a randomized design. However, analysts might be aware of uncontrollable variables that could influence the outcome in some studies.
These nuisance variables are covariates. They’re a nuisance because they can increase both variability and bias.
Including these nuisance variables as covariates in the model statistically controls their impact on the dependent variable, which can increase statistical power and reduce confounder bias. Learn more about How Confounders Can Bias Your Results.
So, the historical definition of a covariate is that it is:
- In an randomized experimental design where researchers set the categorical factors of primary interest.
- A continuous, independent variable that researchers measure (as opposed to setting).
- Uncontrollable and can’t be randomized (i.e., a nuisance).
- Not a primary variable of interest even though it correlates with the outcome.
When you include a covariate in an ANOVA model, it becomes an ANCOVA model (Analysis of Covariance). Learn more about ANCOVA: Uses, Assumptions & Example.
I’ve heard long-time researchers stick steadfastly to this definition and even firmly proclaim that the analytical procedure must enter a covariate into the model last to calculate the sums of squares correctly!
Related posts: Experimental Designs and Independent vs. Dependent Variables
Modern Usage
In current times, the historical definition of covariate has faded somewhat. Many analysts use this term as a synonym for a continuous predictor—not only for the specific subset of experimental design cases I describe above.
In current usage, a covariate might be a primary variable of interest in a non-DOE context!
In an analytical sense, the modern usage is valid. Covariates in the stricter context performs the same function as continuous predictors in the broader definition.
Just be aware that some analysts will have an extremely specific context in mind when discussing covariates. Others will be thinking in much broader terms!
Covariate Example
Let’s look at a covariate example that fits the original definition involving an experimental design.
Consider a manufacturing process where temperature and pressure are experimental factors. The experimenters set the temperature controls at A, B, and C and the pressure controls at X, Y, and Z. While temperature and pressure are continuous variables, the experiment treats them as categorical factors because the researchers set them to several specific values.
To minimize sources of variation and the effect of other variables, the researchers control the experimental environment as much as possible and use randomization to determine the settings for each experimental run. All in all, it’s a highly controlled, randomized experiment.
However, the researchers know from experience that humidity levels also affect the outcome. Unfortunately, humidity is much harder to control because it depends on outdoor conditions and is impossible to regulate throughout the manufacturing environment. Consequently, they record humidity as a covariate during each experimental run so the ANCOVA model can account for its effect.
The manufacturer is primarily interested in how Temperature and Pressure affect their manufacturing outcome. However, by including humidity as a covariate, the model can control for changing humidity conditions during the experiment.
Hi Jim,
Thank you for the detailed explanation. But I am confused about a data that I have been handed. It is a series of RCBD field experiments measuring various traits (plant height, flowering days etc.) conducted over three years on different genotypes of a particular plant species. I have three replicates for each measurement. So, primarily the hypothesis question are:
1. Is the trait affected by differences in the genotype?
2. Is the trait affected by changes in the year?
3. Is the trait affected by both genotype and year?
4. Is there an impact of replicates?
So, which one should I consider as factor and which one as a covariate?
To further increase the complexity, these genotypes have different ploidy levels. But I am assuming since this is something that can been experimentally measured but not humanly controlled, should be included as a covariate.
Hi Swati,
From what you write, it sounds like the following are the types of variables in your study:
Outcomes: the various traits. Probably need to fit separate models for each trait/outcome.
Factors: genotype, year (you might include year as a blocking variable instead.)
Covariate: ploidy levels (if it’s a continuous variable)
As for determining the role of replicates, assess the consistency of results across replicates. If there’s large variation, uncontrolled variables might be affecting the results. You’re hoping for low variation between replicates.
I hope that helps!
Can race, ethnicity, and school type be used as covariates in studies on high school students if the predictor variables are parent future expectations for students and the outcome variable is student grades? My rationale is that these variables may explain some of the variance in the model. The sample population is from grades 9-12. I have used these variables in a stepwise regression and now I am rethinking it based on your definition of covariate.
Hi,
Yes, your logic is sound for including them! The only thing is that the variables you mention are categorical variables rather than continuous. So, they can’t technically be covariates because that term is reserved for continuous variables. However, you can include those categorical variables in your model for the same reasons as you do for covariates. You can call them demographic control variables or something like that. So, again, your logic for including them is sound! 🙂
Can’t the type of experimental design, say, completely randomized block design (RCBD), or Latin square design rule out the effects of a potential confounding factor. In other words, isn’t the consideration of a covariate only applicable to RCD trials?
Hi Collin,
Blocked design, including Latin Square designs, are one way to handle nuisance variables. Blocks are essentially a categorical nuisance variable. For example, a block might represent days when you think the experimental conditions might change on the different days over which experimental runs occur. With blocks, you might not even be sure exactly what the nuisance is, or it might be a combination of variables, such as with blocking by day. Although, blocks can certainly represent known factors, such as material batches, shifts, etc. But either way, blocks are categorical.
Covariates are another method for handling continuous nuisance variables. You’ll enter the nuisance variables with continuous values. Humidity is a good example of a covariate. It’s not categorical but quite clearly a continuous variable where you’d enter the percentage.
And you can use blocks and covariates together too!