What is a Covariate?
Covariates are continuous independent variables (or predictors) in a regression or ANOVA model. These variables can explain some of the variability in the dependent variable.
That definition of covariates is simple enough. However, the usage of the term has changed over time. Consequently, analysts can have drastically different contexts in mind when discussing covariates.
Historically, statisticians considered covariates to be a subtype of continuous predictors that appears only in ANOVA models, usually relating to designed experiments (DOE). Originally, they were part of experimental designs where the primary variables of interest are categorical factors that the researchers control.
In these designs, most other potential explanatory variables (confounders) are addressed by controlling the experimental environment and using a randomized design. However, analysts might be aware of uncontrollable variables that could influence the outcome in some studies.
These nuisance variables are covariates. They’re a nuisance because they can increase both variability and bias.
Including these nuisance variables as covariates in the model statistically controls their impact on the dependent variable, which can increase statistical power and reduce confounder bias. Learn more about How Confounders Can Bias Your Results.
So, the historical definition of a covariate is that it is:
- In an randomized experimental design where researchers set the categorical factors of primary interest.
- A continuous, independent variable that researchers measure (as opposed to setting).
- Uncontrollable and can’t be randomized (i.e., a nuisance).
- Not a primary variable of interest even though it correlates with the outcome.
When you include a covariate in an ANOVA model, it becomes an ANCOVA model (Analysis of Covariance). Learn more about ANCOVA: Uses, Assumptions & Example.
I’ve heard long-time researchers stick steadfastly to this definition and even firmly proclaim that the analytical procedure must enter a covariate into the model last to calculate the sums of squares correctly!
Related posts: Experimental Designs and Independent vs. Dependent Variables
Modern Usage
In current times, the historical definition of covariate has faded somewhat. Many analysts use this term as a synonym for a continuous predictor—not only for the specific subset of experimental design cases I describe above.
In current usage, a covariate might be a primary variable of interest in a non-DOE context!
In an analytical sense, the modern usage is valid. Covariates in the stricter context performs the same function as continuous predictors in the broader definition.
Just be aware that some analysts will have an extremely specific context in mind when discussing covariates. Others will be thinking in much broader terms!
Covariate Example
Let’s look at a covariate example that fits the original definition involving an experimental design.
Consider a manufacturing process where temperature and pressure are experimental factors. The experimenters set the temperature controls at A, B, and C and the pressure controls at X, Y, and Z. While temperature and pressure are continuous variables, the experiment treats them as categorical factors because the researchers set them to several specific values.
To minimize sources of variation and the effect of other variables, the researchers control the experimental environment as much as possible and use randomization to determine the settings for each experimental run. All in all, it’s a highly controlled, randomized experiment.
However, the researchers know from experience that humidity levels also affect the outcome. Unfortunately, humidity is much harder to control because it depends on outdoor conditions and is impossible to regulate throughout the manufacturing environment. Consequently, they record humidity as a covariate during each experimental run so the ANCOVA model can account for its effect.
The manufacturer is primarily interested in how Temperature and Pressure affect their manufacturing outcome. However, by including humidity as a covariate, the model can control for changing humidity conditions during the experiment.
Swati Puranik says
Hi Jim,
Thank you for the detailed explanation. But I am confused about a data that I have been handed. It is a series of RCBD field experiments measuring various traits (plant height, flowering days etc.) conducted over three years on different genotypes of a particular plant species. I have three replicates for each measurement. So, primarily the hypothesis question are:
1. Is the trait affected by differences in the genotype?
2. Is the trait affected by changes in the year?
3. Is the trait affected by both genotype and year?
4. Is there an impact of replicates?
So, which one should I consider as factor and which one as a covariate?
To further increase the complexity, these genotypes have different ploidy levels. But I am assuming since this is something that can been experimentally measured but not humanly controlled, should be included as a covariate.
Jim Frost says
Hi Swati,
From what you write, it sounds like the following are the types of variables in your study:
Outcomes: the various traits. Probably need to fit separate models for each trait/outcome.
Factors: genotype, year (you might include year as a blocking variable instead.)
Covariate: ploidy levels (if it’s a continuous variable)
As for determining the role of replicates, assess the consistency of results across replicates. If there’s large variation, uncontrolled variables might be affecting the results. You’re hoping for low variation between replicates.
I hope that helps!
CH says
Can race, ethnicity, and school type be used as covariates in studies on high school students if the predictor variables are parent future expectations for students and the outcome variable is student grades? My rationale is that these variables may explain some of the variance in the model. The sample population is from grades 9-12. I have used these variables in a stepwise regression and now I am rethinking it based on your definition of covariate.
Jim Frost says
Hi,
Yes, your logic is sound for including them! The only thing is that the variables you mention are categorical variables rather than continuous. So, they can’t technically be covariates because that term is reserved for continuous variables. However, you can include those categorical variables in your model for the same reasons as you do for covariates. You can call them demographic control variables or something like that. So, again, your logic for including them is sound! 🙂
Collin says
Can’t the type of experimental design, say, completely randomized block design (RCBD), or Latin square design rule out the effects of a potential confounding factor. In other words, isn’t the consideration of a covariate only applicable to RCD trials?
Jim Frost says
Hi Collin,
Blocked design, including Latin Square designs, are one way to handle nuisance variables. Blocks are essentially a categorical nuisance variable. For example, a block might represent days when you think the experimental conditions might change on the different days over which experimental runs occur. With blocks, you might not even be sure exactly what the nuisance is, or it might be a combination of variables, such as with blocking by day. Although, blocks can certainly represent known factors, such as material batches, shifts, etc. But either way, blocks are categorical.
Covariates are another method for handling continuous nuisance variables. You’ll enter the nuisance variables with continuous values. Humidity is a good example of a covariate. It’s not categorical but quite clearly a continuous variable where you’d enter the percentage.
And you can use blocks and covariates together too!
Lopez says
Hi Jim,
I have a couple of questions on this topic.
As you say in one of your replies, in a blocked design, “blocks are essentially a categorical nuisance variable”. In your example, let’s say units/observations are arranged in five blocks (1…5) according to ranges of humidity (very high, high, medium, low, very low or 1…5). Clearly, if we use this variable as a blocking (nuisance) variable, the df would be 4 (5-1). My question is if it is acceptable that using numbers to identify the blocks (1…5), would it be possible (or acceptable) to use the variable as a covariate (not using the actual humidity value but the block number)? (in this case the df would be 1, and I cannot see the reason to do it so).
The other question is if in that particular example it would be possible (and would make sense) to include in the same statistical model both the block factor (as a nuisance variable with df=4) and also the actual humidity value as a covariate (with df=1). I am not sure if this approach will contribute to reducing nuisance further or, instead, would be rather redundant.
Thank you
Jim Frost says
Hi Lopez,
For you first question, blocks are categorical variables. If you identify the blocks using numbers, it’s vital that you DON’T treat them as continuous. If you have the actual humidity value, you can include it as a covariate instead. They’re are roughly the continuous version of blocks in an experiment.
I would not include humidity as a block and covariate because that would be redundant. If you have the actual humidity reading for each experimental run, include it as a covariate. If you have it recorded as a categorical variable (as you described), include it as a block.
Lopez says
Thank you
It is very clear and this seems common sense