Independent variables and dependent variables are the two fundamental types of variables in statistical modeling and experimental designs. Analysts use these methods to understand the relationships between the variables and estimate effect sizes. What effect does one variable have on another?
In this post, learn the definitions of independent and dependent variables, how to identify each type, how they differ between different types of studies, and see examples of them in use.
What is an Independent Variable?
Independent variables (IVs) are the ones that you include in the model to explain or predict changes in the dependent variable. The name helps you understand their role in statistical analysis. These variables are independent. In this context, independent indicates that they stand alone and other variables in the model do not influence them. The researchers are not seeking to understand what causes the independent variables to change.
Independent variables are also known as predictors, factors, treatment variables, explanatory variables, input variables, x-variables, and right-hand variables—because they appear on the right side of the equals sign in a regression equation. In notation, statisticians commonly denote them using Xs. On graphs, analysts place independent variables on the horizontal, or X, axis.
In machine learning, independent variables are known as features.
For example, in a plant growth study, the independent variables might be soil moisture (continuous) and type of fertilizer (categorical).
Statistical models will estimate effect sizes for the independent variables.
Relate post: Effect Sizes in Statistics
Including independent variables in studies
The nature of independent variables changes based on the type of experiment or study:
Controlled experiments: Researchers systematically control and set the values of the independent variables. In randomized experiments, relationships between independent and dependent variables tend to be causal. The independent variables cause changes in the dependent variable.
Observational studies: Researchers do not set the values of the explanatory variables but instead observe them in their natural environment. When the independent and dependent variables are correlated, those relationships might not be causal.
When you include one independent variable in a regression model, you are performing simple regression. For more than one independent variable, it is multiple regression. Despite the different names, it’s really the same analysis with the same interpretations and assumptions.
Determining which IVs to include in a statistical model is known as model specification. That process involves in-depth research and many subject-area, theoretical, and statistical considerations. At its most basic level, you’ll want to include the predictors you are specifically assessing in your study and confounding variables that will bias your results if you don’t add them—particularly for observational studies.
For more information about choosing independent variables, read my post about Specifying the Correct Regression Model.
Related posts: Randomized Experiments, Observational Studies, Covariates, and Confounding Variables
What is a Dependent Variable?
The dependent variable (DV) is what you want to use the model to explain or predict. The values of this variable depend on other variables. It is the outcome that you’re studying. It’s also known as the response variable, outcome variable, and left-hand variable. Statisticians commonly denote them using a Y. Traditionally, graphs place dependent variables on the vertical, or Y, axis.
For example, in the plant growth study example, a measure of plant growth is the dependent variable. That is the outcome of the experiment, and we want to determine what affects it.
How to Identify Independent and Dependent Variables
If you’re reading a study’s write-up, how do you distinguish independent variables from dependent variables? Here are some tips!
Identifying IVs
How statisticians discuss independent variables changes depending on the field of study and type of experiment.
In randomized experiments, look for the following descriptions to identify the independent variables:
- Independent variables cause changes in another variable.
- The researchers control the values of the independent variables. They are controlled or manipulated variables.
- Experiments often refer to them as factors or experimental factors. In areas such as medicine, they might be risk factors.
- Treatment and control groups are always independent variables. In this case, the independent variable is a categorical grouping variable that defines the experimental groups to which participants belong. Each group is a level of that variable.
In observational studies, independent variables are a bit different. While the researchers likely want to establish causation, that’s harder to do with this type of study, so they often won’t use the word “cause.” They also don’t set the values of the predictors. Some independent variables are the experiment’s focus, while others help keep the experimental results valid.
Here’s how to recognize independent variables in observational studies:
- IVs explain the variability, predict, or correlate with changes in the dependent variable.
- Researchers in observational studies must include confounding variables (i.e., confounders) to keep the statistical results valid even if they are not the primary interest of the study. For example, these might include the participants’ socio-economic status or other background information that the researchers aren’t focused on but can explain some of the dependent variable’s variability.
- The results are adjusted or controlled for by a variable.
Regardless of the study type, if you see an estimated effect size, it is an independent variable.
Identifying DVs
Dependent variables are the outcome. The IVs explain the variability or causes changes in the DV. Focus on the “depends” aspect. The value of the dependent variable depends on the IVs. If Y depends on X, then Y is the dependent variable. This aspect applies to both randomized experiments and observational studies.
In an observational study about the effects of smoking, the researchers observe the subjects’ smoking status (smoker/non-smoker) and their lung cancer rates. It’s an observational study because they cannot randomly assign subjects to either the smoking or non-smoking group. In this study, the researchers want to know whether lung cancer rates depend on smoking status. Therefore, the lung cancer rate is the dependent variable.
In a randomized COVID-19 vaccine experiment, the researchers randomly assign subjects to the treatment or control group. They want to determine whether COVID-19 infection rates depend on vaccination status. Hence, the infection rate is the DV.
Note that a variable can be an independent variable in one study but a dependent variable in another. It depends on the context.
For example, one study might assess how the amount of exercise (IV) affects health (DV). However, another study might study the factors (IVs) that influence how much someone exercises (DV). The amount of exercise is an independent variable in one study but a dependent variable in the other!
How Analyses Use IVs and DVs
Regression analysis and ANOVA mathematically describe the relationships between each independent variable and the dependent variable. Typically, you want to determine how changes in one or more predictors associate with changes in the dependent variable. These analyses estimate an effect size for each independent variable.
Suppose researchers study the relationship between wattage, several types of filaments, and the output from a light bulb. In this study, light output is the dependent variable because it depends on the other two variables. Wattage (continuous) and filament type (categorical) are the independent variables.
After performing the regression analysis, the researchers will understand the nature of the relationship between these variables. How much does the light output increase on average for each additional watt? Does the mean light output differ by filament types? They will also learn whether these effects are statistically significant.
Related post: When to Use Regression Analysis
Graphing Independent and Dependent Variables
As I mentioned earlier, graphs traditionally display the independent variables on the horizontal X-axis and the dependent variable on the vertical Y-axis. The type of graph depends on the nature of the variables. Here are a couple of examples.
Suppose you experiment to determine whether various teaching methods affect learning outcomes. Teaching method is a categorical predictor that defines the experimental groups. To display this type of data, you can use a boxplot, as shown below.
The groups are along the horizontal axis, while the dependent variable, learning outcomes, is on the vertical. From the graph, method 4 has the best results. A one-way ANOVA will tell you whether these results are statistically significant. Learn more about interpreting boxplots.
Now, imagine that you are studying people’s height and weight. Specifically, do height increases cause weight to increase? Consequently, height is the independent variable on the horizontal axis, and weight is the dependent variable on the vertical axis. You can use a scatterplot to display this type of data.
It appears that as height increases, weight tends to increase. Regression analysis will tell you if these results are statistically significant. Learn more about interpreting scatterplots.
Kathy Spencer says
Hi again Jim
Thanks so much for taking an interest in New Zealand’s Equity Index.
Rather than me trying to explain what our Ministry of Education has done, here is a link to a fairly short paper. Scroll down to page 4 of this (if you have the inclination) –
https://fyi.org.nz/request/21253/response/80708/attach/4/1301098%20Response%20and%20Appendix.pdf
The Equity Index is used to allocate only 4% of total school funding. The most advantaged 5% of schools get no “equity funding” and the other 95% get a share of the equity funding pool based on their index score. We are talking a maximum of around $1,000NZD per child per year for the most disadvantaged schools. The average amount is around $200-$300 per child per year.
My concern is that I thought the dependent variable is the thing you want to explain or predict using one or more independent variables. Choosing the form of dependent variable that gets a good fit seems to be answering the question “what can we predict well?” rather than “how do we best predict the factor of interest?” The factor is educational achievement and I think this should have been decided upon using theory rather than experimentation with the data.
As it turns out, the Ministry has chosen a measure of educational achievement that puts a heavy weight on achieving an “excellence” rating on a qualification and a much lower weight on simply gaining a qualification. My reading is that they have taken what our universities do when looking at which students to admit.
It doesn’t seem likely to me that a heavy weighting on excellent achievement is appropriate for targeting extra funding to schools with a lot of under-achieving students.
However, my stats knowledge isn’t extensive and it’s definitely rusty, so your thoughts are most helpful.
Regards
Kathy Spencer
Kathy Spencer says
Hi Jim,
Great website, thank you.
I have been looking at New Zealand’s Equity Index which is used to allocate a small amount of extra funding to schools attended by children from disadvantaged backgrounds. The Index uses 37 socioeconomic measures relating to a child’s and their parents’ backgrounds that are found to be associated with educational achievement.
I was a bit surprised to read how they had decided on the dependent variable to be used as the measure of educational achievement, or dependent variable. Part of the process was as follows-
“Each measure was tested to see the degree to which it could be predicted by the socioeconomic factors selected for the Equity Index.”
Any comment?
Many thanks
Kathy Spencer
Jim Frost says
Hi Kathy,
That’s a very complex study and I don’t know much about it. So, that limits what I can say about it. But I’ll give you a few thoughts that come to mind.
This method is common in educational and social research, particularly when the goal is to understand or mitigate the impact of socioeconomic disparities on educational outcomes.
There are the usual concerns about not confusing correlation with causation. However, because this program seems to quantify barriers and then provide extra funding based on the index, I don’t think that’s a problem. They’re not attempting to adjust the socioeconomic measures so no worries about whether they’re directly causal or not.
I might have a small concern about cherry picking the model that happens to maximize the R-squared. Chasing the R-squared rather than having theory drive model selecting is often problematic. Chasing the best fit increases the likelihood that the model fits this specific dataset best by random chance rather than being truly the best. If so, it won’t perform as well outside the dataset used to fit the model. Hopefully, they validated the predicted ability of the model using other data.
However, I’m not sure if the extra funding is determined by the model? I don’t know if the index value is calculated separately outside the candidate models and then fed into the various models. Or does the choice of model affect how the index value is calculated? If it’s the former, then the funding doesn’t depend on a potentially cherry picked model. If the latter, it does.
So, I’m not really clear on the purpose of the model. I’m guessing they just want to validate their Equity Index. And maximizing the R-squared doesn’t really say it’s the best Index but it does at least show that it likely has some merit. I’d be curious how the took the 37 measures and combined them to one index. So, I have more questions than answers. I don’t mean that in a critical sense. Just that I know almost nothing about this program.
I’m curious, what was the outcome they picked? How high was the R-squared? And what were your concerns?
Anna says
Excellent explanation, thank you.
Anna says
Thank you for this insightful blog.
Is it valid to use a dependent variable delivered from the mean of independent variables in multiple regression if you want to evaluate the influence of each unique independent variable on the dependent variables?
Jim Frost says
Hi Anna,
It’s difficult to answer your question because I’m not sure what you mean that the DV is “delivered from the mean of IVs.” If you mean that multiple IVs explain changes in the DV’s mean, yes, that’s the standard use for multiple regression.
If you mean something else, please explain in further detail. Thanks!
Anna says
Hi Jim,
What I meant is; the DV values used as parameters for multiple regression is basically calculated as the average of the IVs. For instance:
From 3 IVs (X1, X2, X3), Y is delivered as :
Y = (Sum of all IVs) / (3)
Then the resulting Y is used as the DV along with the initial IVs to compute the multiple regression.
Jim Frost says
Hi Anna,
There are a couple of reasons why you shouldn’t do that.
For starters, Y-hat (the predicted value of the regression equation) is the mean of the DV given specific values of the IV. However, that mean is calculated by using the regression coefficients and constant in the regression equation. You don’t calculate the DV mean as the sum of the IVs divided by the number of IVs. Perhaps given a very specific subject-area context, using this approach might seem to make sense but there are other problems.
A critical problem is that the Y is now calculated using the IVs. Instead, the DVs should be measured outcomes and not calculated from IVs. This violates regression assumptions and produces questionable results.
Additionally, it complicates the interpretation. Because the DV is calculated from the IV, you know the regression analysis will find a relationship between them. But you have no idea if that relationship exists in the real world. This complication occurs because your results are based on forcing the DV to equal a function of the IVs and do not reflect real-world outcomes.
In short, DVs should be real-world outcomes that you measure! And be sure to keep your IVs and DV independent. Let the regression analysis estimate the regression equation from your data that contains measured DVs. Don’t use a function to force the DV to equal some function of the IVs because that’s the opposite direction of how regression works!
I hope that helps!
Julia Boms says
Thank you for sharing.
Ty says
Excellent explanation.
Thank you
Yashika says
Hi Jim,
Thanks a lot for creating this excellent blog. This is my go-to resource for Statistics.
I had been pondering over a question for sometime, it would be great if you could shed some light on this.
In linear and non-linear regression, should the distribution of independent and dependent variables be unskewed?
When is there a need to transform the data (say, Box-Cox transformation), and do we transform the independent variables as well?
Thank You
Kelly Anderson says
If I use a independent variable (X) and it displays a low p-value <.05, why is it if I introduce another independent variable to regression the coefficient and p-value of Y that I used in first regression changes to look insignificant? The second variable that I introduced has a low p-value in regression.
Jim Frost says
Hi Kelly,
Keep in mind that the significance of each IV is calculated after accounting for the variance of all the other variables in the model, assuming you’re using the standard adjusted sums of squares rather than sequential sums of squares. The sums of squares (SS) is a measure of how much dependent variable variability that each IV accounts for. In the illustration below, I’ll assume you’re using the standard of adjusted SS.
So, let’s say that originally you have X1 in the model along with some other IVs. Your model estimates the significance of X1 after assessing the variability that the other IVs account for and finds that X1 is significant. Now, you add X2 to the model in addition to X1 and the other IVs. Now, when assessing X1, the model accounts for the variability of the IVs including the newly added X2. And apparently X2 explains a good portion of the variability. X1 is no longer able to account for that variability, which causes it to not be statistically significant.
In other words, X2 explains some of the variability that X1 previously explained. Because X1 no longer explains it, it is no longer significant.
Additionally, the significance of IVs is more likely to change when you add or remove IVs that are correlated. Correlated IVs is known as multicollinearity. Multicollinearity can be a problem when you have too much. Given the change in significance, I’d check your model for multicollinearity just to be safe! Click the link to read a post that wrote about that!
I hope that helps!
Tanu Khare says
nice explanation
Dessie says
it is excellent explanation