An indicator variable is a binary variable that identifies whether an observation belongs to a specific category. It takes the value 1 if the condition is true and 0 otherwise. Indicator variables are essential for including categorical predictors in models that require numeric inputs, such as linear regression. Statisticians also refer to them less formally as dummy variables.
For a categorical variable with n distinct levels, only n–1 indicator variables are created for use in a regression model. One category is designated as the reference (or baseline) level. The remaining categories each receive their own indicator variable, allowing the model to estimate their effects relative to the baseline. Leaving the baseline out of the model prevents perfect collinearity while still capturing the full set of group comparisons.
Any level can serve as the baseline. Choose the one that makes the most sense for your research question or comparison of interest.
Imagine a variable representing weather type with three levels: Sunny, Cloudy, and Rainy. The indicator variable coding with two variables looks like the following:
| Weather Type | Sunny Indicator | Cloudy Indicator |
| Sunny | 1 | 0 |
| Cloudy | 0 | 1 |
| Rainy | 0 | 0 |
In this example, Rainy serves as the baseline level. Only Sunny and Cloudy receive indicator variables. When used in regression, the model’s intercept captures the expected value for rainy days, and the coefficients on the indicators show how the other conditions differ relative to the baseline value. Most statistical software automatically performs this indicator variable encoding when you include a categorical variable in a regression model.
Suppose the coefficient for the Cloudy indicator is –3.50 in a model predicting daily coffee sales in dollars. This means that, on average, sales on cloudy days are expected to be $3.50 lower than on rainy days, holding all other variables constant.
« Back to Glossary Index