A dummy variable is a binary variable indicating the presence or absence of a condition. It takes the value 1 if the observation belongs to a specific category and 0 otherwise. Dummy variables allow you to include categorical variables in regression models by translating qualitative groupings into a numeric format. Statisticians also refer to them more formally as indicator variables.
To include a categorical variable with n levels in a regression model, you create n–1 dummy variables. Each represents one of the categories except for the baseline level, which is absorbed into the model intercept. Leaving the baseline out of the model prevents perfect collinearity while still capturing the full set of group comparisons.
Any level can serve as the baseline. Choose the one that makes the most sense for your research question or comparison of interest.
For example, suppose you have a categorical variable for job type with three levels: Manager, Technician, and Clerk. The dummy coding with two variables looks like the following:
| Job Type | Manager Dummy | Technician Dummy |
| Manager | 1 | 0 |
| Technician | 0 | 1 |
| Clerk | 0 | 0 |
Here, Clerk is the baseline category. It is not assigned its own dummy variable because its effects are captured by the model intercept. When used in regression, the coefficients on the indicators show how those conditions differ relative to the baseline value. Most statistical software automatically performs this indicator variable encoding when you include a categorical variable in a regression model.
Suppose a regression model predicting salary includes these dummy variables, and the Manager dummy variable has a coefficient of 5,000. This result indicates that managers earn a mean of $5,000 more than clerks, holding all other variables constant.
« Back to Glossary Index