Introduction
Categorical variables are an important part of research and modeling. They arise anytime we have observations that fall into discrete groups, rather than on a continuous scale.
Some everyday examples include:
- Marriage status (Not Married, Married)
- Transportation choices (Car, Subway, Bus, Other),
- Performance ratings (Poor, Fair, Average, Good, Excellent)
In today’s blog, we look more closely at what categorical variables are and how these variables are treated in estimation.
What is a categorical variable?
A categorical variable is a discrete variable that captures qualitative outcomes by placing observations into fixed groups (or levels). The groups are mutually exclusive, which means that each individual fits into only one category.
Types of categorical variables
Categorical variables can be used to represent different types of qualitative data. For example:
- Ordinal data - represents outcomes for which the order of the groups is relevant.
- Nominal data - represent outcomes for which the order of groups does not matter.
- Binary data - data with only two possible outcomes.
Who uses categorical variables?
Categorical variables are used widely across fields:
Example | Field | Type | Categories | |||||
---|---|---|---|---|---|---|---|---|
Income range | Economics, Sociology | Nominal |
| |||||
Blood pressure | Epidemiology | Ordinal |
| |||||
Commute method | Transportation modeling | Nominal |
| |||||
Marriage status | Economics, Sociology | Binary |
|
How are categorical variables treated in estimation?
How we treat categorical variables in estimation depends on if data is being used as a dependent or independent variable. Whatever the case, in order to capture the impacts we are interested in, categorical variables require special treatment before being used in estimation. We cannot use them in estimation the same way we do continuous variables, they must be recoded.
Categorical data as dependent variables
When dependent variables are categorical data we use a special branch of estimation models called discrete choice models. Some examples of discrete choice models include:
Model | Potential applications |
---|---|
Logit model | Estimating what factors impact election results. |
Ordered probit model | Modeling treatment outcomes for patients. |
Conditional logit model | Modeling occupational choice. |
Nested logit | Modeling travel mode selection. |
Discrete choice modeling is an important and broad field. It won't be the focus of our blog today.
Categorical data as independent variables
Dummy variables
The most common method for including categorical data in regressions is to create dummy variables for each possible category. When using this method:
- One reference category must be excluded to avoid perfect multicollinearity.
- The impact of each level on the dependent variables is in relationship to the reference level.
To generate dummy variables we:
- Create a new variable for each possible category.
- Assign a 1 to the category variable if an observation falls in that category and a 0 otherwise.
For example, consider data recording the region an individual lives in. The possible categories are:
- Northwest
- Southwest
- Midwest
- South
- Northeast
- Southeast
The first six observations are:
ID | Region |
---|---|
1 | Northwest |
2 | Southwest |
3 | Midwest |
4 | Midwest |
5 | Northeast |
5 | South |
After adding dummy variables to represent the categories, the first six observation are:
ID | Region | Northwest | Southwest | Midwest | South | Northeast | Southeast |
---|---|---|---|---|---|---|---|
1 | Northwest | 1 | 0 | 0 | 0 | 0 | 0 |
2 | Southwest | 0 | 1 | 0 | 0 | 0 | 0 |
3 | Midwest | 0 | 0 | 1 | 0 | 0 | 0 |
4 | Midwest | 0 | 0 | 1 | 0 | 0 | 0 |
5 | Northeast | 0 | 0 | 0 | 0 | 1 | 0 |
5 | South | 0 | 0 | 0 | 1 | 0 | 0 |
Other coding methods
Since dummy variable coding is the most common coding method we won't spend time exploring other methods. However, it is worth noting that there are many other methods and the coding method used has a direct impact on how we interpret our results.
Example | Description |
---|---|
Sum coding | Used to compare the mean of the dependent variable for given level to the mean of the dependent variable across all levels. |
Helmert coding | Used to compare each level of an ordinal categorical variable to the mean of the subsequent levels of the category. |
Difference coding | Used to compare each level of an ordinal categorical variable to the mean of the previous levels of the category. |
How are category parameters interpreted in estimation?
When interpreting dummy variable coefficients:
- The parameters estimate the effects of a category relative to the reference (base case) category.
- If the coefficient of a dummy variable is statistically significant, then the difference in impact between the corresponding level and the reference group is statistically significant.
For example, suppose we want to model MPG for a vehicle using weight and whether the car is foreign or domestic:
$$ MPG = \beta_0 + \beta_1 * weight + \beta_2 * foreign $$
The coefficient $\beta_0$ tells us, after accounting for weight, how much more or less MPG is when a car is foreign than when it is domestic.
Example: Using categorical variables in ordinary least squares
Loading the data
Let's estimate our linear regression MPG model from earlier. We will start by loading the auto2.dta
dataset from the GAUSS example directory. When loading data for this model we:
- Load the
MPG
,Weight
, andForeign
variables. - Specify that
Foreign
is a categorical variable.
The code for this action is auto-generated:
auto2 = loadd("C:/gauss21/examples/auto2.dta", "mpg + weight + cat(foreign)");
cat
keyword indicates to GAUSS that foreign
is a categorical variable.Running the regression
Next, we will call olsmt
to estimate our model. Using our categorical variable with olsmt
is easy and requires no extra steps:
call olsmt(auto2, "MPG ~ weight + foreign");
The results are printed:
Standard Prob Std. Cor with Variable Estimate Error t-value >|t| Est. Dep Var --------------------------------------------------------------------------- CONSTANT 41.68 2.166 19.25 0.00 --- --- weight -0.00659 0.000637 -10.34 0.00 -0.8860 -0.807 foreign: Foreign -1.650 1.076 -1.534 0.13 -0.1313 0.393
Interpreting our results
There are a few notable components to our linear regression results:
- The estimated coefficient on the
Foreign
level is1.650
. This tells us that after accounting for weight, foreign cars have an MPG 1.65 lower than domestic cars. - Our p-value of 13% tells us that this difference is not statistically significant.
- GAUSS automatically identifies the categories and labels them appropriately in our results table. The variable name
foreign: Foreign
tells us the that coefficient in the table is for the categoryForeign
of the variableforeign
.
Conclusions
Categorical variables have an important role in modeling, as they offer a quantitative way to include qualitative outcomes in our models. However, it is important to know how to appropriately use them and how to appropriately interpret models that include them.
After today's blog, you should have the foundation to begin working with categorical variables and a better knowledge of:
- What categorical variables are.
- How to include categorical variables in models.
- How to interpret results when categorical variables are used in linear regression.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.