Introduction to Categorical Variables

by Eric · Published March 9, 2021

Introduction

Categorical variables are an important part of research and modeling. They arise anytime we have observations that fall into discrete groups, rather than on a continuous scale.

Some everyday examples include:

Marriage status (Not Married, Married)
Transportation choices (Car, Subway, Bus, Other),
Performance ratings (Poor, Fair, Average, Good, Excellent)

In today’s blog, we look more closely at what categorical variables are and how these variables are treated in estimation.

What is a categorical variable?

A categorical variable is a discrete variable that captures qualitative outcomes by placing observations into fixed groups (or levels). The groups are mutually exclusive, which means that each individual fits into only one category.

Types of categorical variables

Categorical variables can be used to represent different types of qualitative data. For example:

Ordinal data - represents outcomes for which the order of the groups is relevant.
Nominal data - represent outcomes for which the order of groups does not matter.
Binary data - data with only two possible outcomes.

Who uses categorical variables?

Categorical variables are used widely across fields:

Example

Field

Type

How are categorical variables treated in estimation?

How we treat categorical variables in estimation depends on if data is being used as a dependent or independent variable. Whatever the case, in order to capture the impacts we are interested in, categorical variables require special treatment before being used in estimation. We cannot use them in estimation the same way we do continuous variables, they must be recoded.

Categorical data as dependent variables

When dependent variables are categorical data we use a special branch of estimation models called discrete choice models. Some examples of discrete choice models include:

Model	Potential applications
Logit model	Estimating what factors impact election results.
Ordered probit model	Modeling treatment outcomes for patients.
Conditional logit model	Modeling occupational choice.
Nested logit	Modeling travel mode selection.

Discrete choice modeling is an important and broad field. It won't be the focus of our blog today.

Categorical data as independent variables

Dummy variables

The most common method for including categorical data in regressions is to create dummy variables for each possible category. When using this method:

One reference category must be excluded to avoid perfect multicollinearity.
The impact of each level on the dependent variables is in relationship to the reference level.

To generate dummy variables we:

Create a new variable for each possible category.
Assign a 1 to the category variable if an observation falls in that category and a 0 otherwise.

For example, consider data recording the region an individual lives in. The possible categories are:

Northwest
Southwest
Midwest
South
Northeast
Southeast

The first six observations are:

ID	Region
1	Northwest
2	Southwest
3	Midwest
4	Midwest
5	Northeast
5	South

After adding dummy variables to represent the categories, the first six observation are:

ID	Region	Northwest	Southwest	Midwest	South	Northeast
1	Northwest	1	0	0	0	0
2	Southwest	0	1	0	0	0
3	Midwest	0	0	1	0	0
4	Midwest	0	0	1	0	0
5	Northeast	0	0	0	0	1
5	South	0	0	0	1	0

This type of coding is sometimes referred to as one hot encoding or treatment coding.

Other coding methods

Since dummy variable coding is the most common coding method we won't spend time exploring other methods. However, it is worth noting that there are many other methods and the coding method used has a direct impact on how we interpret our results.

Example	Description
Sum coding	Used to compare the mean of the dependent variable for given level to the mean of the dependent variable across all levels.
Helmert coding	Used to compare each level of an ordinal categorical variable to the mean of the subsequent levels of the category.
Difference coding	Used to compare each level of an ordinal categorical variable to the mean of the previous levels of the category.

How are category parameters interpreted in estimation?

When interpreting dummy variable coefficients:

The parameters estimate the effects of a category relative to the reference (base case) category.
If the coefficient of a dummy variable is statistically significant, then the difference in impact between the corresponding level and the reference group is statistically significant.

For example, suppose we want to model MPG for a vehicle using weight and whether the car is foreign or domestic:

M P G = β_{0} + β_{1} * w e i g h t + β_{2} * f o r e i g n

$MPG = \beta_0 + \beta_1 * weight + \beta_2 * foreign$

The coefficient $\beta_0$ tells us, after accounting for weight, how much more or less MPG is when a car is foreign than when it is domestic.

Example: Using categorical variables in ordinary least squares

Loading the data

Let's estimate our linear regression MPG model from earlier. We will start by loading the auto2.dta dataset from the GAUSS example directory. When loading data for this model we:

Load the MPG, Weight, and Foreign variables.
Specify that Foreign is a categorical variable.

The code for this action is auto-generated:


auto2 = loadd("C:/gauss21/examples/auto2.dta", "mpg + weight + cat(foreign)");

The cat keyword indicates to GAUSS that foreign is a categorical variable.

Running the regression

Next, we will call olsmt to estimate our model. Using our categorical variable with olsmt is easy and requires no extra steps:


call olsmt(auto2, "MPG ~ weight + foreign");

The results are printed:

                            Standard             Prob      Std.    Cor with
Variable          Estimate    Error    t-value   >|t|      Est.    Dep Var
---------------------------------------------------------------------------

CONSTANT             41.68     2.166     19.25    0.00     ---       ---
weight            -0.00659  0.000637    -10.34    0.00    -0.8860   -0.807
foreign: Foreign    -1.650     1.076    -1.534    0.13    -0.1313    0.393

Interpreting our results

There are a few notable components to our linear regression results:

The estimated coefficient on the Foreign level is 1.650. This tells us that after accounting for weight, foreign cars have an MPG 1.65 lower than domestic cars.
Our p-value of 13% tells us that this difference is not statistically significant.
GAUSS automatically identifies the categories and labels them appropriately in our results table. The variable name foreign: Foreign tells us the that coefficient in the table is for the category Foreign of the variable foreign.

Conclusions

Categorical variables have an important role in modeling, as they offer a quantitative way to include qualitative outcomes in our models. However, it is important to know how to appropriately use them and how to appropriately interpret models that include them.

After today's blog, you should have the foundation to begin working with categorical variables and a better knowledge of:

What categorical variables are.
How to include categorical variables in models.
How to interpret results when categorical variables are used in linear regression.

Eric( Director of Applications and Training at Aptech Systems, Inc. )

Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.