Introduction
Dummy variables are a common econometric tool, whether working with time series, cross-sectional, or panel data. Unfortunately, raw datasets rarely come formatted with dummy variables that are regression ready.
In today's blog, we explore several options for creating dummy variables from categorical data in GAUSS, including:
- Creating dummy variables from a file using formula strings.
- Creating dummy variables from an existing vector of categorical data.
- Creating dummy variables from an existing vector of continuous variables.
Creating Dummy Variables from a File
Dummy variables can be conveniently created from files at the time of loading data or calling procedures using formula string notation. Formula string notation is a powerful GAUSS tool that allows you to represent a model or collection of variables in a compact and intuitive manner, using the variable names in the dataset.
The factor
Keyword
The factor
keyword is used in formula strings to:
- Specify that a variable contains numeric categorical data.
- Create dummy variables (which are not present in the raw data) while loading data from a dataset.
- Include dummy variables in estimation functions such as
olsmt
,glm
, orgmmFit
.
Let's consider the model
$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78$$
We will use ordinary least squares to estimate this model with data from the auto2.dta
file which can be found in the GAUSSHOME/examples directory.
The variable rep78
is a categorical, 5-point variable that measures a car's repair record in 1978. To estimate the effects of the repair record on mpg
we can include dummy variables representing the different categories.
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Perform OLS estimation, creating dummy variables from 'rep78'
call olsmt(fname, "mpg ~ weight + factor(rep78)");
The printed output table includes coefficients for rep78=fair, average, good, excellent
. Note that rep78=poor
is automatically excluded from the regression as the base level.
Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 38.0594 3.09336 12.3036 0.000 --- --- weight -0.00550304 0.000601001 -9.15645 0.000 -0.743741 -0.80552 rep78: Fair -0.478604 2.76503 -0.173092 0.863 -0.0263109 -0.134619 rep78: Average -0.471562 2.55314 -0.184699 0.854 -0.0401403 -0.279593 rep78: Good -0.599032 2.6066 -0.229814 0.819 -0.0451669 0.0384391 rep78: Excellent 2.08628 2.72482 0.765657 0.447 0.131139 0.454192
The cat
Keyword
Some common file types, such as XLS and CSV do not have a robust method of determining the variable types. In these cases, the cat
keyword is used to:
- Denote a variable in a file as categorical text data.
- Instruct GAUSS to reclassify the string data to integer categories.
The cat
keyword can be combined with the factor
keyword to instruct GAUSS to load a column as string data, reclassify it to integers and then create dummy variables:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/yarn.xlsx";
// Reclassify 'load' variable from 'high, low, med'
// to '0, 1, 2', then create dummy variables from
// integer categories and create OLS estimates
call olsmt(fname, "cycles ~ factor(cat(load))");
Using factor(cat(load))
in the formula strings tells GAUSS to create dummy variables representing the different categories of the load
variable. This is seen in the printed output table which now includes coefficients for load=low, medium
. Note that load=high
is automatically excluded from the regression as the base level.
Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var -------------------------------------------------------------------------------- CONSTANT 534.444444 292.474662 1.827319 0.080 --- --- load: low 621.555556 413.621634 1.502715 0.146 0.338504 0.240716 load: med 359.111111 413.621634 0.868212 0.394 0.195575 0.026323
Creating Dummy Variables Using loadd
In our previous two examples, we used the factor
and cat
keywords directly in calls to estimation procedures. However, we can also use these keywords when loading data to create dummy variables in our data matrices.
For example, let's load the dummy variables associated with the rep78
variable:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Perform OLS estimation, creating dummy variables from 'rep78'
reg_data = loadd(fname, "mpg + weight + factor(rep78)");
The reg_data
matrix is a 74 x 6 matrix. It contains the mpg
and weight
data, as well as 4 columns of dummy variables for rep78=fair, average, good, excellent
.
The first five rows look like this:
mpg weight rep78:fair rep78:avg rep78:good rep78:exc 22 2930 0 1 0 0 17 3350 0 1 0 0 22 2640 . . . . 20 3250 0 1 0 0 15 4080 0 0 1 0
Note that, again, rep78=poor
is automatically excluded as the base level.
Creating Dummy Variables from a Categorical Vector
In the previous section, we looked at creating dummy variables at the time of loading data or running procedures. In this section, we consider how to create dummy variables from an existing GAUSS vector.
The GAUSS design
procedure provides a convenient method for creating dummy variables from a vector of discrete categories.
Let's load the data from the auto2.dta
dataset used in our earlier regression example. This time we won't load rep78
using factor
:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Load auto data for regression
reg_data = loadd(fname, "mpg + weight + rep78");
// Remove missing values
reg_data = packr(reg_data);
The first five rows of reg_data
look like this:
22 2930 3 17 3350 3 20 3250 3 15 4080 4 18 3670 3
Our third column now contains discrete, categorical data with values ranging from 1-5, which represent poor
, fair
, average
, good
, and excellent
.
auto2.dta
file specifies the preferred order for the string categories.// Compute the unique values found
// in the third column of 'reg_data'
print unique(reg_data[., 3]);
1 2 3 4 5
design
creates a matrix with a column of indicator variables for each positive integer in the input. For example:
cats = { 1, 2, 1, 3 };
print design(cats);
will return:
1 0 0 0 1 0 1 0 0 0 0 1
Therefore, if we pass the third column of reg_data
to design
we will get a matrix with a column for all five categories. However, we want to drop the base case column for our regression.
$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78_{fair} + \beta_4 rep78_{avg} + \beta_5 rep78_{good} + \beta_6 rep78_{excl}$$
To do this, we shift the range of the categorical data from 1-5 to 0-4 by subtracting 1.
// Create dummy variables. Subtract one
// to remove the base case.
dummy_vars = design(reg_data[., 3] - 1);
This creates a 69x4 matrix, dummy_vars
, which contains dummy variables representing the final four levels of rep78
.
Now we can estimate our model as shown below.
// Select the 'mpg' data as the dependent variable
y = reg_data[., 1];
// Independent variables:
// 'weight' is in the second column of 'reg_data'.
// 'rep78'= Fair, Average, Good and Excellent
// are represented by the 4 columns
// of 'dummy_vars'.
x = reg_data[., 2]~dummy_vars;
// Estimate model using OLS
call olsmt("", y, x);
Our printed results are the same as earlier, except our table no longer includes variables names:
Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------- CONSTANT 38.059415 3.093361 12.303578 0.000 --- --- X1 -0.005503 0.000601 -9.156447 0.000 -0.743741 -0.805520 X2 -0.478604 2.765035 -0.173092 0.863 -0.026311 -0.134619 X3 -0.471562 2.553145 -0.184699 0.854 -0.040140 -0.279593 X4 -0.599032 2.606599 -0.229814 0.819 -0.045167 0.038439 X5 2.086276 2.724817 0.765657 0.447 0.131139 0.454192
Creating Dummy Variables from Continuous Variables
The design
procedure works well when our data already contains categorical data. However, there may be cases when we want to create dummy variables based on ranges of continuous data. The GAUSS dummybr
, dummydn
, and dummy
procedures can be used to achieve this.
Consider a simple example:
x = { 1.53,
8.41,
3.81,
6.34,
0.03 };
// Breakpoints
v = { 1, 5, 7 };
All three procedures create a set of dummy (0/1) variables by breaking up a data vector into categories based on specified breakpoints. These procedures differ in how they treat boundary cases as shown below.
Category Boundaries | # dummies ($K$ breakpoints) | Call | Result | |
---|---|---|---|---|
dummybr | $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ | $$K$$ | dm = dummybr(x, v); | $$dm = \begin{matrix} 0 & 1 & 0\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0 \end{matrix}$$ |
dummy | $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$x \gt 7 $$ | $$K+1$$ | dm = dummy(x, v); | $$dm = \begin{matrix} 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 \end{matrix}$$ |
dummydn | $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$x \gt 7 $$ | $$K$$ | // Column to drop p = 2; dm = dummydn(x, v, p); | $$dm = \begin{matrix} 0 & 0 & 0\\ 0 & 0 & 1\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{matrix}$$ |
Let's look a little closer at how these procedures work.
Using dummybr
When creating dummy variables with dummybr
:
- All categories are:
- Open on the left (i.e., do not contain their left boundaries).
- Closed on the right (i.e., do contain their right boundaries).
- $K$ breakpoints are required to specify $K$ dummy variables.
- Missings are deleted before the dummy variables are created.
dm = dummybr(x, v);
The code above produces three dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 5 < x <= 7
The matrix dm
contains:
0 1 0 1.53 0 0 0 8.41 dm = 0 1 0 x = 3.81 0 0 1 6.34 1 0 0 0.03
Notice that in this case, the second row of dm
does not contain a 1 because x = 8.41
does not fall into any of our specified categories.
Using dummy
Now, let's compare our results from dummybr
above to the dummy
procedure. When we use the dummy
procedure:
- All categories are:
- Open on the left (i.e., do not contain their left boundaries).
- Closed on the right (i.e., do contain their right boundaries), except the highest (rightmost) category because it extends to $+\infty$.
- $K-1$ breakpoints are required to specify $K$ dummy variables.
- Missings are deleted before the dummy variables are created.
dm = dummy(x, v);
The code above produces four dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 5 < x <= 7 x > 7
The matrix dm
contains:
0 1 0 0 1.53 0 0 0 1 8.41 dm = 0 1 0 0 x = 3.81 0 0 1 0 6.34 1 0 0 0 0.03
These results vary from our previous example:
- The
dummy
procedure results in 4 columns of dummy variables. It adds a new column for the case wherex > 7
. - The second row now contains a 1 in the final column to indicate that
x = 8.41
falls into the categoryx > 7
.
Using dummydn
Our final function is dummydn
which behaves just like dummy
, except that the pth column of the matrix of dummies is dropped. This is convenient for specifying a base case to ensure that these variables will not be collinear with a vector of ones.
// Column to drop
p = 2;
// Create matrix of dummy variables
dm_dn = dummydn(x, v, p);
The code above produces three dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 // Since p = 2, this column is dropped 5 < x <= 7 x > 7
The matrix dm_dn
contains:
0 1 0 0 0 0 0 1.53 0 0 0 1 0 0 1 8.41 dm = 0 1 0 0 dm_dn = 0 0 0 x = 3.81 0 0 1 0 0 1 0 6.34 1 0 0 0 1 0 0 0.03
Note that the matrix dm_dn
is the same as dm
except the second column has been removed.
Conclusion
Dummy variables are an important tool for data analysis whether we are working with time series data, cross-sectional data, or panel data. In today's blog, we have explored three GAUSS tools for generating dummy variables:
- Creating dummy variables from a file using formula strings.
- Creating dummy variables from an existing vector of categorical data using the
design
procedure. - Creating dummy variables from an existing vector of continuous variables using the
dummy
,dummybr
, anddummydn
procedures.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.
Thank you for your positive feedback! I am happy to hear that you found the blog helpful!