How To Create Dummy Variables in GAUSS

by Eric · Published June 11, 2020 · Updated November 20, 2020

Introduction

Dummy variables are a common econometric tool, whether working with time series, cross-sectional, or panel data. Unfortunately, raw datasets rarely come formatted with dummy variables that are regression ready.

In today's blog, we explore several options for creating dummy variables from categorical data in GAUSS, including:

Creating dummy variables from a file using formula strings.
Creating dummy variables from an existing vector of categorical data.
Creating dummy variables from an existing vector of continuous variables.

Creating Dummy Variables from a File

Dummy variables can be conveniently created from files at the time of loading data or calling procedures using formula string notation. Formula string notation is a powerful GAUSS tool that allows you to represent a model or collection of variables in a compact and intuitive manner, using the variable names in the dataset.

The `factor` Keyword

The factor keyword is used in formula strings to:

Specify that a variable contains numeric categorical data.
Create dummy variables (which are not present in the raw data) while loading data from a dataset.
Include dummy variables in estimation functions such as olsmt, glm, or gmmFit.

Let's consider the model

$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78$$

We will use ordinary least squares to estimate this model with data from the auto2.dta file which can be found in the GAUSSHOME/examples directory.

The variable rep78 is a categorical, 5-point variable that measures a car's repair record in 1978. To estimate the effects of the repair record on mpg we can include dummy variables representing the different categories.

// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";

// Perform OLS estimation, creating dummy variables from 'rep78'
call olsmt(fname, "mpg ~ weight + factor(rep78)");

The printed output table includes coefficients for rep78=fair, average, good, excellent. Note that rep78=poor is automatically excluded from the regression as the base level.

                                 Standard                 Prob   Standardized  Cor with
Variable             Estimate      Error      t-value     >|t|     Estimate    Dep Var
---------------------------------------------------------------------------------------
CONSTANT              38.0594     3.09336     12.3036     0.000       ---         ---
weight            -0.00550304 0.000601001    -9.15645     0.000   -0.743741    -0.80552
rep78: Fair         -0.478604     2.76503   -0.173092     0.863  -0.0263109   -0.134619
rep78: Average      -0.471562     2.55314   -0.184699     0.854  -0.0401403   -0.279593
rep78: Good         -0.599032      2.6066   -0.229814     0.819  -0.0451669   0.0384391
rep78: Excellent      2.08628     2.72482    0.765657     0.447    0.131139    0.454192

The `cat` Keyword

Some common file types, such as XLS and CSV do not have a robust method of determining the variable types. In these cases, the cat keyword is used to:

Denote a variable in a file as categorical text data.
Instruct GAUSS to reclassify the string data to integer categories.

The cat keyword can be combined with the factor keyword to instruct GAUSS to load a column as string data, reclassify it to integers and then create dummy variables:

// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/yarn.xlsx";

// Reclassify 'load' variable from 'high, low, med'
// to '0, 1, 2', then create dummy variables from 
// integer categories and create OLS estimates
call olsmt(fname, "cycles ~ factor(cat(load))");

For datasets that do not specify a category order, the categories will be ordered alphabetically by default.

Using factor(cat(load)) in the formula strings tells GAUSS to create dummy variables representing the different categories of the load variable. This is seen in the printed output table which now includes coefficients for load=low, medium. Note that load=high is automatically excluded from the regression as the base level.

                          Standard                 Prob   Standardized  Cor with
Variable      Estimate      Error      t-value     >|t|     Estimate    Dep Var
--------------------------------------------------------------------------------
CONSTANT    534.444444  292.474662    1.827319     0.080       ---         ---
load: low   621.555556  413.621634    1.502715     0.146    0.338504    0.240716
load: med   359.111111  413.621634    0.868212     0.394    0.195575    0.026323

Creating Dummy Variables Using `loadd`

In our previous two examples, we used the factor and cat keywords directly in calls to estimation procedures. However, we can also use these keywords when loading data to create dummy variables in our data matrices.

For example, let's load the dummy variables associated with the rep78 variable:

// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";

// Perform OLS estimation, creating dummy variables from 'rep78'
reg_data = loadd(fname, "mpg + weight + factor(rep78)");

The reg_data matrix is a 74 x 6 matrix. It contains the mpg and weight data, as well as 4 columns of dummy variables for rep78=fair, average, good, excellent.

The first five rows look like this:

mpg  weight  rep78:fair  rep78:avg rep78:good rep78:exc
 22    2930           0          1          0         0
 17    3350           0          1          0         0
 22    2640           .          .          .         .
 20    3250           0          1          0         0
 15    4080           0          0          1         0

Note that, again, rep78=poor is automatically excluded as the base level.

Creating Dummy Variables from a Categorical Vector

In the previous section, we looked at creating dummy variables at the time of loading data or running procedures. In this section, we consider how to create dummy variables from an existing GAUSS vector.

The GAUSS design procedure provides a convenient method for creating dummy variables from a vector of discrete categories.

Let's load the data from the auto2.dta dataset used in our earlier regression example. This time we won't load rep78 using factor:

// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";

//  Load auto data for regression
reg_data = loadd(fname, "mpg + weight + rep78");

// Remove missing values
reg_data = packr(reg_data);

The first five rows of reg_data look like this:

22     2930        3
17     3350        3
20     3250        3
15     4080        4
18     3670        3

Our third column now contains discrete, categorical data with values ranging from 1-5, which represent poor, fair, average, good, and excellent.

The categories are not ordered alphabetically this time, because the auto2.dta file specifies the preferred order for the string categories.

// Compute the unique values found
// in the third column of 'reg_data'
print unique(reg_data[., 3]);

design creates a matrix with a column of indicator variables for each positive integer in the input. For example:

cats = { 1, 2, 1, 3 };
print design(cats);

will return:

       1        0        0
       0        1        0
       1        0        0
       0        0        1

Therefore, if we pass the third column of reg_data to design we will get a matrix with a column for all five categories. However, we want to drop the base case column for our regression.

$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78_{fair} + \beta_4 rep78_{avg} + \beta_5 rep78_{good} + \beta_6 rep78_{excl}$$

To do this, we shift the range of the categorical data from 1-5 to 0-4 by subtracting 1.

// Create dummy variables. Subtract one
// to remove the base case.
dummy_vars = design(reg_data[., 3] - 1);

This creates a 69x4 matrix, dummy_vars, which contains dummy variables representing the final four levels of rep78.

Now we can estimate our model as shown below.

// Select the 'mpg' data as the dependent variable
y = reg_data[., 1];

// Independent variables:
//     'weight' is in the second column of 'reg_data'.
//     'rep78'= Fair, Average, Good and Excellent
//              are represented by the 4 columns 
//              of 'dummy_vars'.
x = reg_data[., 2]~dummy_vars;

// Estimate model using OLS
call olsmt("", y, x);

Our printed results are the same as earlier, except our table no longer includes variables names:

                         Standard                 Prob   Standardized  Cor with
Variable     Estimate      Error      t-value     >|t|     Estimate    Dep Var
-------------------------------------------------------------------------------
CONSTANT    38.059415    3.093361   12.303578     0.000       ---         ---
X1          -0.005503    0.000601   -9.156447     0.000   -0.743741   -0.805520
X2          -0.478604    2.765035   -0.173092     0.863   -0.026311   -0.134619
X3          -0.471562    2.553145   -0.184699     0.854   -0.040140   -0.279593
X4          -0.599032    2.606599   -0.229814     0.819   -0.045167    0.038439
X5           2.086276    2.724817    0.765657     0.447    0.131139    0.454192

Creating Dummy Variables from Continuous Variables

The design procedure works well when our data already contains categorical data. However, there may be cases when we want to create dummy variables based on ranges of continuous data. The GAUSS dummybr, dummydn, and dummy procedures can be used to achieve this.

Consider a simple example:

x = { 1.53,
      8.41,
      3.81,
      6.34,
      0.03 };

// Breakpoints
v = { 1, 5, 7 };

All three procedures create a set of dummy (0/1) variables by breaking up a data vector into categories based on specified breakpoints. These procedures differ in how they treat boundary cases as shown below.

	Category Boundaries	# dummies ($K$ breakpoints)	Call	Result
dummybr	$$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$	$$K$$	dm = dummybr(x, v);	$$dm = \begin{matrix} 0 & 1 & 0\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0 \end{matrix}$$
dummy	$$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$x \gt 7 $$	$$K+1$$	dm = dummy(x, v);	$$dm = \begin{matrix} 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 \end{matrix}$$
dummydn	$$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$x \gt 7 $$	$$K$$	// Column to drop p = 2; dm = dummydn(x, v, p);	$$dm = \begin{matrix} 0 & 0 & 0\\ 0 & 0 & 1\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{matrix}$$

Let's look a little closer at how these procedures work.

Using `dummybr`

When creating dummy variables with dummybr:

All categories are:
- Open on the left (i.e., do not contain their left boundaries).
- Closed on the right (i.e., do contain their right boundaries).
$K$ breakpoints are required to specify $K$ dummy variables.
Missings are deleted before the dummy variables are created.

dm = dummybr(x, v);

The code above produces three dummies based upon the breakpoints in the vector v:

x <= 1
1 < x <= 5
5 < x <= 7

The matrix dm contains:

     0 1 0       1.53
     0 0 0       8.41
dm = 0 1 0   x = 3.81
     0 0 1       6.34
     1 0 0       0.03

Notice that in this case, the second row of dm does not contain a 1 because x = 8.41 does not fall into any of our specified categories.

Using `dummy`

Now, let's compare our results from dummybr above to the dummy procedure. When we use the dummy procedure:

All categories are:
- Open on the left (i.e., do not contain their left boundaries).
- Closed on the right (i.e., do contain their right boundaries), except the highest (rightmost) category because it extends to $+\infty$.
$K-1$ breakpoints are required to specify $K$ dummy variables.
Missings are deleted before the dummy variables are created.

dm = dummy(x, v);

The code above produces four dummies based upon the breakpoints in the vector v:

x <= 1
1 < x <= 5
5 < x <= 7
x > 7

The matrix dm contains:

     0 1 0 0       1.53
     0 0 0 1       8.41
dm = 0 1 0 0   x = 3.81
     0 0 1 0       6.34
     1 0 0 0       0.03

These results vary from our previous example:

The dummy procedure results in 4 columns of dummy variables. It adds a new column for the case where x > 7.
The second row now contains a 1 in the final column to indicate that x = 8.41 falls into the category x > 7.

Using `dummydn`

Our final function is dummydn which behaves just like dummy, except that the pth column of the matrix of dummies is dropped. This is convenient for specifying a base case to ensure that these variables will not be collinear with a vector of ones.

// Column to drop
p = 2;

// Create matrix of dummy variables
dm_dn = dummydn(x, v, p);

The code above produces three dummies based upon the breakpoints in the vector v:

x <= 1
1 < x <= 5 // Since p = 2, this column is dropped
5 < x <= 7
x > 7

The matrix dm_dn contains:

     0 1 0 0           0 0 0       1.53
     0 0 0 1           0 0 1       8.41
dm = 0 1 0 0   dm_dn = 0 0 0   x = 3.81
     0 0 1 0           0 1 0       6.34
     1 0 0 0           1 0 0       0.03

Note that the matrix dm_dn is the same as dm except the second column has been removed.

Conclusion

Dummy variables are an important tool for data analysis whether we are working with time series data, cross-sectional data, or panel data. In today's blog, we have explored three GAUSS tools for generating dummy variables:

Creating dummy variables from a file using formula strings.
Creating dummy variables from an existing vector of categorical data using the design procedure.
Creating dummy variables from an existing vector of continuous variables using the dummy, dummybr, and dummydn procedures.

Eric( Director of Applications and Training at Aptech Systems, Inc. )

Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.

How To Create Dummy Variables in GAUSS

Introduction

Creating Dummy Variables from a File

The factor Keyword

The cat Keyword

Creating Dummy Variables Using loadd

Creating Dummy Variables from a Categorical Vector

Creating Dummy Variables from Continuous Variables

Using dummybr

Using dummy

Using dummydn