Easy Management of Categorical Variables

by Eric · Published March 19, 2021 · Updated March 22, 2021

Introduction

Categorical variables offer an important opportunity to capture qualitative effects in statistical modeling. Unfortunately, it can be tedious and cumbersome to manage categorical variables in statistical software.

The new GAUSS category type, introduced in GAUSS 21, makes it easy and intuitive to work with categorical data.

In today's blog we use real-life housing data to explore the numerous advantages of the GAUSS category type including:

Easy set-up and viewing of categorical data.
Simple renaming of category labels.
Easy changing of the reference base case and reordering of categories.
Single-line frequency plots and tables.
Internal creation of dummy variables for regressions.
Proper labeling of categories in regression output.

Want to see these advantages for yourself? Contact us for a GAUSS 21 demo!

The data

Throughout today's blog, we will be using the Melbourne Housing Snapshot dataset.

The dataset contains the following variables:

Variable	Description
Suburb	Number of the suburb.
Address	House address.
Rooms	Number of rooms.
Type	Type of house.
Price	Sale price.
Method	Method of sale.
SellerG	Real estate agent.
Date	Date sold.
Distance	Distance from CBD.
Postcode	Postal code.
Bedroom2	Number of bedrooms.
Bathrooms	Number of bathrooms.
Car	Number of carspots.
Landsize	Land size.
BuildingArea	Building size.
YearBuilt	Year the house was built.
CouncilArea	Governing council for the area.
Latitude	Location latitude.
Longitude	Location longitude.
Regionname	General region.
Propertycount	Number of properties that exist in the suburb.

Loading our Data

Let's start by loading our dataset using the GAUSS Data Import window:

When we open the dataset for loading we can see quickly from the color-coded columns what type GAUSS is assigning to our variables. For example:

Type and Method (highlighted orange) are categorical variables.
Suburb, Address, and SellerG (highlighted yellow) are listed as string variables.
Rooms and Price (highlighted blue) are numbers.

GAUSS uses an algorithm based on the number of repeated labels and the ratio of varying labels to total observations to detect categorical variables.

Changing a Variable to a Category

Suppose that we also want the variable Suburb to be loaded as a categorical variable. This can be easily done using the Variables tab.

When we change the variable Suburb type to Category, the Modify Column Mapping window opens:

This window is very useful and gives us a number of insights. For example, from this window we can tell that:

The category Abbotsford is the base case.
All labels and category mapping.
There are 314 different categories.

If we want to explore the categories more we can use the Label Filter:

Once we are done changing the type to category, if we click OK, the preview for Suburb is changed to Category with orange highlighting:

Importing our Data

Now that we've set Suburb to be a categorical variable, we're ready to load our data. Clicking Import loads our data and auto-generates the GAUSS code for all import steps.

Managing the Properties of Categorical Variables

One of the advantages of using the category type is that the category properties can be easily modified using the Data Management pane.

Changing Category Labels Interactively

Suppose that after loading our data, we realize that our current labels for the Type variable are not very clear. Instead, we wish to rename the labels such that :

Original label	New label
h	House
t	Townhouse
u	Duplex unit

Once we open the dataframe in the data editor, this is easy to do. We simply:

Select the Manage button to open the Data Management pane:
Click the drop-down button to the right of the variable name and select Properties to open the Modify Column Mapping dialog.
Enter the new label in the Renamed Label textbox next to the category label we want to change.

Let's use the same process to also change the category labels on for the Method variable:

Once we've done this, the variable names Type and Method are highlighted in red. This indicates that we have unsaved changes.

When we click Apply the changes are saved and the commands GAUSS uses to change the category labels are generated in the Program Input/Output window.


melb_data = setcollabels(move(melb_data), "House"$|"Townhouse"$|"Duplex unit", 0|1|2, "Type");
melb_data = setcollabels(move(melb_data), "Passed in"$|"Sold"$|"Sold after auction"$|"Sold prior auction"$|"Vendor bid", 0|1|2|3|4, "Method");

Changing the Base Case

When working with categorical variables, it is useful to know that GAUSS always treats the variable with the key value "0" as the base case. For example, we saw earlier that the base case for the Suburb variable was Abbotsford.

If you want to change the assigned base case (or otherwise reorder the categories), this can quickly be done using the Modify Column Mapping dialog.

For example, suppose we want to replicate a study that uses the category Chelsea as the base case. To do this we:

Locate and select the category Chelsea using the Label Filter
Use the Double Arrow button to move Chelsea to the top of the list.
Apply our changes.

More detailed information on how to manage categorical variables, both interactively and programmatically, can be found in our Data Management Guide.

Exploring our Categorical Variables

To dive deeper into our categorical variables we can use the dstatmt, frequency, and plotFreq procedures.

General Summary Statistics

First, let's get a general overview of our data, including the categorical variables, using the dstatmt function:


// Print descriptive statistics for all variables in 'melb_data'
call dstatmt(melb_data);

This prints a table of descriptive statistics to the Program Input/Output window:

----------------------------------------------------------------------------------------------
Variable              Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
----------------------------------------------------------------------------------------------

Suburb               -----       -----         -----     Chelsea  Yarraville     13580    0
Address              -----       -----         -----       -----       -----     13580    0
Rooms                2.938      0.9557        0.9135           1          10     13580    0
Type                 -----       -----         -----       House  Duplex uni     13580    0
Price            1.076e+06      639311     4.087e+11     8.5e+04       9e+06     13580    0
Method               -----       -----         -----   Passed in  Vendor bid     13580    0
SellerG              -----       -----         -----       -----       -----     13580    0
Date                 -----       -----         -----  28/01/2016  23/09/2017     13580    0
Distance             10.14       5.869         34.44           0        48.1     13580    0
Postcode              3105       90.68          8222        3000        3977     13580    0
Bedroom2             2.915      0.9659         0.933           0          20     13580    0
Bathroom             1.534      0.6917        0.4785           0           8     13580    0
Car                   1.61      0.9626        0.9267           0          10     13518   62
Landsize             558.4        3991     1.593e+07           0    4.33e+05     13580    0
BuildingArea           152         541        292697           0   4.452e+04      7130 6450
YearBuilt             1965       37.27          1389        1196        2018      8205 5375
CouncilArea          -----       -----         -----     Banyule  Yarra Rang     12211 1369
Latitude            -37.81     0.07926      0.006282      -38.18      -37.41     13580    0
Longitude              145      0.1039        0.0108       144.4       145.5     13580    0
Regionname           -----       -----         -----  Eastern Me  Western Vi     13580    0
Propertycount         7454        4379     1.917e+07         249   2.165e+04     13580    0

Though traditional summary statistics aren't valid for categorical data, the descriptive statistics still provide some insights:

The minimum category is always the base case for the variable.
We can identify if there are any missing observations.

Frequency Table of Categories

The frequency procedure was introduced in GAUSS 21 specifically to provide frequency count tables. The procedure requires two inputs:

x: Data matrix or data frame.
varlist: String, names or indices of variables to be counted. If names, should be entered as a formula string e.g "rep78 + foreign".

To see frequency counts for both Method and Type in our dataframe melb_data we enter:


// Print frequency tables for the 'Type' and 'Method' variables
// in the 'melb_data' dataframe
frequency(melb_data, "Type + Method");

This prints the category names along with:

Total counts.
Frequency percentages.
Cumulative percentages.

             Label      Count   Total %    Cum. %
             House       9449     69.58     69.58
         Townhouse       1114     8.203     77.78
       Duplex unit       3017     22.22       100
             Total      13580       100


             Label      Count   Total %    Cum. %
         Passed in       1564     11.52     11.52
              Sold       9022     66.44     77.95
Sold after auction         92    0.6775     78.63
Sold prior auction       1703     12.54     91.17
        Vendor bid       1199     8.829       100
             Total      13580       100

Frequency Plot of Categories

The information provided with the frequency procedure can be quickly visualized using the plotFreq procedure.

For example, let's plot the frequencies for the Method variable:


plotFreq(melb_data, "Method");

Estimation with Categorical Variables

The final area we will explore today is the use of the GAUSS category type in estimation. Estimation of one of the areas that the GAUSS category type offers the greatest advantages.

GAUSS category variables can be used in estimation routines, such as olsmt or glm without taking any additional step.

When category variables are detected in estimation routines GAUSS will automatically:

Create and use dummy variables during estimation.
Exclude the base case category.
Print output tables using specified category labels.

Using categorical data in your models? See how the new category type works for you!

As an example, let's run a simple linear regression model estimating the role of Method, Bedroom2, and Bath on Price:


call olsmt(melb_data, "price ~ method + bedroom2 + bath");

                                           Standard                 Prob   Standardized  Cor with
Variable                       Estimate      Error      t-value     >|t|     Estimate    Dep Var
-------------------------------------------------------------------------------------------------

CONSTANT                        59797.1     20407.1     2.93022     0.003       ---         ---

Method: Sold                    40895.7     14857.5     2.75252     0.006   0.0302079   0.0256233
Method: Sold after auction     -54790.4     57921.6    -0.94594     0.344 -0.00703034  -0.0064481
Method: Sold prior auction      -109114     18989.1    -5.74614     0.000  -0.0565257   -0.104125
Method: Vendor bid              77144.6     20737.2      3.7201     0.000   0.0342371   0.0442128
Bedroom2                         202813     5924.64     34.2321     0.000    0.306426    0.475951
Bathroom                         263854      8314.3      31.735     0.000    0.285481    0.467038

Notice that the results include estimated coefficients for each of the Method categories, except the base case, Passed In.

For more information on how to use and interpret categorical variables in linear regression see our earlier blog, "Introduction to Categorical Variables".

Conclusion

Today we've seen how fast and easy it can be to work with categorical variables using the new GAUSS category type. Whether you're just getting started exploring your data or you're in the final stage of estimation, the category type can speed up your work and get you to your results faster.

	melb_data = setcollabels(move(melb_data), "House"$\|"Townhouse"$\|"Duplex unit", 0\|1\|2, "Type");
	melb_data = setcollabels(move(melb_data), "Passed in"$\|"Sold"$\|"Sold after auction"$\|"Sold prior auction"$\|"Vendor bid", 0\|1\|2\|3\|4, "Method");

	// Print descriptive statistics for all variables in 'melb_data'
	call dstatmt(melb_data);

	// Print frequency tables for the 'Type' and 'Method' variables
	// in the 'melb_data' dataframe
	frequency(melb_data, "Type + Method");