Easy Management of Categorical Variables

Introduction

Categorical variables offer an important opportunity to capture qualitative effects in statistical modeling. Unfortunately, it can be tedious and cumbersome to manage categorical variables in statistical software.

The new GAUSS category type, introduced in GAUSS 21, makes it easy and intuitive to work with categorical data.

In today's blog we use real-life housing data to explore the numerous advantages of the GAUSS category type including:

  • Easy set-up and viewing of categorical data.
  • Simple renaming of category labels.
  • Easy changing of the reference base case and reordering of categories.
  • Single-line frequency plots and tables.
  • Internal creation of dummy variables for regressions.
  • Proper labeling of categories in regression output.

Want to see these advantages for yourself? Contact us for a GAUSS 21 demo!

The data

Throughout today's blog, we will be using the Melbourne Housing Snapshot dataset.

The dataset contains the following variables:

VariableDescription
SuburbNumber of the suburb.
AddressHouse address.
RoomsNumber of rooms.
TypeType of house.
PriceSale price.
MethodMethod of sale.
SellerGReal estate agent.
DateDate sold.
DistanceDistance from CBD.
PostcodePostal code.
Bedroom2Number of bedrooms.
BathroomsNumber of bathrooms.
CarNumber of carspots.
LandsizeLand size.
BuildingAreaBuilding size.
YearBuiltYear the house was built.
CouncilAreaGoverning council for the area.
LatitudeLocation latitude.
LongitudeLocation longitude.
RegionnameGeneral region.
PropertycountNumber of properties that exist in the suburb.

Loading our Data

Let's start by loading our dataset using the GAUSS Data Import window:

When we open the dataset for loading we can see quickly from the color-coded columns what type GAUSS is assigning to our variables. For example:

  • Type and Method (highlighted orange) are categorical variables.
  • Suburb, Address, and SellerG (highlighted yellow) are listed as string variables.
  • Rooms and Price (highlighted blue) are numbers.

Changing a Variable to a Category

Suppose that we also want the variable Suburb to be loaded as a categorical variable. This can be easily done using the Variables tab.

When we change the variable Suburb type to Category, the Modify Column Mapping window opens:

This window is very useful and gives us a number of insights. For example, from this window we can tell that:

If we want to explore the categories more we can use the Label Filter:

Once we are done changing the type to category, if we click OK, the preview for Suburb is changed to Category with orange highlighting:

Importing our Data

Now that we've set Suburb to be a categorical variable, we're ready to load our data. Clicking Import loads our data and auto-generates the GAUSS code for all import steps.

Managing the Properties of Categorical Variables

One of the advantages of using the category type is that the category properties can be easily modified using the Data Management pane.

Changing Category Labels Interactively

Suppose that after loading our data, we realize that our current labels for the Type variable are not very clear. Instead, we wish to rename the labels such that :

Original labelNew label
hHouse
tTownhouse
uDuplex unit

Once we open the dataframe in the data editor, this is easy to do. We simply:

  1. Select the Manage button to open the Data Management pane:
  2. Click the drop-down button to the right of the variable name and select Properties to open the Modify Column Mapping dialog.
  3. Enter the new label in the Renamed Label textbox next to the category label we want to change.

Let's use the same process to also change the category labels on for the Method variable:

Once we've done this, the variable names Type and Method are highlighted in red. This indicates that we have unsaved changes.

When we click Apply the changes are saved and the commands GAUSS uses to change the category labels are generated in the Program Input/Output window.

melb_data = setcollabels(move(melb_data), "House"$|"Townhouse"$|"Duplex unit", 0|1|2, "Type");
melb_data = setcollabels(move(melb_data), "Passed in"$|"Sold"$|"Sold after auction"$|"Sold prior auction"$|"Vendor bid", 0|1|2|3|4, "Method");

Changing the Base Case

When working with categorical variables, it is useful to know that GAUSS always treats the variable with the key value "0" as the base case. For example, we saw earlier that the base case for the Suburb variable was Abbotsford.

If you want to change the assigned base case (or otherwise reorder the categories), this can quickly be done using the Modify Column Mapping dialog.

For example, suppose we want to replicate a study that uses the category Chelsea as the base case. To do this we:

  1. Locate and select the category Chelsea using the Label Filter
  2. Use the Double Arrow button to move Chelsea to the top of the list.
  3. Apply our changes.

Exploring our Categorical Variables

To dive deeper into our categorical variables we can use the dstatmt, frequency, and plotFreq procedures.

General Summary Statistics

First, let's get a general overview of our data, including the categorical variables, using the dstatmt function:

// Print descriptive statistics for all variables in 'melb_data'
call dstatmt(melb_data);

This prints a table of descriptive statistics to the Program Input/Output window:

----------------------------------------------------------------------------------------------
Variable              Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
----------------------------------------------------------------------------------------------

Suburb               -----       -----         -----     Chelsea  Yarraville     13580    0
Address              -----       -----         -----       -----       -----     13580    0
Rooms                2.938      0.9557        0.9135           1          10     13580    0
Type                 -----       -----         -----       House  Duplex uni     13580    0
Price            1.076e+06      639311     4.087e+11     8.5e+04       9e+06     13580    0
Method               -----       -----         -----   Passed in  Vendor bid     13580    0
SellerG              -----       -----         -----       -----       -----     13580    0
Date                 -----       -----         -----  28/01/2016  23/09/2017     13580    0
Distance             10.14       5.869         34.44           0        48.1     13580    0
Postcode              3105       90.68          8222        3000        3977     13580    0
Bedroom2             2.915      0.9659         0.933           0          20     13580    0
Bathroom             1.534      0.6917        0.4785           0           8     13580    0
Car                   1.61      0.9626        0.9267           0          10     13518   62
Landsize             558.4        3991     1.593e+07           0    4.33e+05     13580    0
BuildingArea           152         541        292697           0   4.452e+04      7130 6450
YearBuilt             1965       37.27          1389        1196        2018      8205 5375
CouncilArea          -----       -----         -----     Banyule  Yarra Rang     12211 1369
Latitude            -37.81     0.07926      0.006282      -38.18      -37.41     13580    0
Longitude              145      0.1039        0.0108       144.4       145.5     13580    0
Regionname           -----       -----         -----  Eastern Me  Western Vi     13580    0
Propertycount         7454        4379     1.917e+07         249   2.165e+04     13580    0 

Though traditional summary statistics aren't valid for categorical data, the descriptive statistics still provide some insights:

  • The minimum category is always the base case for the variable.
  • We can identify if there are any missing observations.

Frequency Table of Categories

The frequency procedure was introduced in GAUSS 21 specifically to provide frequency count tables. The procedure requires two inputs:


x
Data matrix or data frame.
varlist
String, names or indices of variables to be counted. If names, should be entered as a formula string e.g "rep78 + foreign".

To see frequency counts for both Method and Type in our dataframe melb_data we enter:

// Print frequency tables for the 'Type' and 'Method' variables
// in the 'melb_data' dataframe
frequency(melb_data, "Type + Method");

This prints the category names along with:

  • Total counts.
  • Frequency percentages.
  • Cumulative percentages.
             Label      Count   Total %    Cum. %
             House       9449     69.58     69.58
         Townhouse       1114     8.203     77.78
       Duplex unit       3017     22.22       100
             Total      13580       100
Label Count Total % Cum. % Passed in 1564 11.52 11.52 Sold 9022 66.44 77.95 Sold after auction 92 0.6775 78.63 Sold prior auction 1703 12.54 91.17 Vendor bid 1199 8.829 100 Total 13580 100

Frequency Plot of Categories

The information provided with the frequency procedure can be quickly visualized using the plotFreq procedure.

For example, let's plot the frequencies for the Method variable:

plotFreq(melb_data, "Method");

Estimation with Categorical Variables

The final area we will explore today is the use of the GAUSS category type in estimation. Estimation of one of the areas that the GAUSS category type offers the greatest advantages.

GAUSS category variables can be used in estimation routines, such as olsmt or glm without taking any additional step.

When category variables are detected in estimation routines GAUSS will automatically:

  • Create and use dummy variables during estimation.
  • Exclude the base case category.
  • Print output tables using specified category labels.
Using categorical data in your models? See how the new category type works for you!

   

As an example, let's run a simple linear regression model estimating the role of Method, Bedroom2, and Bath on Price:

call olsmt(melb_data, "price ~ method + bedroom2 + bath");
                                           Standard                 Prob   Standardized  Cor with
Variable                       Estimate      Error      t-value     >|t|     Estimate    Dep Var
-------------------------------------------------------------------------------------------------

CONSTANT                        59797.1     20407.1     2.93022     0.003       ---         ---
Method: Sold 40895.7 14857.5 2.75252 0.006 0.0302079 0.0256233 Method: Sold after auction -54790.4 57921.6 -0.94594 0.344 -0.00703034 -0.0064481 Method: Sold prior auction -109114 18989.1 -5.74614 0.000 -0.0565257 -0.104125 Method: Vendor bid 77144.6 20737.2 3.7201 0.000 0.0342371 0.0442128 Bedroom2 202813 5924.64 34.2321 0.000 0.306426 0.475951 Bathroom 263854 8314.3 31.735 0.000 0.285481 0.467038

Notice that the results include estimated coefficients for each of the Method categories, except the base case, Passed In.

Conclusion

Today we've seen how fast and easy it can be to work with categorical variables using the new GAUSS category type. Whether you're just getting started exploring your data or you're in the final stage of estimation, the category type can speed up your work and get you to your results faster.

Further reading

  1. Introduction to Categorical Variables.
  2. Easy and Fast Data Management in GAUSS 21 .
  3. Preparing and Cleaning FRED data in GAUSS/

Leave a Reply