Introduction
Categorical variables offer an important opportunity to capture qualitative effects in statistical modeling. Unfortunately, it can be tedious and cumbersome to manage categorical variables in statistical software.
The new GAUSS category type, introduced in GAUSS 21, makes it easy and intuitive to work with categorical data.
In today's blog we use real-life housing data to explore the numerous advantages of the GAUSS category type including:
- Easy set-up and viewing of categorical data.
- Simple renaming of category labels.
- Easy changing of the reference base case and reordering of categories.
- Single-line frequency plots and tables.
- Internal creation of dummy variables for regressions.
- Proper labeling of categories in regression output.
Want to see these advantages for yourself? Contact us for a GAUSS 21 demo!
The data
Throughout today's blog, we will be using the Melbourne Housing Snapshot dataset.
The dataset contains the following variables:
Variable | Description |
---|---|
Suburb | Number of the suburb. |
Address | House address. |
Rooms | Number of rooms. |
Type | Type of house. |
Price | Sale price. |
Method | Method of sale. |
SellerG | Real estate agent. |
Date | Date sold. |
Distance | Distance from CBD. |
Postcode | Postal code. |
Bedroom2 | Number of bedrooms. |
Bathrooms | Number of bathrooms. |
Car | Number of carspots. |
Landsize | Land size. |
BuildingArea | Building size. |
YearBuilt | Year the house was built. |
CouncilArea | Governing council for the area. |
Latitude | Location latitude. |
Longitude | Location longitude. |
Regionname | General region. |
Propertycount | Number of properties that exist in the suburb. |
Loading our Data
Let's start by loading our dataset using the GAUSS Data Import window:
When we open the dataset for loading we can see quickly from the color-coded columns what type GAUSS is assigning to our variables. For example:
Type
andMethod
(highlighted orange) are categorical variables.Suburb
,Address
, andSellerG
(highlighted yellow) are listed as string variables.Rooms
andPrice
(highlighted blue) are numbers.
Changing a Variable to a Category
Suppose that we also want the variable Suburb
to be loaded as a categorical variable. This can be easily done using the Variables tab.
When we change the variable Suburb
type to Category, the Modify Column Mapping window opens:
This window is very useful and gives us a number of insights. For example, from this window we can tell that:
- The category Abbotsford is the base case.
- All labels and category mapping.
- There are 314 different categories.
If we want to explore the categories more we can use the Label Filter:
Once we are done changing the type to category, if we click OK, the preview for Suburb
is changed to Category with orange highlighting:
Importing our Data
Now that we've set Suburb
to be a categorical variable, we're ready to load our data. Clicking Import loads our data and auto-generates the GAUSS code for all import steps.
Managing the Properties of Categorical Variables
One of the advantages of using the category type is that the category properties can be easily modified using the Data Management pane.
Changing Category Labels Interactively
Suppose that after loading our data, we realize that our current labels for the Type
variable are not very clear. Instead, we wish to rename the labels such that :
Original label | New label |
---|---|
h | House |
t | Townhouse |
u | Duplex unit |
Once we open the dataframe in the data editor, this is easy to do. We simply:
- Select the Manage button to open the Data Management pane:
- Click the drop-down button to the right of the variable name and select Properties to open the Modify Column Mapping dialog.
- Enter the new label in the Renamed Label textbox next to the category label we want to change.
Let's use the same process to also change the category labels on for the Method
variable:
Once we've done this, the variable names Type
and Method
are highlighted in red. This indicates that we have unsaved changes.
When we click Apply the changes are saved and the commands GAUSS uses to change the category labels are generated in the Program Input/Output window.
melb_data = setcollabels(move(melb_data), "House"$|"Townhouse"$|"Duplex unit", 0|1|2, "Type");
melb_data = setcollabels(move(melb_data), "Passed in"$|"Sold"$|"Sold after auction"$|"Sold prior auction"$|"Vendor bid", 0|1|2|3|4, "Method");
Changing the Base Case
When working with categorical variables, it is useful to know that GAUSS always treats the variable with the key value "0" as the base case. For example, we saw earlier that the base case for the Suburb
variable was Abbotsford.
If you want to change the assigned base case (or otherwise reorder the categories), this can quickly be done using the Modify Column Mapping dialog.
For example, suppose we want to replicate a study that uses the category Chelsea as the base case. To do this we:
- Locate and select the category Chelsea using the Label Filter
- Use the Double Arrow button to move Chelsea to the top of the list.
- Apply our changes.
Exploring our Categorical Variables
To dive deeper into our categorical variables we can use the dstatmt, frequency, and plotFreq procedures.
General Summary Statistics
First, let's get a general overview of our data, including the categorical variables, using the dstatmt
function:
// Print descriptive statistics for all variables in 'melb_data'
call dstatmt(melb_data);
This prints a table of descriptive statistics to the Program Input/Output window:
---------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ---------------------------------------------------------------------------------------------- Suburb ----- ----- ----- Chelsea Yarraville 13580 0 Address ----- ----- ----- ----- ----- 13580 0 Rooms 2.938 0.9557 0.9135 1 10 13580 0 Type ----- ----- ----- House Duplex uni 13580 0 Price 1.076e+06 639311 4.087e+11 8.5e+04 9e+06 13580 0 Method ----- ----- ----- Passed in Vendor bid 13580 0 SellerG ----- ----- ----- ----- ----- 13580 0 Date ----- ----- ----- 28/01/2016 23/09/2017 13580 0 Distance 10.14 5.869 34.44 0 48.1 13580 0 Postcode 3105 90.68 8222 3000 3977 13580 0 Bedroom2 2.915 0.9659 0.933 0 20 13580 0 Bathroom 1.534 0.6917 0.4785 0 8 13580 0 Car 1.61 0.9626 0.9267 0 10 13518 62 Landsize 558.4 3991 1.593e+07 0 4.33e+05 13580 0 BuildingArea 152 541 292697 0 4.452e+04 7130 6450 YearBuilt 1965 37.27 1389 1196 2018 8205 5375 CouncilArea ----- ----- ----- Banyule Yarra Rang 12211 1369 Latitude -37.81 0.07926 0.006282 -38.18 -37.41 13580 0 Longitude 145 0.1039 0.0108 144.4 145.5 13580 0 Regionname ----- ----- ----- Eastern Me Western Vi 13580 0 Propertycount 7454 4379 1.917e+07 249 2.165e+04 13580 0
Though traditional summary statistics aren't valid for categorical data, the descriptive statistics still provide some insights:
- The minimum category is always the base case for the variable.
- We can identify if there are any missing observations.
Frequency Table of Categories
The frequency
procedure was introduced in GAUSS 21 specifically to provide frequency count tables. The procedure requires two inputs:
- x
- Data matrix or data frame.
- varlist
- String, names or indices of variables to be counted. If names, should be entered as a formula string e.g
"rep78 + foreign"
.
To see frequency counts for both Method
and Type
in our dataframe melb_data
we enter:
// Print frequency tables for the 'Type' and 'Method' variables
// in the 'melb_data' dataframe
frequency(melb_data, "Type + Method");
This prints the category names along with:
- Total counts.
- Frequency percentages.
- Cumulative percentages.
Label Count Total % Cum. % House 9449 69.58 69.58 Townhouse 1114 8.203 77.78 Duplex unit 3017 22.22 100 Total 13580 100
Label Count Total % Cum. % Passed in 1564 11.52 11.52 Sold 9022 66.44 77.95 Sold after auction 92 0.6775 78.63 Sold prior auction 1703 12.54 91.17 Vendor bid 1199 8.829 100 Total 13580 100
Frequency Plot of Categories
The information provided with the frequency
procedure can be quickly visualized using the plotFreq
procedure.
For example, let's plot the frequencies for the Method
variable:
plotFreq(melb_data, "Method");
Estimation with Categorical Variables
The final area we will explore today is the use of the GAUSS category type in estimation. Estimation of one of the areas that the GAUSS category type offers the greatest advantages.
GAUSS category variables can be used in estimation routines, such as olsmt or glm without taking any additional step.
When category variables are detected in estimation routines GAUSS will automatically:
- Create and use dummy variables during estimation.
- Exclude the base case category.
- Print output tables using specified category labels.
As an example, let's run a simple linear regression model estimating the role of Method
, Bedroom2
, and Bath
on Price
:
call olsmt(melb_data, "price ~ method + bedroom2 + bath");
Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------------------------- CONSTANT 59797.1 20407.1 2.93022 0.003 --- ---
Method: Sold 40895.7 14857.5 2.75252 0.006 0.0302079 0.0256233 Method: Sold after auction -54790.4 57921.6 -0.94594 0.344 -0.00703034 -0.0064481 Method: Sold prior auction -109114 18989.1 -5.74614 0.000 -0.0565257 -0.104125 Method: Vendor bid 77144.6 20737.2 3.7201 0.000 0.0342371 0.0442128 Bedroom2 202813 5924.64 34.2321 0.000 0.306426 0.475951 Bathroom 263854 8314.3 31.735 0.000 0.285481 0.467038
Notice that the results include estimated coefficients for each of the Method
categories, except the base case, Passed In.
Conclusion
Today we've seen how fast and easy it can be to work with categorical variables using the new GAUSS category type. Whether you're just getting started exploring your data or you're in the final stage of estimation, the category type can speed up your work and get you to your results faster.
Further reading
- Introduction to Categorical Variables.
- Easy and Fast Data Management in GAUSS 21 .
- Preparing and Cleaning FRED data in GAUSS/
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.