Exploring Categorical Data in GAUSS 25

Introduction

Categorical data plays a key role in data analysis, offering a structured way to capture qualitative relationships. Before running any models, simply examining the distribution of categorical data can provide valuable insights into underlying patterns.

Whether summarizing survey responses or exploring demographic trends, fundamental statistical tools, such as frequency counts and tabulations, help reveal these patterns.

GAUSS offers several tools for summarizing and visualizing categorical data, including:

  • tabulate: Quickly compute cross-tabulations and summary tables.
  • frequency: Generate frequency counts and relative frequencies.
  • plotFreq: Create visual representations of frequency distributions.

In GAUSS 25, these functions received significant enhancements, making them more powerful and user-friendly. In this post, we'll explore these improvements and demonstrate their practical applications.

Frequency Counts

The GAUSS frequency function generates frequency tables for categorical variables. In GAUSS 25, it has been enhanced to utilize metadata from dataframes, automatically detecting and displaying variable names. Additionally, the function now includes an option to sort the frequency table, making it easier to analyze distributions.

Example: Counting Product Categories

For this example, we'll use a hypothetical dataset containing 50 observations of two categorical variables: Product_Type and Region. You can download the dataset here.

To start, we'll load the data using loadd:

/*
** Sample product sales data
*/
// Import sales dataframe
product_data = loadd(__FILE_DIR $+ "product_data.csv");

// Preview data
head(product_data);
    Product_Type           Region
     Electronics             East
      Home Goods             West
       Furniture            North
            Toys             East
      Home Goods            North

Next, we will compute the frequency counts of the Product_Type variable:

// Compute frequency counts
frequency(product_data, "Product_Type");
=============================================
   Product_Type     Count   Total %    Cum. %
=============================================

       Clothing         8        16        16
    Electronics        13        26        42
      Furniture        10        20        62
     Home Goods         7        14        76
           Toys        12        24       100
=============================================
          Total        50       100

We can also generate a sorted frequency table, using the optional sorting argument:

// Compute frequency counts
frequency(product_data, "Product_Type", 1);
=============================================
   Product_Type     Count   Total %    Cum. %
=============================================

    Electronics        13        26        26
           Toys        12        24        50
      Furniture        10        20        70
       Clothing         8        16        86
     Home Goods         7        14       100
=============================================
          Total        50       100  

Tabulating Categorical Data

While frequency counts help us understand individual categories, the tabulate function allows us to explore relationships between categorical variables. This function performs cross-tabulations, offering deeper insights into categorical distributions. In GAUSS 25, it was enhanced with new options for calculating row and column percentages, making comparisons easier.

Example: Cross-Tabulating Product Type and Region

Now let's look at the relationship between Product_Type and Region.

// Generate cross-tabulation
call tabulate(product_data, "Product_Type ~ Region");
=====================================================================================
   Product_Type                              Region                             Total
=====================================================================================
                      East          North          South           West

       Clothing          1              5              1              1             8
    Electronics          5              1              5              2            13
      Furniture          3              3              1              3            10
     Home Goods          1              3              2              1             7
           Toys          4              3              2              3            12
          Total         14             15             11             10            50

=====================================================================================

By default, the tabulate function generates absolute counts. However, in some cases, relative frequencies provide more meaningful insights. In GAUSS 25, tabulate now includes options to calculate row and column percentages, making it easier to compare distributions across categories.

This is done using the tabControl structure and the rowPercent or columnPercent members.

  • Row percentages show how the distribution of product types varies across regions.
  • Column percentages highlight the composition of product types within each region.
/*
** Relative tabulations
*/ 
struct tabControl tCtl;
tCtl = tabControlCreate();

// Specify row percentages
tCtl.rowPercent = 1;

// Tabulate
call tabulate(product_data, "Product_Type ~ Region", tCtl);
=====================================================================================
   Product_Type                               Region                            Total
=====================================================================================
                       East          North          South           West

       Clothing        12.5           62.5           12.5           12.5          100
    Electronics        38.5            7.7           38.5           15.4          100
      Furniture        30.0           30.0           10.0           30.0          100
     Home Goods        14.3           42.9           28.6           14.3          100
           Toys        33.3           25.0           16.7           25.0           99

=====================================================================================
Table reports row percentages.

Alternatively we can find the column percentages:

/*
** Relative column tabulations
*/ 
struct tabControl tCtl;
tCtl = tabControlCreate();

// Compute row percentages
tCtl.columnPercent = 1;

// Tabulate product types
call tabulate(product_data, "Product_Type ~ Region", tCtl);
===========================================================================
   Product_Type                                  Region
=========================================================================== East North South West Clothing 7.1 33.3 9.1 10.0 Electronics 35.7 6.7 45.5 20.0 Furniture 21.4 20.0 9.1 30.0 Home Goods 7.1 20.0 18.2 10.0 Toys 28.6 20.0 18.2 30.0 Total 100 100 100 100 =========================================================================== Table reports column percentages.

Visualizing Distributions

While tables provide numerical insights, frequency plots offer an intuitive visual representation. GAUSS 25 enhancements to the plotFreq function include:

  • Automatic category labeling for better clarity.
  • New support for the by keyword to split data by category.
  • New percentage distributions.

Example: Visualizing Product Type Percent Distribution

To start, let's look at the percentage distribution of product type. To help with interpretation, we'll sort the graph by frequency and use a percentage axis:

// Sort frequencies
sort = 1;

// Report percentage axis
pct_axis = 1;

// Generate frequency plot
plotFreq(product_data, "Product_Type", sort, pct_axis);

Product type percentage distribution plot in GAUSS.

Example: Visualizing Product Type Distribution by Region

Next, let's visualize the distribution of the product types across regions using the plotFreq function and the by keyword:

// Generate frequency plot
plotFreq(product_data, "Product_Type + by(Region)");

Product distribution frequency plot.

Conclusion

In this blog, we've demonstrated how updates to frequency, tabulate, and plotFreq in GAUSS 25 make categorical data analysis more efficient and insightful. These enhancements provide better readability, enhanced cross-tabulations, and more intuitive visualization options.

Further Reading

  1. Introduction to Categorical Variables.
  2. Easy Management of Categorical Variables
  3. What is a GAUSS Dataframe and Why Should You Care?.
  4. Getting Started With Survey Data In GAUSS.
Leave a Reply