Introduction
Categorical data plays a key role in data analysis, offering a structured way to capture qualitative relationships. Before running any models, simply examining the distribution of categorical data can provide valuable insights into underlying patterns.
Whether summarizing survey responses or exploring demographic trends, fundamental statistical tools, such as frequency counts and tabulations, help reveal these patterns.
GAUSS offers several tools for summarizing and visualizing categorical data, including:
- tabulate: Quickly compute cross-tabulations and summary tables.
- frequency: Generate frequency counts and relative frequencies.
- plotFreq: Create visual representations of frequency distributions.
In GAUSS 25, these functions received significant enhancements, making them more powerful and user-friendly. In this post, we'll explore these improvements and demonstrate their practical applications.
Frequency Counts
The GAUSS frequency
function generates frequency tables for categorical variables. In GAUSS 25, it has been enhanced to utilize metadata from dataframes, automatically detecting and displaying variable names. Additionally, the function now includes an option to sort the frequency table, making it easier to analyze distributions.
Example: Counting Product Categories
For this example, we'll use a hypothetical dataset containing 50 observations of two categorical variables: Product_Type and Region. You can download the dataset here.
To start, we'll load the data using loadd:
/*
** Sample product sales data
*/
// Import sales dataframe
product_data = loadd(__FILE_DIR $+ "product_data.csv");
// Preview data
head(product_data);
Product_Type Region Electronics East Home Goods West Furniture North Toys East Home Goods North
Next, we will compute the frequency counts of the Product_Type variable:
// Compute frequency counts
frequency(product_data, "Product_Type");
============================================= Product_Type Count Total % Cum. % ============================================= Clothing 8 16 16 Electronics 13 26 42 Furniture 10 20 62 Home Goods 7 14 76 Toys 12 24 100 ============================================= Total 50 100
We can also generate a sorted frequency table, using the optional sorting argument:
// Compute frequency counts
frequency(product_data, "Product_Type", 1);
============================================= Product_Type Count Total % Cum. % ============================================= Electronics 13 26 26 Toys 12 24 50 Furniture 10 20 70 Clothing 8 16 86 Home Goods 7 14 100 ============================================= Total 50 100
Tabulating Categorical Data
While frequency counts help us understand individual categories, the tabulate
function allows us to explore relationships between categorical variables. This function performs cross-tabulations, offering deeper insights into categorical distributions. In GAUSS 25, it was enhanced with new options for calculating row and column percentages, making comparisons easier.
Example: Cross-Tabulating Product Type and Region
Now let's look at the relationship between Product_Type and Region.
// Generate cross-tabulation
call tabulate(product_data, "Product_Type ~ Region");
===================================================================================== Product_Type Region Total ===================================================================================== East North South West Clothing 1 5 1 1 8 Electronics 5 1 5 2 13 Furniture 3 3 1 3 10 Home Goods 1 3 2 1 7 Toys 4 3 2 3 12 Total 14 15 11 10 50 =====================================================================================
By default, the tabulate
function generates absolute counts. However, in some cases, relative frequencies provide more meaningful insights. In GAUSS 25, tabulate
now includes options to calculate row and column percentages, making it easier to compare distributions across categories.
This is done using the tabControl
structure and the rowPercent or columnPercent members.
- Row percentages show how the distribution of product types varies across regions.
- Column percentages highlight the composition of product types within each region.
/*
** Relative tabulations
*/
struct tabControl tCtl;
tCtl = tabControlCreate();
// Specify row percentages
tCtl.rowPercent = 1;
// Tabulate
call tabulate(product_data, "Product_Type ~ Region", tCtl);
===================================================================================== Product_Type Region Total ===================================================================================== East North South West Clothing 12.5 62.5 12.5 12.5 100 Electronics 38.5 7.7 38.5 15.4 100 Furniture 30.0 30.0 10.0 30.0 100 Home Goods 14.3 42.9 28.6 14.3 100 Toys 33.3 25.0 16.7 25.0 99 ===================================================================================== Table reports row percentages.
Alternatively we can find the column percentages:
/*
** Relative column tabulations
*/
struct tabControl tCtl;
tCtl = tabControlCreate();
// Compute row percentages
tCtl.columnPercent = 1;
// Tabulate product types
call tabulate(product_data, "Product_Type ~ Region", tCtl);
=========================================================================== Product_Type Region
=========================================================================== East North South West Clothing 7.1 33.3 9.1 10.0 Electronics 35.7 6.7 45.5 20.0 Furniture 21.4 20.0 9.1 30.0 Home Goods 7.1 20.0 18.2 10.0 Toys 28.6 20.0 18.2 30.0 Total 100 100 100 100 =========================================================================== Table reports column percentages.
Visualizing Distributions
While tables provide numerical insights, frequency plots offer an intuitive visual representation. GAUSS 25 enhancements to the plotFreq
function include:
- Automatic category labeling for better clarity.
- New support for the
by
keyword to split data by category. - New percentage distributions.
Example: Visualizing Product Type Percent Distribution
To start, let's look at the percentage distribution of product type. To help with interpretation, we'll sort the graph by frequency and use a percentage axis:
// Sort frequencies
sort = 1;
// Report percentage axis
pct_axis = 1;
// Generate frequency plot
plotFreq(product_data, "Product_Type", sort, pct_axis);
Example: Visualizing Product Type Distribution by Region
Next, let's visualize the distribution of the product types across regions using the plotFreq
function and the by
keyword:
// Generate frequency plot
plotFreq(product_data, "Product_Type + by(Region)");
Conclusion
In this blog, we've demonstrated how updates to frequency
, tabulate
, and plotFreq
in GAUSS 25 make categorical data analysis more efficient and insightful. These enhancements provide better readability, enhanced cross-tabulations, and more intuitive visualization options.
Further Reading
- Introduction to Categorical Variables.
- Easy Management of Categorical Variables
- What is a GAUSS Dataframe and Why Should You Care?.
- Getting Started With Survey Data In GAUSS.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.