Getting Started With Survey Data In GAUSS

by Eric · Published January 11, 2024 · Updated March 10, 2025

Introduction

Survey data is a powerful analysis tool, providing a window into people's thoughts, behaviors, and experiences. By collecting responses from a diverse sample of responders on a range of topics, surveys offer invaluable insights. These can help researchers, businesses, and policymakers make informed decisions and understand diverse perspectives.

In today's blog we'll look more closely at survey data including:

Fundamental characteristics of survey data.
Data cleaning considerations.
Data exploration using frequency tables and data visualizations.
Managing survey data in GAUSS.

While survey design and data collection are both important topics and can have significant impacts on analysis, they are beyond the scope of what we'll look at today.

Survey Data

Survey data presents unique characteristics and challenges that require careful consideration during the data analysis process.

Survey Data Characteristics
Categorical Nature	Survey data often involves categorical variables, where responses are grouped into distinct categories. Understanding the nature of these categories is crucial for choosing appropriate analysis methods.
Ordinal and Nominal Variables	It is important to recognize the distinction between ordinal variables (categories with a meaningful order) and nominal variables (categories without a specific order). This impacts the choice of statistical tests and visualization techniques.
Missing Data	Surveys may have missing or incomplete responses. Strategies for handling missing data, such as imputation or excluding incomplete cases, need to be considered.
Large Sample Sizes	Surveys often involve large sample sizes, leading to statistically significant but not necessarily practically significant results. It's crucial to consider whether the observed results are meaningful or impactful in the specific context of the study.
Multivariate Nature	Surveys explore relationships among multiple variables simultaneously. Multivariate analysis allows for a more comprehensive understanding of the complex relationships between different factors.
Choice modeling	Surveys act as a primary data collection method for understanding individuals' preferences and choices. Choice modeling techniques expand the insights gained from survey responses, providing a quantitative framework for analyzing decision-making processes in various contexts.

Data Cleaning Considerations For Analyzing Survey Data

Data cleaning allows us to identify and address errors, inconsistencies, and missing values. It is crucial for survey data and helps to:

Ensure accuracy.
Improve reliability.
Make meaningful and trustworthy insights.

Cleaning survey data includes some standard steps, such as:

Handling missing values,
Detecting outliers,

and some steps that are more specific to survey data, such as:

Performing consistency checks on survey responses,
Recoding categorical variables,
Handling open-ended responses.

Common Survey Data Cleaning Steps
Handling Missing Data	Identify missing data. Determine if missing values are systematic or random. Decide if missing values should be imputed or observations should be removed.
Outlier Detection and Treatment	Identify outliers that might skew the analysis. Decide whether outliers should be treated, transformed, or if they represent valid data points.
Standardize Variables	Standardize units and formats of variables to ensure consistency. Convert units, standardize date formats, and/or transform variables for better comparability.
Checking for Consistency	Perform consistency checks on the survey responses. Look for contradictory or illogical responses that may indicate errors in data entry.
Addressing Duplicate Entries	Identify and remove duplicate entries to avoid double-counting.
Recoding and Categorization	Recode variables or categorize responses to simplify analysis. Group similar categories, collapse response options, or create new variables based on recoded values.
Handling Open-Ended Responses	Categorize and code open-ended responses for analysis.
Dealing with Coding Errors	Check for coding errors in categorical variables. Ensure that each category is correctly labeled and that coding aligns with the intended meaning of the variable.

Exploring Survey Data

Exploratory data analysis is an important tool that can help us uncover insights from survey data without complicated computations. During this step, basic statistical tools like frequency tables, contingency tables, and summary statistics can shed light on important patterns and trends in the data.

One-Way Frequency Tables

Frequency tables provide a simple tabulation of the number of occurrences of each category in a single categorical variable. They display the counts (frequencies) of each category along with their corresponding percentages or proportions. Frequency tables are univariate, meaning they describe the distribution of one variable.

A simple frequency table can help us identify:

Inconsistencies, coding errors, typos, and other errors in categorical labels.
Outliers and missing values.
General distribution characteristics. For example, we may find that one level of a categorical variable makes up 90% of our observations.

	Count	Total %	Cum. %
Coffee	31	45.6	45.6
Tea	27	39.7	85.3
Soda	28	14.7	100

Two-Way Tables

Two-way tables, also known as contingency tables, are similar to frequency tables but offer additional information about data interactions. They display the frequency combinations of two categorical variables. This provides a snapshot of how these variables interact, and helps us uncover patterns and associations within survey data.

Two-way tables present information in a structured grid:

The columns correspond to one variable.
The rows correspond to the other variable.
The intersection of a row and column represent the frequency of observations having a pair of outcomes.

	Breakfast	Lunch	Dinner
Coffee	20	8	3
Tea	12	10	5
Soda	8	10	10

As an example, consider the table above:

The columns represent the outcomes for a variable meal_time: Breakfast, Lunch, and Dinner.
The rows represent the outcomes for a variable beverage_choice: Coffee, Tea, and Soda.
The bottom row contains the counts for Soda orders across all possible meal times.
The last column contains counts for all beverage options at Dinner.
The bottom, right corner tell us that 10 Sodas were ordered at Dinner.

Two-way tables are an efficient way to reveal the intricate relationships between two categorical variables. By presenting information in a structured grid, these tables offer a straightforward way to discern patterns, making it easier to grasp how variables interact.

Data Visualizations

Data plots are a great way to understand data trends, observe outliers, and identify other data issues. When choosing a data plot, it is important to consider what plot is best suited for the type of the variable.

Bar Charts	Ideal for comparing the frequency or distribution of categorical variables.
Stacked Bar Charts	Useful for comparing the composition of different groups, where each bar is divided into segments representing subcategories.
Pie Charts	Shows the proportion of each category in relation to the whole.
Histograms	Depicts the distribution of a continuous variable by dividing it into intervals (bins) and showing the frequency of observations in each interval.
Line Charts	Demonstrates trends or patterns over a continuous variable or time.
Scatter Plots	Visualizes the relationship between two continuous variables.
Box Plots (Box-and-Whisker Plots)	Displays the distribution of a variable, including median, quartiles, and outliers.

Hands-On With Survey Data: NextGen National Household Travel Survey

Let's look at more closely at survey data using GAUSS and real-world transportation data.

Today's Data

Today we'll be working with the 2022 National Household Travel Survey (NHTS). This survey is designed to collect comprehensive information about travel patterns and travel behavior in the United States.

The NHTS survey:

Gathers data on various aspects of travel, including daily commuting, recreational trips, shopping, and other activities.
Is typically conducted at regular intervals to capture changes in travel behavior over time, though today we will only consider the 2022 survey results.
Utilizes a combination of interviews and diaries to collect data from a representative sample of households across the country.
Is valuable for transportation planners, policymakers, and researchers in making informed decisions regarding infrastructure development, traffic management, and other transportation-related initiatives.

The raw data from the NHTS is split into four separate CSV files containing:

Vehicle data.
Trip data.
Household data.
Person data.

Today we will work with the trip data.

Data Citation:
Federal Highway Administration. (2022). 2022 National Household Travel Survey, U.S. Department of Transportation, Washington, DC. Available online: https://nhts.ornl.gov.

Loading The Data

Let's get started by loading the data into GAUSS using the loadd procedure. We will also compute descriptive statistics for our data:


// Load trip data
trip_data = loadd("trip_data.gdat");
 
// Preliminary summary stats
dstatmt(trip_data);

-------------------------------------------------------------------------------------------
Variable           Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------

HOUSEID           9e+09    5.83e+04     3.399e+09       9e+09       9e+09     31074    0
PERSONID          1.681      0.9994        0.9989           1           9     31074    0
TRIPID            2.438       1.792         3.209           1          36     31074    0
SEQ_TRIPID        2.436        1.79         3.203           1          36     31074    0
VEHCASEID     7.619e+11   3.244e+11     1.052e+23          -1       9e+11     31074    0
FRSTHM            -----       -----         -----         Yes          No     31074    0
PARK              -----       -----         -----  Valid skip          No     31074    0
TRAVDAY           -----       -----         -----      Sunday    Saturday     31074    0
DWELTIME          95.18       164.3       2.7e+04          -9        1050     31074    0
PUBTRANS          -----       -----         -----  Used publi  Did not us     31074    0
TRIPPURP          -----       -----         -----  Not ascert  Not a home     31074    0
WHYTRP1S          -----       -----         -----        Home  Something      31074    0
TRVLCMIN          24.55       46.48          2161          -9        1425     31074    0
TRPTRANS          -----       -----         -----         Car  School bus     31074    0
NUMONTRP          1.997       3.478          12.1           1          99     31074    0
NONHHCNT         0.4141       3.388         11.48           0          98     31074    0
HHACCCNT          1.583      0.8916         0.795           1           8     31074    0
WHYTO             -----       -----         -----  Regular ac  Something      31074    0
WALK              -----       -----         -----  Valid skip  N/A - Didn     31074    0
TRPMILES          13.97       85.42          7296          -9        4859     31074    0
VMT_MILE          7.527       32.18          1035          -9        1683     31074    0
GASPRICE            398       68.46          4686       272.7       597.9     31074    0
NUMADLT           2.059      0.7616          0.58           1           8     31074    0
HOMEOWN           -----       -----         -----  Owned by h  Occupied w     31074    0
RAIL              -----       -----         -----         Yes          No     31074    0
CENSUS_D          -----       -----         -----  New Englan     Pacific     31074    0
CENSUS_R          -----       -----         -----   Northeast        West     31074    0
CDIVMSAR          -----       -----         -----  New Englan  Pacific No     31074    0
HHFAMINC          -----       -----         -----  I prefer n  $125,000 t     31074    0
HH_RACE           -----       -----         -----       White  Other race     31074    0
HHSIZE            2.822       1.447         2.093           1          10     31074    0
HHVEHCNT          2.134       1.078         1.163           0          11     31074    0
MSACAT            -----       -----         -----  MSA of 1 m  Not in MSA     31074    0
MSASIZE           -----       -----         -----  In an MSA   Not in MSA     31074    0
URBAN             -----       -----         -----  In an urba  Not in urb     31074    0
URBANSIZE         -----       -----         -----  50,000-199  Not in urb     31074    0
URBRUR            -----       -----         -----       Urban       Rural     31074    0
TDAYDATE          -----       -----         -----  2022-01-01  2023-01-01     31074    0
WRKCOUNT          1.304      0.9474        0.8976           0           6     31074    0
R_AGE              46.8       20.77         431.2           5          92     31074    0
R_SEX             -----       -----         -----      Refuse      Female     31074    0
R_RACE            -----       -----         -----       White  Other race     31074    0
EDUC              -----       -----         -----  Valid skip  Profession     31074    0
VEHTYPE           -----       -----         -----  Valid skip  Motorcycle     31074    0

There are many ways to preview dataframes in GAUSS but with a wide dataset that contains many variables, I find dstatmt to be the easiest to view.

The descriptive statistics themselves provide some useful information:

Many of the continuous variables, such as TRPMILES and TRVLCMIN, have minimum values below zero. These don't make sense and it is likely the -9 is coded to represent something different, such as non-responses.
There are 31074 valid observations and no missing values for all variables.

The descriptive statistics report also provides insights beyond the traditional descriptive statistics:

The data contains a mixture of categorical and numerical data.
Observations in our dataset are defined by a set of identification variables: HOUSEID, PERSONID, TRIPID, SEQ_TRIPID, VEHCASEID .

The data in trip_data.gdat has had preliminary cleaning from the raw data.

Checking For Duplicates

As a first step, we'll confirm that our data contains unique observations using the isunique procedure.


isunique(trip_data);

1.0000000

This indicates that our dataset is unique without any duplicates.

Examining Category Labels

Now that we confirmed that our dataset is unique, one of the first data cleaning steps with categorical data is to examine the category labels to check for errors and to get an understanding of the distribution.

Let’s look at the labels of the TRIPPURP variable using a sorted frequency table.


// Print frequency table for 'TRIPPURP'
frequency(trip_data, "TRIPPURP", 1);

====================================================================
                              TRIPPURP     Count   Total %    Cum. %
====================================================================
                Home-based other (HBO)      7714     24.82     24.82
           Not a home-based trip (NHB)      7035     22.64     47.46
           Home-based shopping (HBSHP)      6884     22.15     69.62
                 Home-based work (HBW)      4871     15.68     85.29
Home-based social/recreational (HBSOC)      4546     14.63     99.92
                       Not ascertained        24   0.07723       100
====================================================================
                                 Total     31074       100

Using this we can see that three categories make up almost 70% of the trips: "Home-based other", "Not a home-based trip", and "Home-based shopping".

The frequency table is also useful for learning more about our labels. In this table, the labels appear to be clean and we don’t see anything that suggests typos or errors.

To clean up the labels, let's separate the abbreviations from the descriptions. We can do this using some simple string manipulation in GAUSS.

First, let’s separate the abbreviations from the full descriptions by splitting the labels at "(" and storing the new string arrays:


// Use '(' to split existing labels into 2 columns
tmp = strsplit(trip_data[. , "TRIPPURP"], "(" );
 
// Trim whitespace from the front and back of both variables
tmp = strtrim(tmp);
 
// Rename columns 
tmp = setColNames(tmp , "TRIP_DESC"$|"TRIP_ABBR");
 
// Preview data
head(tmp);

              TRIP_DESC        TRIP_ABBR
       Home-based socia           HBSOC)
       Home-based socia           HBSOC)
       Home-based shopp           HBSHP)
       Not a home-based             NHB)
       Home-based shopp           HBSHP)

The TRIP_DESC variable looks good – it stores the full description of the TRIPPURP. However, the abbreviations in the TRIP_ABBR don’t quite look right, we still need to strip the ")".


/*
** Remove the right parenthesis
*/
// Replace ')' with an empty string
tmp[. , "TRIP_ABBR"]  = strreplace(tmp[. , "TRIP_ABBR"], ")", "");
 
// Check frequencies for both variables
frequency(tmp, "TRIP_DESC + TRIP_ABBR");

==============================================================
                       TRIP_DESC     Count   Total %    Cum. %
==============================================================
                Home-based other      7714     24.82     24.82
             Home-based shopping      6884     22.15     46.98
  Home-based social/recreational      4546     14.63     61.61
                 Home-based work      4871     15.68     77.28
           Not a home-based trip      7035     22.64     99.92
                 Not ascertained        24   0.07723       100
==============================================================
                           Total     31074       100


=============================================
      TRIP_ABBR     Count   Total %    Cum. %
=============================================
                       24   0.07723   0.07723
            HBO      7714     24.82      24.9
          HBSHP      6884     22.15     47.06
          HBSOC      4546     14.63     61.69
            HBW      4871     15.68     77.36
            NHB      7035     22.64       100
=============================================
          Total     31074       100

One final change we may want to make is to replace the missing abbreviation label for the "Not Ascertained" category using the recodeCatLabels.


/*
** Recode missing label
*/
// Add missing label for 'NA'
tmp[., 2] = recodecatlabels(tmp[., 2], "", "NA", "TRIP_ABBR");
 
// Check frequencies for both variables
frequency(tmp, "TRIP_DESC + TRIP_ABBR");

==============================================================
                       TRIP_DESC     Count   Total %    Cum. %
==============================================================
                Home-based other      7714     24.82     24.82
             Home-based shopping      6884     22.15     46.98
  Home-based social/recreational      4546     14.63     61.61
                 Home-based work      4871     15.68     77.28
           Not a home-based trip      7035     22.64     99.92
                 Not ascertained        24   0.07723       100
==============================================================
                           Total     31074       100

=============================================
      TRIP_ABBR     Count   Total %    Cum. %
=============================================
             NA        24   0.07723   0.07723
            HBO      7714     24.82      24.9
          HBSHP      6884     22.15     47.06
          HBSOC      4546     14.63     61.69
            HBW      4871     15.68     77.36
            NHB      7035     22.64       100
=============================================
          Total     31074       100

We've successfully created two new variables - TRIP_DESC and TRIP_ABBR which we can concatenate to our trip_data dataframe:


// Add the new variables to the end of 'trip_data'
trip_data = trip_data ~ tmp;

Two-Way Tables

Frequency tables give provide insights into a single categorical variable. However, if we are interested in the relationship between multiple categorical variables, we need to use two-way, or contingency, tables.

Let's use a contingency table to look at the relationship between the URBRUR and the VEHTYPE. To do this we can use the tabulate procedure, introduced in GAUSS 24.

The tabulate function requires either a dataframe or filename input, along with a formula string to specify which variables to include in the table. It also takes an optional tabControl structure input for advanced options.

data: A GAUSS dataframe or filename.
formula: String, formula string. E.g "df1 ~ df2 + df3", "df1" categories will be reported in rows, separate columns will be returned for each category in "df2" and "df3".
tbctl: Optional, an instance of the tabControl structure used for advanced table options.


// Compute a two-way table with
// VEHTYPE categories in rows
// URBUR categories in columns
// Results stored in tab_df
tab_df = tabulate(trip_data, "VEHTYPE ~ URBRUR");

===============================================================
           VEHTYPE                   URBRUR               Total
===============================================================
                            Urban          Rural

  Car/Stationwagon           4061            719           4780
  Motorcycle/Moped           9306           1774          11080
       Other Truck           1438            358           1796
      Pickup Truck           8275           1935          10210
      RV/Motorhome           2043           1043           3086
               SUV             36             24             60
        Valid skip              4              4              8
               Van             39             15             54
             Total          25202           5872          31074

===============================================================

The initial counts provide us some insights:

The total counts of vehicles are higher in urban areas.
In urban areas the most frequently occurring type of vehicle is the Motorcycle/Mooped.
In rural areas the most frequently occurring type of vehicle is Pickup Truck.

It might useful to see relative percentages of the vehicle types. Fortunately, the GAUSS tabulation function allows us to:

Find row percentages using the rowPercent member of the tabControl structure.
Find column percentages using the columnPercent member of the tabControl structure.

Setting either of these to one tells GAUSS to find relative percentages rather than absolute counts.

For example, let's find the percentages of each vehicle type within the total urban and rural vehicle counts, respectively. To do this, we need to compute the column percentages.


// Declare structure
struct tabControl tbCtl;
 
// Fill defaults
tbCtl = tabControlCreate();
 
// Specify to exclude the 'Valid skip' category
// from the 'VEHTYPE' variable
tbCtl.columnPercent = 1;
 
// Compute percentages within urban and rural areas
// by finding the column percentages
call tabulate(trip_data, "VEHTYPE ~ URBRUR", tbCtl);

================================================
           VEHTYPE                   URBRUR

================================================
                            Urban          Rural

  Car/Stationwagon           16.1           12.2
  Motorcycle/Moped           36.9           30.2
       Other Truck            5.7            6.1
      Pickup Truck           32.8           33.0
      RV/Motorhome            8.1           17.8
               SUV            0.1            0.4
        Valid skip            0.0            0.1
               Van            0.2            0.3
             Total            100            100

================================================
Table reports column percentages.

These percentages help us see that:

The distribution of Pickup Truck, Van, and Other Truck are fairly similar in urban and rural areas.
There is a higher percentage of the Car/Stationwagon and Motorcycle/Moped categories in urban areas.
There is a higher percentage of the RV/Motorhome in rural areas.

Alternatively we can look at the distribution of each vehicle type across rural and urban areas.


// Declare structure
struct tabControl tbCtl;
 
// Fill defaults
tbCtl = tabControlCreate();
 
// Specify to exclude the 'Valid skip' category
// from the 'VEHTYPE' variable
tbCtl.rowPercent = 1;
 
// Compute percentages within urban and rural areas
// by finding the column percentages
call tabulate(trip_data, "VEHTYPE ~ URBRUR", tbCtl);

===============================================================
           VEHTYPE                   URBRUR               Total
===============================================================
                            Urban          Rural

  Car/Stationwagon           85.0           15.0            100
  Motorcycle/Moped           84.0           16.0            100
       Other Truck           80.1           19.9            100
      Pickup Truck           81.0           19.0            100
      RV/Motorhome           66.2           33.8            100
               SUV           60.0           40.0            100
        Valid skip           50.0           50.0            100
               Van           72.2           27.8            100

===============================================================
Table reports row percentages.

This table tells a similar store from a different perspective:

Urban vehicles make up 80-85% of the Car/Stationwagon, Motorcyle/Moped, and truck categories.
Urban vehicles only make up 66% and 60% the RV/Motorhome and SUV categories, respectively.
Urban vehicles make up 72% of the Van category.

Note that support for tabulation of column and row percentages was added in GAUSS 25.

Excluding Categories

Suppose we don't want to include the Valid skip responses in our contingency table. We can remove these using the exclude member of the tabControl structure.

To specify categories to be excluded from the contingency table, we use a string to specify the variable name and category separated by a ":".


// Declare structure
struct tabControl tbCtl;
 
// Fill defaults
tbCtl = tabControlCreate();
 
// Specify to exclude the 'Valid skip' category
// from the 'VEHTYPE' variable
tbCtl.exclude = "VEHTYPE:Valid skip";
 
// Find contingency table including tbCtl input
tab_df2 =  tabulate(trip_data, "VEHTYPE ~ URBRUR", tbCtl);

===============================================================
           VEHTYPE                   URBRUR               Total
===============================================================
                            Urban          Rural

  Car/Stationwagon           4061            719           4780
  Motorcycle/Moped           9306           1774          11080
       Other Truck           1438            358           1796
      Pickup Truck           8275           1935          10210
      RV/Motorhome           2043           1043           3086
               SUV             36             24             60
               Van             39             15             54
             Total          25198           5868          31066

===============================================================

Now our table excludes the Valid skip category.

Ready to try it for yourself in GAUSS 25? Start your free trial today!

Data Visualizations

Data visualizations are one of the most useful tools for data exploration. There are several ways to utilize the plotting capabilities of GAUSS to explore survey data.

Frequency plots

First, let's use a frequency plot to explore the distribution of responses across census regions. To do this, we will utilize the plotFreq procedure.


// Census region frequencies
plotFreq(trip_data, "CENSUS_R", 1);

The sorted frequency plot allows us to quickly identify that the most frequently occurring region in our data is "South".

Note that support for sorting frequency plots was added in GAUSS 24.

Plotting Contingency Tables

Like frequency tables, frequency plots are useful for visualizing the categories of one variable. However, they don't provide much insight into the relationship across categorical variables.

To visualize the relationship between VEHTYPE and URBRUR, let's create a bar plot using our stored contingency table dataframe, tab_df2.

The plotBar function requires two inputs, labels for the x-axis and corresponding heights.

The labels for our bar plot are the vehicle types which are stored as a dataframe in the first column of the tab_df2. To use them as inputs we will need to:

Get the category labels.
Convert them to a string array.


// Get category labels
labels = getCategories(tab_df2, "VEHTYPE");
 
// Convert to string array
labels_sa = ntos(labels);

The corresponding heights will come from the tab_df2 variable. Let's find out the variable names in tab_df2:


// Print the variable names from 'tab_df2'
getcolnames(tab_df2);

     VEHTYPE
URBRUR_Urban
URBRUR_Rural

The final two variable names were created by the tabulate function to tell us which original variable the column came from, URBRUR, and which category is being referenced. Let's change the variable names to just Urban and Rural to make them more concise.


new_names = "Urban" $| "Rural";
col_idx = { 2, 3 };
tab_df2 = setcolnames(tab_df2, "Urban" $| "Rural", col_idx);

Now we're ready to use the Urban and Rural count variables to plot our data.


plotBar(labels_sa, tab_df2[., "Urban" "Rural"]);

By default, this plots our bars side-by-side. We can change this using a plotControl structure and plotsetbar .


// Declare structure
struct plotControl plt;
 
// Fill defaults
plt = plotGetDefaults("bar");
 
// Set bars to be solid and stacked
plotSetBar(&plt, 1, 1);
 
// Plot contingency table
plotBar(plt, labels_sa, tab_df2[., "Urban" "Rural"]);

Scatter Plots

Now suppose we wish to examine the relationship between a categorical variable and continuous variables. We can do this using the 'by' keyword and the plotScatter function.


// Plot TRIPMILES vs GASPRICE 
// Sorting by color using the categories in CENSUS_R
plotScatter(trip_data, "TRPMILES ~ GASPRICE + by(CENSUS_R)");

Adding the census regions provides some interesting observations:

The West region has higher gas prices than other regions.
The South region seems to have lower gas prices than other regions.

Conclusion

In this blog, we've covered some fundamental concepts related to survey data and looked at some GAUSS tools for cleaning, exploring, and visualizing survey data.

	// Load trip data
	trip_data = loadd("trip_data.gdat");

	// Preliminary summary stats
	dstatmt(trip_data);

	// Print frequency table for 'TRIPPURP'
	frequency(trip_data, "TRIPPURP", 1);

	// Use '(' to split existing labels into 2 columns
	tmp = strsplit(trip_data[. , "TRIPPURP"], "(" );

	// Trim whitespace from the front and back of both variables
	tmp = strtrim(tmp);

	// Rename columns
	tmp = setColNames(tmp , "TRIP_DESC"$\|"TRIP_ABBR");

	// Preview data
	head(tmp);

	/*
	** Remove the right parenthesis
	*/
	// Replace ')' with an empty string
	tmp[. , "TRIP_ABBR"] = strreplace(tmp[. , "TRIP_ABBR"], ")", "");

	// Check frequencies for both variables
	frequency(tmp, "TRIP_DESC + TRIP_ABBR");

	/*
	** Recode missing label
	*/
	// Add missing label for 'NA'
	tmp[., 2] = recodecatlabels(tmp[., 2], "", "NA", "TRIP_ABBR");

	// Check frequencies for both variables
	frequency(tmp, "TRIP_DESC + TRIP_ABBR");

	// Add the new variables to the end of 'trip_data'
	trip_data = trip_data ~ tmp;

	// Compute a two-way table with
	// VEHTYPE categories in rows
	// URBUR categories in columns
	// Results stored in tab_df
	tab_df = tabulate(trip_data, "VEHTYPE ~ URBRUR");

	// Declare structure
	struct tabControl tbCtl;

	// Fill defaults
	tbCtl = tabControlCreate();

	// Specify to exclude the 'Valid skip' category
	// from the 'VEHTYPE' variable
	tbCtl.columnPercent = 1;

	// Compute percentages within urban and rural areas
	// by finding the column percentages
	call tabulate(trip_data, "VEHTYPE ~ URBRUR", tbCtl);

	// Census region frequencies
	plotFreq(trip_data, "CENSUS_R", 1);

	// Get category labels
	labels = getCategories(tab_df2, "VEHTYPE");

	// Convert to string array
	labels_sa = ntos(labels);

	// Print the variable names from 'tab_df2'
	getcolnames(tab_df2);

	new_names = "Urban" $\| "Rural";
	col_idx = { 2, 3 };
	tab_df2 = setcolnames(tab_df2, "Urban" $\| "Rural", col_idx);

	// Declare structure
	struct plotControl plt;

	// Fill defaults
	plt = plotGetDefaults("bar");

	// Set bars to be solid and stacked
	plotSetBar(&plt, 1, 1);

	// Plot contingency table
	plotBar(plt, labels_sa, tab_df2[., "Urban" "Rural"]);

	// Plot TRIPMILES vs GASPRICE
	// Sorting by color using the categories in CENSUS_R
	plotScatter(trip_data, "TRPMILES ~ GASPRICE + by(CENSUS_R)");

Getting Started With Survey Data In GAUSS

Introduction

Survey Data

Survey Data Characteristics

Data Cleaning Considerations For Analyzing Survey Data

Common Survey Data Cleaning Steps

Exploring Survey Data

One-Way Frequency Tables

Two-Way Tables

Data Visualizations

Hands-On With Survey Data: NextGen National Household Travel Survey

Today's Data

Loading The Data

Checking For Duplicates

Examining Category Labels

Two-Way Tables

Excluding Categories

Data Visualizations

Frequency plots

Plotting Contingency Tables

Scatter Plots

Conclusion

Further Reading