Introduction
Survey data is a powerful analysis tool, providing a window into people's thoughts, behaviors, and experiences. By collecting responses from a diverse sample of responders on a range of topics, surveys offer invaluable insights. These can help researchers, businesses, and policymakers make informed decisions and understand diverse perspectives.
In today's blog we'll look more closely at survey data including:
- Fundamental characteristics of survey data.
- Data cleaning considerations.
- Data exploration using frequency tables and data visualizations.
- Managing survey data in GAUSS.
Survey Data
Survey data presents unique characteristics and challenges that require careful consideration during the data analysis process.
Survey Data Characteristics |
||
---|---|---|
Categorical Nature | Survey data often involves categorical variables, where responses are grouped into distinct categories. Understanding the nature of these categories is crucial for choosing appropriate analysis methods. | |
Ordinal and Nominal Variables | It is important to recognize the distinction between ordinal variables (categories with a meaningful order) and nominal variables (categories without a specific order). This impacts the choice of statistical tests and visualization techniques. | |
Missing Data | Surveys may have missing or incomplete responses. Strategies for handling missing data, such as imputation or excluding incomplete cases, need to be considered. | |
Large Sample Sizes | Surveys often involve large sample sizes, leading to statistically significant but not necessarily practically significant results. It's crucial to consider whether the observed results are meaningful or impactful in the specific context of the study. | |
Multivariate Nature | Surveys explore relationships among multiple variables simultaneously. Multivariate analysis allows for a more comprehensive understanding of the complex relationships between different factors. | |
Choice modeling | Surveys act as a primary data collection method for understanding individuals' preferences and choices. Choice modeling techniques expand the insights gained from survey responses, providing a quantitative framework for analyzing decision-making processes in various contexts. |
Data Cleaning Considerations For Analyzing Survey Data
Data cleaning allows us to identify and address errors, inconsistencies, and missing values. It is crucial for survey data and helps to:
- Ensure accuracy.
- Improve reliability.
- Make meaningful and trustworthy insights.
Cleaning survey data includes some standard steps, such as:
- Handling missing values,
- Detecting outliers,
and some steps that are more specific to survey data, such as:
- Performing consistency checks on survey responses,
- Recoding categorical variables,
- Handling open-ended responses.
Common Survey Data Cleaning Steps |
||
---|---|---|
Handling Missing Data |
| |
Outlier Detection and Treatment |
| |
Standardize Variables |
| |
Checking for Consistency |
| |
Addressing Duplicate Entries |
| |
Recoding and Categorization |
| |
Handling Open-Ended Responses |
| |
Dealing with Coding Errors |
|
Exploring Survey Data
Exploratory data analysis is an important tool that can help us uncover insights from survey data without complicated computations. During this step, basic statistical tools like frequency tables, contingency tables, and summary statistics can shed light on important patterns and trends in the data.
One-Way Frequency Tables
Frequency tables provide a simple tabulation of the number of occurrences of each category in a single categorical variable. They display the counts (frequencies) of each category along with their corresponding percentages or proportions. Frequency tables are univariate, meaning they describe the distribution of one variable.
A simple frequency table can help us identify:
- Inconsistencies, coding errors, typos, and other errors in categorical labels.
- Outliers and missing values.
- General distribution characteristics. For example, we may find that one level of a categorical variable makes up 90% of our observations.
Count | Total % | Cum. % | |
---|---|---|---|
Coffee | 31 | 45.6 | 45.6 |
Tea | 27 | 39.7 | 85.3 |
Soda | 28 | 14.7 | 100 |
Two-Way Tables
Two-way tables, also known as contingency tables, are similar to frequency tables but offer additional information about data interactions. They display the frequency combinations of two categorical variables. This provides a snapshot of how these variables interact, and helps us uncover patterns and associations within survey data.
Two-way tables present information in a structured grid:
- The columns correspond to one variable.
- The rows correspond to the other variable.
- The intersection of a row and column represent the frequency of observations having a pair of outcomes.
Breakfast | Lunch | Dinner | |
---|---|---|---|
Coffee | 20 | 8 | 3 |
Tea | 12 | 10 | 5 |
Soda | 8 | 10 | 10 |
As an example, consider the table above:
- The columns represent the outcomes for a variable meal_time:
Breakfast
,Lunch
, andDinner
. - The rows represent the outcomes for a variable beverage_choice:
Coffee
,Tea
, andSoda
. - The bottom row contains the counts for
Soda
orders across all possible meal times. - The last column contains counts for all beverage options at
Dinner
. - The bottom, right corner tell us that 10
Sodas
were ordered atDinner
.
Two-way tables are an efficient way to reveal the intricate relationships between two categorical variables. By presenting information in a structured grid, these tables offer a straightforward way to discern patterns, making it easier to grasp how variables interact.
Data Visualizations
Data plots are a great way to find understand data trends, observe outliers, and identify other data issues. When choosing a data plot, it is important to consider what plot is best suited for the type of the variable.
Bar Charts | Ideal for comparing the frequency or distribution of categorical variables. |
---|---|
Stacked Bar Charts | Useful for comparing the composition of different groups, where each bar is divided into segments representing subcategories. |
Pie Charts | Shows the proportion of each category in relation to the whole. |
Histograms | Depicts the distribution of a continuous variable by dividing it into intervals (bins) and showing the frequency of observations in each interval. |
Line Charts | Demonstrates trends or patterns over a continuous variable or time. |
Scatter Plots | Visualizes the relationship between two continuous variables. |
Box Plots (Box-and-Whisker Plots) | Displays the distribution of a variable, including median, quartiles, and outliers. |
Hands-On With Survey Data: NextGen National Household Travel Survey
Let's look at more closely at survey data using GAUSS and real-world transportation data.
Today's Data
Today we'll be working with the 2022 National Household Travel Survey (NHTS). This survey is designed to collect comprehensive information about travel patterns and travel behavior in the United States.
The NHTS survey:
- Gathers data on various aspects of travel, including daily commuting, recreational trips, shopping, and other activities.
- Is typically conducted at regular intervals to capture changes in travel behavior over time, though today we will only consider the 2022 survey results.
- Utilizes a combination of interviews and diaries to collect data from a representative sample of households across the country.
- Is valuable for transportation planners, policymakers, and researchers in making informed decisions regarding infrastructure development, traffic management, and other transportation-related initiatives.
The raw data from the NHTS is split into four separate CSV files containing:
- Vehicle data.
- Trip data.
- Household data.
- Person data.
Today we will work with the trip data.
Federal Highway Administration. (2022). 2022 National Household Travel Survey, U.S. Department of Transportation, Washington, DC. Available online: https://nhts.ornl.gov.
Loading The Data
Let's get started by loading the data into GAUSS using the loadd procedure. We will also compute descriptive statistics for our data:
// Load trip data
trip_data = loadd("trip_data.gdat");
// Preliminary summary stats
dstatmt(trip_data);
------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ------------------------------------------------------------------------------------------- HOUSEID 9e+09 5.83e+04 3.399e+09 9e+09 9e+09 31074 0 PERSONID 1.681 0.9994 0.9989 1 9 31074 0 TRIPID 2.438 1.792 3.209 1 36 31074 0 SEQ_TRIPID 2.436 1.79 3.203 1 36 31074 0 VEHCASEID 7.619e+11 3.244e+11 1.052e+23 -1 9e+11 31074 0 FRSTHM ----- ----- ----- Yes No 31074 0 PARK ----- ----- ----- Valid skip No 31074 0 TRAVDAY ----- ----- ----- Sunday Saturday 31074 0 DWELTIME 95.18 164.3 2.7e+04 -9 1050 31074 0 PUBTRANS ----- ----- ----- Used publi Did not us 31074 0 TRIPPURP ----- ----- ----- Not ascert Not a home 31074 0 WHYTRP1S ----- ----- ----- Home Something 31074 0 TRVLCMIN 24.55 46.48 2161 -9 1425 31074 0 TRPTRANS ----- ----- ----- Car School bus 31074 0 NUMONTRP 1.997 3.478 12.1 1 99 31074 0 NONHHCNT 0.4141 3.388 11.48 0 98 31074 0 HHACCCNT 1.583 0.8916 0.795 1 8 31074 0 WHYTO ----- ----- ----- Regular ac Something 31074 0 WALK ----- ----- ----- Valid skip N/A - Didn 31074 0 TRPMILES 13.97 85.42 7296 -9 4859 31074 0 VMT_MILE 7.527 32.18 1035 -9 1683 31074 0 GASPRICE 398 68.46 4686 272.7 597.9 31074 0 NUMADLT 2.059 0.7616 0.58 1 8 31074 0 HOMEOWN ----- ----- ----- Owned by h Occupied w 31074 0 RAIL ----- ----- ----- Yes No 31074 0 CENSUS_D ----- ----- ----- New Englan Pacific 31074 0 CENSUS_R ----- ----- ----- Northeast West 31074 0 CDIVMSAR ----- ----- ----- New Englan Pacific No 31074 0 HHFAMINC ----- ----- ----- I prefer n $125,000 t 31074 0 HH_RACE ----- ----- ----- White Other race 31074 0 HHSIZE 2.822 1.447 2.093 1 10 31074 0 HHVEHCNT 2.134 1.078 1.163 0 11 31074 0 MSACAT ----- ----- ----- MSA of 1 m Not in MSA 31074 0 MSASIZE ----- ----- ----- In an MSA Not in MSA 31074 0 URBAN ----- ----- ----- In an urba Not in urb 31074 0 URBANSIZE ----- ----- ----- 50,000-199 Not in urb 31074 0 URBRUR ----- ----- ----- Urban Rural 31074 0 TDAYDATE ----- ----- ----- 2022-01-01 2023-01-01 31074 0 WRKCOUNT 1.304 0.9474 0.8976 0 6 31074 0 R_AGE 46.8 20.77 431.2 5 92 31074 0 R_SEX ----- ----- ----- Refuse Female 31074 0 R_RACE ----- ----- ----- White Other race 31074 0 EDUC ----- ----- ----- Valid skip Profession 31074 0 VEHTYPE ----- ----- ----- Valid skip Motorcycle 31074 0
There are many ways to preview dataframes in GAUSS but with a wide dataset that contains many variables, I find dstatmt to be the easiest to view.
The descriptive statistics themselves provide some useful information:
- Many of the continuous variables, such as TRPMILES and TRVLCMIN, have minimum values below zero. These don't make sense and it is likely the -9 is coded to represent something different, such as non-responses.
- There are 31074 valid observations and no missing values for all variables.
The descriptive statistics report also provides insights beyond the traditional descriptive statistics:
- The data contains a mixture of categorical and numerical data.
- Observations in our dataset are defined by a set of identification variables: HOUSEID, PERSONID, TRIPID, SEQ_TRIPID, VEHCASEID .
Checking For Duplicates
As a first step, we'll confirm that our data contains unique observations using the isunique procedure.
isunique(trip_data);
1.0000000
This indicates that our dataset is unique without any duplicates.
Examining Category Labels
Now that we confirmed that our dataset is unique, one of the first data cleaning steps with categorical data is to examine the category labels to check for errors and to get an understanding of the distribution.
Let’s look at the labels of the TRIPPURP variable using a sorted frequency table.
// Print frequency table for 'TRIPPURP'
frequency(trip_data, "TRIPPURP", 1);
Label Count Total % Cum. % Home-based other (HBO) 7714 24.82 24.82 Not a home-based trip (NHB) 7035 22.64 47.46 Home-based shopping (HBSHP) 6884 22.15 69.62 Home-based work (HBW) 4871 15.68 85.29 Home-based social/recreational (HBSOC) 4546 14.63 99.92 Not ascertained 24 0.07723 100 Total 31074 100
Using this we can see that three categories make up almost 70% of the trips: "Home-based other"
, "Not a home-based trip"
, and "Home-based shopping"
.
The frequency table is also useful for learning more about our labels. In this table, the labels appear to be clean and we don’t see anything that suggests typos or errors.
To clean up the labels, let's separate the abbreviations from the descriptions. We can do this using some simple string manipulation in GAUSS.
First, let’s separate the abbreviations from the full descriptions by splitting the labels at "("
and storing the new string arrays:
// Use '(' to split existing labels into 2 columns
tmp = strsplit(trip_data[. , "TRIPPURP"], "(" );
// Trim whitespace from the front and back of both variables
tmp = strtrim(tmp);
// Rename columns
tmp = setColNames(tmp , "TRIP_DESC"$|"TRIP_ABBR");
// Preview data
head(tmp);
TRIP_DESC TRIP_ABBR Home-based socia HBSOC) Home-based socia HBSOC) Home-based shopp HBSHP) Not a home-based NHB) Home-based shopp HBSHP)
The TRIP_DESC variable looks good – it stores the full description of the TRIPPURP. However, the abbreviations in the TRIP_ABBR don’t quite look right, we still need to strip the ")"
.
/*
** Remove the right parenthesis
*/
// Replace ')' with an empty string
tmp[. , "TRIP_ABBR"] = strreplace(tmp[. , "TRIP_ABBR"], ")", "");
// Check frequencies for both variables
frequency(tmp, "TRIP_DESC + TRIP_ABBR");
Label Count Total % Cum. % Home-based other 7714 24.82 24.82 Home-based shopping 6884 22.15 46.98 Home-based social/recreational 4546 14.63 61.61 Home-based work 4871 15.68 77.28 Not a home-based trip 7035 22.64 99.92 Not ascertained 24 0.07723 100 Total 31074 100 Label Count Total % Cum. % 24 0.07723 0.07723 HBO 7714 24.82 24.9 HBSHP 6884 22.15 47.06 HBSOC 4546 14.63 61.69 HBW 4871 15.68 77.36 NHB 7035 22.64 100 Total 31074 100
One final change we may want to make is to replace the missing abbreviation label for the "Not Ascertained"
category using the recodeCatLabels.
/*
** Recode missing label
*/
// Add missing label for 'NA'
tmp[., 2] = recodecatlabels(tmp[., 2], "", "NA", "TRIP_ABBR");
// Check frequencies for both variables
frequency(tmp, "TRIP_DESC + TRIP_ABBR");
Label Count Total % Cum. % Home-based other 7714 24.82 24.82 Home-based shopping 6884 22.15 46.98 Home-based social/recreational 4546 14.63 61.61 Home-based work 4871 15.68 77.28 Not a home-based trip 7035 22.64 99.92 Not ascertained 24 0.07723 100 Total 31074 100 Label Count Total % Cum. % NA 24 0.07723 0.07723 HBO 7714 24.82 24.9 HBSHP 6884 22.15 47.06 HBSOC 4546 14.63 61.69 HBW 4871 15.68 77.36 NHB 7035 22.64 100 Total 31074 100
We've successfully created two new variables - TRIP_DESC and TRIP_ABBR which we can concatenate to our trip_data dataframe:
// Add the new variables to the end of 'trip_data'
trip_data = trip_data ~ tmp;
Two-Way Tables
Frequency tables give provide insights into a single categorical variable. However, if we are interested in the relationship between multiple categorical variables, we need to use two-way, or contingency, tables.
Let's use a contingency table to look at the relationship between the URBRUR and the VEHTYPE. To do this we can use the tabulate procedure, introduced in GAUSS 24.
The tabulate function requires either a dataframe or filename input, along with a formula string to specify which variables to include in the table. It also takes an optional tabControl
structure input for advanced options.
- data
- A GAUSS dataframe or filename.
- formula
- String, formula string. E.g
"df1 ~ df2 + df3"
,"df1"
categories will be reported in rows, separate columns will be returned for each category in"df2"
and"df3"
. - tbctl
- Optional, an instance of the
tabControl
structure used for advanced table options.
// Compute a two-way table with
// VEHTYPE categories in rows
// URBUR categories in columns
// Results stored in tab_df
tab_df = tabulate(trip_data, "VEHTYPE ~ URBRUR");
=============================================================== VEHTYPE URBRUR Total =============================================================== Urban Rural Valid skip 4061 719 4780 Car/Stationwagon 9306 1774 11080 Van 1438 358 1796 SUV 8275 1935 10210 Pickup Truck 2043 1043 3086 Other Truck 36 24 60 RV/Motorhome 4 4 8 Motorcycle/Moped 39 15 54 Total 25202 5872 31074 ===============================================================
The initial counts provide us some insights:
- The total counts of vehicles are higher in urban areas.
- In urban areas the most frequently occurring type of vehicle is the
Car/Stationwagon
. - In rural areas the most frequently occurring type of vehicle is
SUV
.
It might useful to see relative percentages of the vehicle types. Because we stored the counts in the tab_df, this can easily be done.
First, let's look at what percentage each category makes up of the total vehicles in the urban and rural areas, respectively.
// Compute percentages within urban and rural areas
// by dividing by column totals
tab_df[., 1]~(tab_df[., 2:3]./sumc(tab_df[., 2:3])');
VEHTYPE URBUR_Urban URBUR_Rural Valid skip 0.1611 0.1224 Car/Stationwagon 0.3692 0.3021 Van 0.0571 0.0610 SUV 0.3283 0.3295 Pickup Truck 0.0811 0.1776 Other Truck 0.0014 0.0041 RV/Motorhome 0.0002 0.0007 Motorcycle/Moped 0.0015 0.0026
These percentages help us see that:
- The distribution of
Car/Stationwagon
,Van
, andSUV
are fairly similar in urban and rural areas. - There is a higher percentage of the
Pickup Truck
,Other Truck
,Motorcycle/Moped
categories in rural areas.
Alternatively we can look at the distribution of each vehicle type across rural and urban areas.
// Compute percentages across urban and rural areas
// by dividing by row totals
tab_df[., 1]~(tab_df[., 2:3]./sumr(tab_df[., 2:3]))
VEHTYPE URBUR_Urban URBUR_Rural Valid skip 0.8496 0.1504 Car/Stationwagon 0.8399 0.1601 Van 0.8007 0.1993 SUV 0.8105 0.1895 Pickup Truck 0.6620 0.3380 Other Truck 0.6000 0.4000 RV/Motorhome 0.5000 0.5000 Motorcycle/Moped 0.7222 0.2778
This table tells a similar store from a different perspective:
- Urban vehicles make up 80-83% of the
Cars/Stationwagon
,Van
, andSUV
categories. - Urban vehicles only make up 60% and 66% the
Pickup Truck
andOther Truck
categories, respectively. - Urban vehicles make up 72% of the
Motorcyle/Moped
category.
Excluding Categories
Suppose we don't want to include the Valid skip
responses in our contingency table. We can remove these using the exclude member of the tabControl
structure.
To specify categories to be excluded from the contingency table, we use a string to specify the variable name and category separated by a ":"
.
// Declare structure
struct tabControl tbCtl;
// Fill defaults
tbCtl = tabControlCreate();
// Specify to exclude the 'Valid skip' category
// from the 'VEHTYPE' variable
tbCtl.exclude = "VEHTYPE:Valid skip";
// Find contingency table including tbCtl input
tab_df2 = tabulate(trip_data, "VEHTYPE ~ URBRUR", tbCtl);
============================================================================= VEHTYPE URBRUR Total ============================================================================= Urban Rural Car/Stationwagon 9306 1774 11080 Van 1438 358 1796 SUV 8275 1935 10210 Pickup Truck 2043 1043 3086 Other Truck 36 24 60 RV/Motorhome 4 4 8 Motorcycle/Moped 39 15 54 Total 21141 5153 26294 =============================================================================
Now our table excludes the Valid skip
category.
Ready to try it for yourself in GAUSS 24? Start your free trial today!
Data Visualizations
Data visualizations are one of the most useful tools for data exploration. There are several ways to utilize the plotting capabilities of GAUSS to explore survey data.
Frequency plots
First, let's use a frequency plot to explore the distribution of responses across census regions. To do this, we will utilize the plotFreq procedure.
// Census region frequencies
plotFreq(trip_data, "CENSUS_R", 1);
The sorted frequency plot allows us to quickly identify that the most frequently occurring region in our data is "South"
.
Plotting Contingency Tables
Like frequency tables, frequency plots are useful for visualizing the categories of one variable. However, they don't provide much insight into the relationship across categorical variables.
To visualize the relationship between VEHTYPE and URBRUR, let's create a bar plot using our stored contingency table dataframe, tab_df2.
The plotBar function requires two inputs, labels for the x-axis and corresponding heights.
The labels for our bar plot are the vehicle types which are stored as a dataframe in the first column of the tab_df2. To use them as inputs we will need to:
- Get the category labels.
- Convert them to a string array.
// Get category labels
labels = getCategories(tab_df2, "VEHTYPE");
// Convert to string array
labels_sa = ntos(labels);
The corresponding heights will come from the tab_df2 variable. Let's find out the variable names in tab_df2:
// Print the variable names from 'tab_df2'
getcolnames(tab_df2);
VEHTYPE URBRUR_Urban URBRUR_Rural
The final two variable names were created by the tabulate
function to tell us which original variable the column came from, URBRUR, and which category is being referenced. Let's change the variable names to just Urban and Rural to make them more concise.
new_names = "Urban" $| "Rural";
col_idx = { 2, 3 };
tab_df2 = setcolnames(tab_df2, "Urban" $| "Rural", col_idx);
Now we're ready to use the Urban and Rural count variables to plot our data.
plotBar(labels_sa, tab_df2[., "Urban" "Rural"]);
By default, this plots our bars side-by-side. We can change this using a plotControl
structure and plotsetbar.
// Declare structure
struct plotControl plt;
// Fill defaults
plt = plotGetDefaults("bar");
// Set bars to be solid and stacked
plotSetBar(&plt, 1, 1);
// Plot contingency table
plotBar(plt, labels_sa, tab_df2[., "Urban" "Rural"]);
Scatter Plots
Now suppose we wish to examine the relationship between a categorical variable and continuous variables. We can do this using the 'by' keyword and the plotScatter function.
// Plot TRIPMILES vs GASPRICE
// Sorting by color using the categories in CENSUS_R
plotScatter(trip_data, "TRPMILES ~ GASPRICE + by(CENSUS_R)");
Adding the census regions provides some interesting observations:
- The West region has higher gas prices than other regions.
- The South region seems to have lower gas prices than other regions.
Conclusion
In this blog, we've covered some fundamental concepts related to survey data and looked at some GAUSS tools for cleaning, exploring, and visualizing survey data.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.