Introduction
Working with strings hasn’t always been easy in GAUSS. In the past, the only option in GAUSS was to store strings separately from numeric data. It made it difficult to work with datasets that contained mixed types.
With the introduction of GAUSS dataframes in GAUSS 21 and the enhanced string capabilities of GAUSS 23, that has all changed! I would argue that GAUSS now offers one of the best environments for managing and cleaning mixed-type data.
I recently used GAUSS to perform the very practical task of creating an email list from a string-heavy dataset – something I never would have chosen GAUSS for in the past. In this blog, we walk through this data cleaning task, highlighting several key features for handling strings.
Quick Overview of Strings in GAUSS
The GAUSS dataframe revolutionized data storage in GAUSS. It allows you to store mixed data types together including numbers, dates, categorical data, and strings.
The GAUSS string data type can contain letters, numbers, and other characters. The string data type:
- Keeps labels with data.
- Saves additional loading steps.
- Makes data and reports easier to understand.
It isn’t difficult to see the usefulness of this in real-world data which often includes information such as customer names, product names, or locations.
Loading Strings in GAUSS
Strings can be programmatically loaded from multiple data file types using loadd
. No special steps are required, and GAUSS automatically detects strings in XLSX, CSV, STATA, SAS, and GDAT files.
In addition, the Data Import window provides a great tool for interactively previewing and managing data of all types at the time of import.
Data Exercise: Building an Email List
To help demonstrate GAUSS's string capabilities, we will build and export an email contact list from a provided Excel dataset. We will break this project into several smaller tasks:
- Loading our raw data.
- Generating email addresses from the provided information.
- Combining the desired contact list information into a dataframe.
- Exporting the dataframe as a CSV file.
Provided Data
We will use a sample dataset containing sales territory information for sales representatives. The original dataset includes a mix of string and categorical data including:
Variable | Description | Type |
---|---|---|
KAR | The assigned territory sales representative. | Category |
Store | The store number. | Numeric |
Store name | The store name. | String |
Format | Type of display found in store. | Category |
Vet | Y/N indicator of in-store vet clinics. | Category |
Nielsen Market | Assigned Nielsen Market | Category |
You can download the original dataset here.
Importing Raw Data Interactively
For this exercise, I'm going to use the interactive Data Window to load my data. For data cleaning projects like this, I often find it helpful to have a preview of my raw data. This allows me to make preliminary observations about my raw data such as:
- The presence of unnecessary variables.
- If the dataset has a non-standard header.
- Data types.
It's useful to note that the GAUSS Data Window always generates GAUSS code that can be used for replicating data loading programmatically in the future.
territory_info = loadd("C:/business/accounts/territory-info.xlsx");
Notice that this data will load directly as we saw it in the preview.
// Print the first 5 rows
head(territory_info);
It will also look exactly like our preview if we print it to screen:
KAR Store Store Name Format Vet Nielsen Market Larry McGuire 725.00000 NY-MIDDLETWN RUN N New York, NY Larry McGuire 728.00000 STRATFORD PREMIUM RUN N New York, NY Larry McGuire 752.00000 NORWALK-CT PANTRY N New York, NY Larry McGuire 758.00000 SEEKONK-MA PANTRY N Providence et al Larry McGuire 762.00000 SOUTHINGTON-CT 4 FT. MINI RUN N Hartford and New
Cleaning Our Data
Before generating our email list, we should perform some preliminary data cleaning.
First, we check for duplicates:
// Check for duplicates
getduplicates(territory_info);
Since no output is printed, it shows us that there are not any duplicate rows.
Next, let's review the Nielsen Market
variable using the frequency
command:
// Check Frequencies
frequency(territory_info, "Nielsen Market");
The frequencies aren't that interesting to us. However, the report provides us with a quick view of the categories:
Label Count Total % Cum. % 21 iowa-idaho 1 0.09814 0.09814 Abilene-Sweetwater, TX 1 0.09814 0.1963 Abilene-Sweetwater, TX 1 0.09814 0.2944 Albany et al, NY 2 0.1963 0.4907 Albuquerque-Santa Fe, NM 3 0.2944 0.7851 Atlanta, GA 20 1.963 2.748 Augusta-Aiken, GA-SC 1 0.09814 2.846 Austin, TX 13 1.276 4.122 Bakersfield, CA 2 0.1963 4.318 Baltimore, MD 19 1.865 6.183 Beaumont-Port Arthur, TX 2 0.1963 6.379 Bend, OR 2 0.1963 6.575 Binghamton, NY 1 0.09814 6.673 Boston et al, MA-NH 37 3.631 10.3 Buffalo, NY 3 0.2944 10.6 Butte-Bozeman, MT 2 0.1963 10.79 Charleston, SC 1 0.09814 10.89 Charlotte, NC 7 0.6869 11.58 Charlottesville, VA 1 0.09814 11.68 Cheyenne et al, WY-NE 1 0.09814 11.78 Chicago, IL 45 4.416 16.19 Chico-Redding, CA 3 0.2944 16.49 Cincinnati, OH 1 0.09814 16.58 Cleveland et al, OH 11 1.079 17.66 Colorado Sprgs et al, CO 6 0.5888 18.25 Columbia, SC 2 0.1963 18.45 Columbus, OH 2 0.1963 18.65 Corpus Christi, TX 2 0.1963 18.84 Dallas-Ft. Worth, TX 35 3.435 22.28 Dallas-Ft. Worth, TX 2 0.1963 22.47 Davenport et al, IA-IL 1 0.09814 22.57 Dayton, OH 3 0.2944 22.87 Denver, CO 27 2.65 25.52 Des Moines-Ames, IA 1 0.09814 25.61 Detroit, MI 12 1.178 26.79 El Paso et al, TX-NM 3 0.2944 27.09 Elmira et al, NY 1 0.09814 27.18 Eugene, OR 2 0.1963 27.38 Eureka, CA 1 0.09814 27.48 Fargo-Valley City, ND 1 0.09814 27.58 Fresno-Visalia, CA 5 0.4907 28.07 Ft. Myers-Naples, FL 5 0.4907 28.56 Ft. Smith et al, AR 2 0.1963 28.75 Gainesville, FL 1 0.09814 28.85 Grand Junction et al, CO 1 0.09814 28.95 Greensboro et al, NC 2 0.1963 29.15 Greenville et al, NC 2 0.1963 29.34 Greenville et al, SC-NC 6 0.5888 29.93 Harlingen et al, TX 2 0.1963 30.13 Harrisburg et al, PA 5 0.4907 30.62 Harrisonburg, VA 2 0.1963 30.81 Hartford and New Haven, CT 17 1.668 32.48 Houston, TX 38 3.729 36.21 Indianapolis, IN 6 0.5888 36.8 Jacksonville, FL 5 0.4907 37.29 Johnstown et al, PA 4 0.3925 37.68 Kansas City, MO-KS 1 0.09814 37.78 Knoxville, TN 1 0.09814 37.88 Lafayette, LA 2 0.1963 38.08 Lake Charles, LA 1 0.09814 38.17 Las Vegas, NV 9 0.8832 39.06 Lexington, KY 1 0.09814 39.16 Lincoln et al, NE 2 0.1963 39.35 Los Angeles, CA 92 9.028 48.38 Medford et al, OR 2 0.1963 48.58 Miami-Ft. Lauderdale, FL 14 1.374 49.95 Milwaukee, WI 7 0.6869 50.64 Minneapolis-St. Paul, MN 15 1.472 52.11 Minot et al, ND 2 0.1963 52.31 Missoula, MT 2 0.1963 52.5 Mobile et al, AL-FL 1 0.09814 52.6 Myrtle Beach et al, SC 3 0.2944 52.89 Nashville, TN 10 0.9814 53.88 New Orleans, LA 3 0.2944 54.17 New York, NY 80 7.851 62.02 Norfolk et al, VA 7 0.6869 62.71 Odessa-Midland, TX 1 0.09814 62.81 Oklahoma City, OK 6 0.5888 63.4 Omaha, NE 1 0.09814 63.49 Orlando et al, FL 18 1.766 65.26 Palm Springs, CA 4 0.3925 65.65 Peoria-Bloomington, IL 3 0.2944 65.95 Philadelphia, PA 35 3.435 69.38 Phoenix et al, AZ 21 2.061 71.44 Pittsburgh, PA 13 1.276 72.72 Portland, OR 25 2.453 75.17 Portland-Auburn, ME 2 0.1963 75.37 Providence et al, RI-MA 9 0.8832 76.25 Quincy et al, IL-MO-IA 1 0.09814 76.35 Raleigh et al, NC 6 0.5888 76.94 Rapid City, SD 1 0.09814 77.04 Reno, NV 3 0.2944 77.33 Richmond-Petersburg, VA 5 0.4907 77.82 Roanoke-Lynchburg, VA 2 0.1963 78.02 Rochester, NY 4 0.3925 78.41 Rockford, IL 1 0.09814 78.51 Sacramento et al, CA 6 0.5888 79.1 Salisbury, MD 2 0.1963 79.29 Salt Lake City, UT 19 1.865 81.16 San Antonio, TX 13 1.276 82.43 San Diego, CA 28 2.748 85.18 Santa Barbara et al, CA 5 0.4907 85.67 Savannah, GA 1 0.09814 85.77 Seattle-Tacoma, WA 1 0.09814 85.87 Seattle-Tacoma, WA 38 3.729 89.6 Sherman-Ada, TX-OK 2 0.1963 89.79 Sioux Falls et al, SD 1 0.09814 89.89 South Bend-Elkhart, IN 1 0.09814 89.99 Spokane- wa 1 0.09814 90.09 Springfield-Holyoke, MA 1 0.09814 90.19 St. Louis, MO 11 1.079 91.27 Syracuse, NY 2 0.1963 91.46 Tallahassee et al, FL-GA 1 0.09814 91.56 Tampa et al, FL 16 1.57 93.13 Toledo, OH 1 0.09814 93.23 Tucson(Sierra Vista), AZ 8 0.7851 94.01 Tulsa, OK 1 0.09814 94.11 Tyler-Longview et al, TX 2 0.1963 94.31 Utica, NY 1 0.09814 94.41 W. Palm Beach et al, FL 8 0.7851 95.19 Waco-Temple-Bryan, TX 4 0.3925 95.58 Washington et al, DC-MD 37 3.631 99.21 Watertown, NY 1 0.09814 99.31 Wichita Fls et al, TX-OK 2 0.1963 99.51 Yakima et al, WA 3 0.2944 99.8 spokane- wa 2 0.1963 100 Total 1019 100
From this report we can identify a few issues that need addressing:
- The
Spokane-WA
market is entered twice, once asspokane-wa
and once asSpokane, WA
. - The format of the
Spokane-WA
market differs from the other entries. It uses a dash rather than a comma to separate the city from the state. - The
Abilene-Sweetwater, TX
andSeattle-Tacoma, WA
markets occur twice because of differing white spaces. - The misalignment in the market names indicates that there are leading and trailing white spaces which should be removed.
- It would be useful to separate the Nielsen Market into
Nielsen City
andNielsen State
/*
** Cleaning data
*/
// Strip leading and trailing white spaces
territory_info[., "Nielsen Market"] =
strtrim(territory_info[., "Nielsen Market"]);
// Update the Spokane listing
territory_info[., "Nielsen Market"] =
strreplace(territory_info[., "Nielsen Market"], "spokane", "Spokane");
// Replace Spokane-WA with Spokane, WA
territory_info[., "Nielsen Market"] =
strreplace(territory_info[., "Nielsen Market"], "Spokane- wa", "Spokane, WA");
// Split Nielsen Market into state and city
nielsen = asDF(strsplit(territory_info[., "Nielsen Market"], ","),
"Nielsen City", "Nielsen State");
Notice that we've used three different GAUSS string procedures above. These three are all very useful for data cleaning and are worth noting:
Procedure | Purpose |
---|---|
strtrim | Strips all white space characters from the left and right side of each element in a string array. |
strreplace | Replaces all matches of a substring with a replacement string. |
strsplit | Splits a string into individual tokens based on a specified separator. |
Generating Email Addresses
Now that we've cleaned up the Nielsen Market data, we can generate the email addresses for our list. The email address for each store takes the general form storenumber + "d" + "@petpeople.com". For example, the email address for store number 548 is "[email protected]"
.
To generate our email addresses we need to:
- Convert the store numbers to strings.
- Add the suffix of the email address to the new strings.
To do this in GAUSS we will:
- Convert the store numbers to strings using the GAUSS function
itos
. - Add the string prefix to form the email using
$+
. - Change the string array to a dataframe using
asDF
.
/*
** Create email addresses
*/
// Convert store number to string
str_store = itos(territory_info[., "Store"]);
// Add prefix
email_address = str_store $+ "[email protected]";
// Convert to dataframe
// and name the variable "Email"
email_df = asDF(email_address, "Email");
Build Email Database
We want the final email database to include KAR
, Store Name
, Email
, Nielsen City
, and Nielsen State
.
// Form dataframe containing
// email list information
email_database = territory_info[., "KAR" "Store Name"] ~ email_df ~ nielsen;
// Preview database
head(email_database);
The first five rows of our data look like:
KAR Store Name Email Nielsen City Nielsen State Larry McGuire NY-MIDDLETWN [email protected] New York NY Larry McGuire STRATFORD [email protected] New York NY Larry McGuire NORWALK-CT [email protected] New York NY Larry McGuire SEEKONK-MA [email protected] Providence et al RI-MA Larry McGuire SOUTHINGTON-CT [email protected] Hartford and New CT
Filtering the Data
Now that our database is created, let's filter our data to focus on one representative, Jeff Canary
, and save the email list under his name.
/*
** Filtering and saving our email list
*/
// Specify KAR
name = "Jeff Canary";
// Filter data for specified employee
email_list = selif(email_database, email_database[., "KAR"] .$== name);
Export to CSV file
As a final step, we will export the email_list
dataframe to a CSV file using saved
.
// Create file name
fsave_name = name $+ "_store_emails.csv";
// Save file
saved(email_list, fsave_name);
Extra Credit: Looping Through All Representatives
Suppose we need to export email lists for all representatives. We can do this using a fairly simple loop.
// Get list of unique
// representative names
kar_names = unique(email_database[., "KAR"]);
// Loop over all names
for i(1, rows(kar_names), 1);
/*
** Filtering and saving our email list
*/
// Specify KAR to create email list for
name = kar_names[i];
// Filter data for specified employee
email_list = selif(email_database, email_database[., "KAR"] .$== name);
// Save email list
fsave_name = name $+ "_store_emails.csv";
// Save file
saved(email_list, fsave_name);
endfor;
Conclusion
In today's blog we've demonstrated the improved string capabilities of GAUSS using a simple data cleaning task. Our project covered several useful tasks including:
- Loading raw data.
- Cleaning common string data issues.
- Generating new string variables by splitting and joining strings.
- Exporting dataframes as CSV files.
Further Reading
- Preparing and Cleaning FRED data in GAUSS
- What is a GAUSS Dataframe and Why Should You Care?
- Getting Started With Survey Data In GAUSS
- Getting Started With Survey Data In GAUSS
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.