Managing String Data with GAUSS Dataframes

Introduction

Working with strings hasn’t always been easy in GAUSS. In the past, the only option in GAUSS was to store strings separately from numeric data. It made it difficult to work with datasets that contained mixed types.

With the introduction of GAUSS dataframes in GAUSS 21 and the enhanced string capabilities of GAUSS 23, that has all changed! I would argue that GAUSS now offers one of the best environments for managing and cleaning mixed-type data.

I recently used GAUSS to perform the very practical task of creating an email list from a string-heavy dataset – something I never would have chosen GAUSS for in the past. In this blog, we walk through this data cleaning task, highlighting several key features for handling strings.

Quick Overview of Strings in GAUSS

The GAUSS dataframe revolutionized data storage in GAUSS. It allows you to store mixed data types together including numbers, dates, categorical data, and strings.

The GAUSS string data type can contain letters, numbers, and other characters. The string data type:

  • Keeps labels with data.
  • Saves additional loading steps.
  • Makes data and reports easier to understand.

It isn’t difficult to see the usefulness of this in real-world data which often includes information such as customer names, product names, or locations.

Loading Strings in GAUSS

Strings can be programmatically loaded from multiple data file types using loadd. No special steps are required, and GAUSS automatically detects strings in XLSX, CSV, STATA, SAS, and GDAT files.

In addition, the Data Import window provides a great tool for interactively previewing and managing data of all types at the time of import.

Data Exercise: Building an Email List

To help demonstrate GAUSS's string capabilities, we will build and export an email contact list from a provided Excel dataset. We will break this project into several smaller tasks:

  • Loading our raw data.
  • Generating email addresses from the provided information.
  • Combining the desired contact list information into a dataframe.
  • Exporting the dataframe as a CSV file.

Provided Data

We will use a sample dataset containing sales territory information for sales representatives. The original dataset includes a mix of string and categorical data including:

VariableDescriptionType
KARThe assigned territory sales representative.Category
StoreThe store number.Numeric
Store nameThe store name.String
FormatType of display found in store.Category
VetY/N indicator of in-store vet clinics.Category
Nielsen MarketAssigned Nielsen MarketCategory

You can download the original dataset here.

Importing Raw Data Interactively

For this exercise, I'm going to use the interactive Data Window to load my data. For data cleaning projects like this, I often find it helpful to have a preview of my raw data. This allows me to make preliminary observations about my raw data such as:

  • The presence of unnecessary variables.
  • If the dataset has a non-standard header.
  • Data types.

Using the GAUSS Data Window to import data.

It's useful to note that the GAUSS Data Window always generates GAUSS code that can be used for replicating data loading programmatically in the future.

territory_info = loadd("C:/business/accounts/territory-info.xlsx");

Notice that this data will load directly as we saw it in the preview.

// Print the first 5 rows
head(territory_info);

It will also look exactly like our preview if we print it to screen:

             KAR            Store       Store Name           Format       Vet   Nielsen Market
   Larry McGuire        725.00000     NY-MIDDLETWN              RUN         N     New York, NY
   Larry McGuire        728.00000        STRATFORD      PREMIUM RUN         N     New York, NY
   Larry McGuire        752.00000       NORWALK-CT           PANTRY         N     New York, NY
   Larry McGuire        758.00000       SEEKONK-MA           PANTRY         N Providence et al
   Larry McGuire        762.00000   SOUTHINGTON-CT   4 FT. MINI RUN         N Hartford and New

Cleaning Our Data

Before generating our email list, we should perform some preliminary data cleaning.

First, we check for duplicates:

// Check for duplicates
getduplicates(territory_info);

Since no output is printed, it shows us that there are not any duplicate rows.

Next, let's review the Nielsen Market variable using the frequency command:

// Check Frequencies
frequency(territory_info, "Nielsen Market");

The frequencies aren't that interesting to us. However, the report provides us with a quick view of the categories:

                      Label      Count   Total %    Cum. % 
              21 iowa-idaho          1   0.09814   0.09814 
     Abilene-Sweetwater, TX          1   0.09814    0.1963 
    Abilene-Sweetwater, TX           1   0.09814    0.2944 
          Albany et al, NY           2    0.1963    0.4907 
  Albuquerque-Santa Fe, NM           3    0.2944    0.7851 
               Atlanta, GA          20     1.963     2.748 
      Augusta-Aiken, GA-SC           1   0.09814     2.846 
                Austin, TX          13     1.276     4.122 
            Bakersfield, CA          2    0.1963     4.318 
             Baltimore, MD          19     1.865     6.183 
  Beaumont-Port Arthur, TX           2    0.1963     6.379 
                  Bend, OR           2    0.1963     6.575 
            Binghamton, NY           1   0.09814     6.673 
       Boston et al, MA-NH          37     3.631      10.3 
               Buffalo, NY           3    0.2944      10.6 
         Butte-Bozeman, MT           2    0.1963     10.79 
            Charleston, SC           1   0.09814     10.89 
             Charlotte, NC           7    0.6869     11.58 
       Charlottesville, VA           1   0.09814     11.68 
      Cheyenne et al, WY-NE          1   0.09814     11.78 
               Chicago, IL          45     4.416     16.19 
         Chico-Redding, CA           3    0.2944     16.49 
            Cincinnati, OH           1   0.09814     16.58 
       Cleveland et al, OH          11     1.079     17.66 
   Colorado Sprgs et al, CO          6    0.5888     18.25 
              Columbia, SC           2    0.1963     18.45 
              Columbus, OH           2    0.1963     18.65 
        Corpus Christi, TX           2    0.1963     18.84 
       Dallas-Ft. Worth, TX         35     3.435     22.28 
      Dallas-Ft. Worth, TX           2    0.1963     22.47 
    Davenport et al, IA-IL           1   0.09814     22.57 
                Dayton, OH           3    0.2944     22.87 
                 Denver, CO         27      2.65     25.52 
       Des Moines-Ames, IA           1   0.09814     25.61 
               Detroit, MI          12     1.178     26.79 
      El Paso et al, TX-NM           3    0.2944     27.09 
          Elmira et al, NY           1   0.09814     27.18 
                Eugene, OR           2    0.1963     27.38 
                Eureka, CA           1   0.09814     27.48 
     Fargo-Valley City, ND           1   0.09814     27.58 
        Fresno-Visalia, CA           5    0.4907     28.07 
      Ft. Myers-Naples, FL           5    0.4907     28.56 
       Ft. Smith et al, AR           2    0.1963     28.75 
           Gainesville, FL           1   0.09814     28.85 
   Grand Junction et al, CO          1   0.09814     28.95 
      Greensboro et al, NC           2    0.1963     29.15 
      Greenville et al, NC           2    0.1963     29.34 
   Greenville et al, SC-NC           6    0.5888     29.93 
       Harlingen et al, TX           2    0.1963     30.13 
      Harrisburg et al, PA           5    0.4907     30.62 
          Harrisonburg, VA           2    0.1963     30.81 
Hartford and New Haven, CT          17     1.668     32.48 
               Houston, TX          38     3.729     36.21 
          Indianapolis, IN           6    0.5888      36.8 
          Jacksonville, FL           5    0.4907     37.29 
       Johnstown et al, PA           4    0.3925     37.68 
        Kansas City, MO-KS           1   0.09814     37.78 
             Knoxville, TN           1   0.09814     37.88 
             Lafayette, LA           2    0.1963     38.08 
          Lake Charles, LA           1   0.09814     38.17 
             Las Vegas, NV           9    0.8832     39.06 
              Lexington, KY          1   0.09814     39.16 
         Lincoln et al, NE           2    0.1963     39.35 
           Los Angeles, CA          92     9.028     48.38 
         Medford et al, OR           2    0.1963     48.58 
  Miami-Ft. Lauderdale, FL          14     1.374     49.95 
             Milwaukee, WI           7    0.6869     50.64 
  Minneapolis-St. Paul, MN          15     1.472     52.11 
           Minot et al, ND           2    0.1963     52.31 
              Missoula, MT           2    0.1963      52.5 
       Mobile et al, AL-FL           1   0.09814      52.6 
    Myrtle Beach et al, SC           3    0.2944     52.89 
             Nashville, TN          10    0.9814     53.88 
           New Orleans, LA           3    0.2944     54.17 
              New York, NY          80     7.851     62.02 
         Norfolk et al, VA           7    0.6869     62.71 
         Odessa-Midland, TX          1   0.09814     62.81 
         Oklahoma City, OK           6    0.5888      63.4 
                 Omaha, NE           1   0.09814     63.49 
         Orlando et al, FL          18     1.766     65.26 
          Palm Springs, CA           4    0.3925     65.65 
    Peoria-Bloomington, IL           3    0.2944     65.95 
          Philadelphia, PA          35     3.435     69.38 
         Phoenix et al, AZ          21     2.061     71.44 
            Pittsburgh, PA          13     1.276     72.72 
              Portland, OR          25     2.453     75.17 
       Portland-Auburn, ME           2    0.1963     75.37 
   Providence et al, RI-MA           9    0.8832     76.25 
    Quincy et al, IL-MO-IA           1   0.09814     76.35 
         Raleigh et al, NC           6    0.5888     76.94 
            Rapid City, SD           1   0.09814     77.04 
                   Reno, NV          3    0.2944     77.33 
   Richmond-Petersburg, VA           5    0.4907     77.82 
     Roanoke-Lynchburg, VA           2    0.1963     78.02 
             Rochester, NY           4    0.3925     78.41 
              Rockford, IL           1   0.09814     78.51 
      Sacramento et al, CA           6    0.5888      79.1 
             Salisbury, MD           2    0.1963     79.29 
        Salt Lake City, UT          19     1.865     81.16 
           San Antonio, TX          13     1.276     82.43 
             San Diego, CA          28     2.748     85.18 
   Santa Barbara et al, CA           5    0.4907     85.67 
              Savannah, GA           1   0.09814     85.77 
         Seattle-Tacoma, WA          1   0.09814     85.87 
        Seattle-Tacoma, WA          38     3.729      89.6 
        Sherman-Ada, TX-OK           2    0.1963     89.79 
     Sioux Falls et al, SD           1   0.09814     89.89 
    South Bend-Elkhart, IN           1   0.09814     89.99 
                Spokane- wa          1   0.09814     90.09 
   Springfield-Holyoke, MA           1   0.09814     90.19 
             St. Louis, MO          11     1.079     91.27 
              Syracuse, NY           2    0.1963     91.46 
  Tallahassee et al, FL-GA           1   0.09814     91.56 
           Tampa et al, FL          16      1.57     93.13 
                Toledo, OH           1   0.09814     93.23 
   Tucson(Sierra Vista), AZ          8    0.7851     94.01 
                 Tulsa, OK           1   0.09814     94.11 
   Tyler-Longview et al, TX          2    0.1963     94.31 
                 Utica, NY           1   0.09814     94.41 
   W. Palm Beach et al, FL           8    0.7851     95.19 
     Waco-Temple-Bryan, TX           4    0.3925     95.58 
   Washington et al, DC-MD          37     3.631     99.21 
             Watertown, NY           1   0.09814     99.31 
  Wichita Fls et al, TX-OK           2    0.1963     99.51 
          Yakima et al, WA           3    0.2944      99.8 
                spokane- wa          2    0.1963       100 
                      Total       1019       100     

From this report we can identify a few issues that need addressing:

  • The Spokane-WA market is entered twice, once as spokane-wa and once as Spokane, WA.
  • The format of the Spokane-WA market differs from the other entries. It uses a dash rather than a comma to separate the city from the state.
  • The Abilene-Sweetwater, TX and Seattle-Tacoma, WA markets occur twice because of differing white spaces.
  • The misalignment in the market names indicates that there are leading and trailing white spaces which should be removed.
  • It would be useful to separate the Nielsen Market into Nielsen City and Nielsen State
/*
** Cleaning  data
*/
// Strip leading and trailing white spaces
territory_info[., "Nielsen Market"] = 
strtrim(territory_info[., "Nielsen Market"]);

// Update the Spokane listing
territory_info[., "Nielsen Market"] = 
strreplace(territory_info[., "Nielsen Market"], "spokane", "Spokane");

// Replace Spokane-WA with Spokane, WA
territory_info[., "Nielsen Market"] = 
strreplace(territory_info[., "Nielsen Market"], "Spokane- wa", "Spokane, WA");

// Split Nielsen Market into state and city
nielsen = asDF(strsplit(territory_info[., "Nielsen Market"], ","), 
          "Nielsen City", "Nielsen State");

Notice that we've used three different GAUSS string procedures above. These three are all very useful for data cleaning and are worth noting:

ProcedurePurpose
strtrimStrips all white space characters from the left and right side of each element in a string array.
strreplaceReplaces all matches of a substring with a replacement string.
strsplitSplits a string into individual tokens based on a specified separator.

Generating Email Addresses

Now that we've cleaned up the Nielsen Market data, we can generate the email addresses for our list. The email address for each store takes the general form storenumber + "d" + "@petpeople.com". For example, the email address for store number 548 is "[email protected]".

To generate our email addresses we need to:

  • Convert the store numbers to strings.
  • Add the suffix of the email address to the new strings.

To do this in GAUSS we will:

  • Convert the store numbers to strings using the GAUSS function itos.
  • Add the string prefix to form the email using $+.
  • Change the string array to a dataframe using asDF.
/*
** Create email addresses
*/
// Convert store number to string
str_store = itos(territory_info[., "Store"]);

// Add prefix
email_address = str_store $+ "[email protected]";

// Convert to dataframe
// and name the variable "Email"
email_df = asDF(email_address, "Email");

Build Email Database

We want the final email database to include KAR, Store Name, Email, Nielsen City, and Nielsen State.

// Form dataframe containing
// email list information
email_database = territory_info[., "KAR" "Store Name"] ~ email_df ~ nielsen;

// Preview database
head(email_database);

The first five rows of our data look like:

             KAR       Store Name              Email     Nielsen City    Nielsen State
   Larry McGuire     NY-MIDDLETWN [email protected]         New York               NY
   Larry McGuire        STRATFORD [email protected]         New York               NY
   Larry McGuire       NORWALK-CT [email protected]         New York               NY
   Larry McGuire       SEEKONK-MA [email protected] Providence et al            RI-MA
   Larry McGuire   SOUTHINGTON-CT [email protected] Hartford and New               CT 

Filtering the Data

Now that our database is created, let's filter our data to focus on one representative, Jeff Canary, and save the email list under his name.

/*
** Filtering and saving our email list
*/
// Specify KAR 
name = "Jeff Canary";

// Filter data for specified employee
email_list = selif(email_database, email_database[., "KAR"] .$== name);

Export to CSV file

As a final step, we will export the email_list dataframe to a CSV file using saved.

// Create file name
fsave_name = name $+ "_store_emails.csv";

// Save file
saved(email_list, fsave_name);

Extra Credit: Looping Through All Representatives

Suppose we need to export email lists for all representatives. We can do this using a fairly simple loop.

// Get list of unique 
// representative names
kar_names  = unique(email_database[., "KAR"]);

// Loop over all names
for i(1, rows(kar_names), 1);
  /*
  ** Filtering and saving our email list
  */
  // Specify KAR to create email list for
  name = kar_names[i];

  // Filter data for specified employee
  email_list = selif(email_database, email_database[., "KAR"] .$== name);

  // Save email list
  fsave_name = name $+ "_store_emails.csv";

  // Save file
  saved(email_list, fsave_name);
endfor;

Conclusion

In today's blog we've demonstrated the improved string capabilities of GAUSS using a simple data cleaning task. Our project covered several useful tasks including:

  • Loading raw data.
  • Cleaning common string data issues.
  • Generating new string variables by splitting and joining strings.
  • Exporting dataframes as CSV files.

Further Reading

  1. Preparing and Cleaning FRED data in GAUSS
  2. What is a GAUSS Dataframe and Why Should You Care?
  3. Getting Started With Survey Data In GAUSS
  4. Getting Started With Survey Data In GAUSS

Leave a Reply