Getting to Know Your Data With GAUSS 22

Introduction

There is no getting around the fact that data wrangling, cleaning, and exploring plays an important role in any empirical research. Data management can be time-consuming, error-prone, and can make or break results.

GAUSS 22 is built to take the pain out of dealing with your data and to let you move seamlessly towards tackling your important research questions.

In today’s blog, we walk through how to efficiently prepare and explore real-world data before modeling or estimation. We'll look at:

  • Loading data.
  • Cleaning data to eliminate misentries, missing values, and more.
  • Exploring data.

Throughout the blog, we will use Kick Starter 2018 data available on Kaggle.

Loading Data

Let's get started by opening up our data in the Data Import window. This can be done two ways:

  • Selecting File > Import Data from the main GAUSS menu.
  • Double-clicking the filename in the Project Folders window.

The data preview shows us exactly how our data will be loaded and is useful for identifying data issues before we even load our data.

We can tell from the data preview if:

  • Our data has any additional headers that shouldn't be loaded.
  • There are unintended symbols or unacceptable characters.
  • Observations are missing.
  • Data is loading as the type we intend.

For example, in the data preview of our Kickstarter data, we can see that GAUSS is detecting that the variable Category is a string. In the next section, we'll look at how to change this to a categorical variable.

Managing Data in the Data Import Window

Let's look at how to complete preliminary data cleaning using the Import Window. Specifically, we will:

  1. Change the dataframe name for more convenient referencing.
    • For the Kickstarter data, the default symbol name is ks_projects_201801.
    • Let's change this to a simpler name, ks_projects, using the Symbol Name textbox.
  2. Change the Category variable type from String to Category.
    • Variable types are changed using the Type drop-down menu in the Variables list.
    • If necessary, we can change the categorical mappings using the Modify Column Mapping dialog.
  3. Remove unneeded variables from the import list to avoid workspace clutter.
    • Raw datasets often contain excess variables that we don't need.
    • Our Kickstarter data contains the fundraising pledges in local currency and in real USD.
    • We can deselect the variable pledged, along with currency from Variables.
  4. Filter the data to build the appropriate model.
    • Datasets often also contain excess observations that we don't want to include in our model.
    • The Filter tab is a quick and easy way to remove observations.
    • Let's filter our data to remove all observations with launch dates earlier than "2015".

After performing our preliminary data cleaning steps, we click Import to load our data.

Reusable Auto-generated Code

One of the most powerful features of the Data Import tool is that it auto-generates the code for all of our steps performed using the Data Import tool.

/*
** Perform import
*/
ks_projects = loadd("C:/research/data/ks-projects-201801.csv", "ID + str(name) + cat(category) + cat(main_category) + date($deadline) + date($launched) + pledged + cat(state) + backers + cat(country) + usd pledged + usd_pledged_real + usd_goal_real");

/*
** Filter the data
*/
ks_projects = selif(ks_projects, ks_projects[., "launched"] .>= "2015");

The autogenerated code is printed to the Command window after import and is listed in the Command History. It can be copied and pasted for later use.

Cleaning Up Our Data

Now that we have loaded our data, let's get a better feel for our data. We'll do a few quick checks for common data problems such as:

  • Duplicate observations.
  • Missing values.
  • Data misentries.

Viewing the Data

Let's open our ks_projects data in the Symbol Editor to confirm that it loaded correctly.

Checking for Duplicate Data

After loading my data, one of the first steps I take is to check for duplicate data. Duplicate data can distort results and should be eliminated before modeling.

GAUSS includes three functions, introduced in GAUSS 22, to address duplicate data:

ProcedureDescriptionExample
isuniqueChecks if a dataframe or matrix contains unique observations with the option to specify which variables to examine. isunique(mydata, "time"$|"country")
getduplicatesGenerates a report of duplicate observations found in the specified variables. getDuplicates(mydata, "time"$|"country")
dropduplicatesDrops identified duplicates for the specified variables. dropDuplicates(mydata, "time"$|"country")

As a preliminary check, let's see if our data has any duplicate ID observations:

// Check for unique ID numbers
isunique(ks_projects, "ID");

The isunique procedure returns 1.0000000 which tells us that our data is unique.


Interested in seeing how GAUSS 22 can help you? Start your free trial!

Checking for Misentries and Typos

Categorical and string data present the opportunity for errors in spellings, abbreviations, and more. It can sometimes be difficult to locate those errors in data.

The frequency procedure is a great way to check the quality of categorical variables.

Let's look at the frequency report for the country variable

// Check frequencies of countries
frequency(ks_projects, "country");
Label      Count   Total %    Cum. %
   AT        597    0.3196    0.3196
   AU       5385     2.883     3.202
   BE        617    0.3303     3.533
   CA       9679     5.181     8.714
   CH        760    0.4068     9.121
   DE       4171     2.233     11.35
   DK        996    0.5332     11.89
   ES       2276     1.218      13.1
   FR       2939     1.573     14.68
   GB      19889     10.65     25.32
   HK        618    0.3308     25.66
   IE        736     0.394     26.05
   IT       2878     1.541     27.59
   JP         40   0.02141     27.61
   LU         62   0.03319     27.64
   MX       1752    0.9379     28.58
 N,0"       3028     1.621      30.2
   NL       2081     1.114     31.32
   NO        633    0.3389     31.66
   NZ        994    0.5321     32.19
   SE       1589    0.8506     33.04
   SG        555    0.2971     33.34
   US     124533     66.66       100
Total     186808       100       

In this list, we see one odd category, N,0". However, given the frequency of this category, it's unlikely that this is a misentry and we probably don't want to remove this label.

It is reasonable though, that we remove the non-standard characters , and " from the label:

// Replace the N,0" with N_0 
ks_projects[., "country"] = strreplace(ks_projects[., "country"], "N,0\"", "N_0");

Now if we check our frequencies, we see our updated country label:

// Recheck frequencies of countries
frequency(ks_projects, "country");
Label      Count   Total %    Cum. %
   AT        597    0.3196    0.3196
   AU       5385     2.883     3.202
   BE        617    0.3303     3.533
   CA       9679     5.181     8.714
   CH        760    0.4068     9.121
   DE       4171     2.233     11.35
   DK        996    0.5332     11.89
   ES       2276     1.218      13.1
   FR       2939     1.573     14.68
   GB      19889     10.65     25.32
   HK        618    0.3308     25.66
   IE        736     0.394     26.05
   IT       2878     1.541     27.59
   JP         40   0.02141     27.61
   LU         62   0.03319     27.64
   MX       1752    0.9379     28.58
  N_0       3028     1.621      30.2
   NL       2081     1.114     31.32
   NO        633    0.3389     31.66
   NZ        994    0.5321     32.19
   SE       1589    0.8506     33.04
   SG        555    0.2971     33.34
   US     124533     66.66       100
Total     186808       100     

Generating New Variables

GAUSS dataframes make it easy to create new variables using existing variables, even with strings, categorical data, or dates.

Let's suppose we are interested in the impact that the amount of time between the campaign launch and deadline has on pledges. The raw data doesn't include any variable that measures the amount of time between launch and deadline.

Fortunately, GAUSS date handling and variable name referencing make it easy to create our variable.

// Create variable that measures total days between
// campaign launch and deadline
total_days = timeDiffPosix(ks_projects[., "deadline"], 
                           ks_projects[., "launched"], "days");

One important thing to note is that our new variable total_days is a standard GAUSS matrix, not a dataframe. We can use the asDF procedure to convert total_days to a GAUSS dataframe and add a variable name.

Let's do this and concatenate the new variable to our original data:

// Convert `total_days` to a dataframe
// and concatenate with original data
ks_projects = ks_projects ~ asDF(total_days, "Total Time");

When we open ks_projects in the Symbol Editor, we now see our new variable, Total Time, and its type listed in the variable list.

Getting to Know Our Data

Now that we've performed some preliminary data cleaning steps, it's time to get to know our data better with some simple data exploration.

Summary Statistics

Quick summary statistics give us insight into our data and can be helpful for finding outliers and other data abnormalities.

The dstatmt procedure allows us to compute descriptive statistics in a single line. It computes the following statistics:

  • Mean
  • Standard deviation
  • Variance
  • Minimum
  • Maximum
  • Valid cases
  • Missing cases

There are a few things to note about the dstatmt procedure:

  • Only the valid and missing observations are computed for string variables.
  • Mean, Std Dev, and Variance are not computed for categorical variables. However, the minimum and maximum observations are computed based on the underlying key values.

Let's look at the summary statistics for our ks_projects data:

// Compute summary statistics
call dstatmt(ks_projects);
-------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------
ID                  1.075e+09   6.189e+08     3.831e+17   1.852e+04   2.147e+09    186808    0
name                    -----       -----         -----       -----       -----    186807    1
category                -----       -----         -----       -----       -----    186808    0
main_category           -----       -----         -----         Art     Theater    186808    0
deadline                -----       -----         -----  2015-01-05  2018-03-03    186808    0
launched                -----       -----         -----  2015-01-01  2018-01-02    186808    0
pledged             1.192e+04   1.167e+05     1.361e+10           0   2.034e+07    186808    0
state                   -----       -----         -----    canceled   undefined    186808    0
backers                 114.1        1029      1.06e+06           0   2.194e+05    186808    0
country                 -----       -----         -----          AT          US    186808    0
usd pledged              6423   8.693e+04     7.557e+09           0   2.034e+07    183780 3028
usd_pledged_real    1.054e+04   1.084e+05     1.176e+10           0   2.034e+07    186808    0
usd_goal_real       6.052e+04   1.429e+06     2.041e+12        0.49   1.514e+08    186808    0
Total Time              32.71       11.61         134.7    0.005058       89.57    186808    0

One of the first things that stands out from these descriptive statistics is that there are a lot of missing observations of usd_pledged.

One solution for this, is to use usd_pledged_real and drop the usd_pledged_variable using the delcols procedure:

// Eliminate usd_pledged column 
// because of excess missing values
ks_projects = delcols(ks_projects, "usd pledged");

Statistics by Groups

Now, let's see how the mean of the variable usd_pledged_real varies across the variable main_category using the aggregate procedure.

// Find mean of `usd_pledged_real` across `main_category`
aggregate(ks_projects[., "main_category" "usd_pledged_real"], "mean");
   main_category usd_pledged_real
             Art        3472.4776
          Comics        6507.6264
          Crafts        1692.6939
           Dance        3344.5054
          Design        27245.928
         Fashion        5906.9240
    Film & Video        6065.4645
            Food        4736.2799
           Games        21032.577
      Journalism        2403.8985
           Music        3767.5486
     Photography        4571.2224
      Publishing        3792.7673
      Technology        20029.666
         Theater        3886.6224 

This is useful and we can see that there are clear differences in the mean amount pledged across project categories. However, it may be easier to see these differences if we sort our results using sortc:

// Sort aggregated data based on its second column, 
sortc(aggregate(ks_projects[., "main_category" "usd_pledged_real"], "mean"), 2);
   main_category usd_pledged_real
          Crafts        1692.6939
      Journalism        2403.8985
           Dance        3344.5054
             Art        3472.4776
           Music        3767.5486
      Publishing        3792.7673
         Theater        3886.6224
     Photography        4571.2224
            Food        4736.2799
         Fashion        5906.9240
    Film & Video        6065.4645
          Comics        6507.6264
      Technology        20029.666
           Games        21032.577
          Design        27245.928 

Now we can clearly see that the mean pledged amount is the lowest for Craft projects and is the largest for Design projects.

Exploring our Data with Visualizations

Data visualizations are one of the most powerful tools for preliminary data exploration. They can help us:

  • Identify outliers and abnormalities in data.
  • Find relationships between variables.
  • Identify time-series dynamics.

All of these provide insights that can help guide modeling.

Unlike publication graphics, preliminary data visualizations for exploration don't need much custom formatting and annotation. Instead, it is more important that we can make quick and clear plots.

In GAUSS 22, plots were enhanced with intelligent graph attributes which make data visualization substantially easier and more useful, including:

  • The ability to specify variables to plot by name with convenient formula strings.
  • Automatic use of variable names and category labels.
  • Data splitting using the new by keyword.

For example, let's see how the usd_pledged_real is related to the Total Time variable that we created using a scatter plot:

// Create scatter plot of `usd_pledged_real` against `Total Time` 
plotScatter(ks_projects, "usd_pledged_real ~ Total Time");

Our plot provides some quick data insights:

  • The majority of campaigns are between 0-60 days in length.
  • There appear to be clusters of campaigns around approximately 30, 45, and 60 days in length.
  • There doesn't appear to be much of a correlation between campaign length and pledged dollars.
  • The majority of campaigns earned less than 3e+06 and there are a lot of campaigns that didn't receive any pledges.

Based on the last observation, it might be useful to:

  1. Filter out the observations that received no pledges (we won't address the topic of truncated data here and I'm not suggesting that this is in general a good modeling step)
// Filter out observations with no pledges
ks_projects = selif(ks_projects, (ks_projects[., "usd_pledged_real"] .!= 0));
  1. Look at the percentiles usd_pledged_real using the quantile procedure:
// Create sequence of percentiles
e = seqa(0.1, .1, 10);

// Compute quantiles
quantile(ks_projects[., "usd_pledged_real"], e);
 quantile    usd_pledged_real
      0.1           12.000000
      0.2           51.720000
      0.3           151.00000
      0.4           390.60800
      0.5           896.00000
      0.6           1868.1440
      0.7           3664.7240
      0.8           7442.5860
      0.9           18309.240 

This is helpful, it tells us that 50% of usd_pledged_real observations fall below about $1,000, 80% fall below about $10,000 and 90% fall below about $19,000.

Given these observations let's look at histogram of the pledges that fall between the 10% and 50%:

// Filter our data to look at
// data between 10% and 50% 
ks_projects = selif(ks_projects, 
                   (ks_projects[., "usd_pledged_real"] .> 12 .and 
                    ks_projects[., "usd_pledged_real"] .< 1000));

// Histogram of remaining data
plotHist(ks_projects[., "usd_pledged_real"], 100); 

Conclusion

Data cleaning is tedious and time-consuming, but we've seen in this blog how the right tools and make it less painful. We've looked at how to efficiently prepare and explore real-world data before modeling or estimation including:

  • Loading.
  • Cleaning data to eliminate misentries, missing values, and more.
  • Exploring data.
Leave a Reply