Descriptive Statistics from a Dataset

Introduction

The GAUSS formula string syntax allows us to compute descriptive statistics directly on a dataset without loading the dataset into GAUSS. In this tutorial, we will explore a number of ways the formula string syntax can be used to find descriptive statistics with the GAUSS procedure dstatmt. The dstatmt procedure takes 1-3 inputs. It always requires a dataset, specified variables (optional), and a control structure (optional).

Descriptive Statistics for All Variables

The simplest case we will consider is to find descriptive statistics for the entire dataset. In this case, no variable specification is required and we must simply specify the dataset name. For example, consider finding descriptive statistics using the demographics and crime statistics in Detroit stored in the SAS dataset detriot.sas7bdat

//Compute statistics for all variables in the dataset
fname = getGAUSSHome() $+ "examples/detroit.sas7bdat";

//The 'call' keyword disregards return values from the function
call dstatmt(fname);

The output from the above code is:

-------------------------------------------------------------------------------------------------------
Variable                       Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------------

year                      1967.0000      3.8944       15.1667   1961.0000   1973.0000        13    0
ft_police                  304.5115     46.8117     2191.3312    260.3500    390.1900        13    0
unemployment                 5.7923      2.3592        5.5658      3.2000     11.0000        13    0
manufacture_employ         556.4462     49.8222     2482.2477    455.5000    613.5000        13    0
gun_license                537.5069    316.4151   100118.5406    156.4100   1131.2100        13    0
gun_registration           545.6592    311.0316    96740.6634    180.4800   1029.7500        13    0
homicide_clearance          81.4462     12.6592      160.2560     58.9000     94.4000        13    0
num_white_males         452507.5385  64568.1239 4169042623.43 359647.0000 558724.0000        13    0
non_manufacture_employ     673.9231     94.7734     8981.9969    538.1000    819.8000        13    0
govt_employ                185.7692     37.0362     1371.6790    133.9000    230.9000        13    0
hourly_earn                  3.9477      0.9666        0.9342      2.9100      5.7600        13    0
weekly_earn                169.9708     42.5112     1807.2053    117.1800    258.0500        13    0
homicide                    25.1269     16.3854      268.4825      8.5200     52.3300        13    0
accident_death              46.9231      5.1396       26.4155     39.1700     55.0500        13    0
assault                    311.9500     73.0912     5342.3166    217.9900    473.0100        13    0 

Use A Subset of Variables

In some cases, we may be interested in just a small group of variables within the dataset. This can be done using GAUSS formula string specification. If we want to find descriptive statistics for a portion of variables from a dataset we use a string list of variables separated by a +, variable_1 + variable_2 + … +variable_k. For example, suppose our variables of interest are homicide, accident_deaths and gun_license

call dstatmt(fname, "homicide + accident_death + gun_license");

This time the output reads :

-----------------------------------------------------------------------------------------------
Variable               Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-----------------------------------------------------------------------------------------------

homicide            25.1269     16.3854      268.4825      8.5200     52.3300        13    0
accident_death      46.9231      5.1396       26.4155     39.1700     55.0500        13    0
gun_license        537.5069    316.4151   100118.5406    156.4100   1131.2100        13    0 

Transformed Variables

The GAUSS formula string syntax allows us to transform variables at the same time we perform an analysis. The syntax is just the same as if you were to call the function outside of the formula string, but without an assignment. Simply add the name of the function to apply, followed by the name of the variable to which the function should be applied inside a pair of parentheses. For example, consider the case that in addition to the variables in the previous example we also wish to find descriptive statistics for the natural log of num_white_males.

call dstatmt(fname, "ln(num_white_males) + homicide + accident_death + gun_license");

The output now includes descriptive statistics for four variables, ln(white_males), homicide, accident_death, and gun_license

----------------------------------------------------------------------------------------------------
Variable                    Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
----------------------------------------------------------------------------------------------------

ln(num_white_males)      13.0131      0.1430        0.0204     12.7929     13.2334        13    0
homicide                 25.1269     16.3854      268.4825      8.5200     52.3300        13    0
accident_death           46.9231      5.1396       26.4155     39.1700     55.0500        13    0
gun_license             537.5069    316.4151   100118.5406    156.4100   1131.2100        13    0 

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.