Introduction
The GAUSS formula string syntax allows us to compute descriptive statistics directly on a dataset without loading the dataset into GAUSS. In this tutorial, we will explore a number of ways the formula string syntax can be used to find descriptive statistics with the GAUSS procedure dstatmt
. The dstatmt
procedure takes 1-3 inputs. It always requires a dataset, specified variables (optional), and a control structure (optional).
Descriptive Statistics for All Variables
The simplest case we will consider is to find descriptive statistics for the entire dataset. In this case, no variable specification is required and we must simply specify the dataset name. For example, consider finding descriptive statistics using the demographics and crime statistics in Detroit stored in the SAS dataset detriot.sas7bdat
//Compute statistics for all variables in the dataset
fname = getGAUSSHome() $+ "examples/detroit.sas7bdat";
//The 'call' keyword disregards return values from the function
call dstatmt(fname);
The output from the above code is:
------------------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ------------------------------------------------------------------------------------------------------- year 1967.0000 3.8944 15.1667 1961.0000 1973.0000 13 0 ft_police 304.5115 46.8117 2191.3312 260.3500 390.1900 13 0 unemployment 5.7923 2.3592 5.5658 3.2000 11.0000 13 0 manufacture_employ 556.4462 49.8222 2482.2477 455.5000 613.5000 13 0 gun_license 537.5069 316.4151 100118.5406 156.4100 1131.2100 13 0 gun_registration 545.6592 311.0316 96740.6634 180.4800 1029.7500 13 0 homicide_clearance 81.4462 12.6592 160.2560 58.9000 94.4000 13 0 num_white_males 452507.5385 64568.1239 4169042623.43 359647.0000 558724.0000 13 0 non_manufacture_employ 673.9231 94.7734 8981.9969 538.1000 819.8000 13 0 govt_employ 185.7692 37.0362 1371.6790 133.9000 230.9000 13 0 hourly_earn 3.9477 0.9666 0.9342 2.9100 5.7600 13 0 weekly_earn 169.9708 42.5112 1807.2053 117.1800 258.0500 13 0 homicide 25.1269 16.3854 268.4825 8.5200 52.3300 13 0 accident_death 46.9231 5.1396 26.4155 39.1700 55.0500 13 0 assault 311.9500 73.0912 5342.3166 217.9900 473.0100 13 0
Use A Subset of Variables
In some cases, we may be interested in just a small group of variables within the dataset. This can be done using GAUSS formula string specification. If we want to find descriptive statistics for a portion of variables from a dataset we use a string list of variables separated by a +
, variable_1 + variable_2 + … +variable_k
. For example, suppose our variables of interest are homicide
, accident_deaths
and gun_license
call dstatmt(fname, "homicide + accident_death + gun_license");
This time the output reads :
----------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ----------------------------------------------------------------------------------------------- homicide 25.1269 16.3854 268.4825 8.5200 52.3300 13 0 accident_death 46.9231 5.1396 26.4155 39.1700 55.0500 13 0 gun_license 537.5069 316.4151 100118.5406 156.4100 1131.2100 13 0
Transformed Variables
The GAUSS formula string syntax allows us to transform variables at the same time we perform an analysis. The syntax is just the same as if you were to call the function outside of the formula string, but without an assignment. Simply add the name of the function to apply, followed by the name of the variable to which the function should be applied inside a pair of parentheses. For example, consider the case that in addition to the variables in the previous example we also wish to find descriptive statistics for the natural log of num_white_males
.
call dstatmt(fname, "ln(num_white_males) + homicide + accident_death + gun_license");
The output now includes descriptive statistics for four variables, ln(white_males)
, homicide
, accident_death
, and gun_license
---------------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ---------------------------------------------------------------------------------------------------- ln(num_white_males) 13.0131 0.1430 0.0204 12.7929 13.2334 13 0 homicide 25.1269 16.3854 268.4825 8.5200 52.3300 13 0 accident_death 46.9231 5.1396 26.4155 39.1700 55.0500 13 0 gun_license 537.5069 316.4151 100118.5406 156.4100 1131.2100 13 0