What is a GAUSS Dataframe and Why Should You Care?

by Eric · Published May 21, 2021 · Updated January 17, 2022

Introduction

The GAUSS dataframe, introduced in GAUSS 21, is a powerful tool for storing data. In today's blog, we explain what a GAUSS dataframe is and discuss the advantages of making it a part of your everyday GAUSS use.

What is a GAUSS dataframe?

The GAUSS dataframe is a tool for storing two-dimensional data. The dataframe allows you to store:

Data in rows and columns, similar to the GAUSS matrix.
Metadata about the data type and type-related properties.
Different data types together in one place.

GAUSS dataframes keep your data together and familiar. They offer a number of advantages including:

Intuitive display and management of strings, dates, and categorical data.
Easy to read results and output.
Simple data referencing with variable names and labels.

How do I create a GAUSS dataframe?

Creating a GAUSS dataframe is easy to do and often does not require extra steps. Let's look at a few different cases of dataframe creation.

Loading data interactively

The GAUSS Data Import window allows us to interactively import data from:

CSV and other text delimited files.
Excel files (XLS, XLSX).
GAUSS datasets (DAT) and matrix files (FMT).
SAS, Stata and SPSS datasets.

When loading data interactively, the Import Options tab lets you specify various aspects of data import, including the option to Keep Metadata:

When Keep Metadata is checked:

The data will be imported as a dataframe.
Variable names, column types, and associated properties will be stored.
Internal algorithms are used to determine variable types and the associated properties.

Questions? Get an answer from our experts or request a quote!

Programmatically loading a dataframe


// Create file name with full path
dataset = getGAUSSHome() $+ "examples/housing.csv";
 
// Load all variables from the file
housing = loadd(dataset);

Creating a GAUSS dataframe programmatically is also very simple and is done by default when using the loadd function.

For datasets that do not have variable type information, such as XLS/XLSX and CSV files, key words are used to tell GAUSS how to interpret the columns:

Keyword	Data Type	Example
`cat`	Category	`yarn = loadd(dataset, "cat(amplitude) + cycles")`
`date`	Date	`eur_usd = loadd(dataset, "date(date) + bid + ask")`
`str`	String	`nba = loadd(dataset, "str(player) + height + weight")`

Changing a GAUSS matrix to a dataframe

A GAUSS matrix can be changed to a dataframe using a number of functions designed for assigning metadata to a matrix. Any time metadata is assigned to a matrix, the matrix is converted into a dataframe.

For example, the setColMetaData function allows us to assign a variable name and type to a column.


x_dataframe = setColMetadata(X, varnames, types);

X: N x K data matrix.
varnames: K x 1 string array, variable names to assign to columns.
types: Kx1 vector, Specifies types to be assigned to names in varnames. Valid options include 0: string, 1: number, 2: categorical 3: date.

Examples of functions that assign metadata

Function	Purpose
`setColMetadata`	Set columns in a matrix to have variable names and types.
`setColNames`	Set column variable names.
`setColTypes`	Set column types.
`setColLabels`	Set categorical labels for a column.

What are the Data Types used in Dataframes?

The GAUSS dataframes allow you to work with 4 different data types:

Numeric
Category
String
Date

The GAUSS `Numeric` data type

The Numeric data type is directly analogous to the data stored in a standard GAUSS matrix. The Numeric data type:

Houses continuous numeric data.
Can be directly used in all GAUSS functions that accept data matrices.
Examples: Daily temperatures, real GDP, weight.

The GAUSS `Category` data type

The Category data type houses discrete variables that capture qualitative outcomes stored in fixed groups (or levels). The Category data type:

Stores categorical labels and underlying key values.
Can be directly used in a number of GAUSS functions with automatic dummy variable creation for estimation.
Seamlessly integrates with categorical data in SAS/Stata/SPSS data files.
Examples: Marriage status, performance ratings, weight class

Our blog, Easy Management of Categorical Variables covers in depth how to use the GAUSS category type.

The GAUSS `String` data type

The String data type can contain letters, numbers, and other characters. The String data type:

Keeps labels with data.
Saves additional loading steps.
Makes data and reports easier to understand.
Examples: Customer names, product name, location

The GAUSS `Date` type

The Date data type houses and displays dates and times. The Date data type:

Saves time, frustration and errors when working with dates and times.
Allows you to display dates in readable formats.
Improves labeling in graphics.

Using Metadata

The information stored in GAUSS dataframes can easily be used to simplify data management, cleaning, and accessing, along with estimation.

For more detailed coverage of how to use GAUSS dataframes, see our online GAUSS Data Management Guide.

Reference your data by name

One of the advantages of the GAUSS dataframe is that:

Variables names.
Categorical and string labels.
Dates.

can be used to reference your data. This makes programming in GAUSS more intuitive and readable.

For example, suppose we want to extract just the mpg and rep78 variable from the auto dataframe. These can be indexed directly using variable names:


// Index by variable name
auto = auto[., "mpg" "rep78"];

We can use the category label "C" to filter the data in our dataframe, NBA by the variable Pos:


// Select if position is center
nba = selif(nba, nba[., "Pos"] .$== "C");

We can also filter dataframes using an easy-to-read date string.


// Select observations on or after June 20th, 2017
xle = selif(xle, xle[.,"date"] .>= "2017-06-20")

Enhanced data exploration

Dataframes also improve data exploration with detailed, readable reports.

For example, when viewed in the Data Editor or printed to screen, a dataframe:

Is readable and clear.
Can be easily filtered and navigated using variable names, dates, etc.
Keeps all your data in one place


>> print lending[1:5,.];

        int_rate             date          purpose
       12.620000       2014-12-14      credit_card
       9.4400000       2014-05-18             auto
       10.420000       2014-02-01          medical
       15.050000       2014-04-05      credit_card
       9.9300000       2014-05-17      credit_card


>> print lending[1:5,"purpose"];

           purpose
       credit_card
              auto
           medical
       credit_card
       credit_card

Summary statistics are more insightful with clearer tables.


>> dstatmt(auto2);

---------------------------------------------------------------------------------------------
Variable             Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
---------------------------------------------------------------------------------------------
make                -----       -----         -----       -----       -----        74    0
price                6165        2949       8.7e+06        3291   1.591e+04        74    0
mpg                  21.3       5.786         33.47          12          41        74    0
rep78               -----       -----         -----        Poor   Excellent        69    5
headroom            2.993       0.846        0.7157         1.5           5        74    0
trunk               13.76       4.277          18.3           5          23        74    0
weight               3019       777.2        604030        1760        4840        74    0
length              187.9       22.27         495.8         142         233        74    0
turn                39.65       4.399         19.35          31          51        74    0
displacement        197.3       91.84          8434          79         425        74    0
gear_ratio          3.015      0.4563        0.2082        2.19        3.89        74    0
foreign             -----       -----         -----    Domestic     Foreign        74    0

Using metadata in estimation

GAUSS dataframes makes estimation easy to implement and easy to interpret with no additional steps needed. GAUSS internally accesses metadata to use variable names, category labels, and dates when appropriate.

For example, if we include a categorical variable, like purpose from the dataset lending.csv dataset, GAUSS automatically:

Creates the dummy variables needed for estimation.
Excludes the base case.
Includes category labels in the output table.


// Load data
lending = loadd("lending.csv");
 
// Estimate in one line
call olsmt(lending, "int_rate ~ .");

                                            Std.
Variable                     Estimate      Error   t-value
-----------------------------------------------------------

CONSTANT                       11.988     1.298      9.235
term: 60_months                 3.987    0.6446      6.184
annual_inc                 -0.0092475  0.005473     -1.689
purpose: credit_card         -0.12827     1.320   -0.09714
         renewable_energy     -1.9637     4.229    -0.4643
         small_business        3.9262     1.799      2.182
         vacation             -0.7038     1.809    -0.3890

Want to see these advantages for yourself? Contact us for a GAUSS 21 demo!

Conclusion

The GAUSS dataframe is a powerful new data storage tool that offers many advantages. Whether you're just getting started with data exploration or you're in the final stages of estimation, you'll find that dataframes make your work easier and more intuitive.

	// Create file name with full path
	dataset = getGAUSSHome() $+ "examples/housing.csv";

	// Load all variables from the file
	housing = loadd(dataset);

	// Select if position is center
	nba = selif(nba, nba[., "Pos"] .$== "C");

	// Select observations on or after June 20th, 2017
	xle = selif(xle, xle[.,"date"] .>= "2017-06-20")

	// Load data
	lending = loadd("lending.csv");

	// Estimate in one line
	call olsmt(lending, "int_rate ~ .");