Using random forest regression to predict GDP

Using Random Forests to Predict GDP

This tutorial explores the use of random forests to predict Gross Domestic Product (GDP). This tutorial examines how to use:

Load data using loadd.
Split data into testing and training subsets based on time periods.
Specify parameters for random forest models using the rfControl structure.
Fit a random forest regression model from training data using rfRegressFit.
Plot variable importance using plotVariableImportance.
Use rfRegressPredict to make predictions from a random forest model.

Data descriptions

The data for this tutorial is stored in the GAUSS dataset rf_gdp.dat. The starting variables, their descriptions and their sources can be found here. All monthly data is aggregated to quarterly data using beginning of the quarter observations. In addition, percentage change from the preceding period is computed for unemployment, the industrial production index, housing starts, PMI, CCI, CLI, and CPI. Finally, the first through fourth lag of each of the predictor variables, real GDP growth change from the preceding period and real GDP growth is computed. Lagged variables are named with prefixes "L" reflecting the lag and percent change variables are named with the suffix "pc". For example, the variable "L1_housing_pc" is the first lag of the percent change from the preceding period of housing starts.

Loading the datasset

In this example, the dataset is loaded using the loadd function. The loadd function uses the GAUSS formula string syntax. The formula string syntax allows for loading and transforming of data in a single line. Detailed information on using formula string is available in the formula string tutorials.

The loadd function requires two inputs:

A dataset specification.
A formula which specifies how to load the data.

The data for this tutorial is stored in the file rf_gdp.dat. For this example we will load the response variable separately from the predictor variables. First, the response variable rgdp_pc :

//GDP data quarterly
path = "C:/svn/apps/gml/examples/gdp_tutorial";
gdp_q = loadd(path $+ "/rf_gdp.dat", "rgdp_pc");

All variables, excluding rgdp_pc, are then loaded as the predictor variables:

//Load all other features
features_q = loadd(path $+ "/rf_gdp.dat", ". -rgdp_pc" );

Finally, the variable names are loaded using the getHeaders :

//Load variable names
vnames = getHeaders(path $+ "/rf_gdp.dat");

Construct training and testing subsets

The data ranging between 1960Q1 and 1999Q4 will be used as the training set. The data ranging between 2000Q1 and 2017Q4 will be used as the testing set. To make this range easy to adjust, the testT variable is used to specify the cutoff index of the training period. This is then used to split the full data matrices, gdp_q and features_q, into y_train, y_test, x_train and x_test. In addition, the split dates are printed to the string using the GAUSS function dttostr. The function dttostr converts the numeric DT scalar format to a more readable string:

/***************************************
Split data for training and testing
Testing : 1961Q2 to 1999Q4
Training : 2000Q1 to 2017Q4
*****************************************/
testT = 155;

//Print date ranges
print "Start data of test data:" dttostr(features_q[1,1], "YYYY-QQ");
print "End date of test data:" dttostr(features_q[testT,1], "YYYY-QQ");
print "Start date of training data:" dttostr(features_q[testT+1,1], "YYYY-QQ");
print "End date of training data:" dttostr(features_q[rows(features_q),1], "YYYY-QQ");

//Split dataset
y_train = gdp_q[1:testT, .];
x_train = features_q[1:testT, 2:45];
y_test = gdp_q[testT+1:rows(gdp_q), .];
x_test = features_q[testT+1:rows(features_q), 2:45];

Specify model parameters

The random forest model parameters are specified using the rfControl structure. The rfControl structure contains the following members :

Member	Description
numTrees	Scalar, number of trees (must be integer). Default = 100
obsPerTree	Scalar, observations per a tree. Default = 1.0.
featuresPerNode	Scalar, number of features considered at a node. Default = nvars/3.
maxTreeDepth	Scalar, maximum tree depth. Default = unlimited.
minObsNode	Scalar, minimum observations per node. Default = 1.
oobError	Scalar, 1 to compute OOB error, 0 otherwise. Default = 0.
variableImpurityMethod	Scalar, method of calculating variable importance. 0 = none, 1 = mean decrease in impurity, 2 = mean decrease in accuracy (MDA), 3 = scaled MDA. Default = 0.

Using the rfControl structure to change model parameter requires three steps:

Declare an instance of the rfControl structure
```
struct rfControl rfc;
```
Fill the members in the rfControl structure with default values using rfControlCreate:
```
rfc = rfControlCreate;
```
Change the desired members from their default values:
```
rfc.oobError = 1
```
For this model both the out-of-bag error and variable importance will be computed. Putting the three steps to do this together:

//Use control structure for settings
struct rfControl rfc;
rfc = rfControlCreate;

//Turn on variable importance
rfc.variableImportanceMethod = 1;

//Turn on OOB error
rfc.oobError = 1;

Fitting the random forest regression model

Random forest regression models are fit using the GAUSS procedure rfRegressFit. The rfRegressFit procedure takes two required inputs, the training response matrix and the training predictor matrix. In addition, the rfControl structure may be optionally included to specify model parameters.
The rfRegressFit returns all output to a rfModel structure. An instance of the rfModel structure must be declared prior to calling rfRegressFit. Each instance of the rfModel structure contains the following members:

Member	Description
variableImportance	Matrix, 1 x p, variable importance measure if computation of variable importance is specified, zero otherwise.
oobError	Scalar, out-of-bag error if OOB error computation is specified, zero otherwise.
numClasses	Scalar, number of classes if classification model, zero otherwise.
opaqueModel	Matrix, contains model details for internal use only.

The code below fits the random forest model to the training data, y_train and x_train, which were generated earlier. In addition, the inclusion of the previously created rfControl structure named rfc results in the computation of both the out-of-bag error and the variable importance.

//Output structure
struct rfModel out;

//Fit training data using random forest
out = rfRegressFit(y_train, x_train, rfc);

//OOB Error
print "Out-of-bag error:" out.oobError;

The output from the code above:

Out-of-bag error:   0.00088895283

Plotting variable importance

A useful aspect of the random forest model is the variable importance measure. This measure provides a tool for understanding the relative importance of each predictor in the model. The procedure plotVariableImportance plots a pre-formatted bar graph of the variable importance. The procedure takes two inputs, the rfModel structure and a string array of variable names.

//Plot variable names
plotVariableImportance(out, vnames[3:46]);

The resulting plot: Variable Importance

Make predictions

The rfRegressPredict function is used after rfRegressFit to make predictions from the random forest regression model. The function requires a filled rfModel structure and test set of predictors. The code below computes the predictions, prints the first 10 predictions and finds and compares the Random Forest MSE to OLS MSE:

//Make predictions using test data
predictions = rfRegressPredict(out, x_test);

//Print predictions
print predictions[1:10,.]~y_test[1:10,.];
print "random forest MSE: " meanc((predictions - y_test).^2);

//Print ols MSE
b_hat = y_train / (ones(rows(x_train), 1)~x_train);
y_hat = (ones(rows(x_test),1)~x_test) * b_hat;
print "OLS MSE using test data  : " meanc((y_hat - y_test).^2);

The output:

0.054363536      0.012000000
0.045101603      0.078000000
0.040693139     0.0050000000
0.041476425      0.023000000
0.025337385     -0.011000000
0.012130298      0.021000000
0.021908774     -0.013000000
0.031871651      0.011000000
0.038505552      0.037000000
0.052293718      0.022000000
random forest MSE:     0.0010788232
OLS MSE using test data  :     0.0021904250

The original GDP series and the predictions are plotted below code available here: GDP Predictions

Find the full code for this example here