Using Random Forests to Predict GDP
This tutorial explores the use of random forests to predict Gross Domestic Product (GDP). This tutorial examines how to use:
- Load data using
loadd
. - Split data into testing and training subsets based on time periods.
- Specify parameters for random forest models using the
rfControl
structure. - Fit a random forest regression model from training data using
rfRegressFit
. - Plot variable importance using
plotVariableImportance
. - Use
rfRegressPredict
to make predictions from a random forest model.
Data descriptions
The data for this tutorial is stored in the GAUSS dataset rf_gdp.dat. The starting variables, their descriptions and their sources can be found here. All monthly data is aggregated to quarterly data using beginning of the quarter observations. In addition, percentage change from the preceding period is computed for unemployment, the industrial production index, housing starts, PMI, CCI, CLI, and CPI. Finally, the first through fourth lag of each of the predictor variables, real GDP growth change from the preceding period and real GDP growth is computed. Lagged variables are named with prefixes "L" reflecting the lag and percent change variables are named with the suffix "pc". For example, the variable "L1_housing_pc" is the first lag of the percent change from the preceding period of housing starts.
Loading the datasset
In this example, the dataset is loaded using the loadd
function. The loadd
function uses the GAUSS formula string syntax. The formula string syntax allows for loading and transforming of data in a single line. Detailed information on using formula string is available in the formula string tutorials.
The loadd
function requires two inputs:
- A dataset specification.
- A formula which specifies how to load the data.
The data for this tutorial is stored in the file rf_gdp.dat. For this example we will load the response variable separately from the predictor variables. First, the response variable rgdp_pc :
//GDP data quarterly
path = "C:/svn/apps/gml/examples/gdp_tutorial";
gdp_q = loadd(path $+ "/rf_gdp.dat", "rgdp_pc");
All variables, excluding rgdp_pc, are then loaded as the predictor variables:
//Load all other features
features_q = loadd(path $+ "/rf_gdp.dat", ". -rgdp_pc" );
Finally, the variable names are loaded using the getHeaders
:
//Load variable names
vnames = getHeaders(path $+ "/rf_gdp.dat");
Construct training and testing subsets
The data ranging between 1960Q1 and 1999Q4 will be used as the training set. The data ranging between 2000Q1 and 2017Q4 will be used as the testing set. To make this range easy to adjust, the testT
variable is used to specify the cutoff index of the training period. This is then used to split the full data matrices, gdp_q and features_q, into y_train, y_test, x_train and x_test. In addition, the split dates are printed to the string using the GAUSS function dttostr
. The function dttostr
converts the numeric DT scalar format to a more readable string:
/***************************************
Split data for training and testing
Testing : 1961Q2 to 1999Q4
Training : 2000Q1 to 2017Q4
*****************************************/
testT = 155;
//Print date ranges
print "Start data of test data:" dttostr(features_q[1,1], "YYYY-QQ");
print "End date of test data:" dttostr(features_q[testT,1], "YYYY-QQ");
print "Start date of training data:" dttostr(features_q[testT+1,1], "YYYY-QQ");
print "End date of training data:" dttostr(features_q[rows(features_q),1], "YYYY-QQ");
//Split dataset
y_train = gdp_q[1:testT, .];
x_train = features_q[1:testT, 2:45];
y_test = gdp_q[testT+1:rows(gdp_q), .];
x_test = features_q[testT+1:rows(features_q), 2:45];
Specify model parameters
The random forest model parameters are specified using the rfControl
structure. The rfControl
structure contains the following members :
Member | Description |
---|---|
numTrees | Scalar, number of trees (must be integer). Default = 100 |
obsPerTree | Scalar, observations per a tree. Default = 1.0. |
featuresPerNode | Scalar, number of features considered at a node. Default = nvars/3. |
maxTreeDepth | Scalar, maximum tree depth. Default = unlimited. |
minObsNode | Scalar, minimum observations per node. Default = 1. |
oobError | Scalar, 1 to compute OOB error, 0 otherwise. Default = 0. |
variableImpurityMethod | Scalar, method of calculating variable importance. 0 = none, 1 = mean decrease in impurity, 2 = mean decrease in accuracy (MDA), 3 = scaled MDA. Default = 0. |
Using the rfControl
structure to change model parameter requires three steps:
- Declare an instance of the
rfControl
structurestruct rfControl rfc;
- Fill the members in the
rfControl
structure with default values usingrfControlCreate
:rfc = rfControlCreate;
- Change the desired members from their default values:
rfc.oobError = 1
For this model both the out-of-bag error and variable importance will be computed. Putting the three steps to do this together:
//Use control structure for settings
struct rfControl rfc;
rfc = rfControlCreate;
//Turn on variable importance
rfc.variableImportanceMethod = 1;
//Turn on OOB error
rfc.oobError = 1;
Fitting the random forest regression model
Random forest regression models are fit using the GAUSS procedure rfRegressFit
. The rfRegressFit
procedure takes two required inputs, the training response matrix and the training predictor matrix. In addition, the rfControl
structure may be optionally included to specify model parameters.
The rfRegressFit
returns all output to a rfModel
structure. An instance of the rfModel
structure must be declared prior to calling rfRegressFit
. Each instance of the rfModel
structure contains the following members:
Member | Description |
---|---|
variableImportance | Matrix, 1 x p, variable importance measure if computation of variable importance is specified, zero otherwise. |
oobError | Scalar, out-of-bag error if OOB error computation is specified, zero otherwise. |
numClasses | Scalar, number of classes if classification model, zero otherwise. |
opaqueModel | Matrix, contains model details for internal use only. |
The code below fits the random forest model to the training data, y_train and x_train, which were generated earlier. In addition, the inclusion of the previously created rfControl
structure named rfc
results in the computation of both the out-of-bag error and the variable importance.
//Output structure
struct rfModel out;
//Fit training data using random forest
out = rfRegressFit(y_train, x_train, rfc);
//OOB Error
print "Out-of-bag error:" out.oobError;
The output from the code above:
Out-of-bag error: 0.00088895283
Plotting variable importance
A useful aspect of the random forest model is the variable importance measure. This measure provides a tool for understanding the relative importance of each predictor in the model. The procedure plotVariableImportance
plots a pre-formatted bar graph of the variable importance. The procedure takes two inputs, the rfModel
structure and a string array of variable names.
//Plot variable names
plotVariableImportance(out, vnames[3:46]);
The resulting plot:
Make predictions
The rfRegressPredict
function is used after rfRegressFit
to make predictions from the random forest regression model. The function requires a filled rfModel
structure and test set of predictors. The code below computes the predictions, prints the first 10 predictions and finds and compares the Random Forest MSE to OLS MSE:
//Make predictions using test data
predictions = rfRegressPredict(out, x_test);
//Print predictions
print predictions[1:10,.]~y_test[1:10,.];
print "random forest MSE: " meanc((predictions - y_test).^2);
//Print ols MSE
b_hat = y_train / (ones(rows(x_train), 1)~x_train);
y_hat = (ones(rows(x_test),1)~x_test) * b_hat;
print "OLS MSE using test data : " meanc((y_hat - y_test).^2);
The output:
0.054363536 0.012000000 0.045101603 0.078000000 0.040693139 0.0050000000 0.041476425 0.023000000 0.025337385 -0.011000000 0.012130298 0.021000000 0.021908774 -0.013000000 0.031871651 0.011000000 0.038505552 0.037000000 0.052293718 0.022000000 random forest MSE: 0.0010788232 OLS MSE using test data : 0.0021904250
The original GDP series and the predictions are plotted below code available here:
Find the full code for this example here