Introduction
This tutorial explores the use of random forests (also called decision forests) to predict baseball players' salaries.
The example builds on the examples in Chapter 8 of G. James, et al. (2013). The model will include 16 predictors: AtBat
, Hits
, HmRun
, Runs
, RBI
, Walks
, Years
, CAtBat
, CHits
, CHmRun
, CRuns
, CRBI
, CWalks
, PutOuts
, Assists
, Errors
.
This tutorial shows how to:
- Load the data and transform the target variable.
- Use
trainTestSplit
to split a dataset into random training and testing subsets. - Specify parameters for random forest models using the
dfControl
structure. - Fit a random forest regression model from training data using
decForestRFit
. - Plot variable importance using
plotVariableImportance
. - Use
decForestPredict
to make predictions from a random forest model.
Load the Data
The data for this tutorial is stored in the file islr_hitters.xlsx
. The model will use the natural log of salary as the response variable and the 16 previously mentioned variables as predictors:
We will use loadd
with GAUSS's formula string syntax, which allows for loading and transforming data in a single line.
new;
library gml;
// Load hitters dataset
dataset = getGAUSSHome $+ "pkgs/gml/examples/hitters.xlsx";
// Load salary and perform natural log transform
y = loadd(dataset, "ln(salary)");
// Load all variables except 'salary'
X = loadd(dataset, ". - salary");
Construct Training and Test Sets
The GAUSS Machine Learning (GML) module includes the trainTestSplit
function for splitting full datasets into randomly drawn train and test subsets.
trainTestSplit
requires three inputs when using matrix inputs:
- y
- The dependent variable.
- X
- The independent variables.
- pct
- The proportion of data to include in the training dataset.
The procedure trainTestSplit
returns four outputs : y_train
, X_train
, y_test
, and X_test
. These outputs contain the feature and predictors, for the training and test datasets:
// Set seed for repeatable sampling
rndseed 234234;
// Split data into training and test sets
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);
Specify Model Parameters
Since random forest is a copyrighted term, the GAUSS functions use the term decision forest. All of the structures and functions use the prefix, df.
The decision forest model parameters are specified using the dfControl
structure, which contains the following members:
Member | Description |
---|---|
numTrees | Scalar, number of trees (must be integer). Default = 100 |
obsPerTree | Scalar, observations per a tree. Default = 1.0. |
featuresPerNode | Scalar, number of features considered at a node. Default = nvars/3. |
maxTreeDepth | Scalar, maximum tree depth. Default = unlimited. |
minObsNode | Scalar, minimum observations per node. Default = 1. |
oobError | Scalar, 1 to compute OOB error, 0 otherwise. Default = 0. |
variableImpurityMethod | Scalar, method of calculating variable importance. 0 = none, 1 = mean decrease in impurity, 2 = mean decrease in accuracy (MDA), 3 = scaled MDA. Default = 0. |
Using the dfControl
structure to change the model parameter requires three steps:
- Declare an instance of the
dfControl
structurestruct dfControl dfc;
- Fill the members in the
dfControl
structure with default values usingdfControlCreate
:dfc = dfControlCreate();
- Change the desired members from their default values:
dfc.oobError = 1
The code below puts these three steps together to turn on both the out-of-bag error and variable importance computation:
/ Declare 'dfc' to be a dfControl structure
// and fill with default settings.
struct dfControl dfc;
dfc = dfControlCreate();
// Turn on variable importance
dfc.variableImportanceMethod = 1;
// Turn on OOB error
dfc.oobError = 1;
Fitting the Random Forest Regression Model
Random forest (or decision forest) regression models are fit using the GAUSS procedure decForestRfit
.
The decForestRfit
procedure takes two required inputs, the training response vector, and the training predictors matrix. In addition, the dfControl
structure may be optionally included to specify model parameters.
The decForestRfit
procedure returns all output to a dfModel
structure. An instance of the dfModel
structure must be declared prior to calling decForestRfit
. Each instance of the dfModel
structure contains the following members:
Member | Description |
---|---|
variableImportance | Matrix, 1 x p, variable importance measure if the computation of variable importance is specified, zero otherwise. |
oobError | Scalar, out-of-bag error if OOB error computation is specified, zero otherwise. |
numClasses | Scalar, number of classes if classification model, zero otherwise. |
opaqueModel | Matrix, contains model details for internal use only. |
The code below fits the random forest model to the training data, y_train and X_train, which were generated earlier using trainTestSplit
. In addition, the inclusion of dfc
, the instance of the previously created dfControl
structure, results in the computation of both the out-of-bag error and the variable importance.
// Structure to hold model results
struct dfModel mdl;
// Fit training data using random forest
mdl = decForestRFit(y_train, X_train, dfc);
// OOB Error
print "Out-of-bag error:" mdl.oobError;
The output from the code above:
Out-of-bag error: 0.22886297
Plotting Variable Importance
A useful aspect of the random forest model is the variable importance measure. This measure provides a tool for understanding the relative importance of each predictor in the model. The procedure plotVariableImportance
plots a pre-formatted bar graph of the variable importance.
The procedure takes a dfModel
structure as its only input.
/*
** Plot variable importance
*/
// Load variable names from dataset
// and assign to dfModel structure
mdl.varNames = getHeaders(dataset);
// Draw variable importance plot
plotVariableImportance(mdl);
Make Predictions
The decForestPredict
function is used after decForestRFit
to make predictions from the random forest regression model. The function requires a filled dfModel
structure and a test set of predictors. The code below computes the predictions, prints the first 10 predictions and finds and compares the Random Forest MSE to OLS MSE:
// Make predictions using test data
predictions = decForestPredict(mdl, X_test);
// Print predictions and decision forest test MSE
print predictions[1:5,.]~y_test[1:5,.];
print "";
print "random forest test MSE: " meanc((predictions - y_test).^2);
// Print ols test MSE
b_hat = y_train / (ones(rows(X_train), 1)~X_train);
alpha_hat = b_hat[1];
b_hat = trimr(b_hat, 1, 0);
y_hat = alpha_hat + X_test * b_hat;
print "OLS test MSE :" meanc((y_hat - y_test).^2);
The output:
6.8781181 6.0402547 6.3337887 6.3630281 6.0366998 5.7838252 6.2387501 6.6200732 5.5766902 5.2522734 random forest test MSE: 0.23044959 OLS test MSE : 0.47283085
Conclusion
In this post, you've seen how:
- Load and transform data.
- Create randomly sampled test and training sets.
- Estimate a basic random forest model.
- Draw a variable importance plot.
With just the default settings we were able to significantly outperform the OLS model.
Find the full code for this example here