This example uses the red wine quality dataset from Cortez, et al., 2009 to fit a random forest classification model. Predictions are then made from the fitted model. The dataset contains 200 observations and includes 12 variables: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality.
Split the dataset
Prior to running the model, the testTrainSplit
function is used to split the model data into test and training sets. The testTrainSplit'
function is compatible with the GAUSS formula string syntax. This creates the test and train datasets without loading the full dataset. In the classification model quality is used to create the response variable and all variables excluding density and chlorides. The response variable is an indicator variable equal to 1 if the quality is greater than 6:
// Load wine quality dataset
dataset = getGAUSSHome() $+ "pkgs/gml/examples/winequality-red.csv";
// Split data into training and test sets
{y_train, y_test, x_train, x_test} = testTrainSplit(dataset, "quality ~ . ", 0.7);
// Create indicator variable
y_test = y_test .>6;
y_train = y_train .>6;
Estimate The Model
The rfClassifyFit
function is used on the y_train and x_train datasets to fit a random forest regression model. All results are stored in an rfModel
structure:
// Output structure
struct rfModel rfm;
// Fit training data using random forest
rfm = rfClassifyFit(y_train, x_train, rfc);
Make predictions
Once the model is fit predictions can be made from the x_test dataset using rfClassifyPredict
function. The rfClassifyPredict
function requires two inputs, a rfModel
structure and a data matrix of predictors:
// Make predictions using test data
predictions = rfClassifyPredict(rfm, x_test);
// Print predictions
print predictions~y_test;
print "accuracy: " meanc(predictions .== y_test);
Output
The output from the code above looks similar to :
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 1.0000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 1.0000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 accuracy = 0.88541667