Introduction
Machine learning algorithms often rely on hyperparameters that can impact the performance of the models. These hyperparameters are external to the data and are part of the modeling choices that practitioners must make.
An important step in machine learning modeling is optimizing model hyperparameters to improve prediction accuracy.
In today's blog, we will cover some fundamentals of hyperparameter tuning using our previous decision forest, or random forest, model.
Model Performance
Before we consider how to fit the best machine learning model, we need to look at what it means to be the best model.
First, we must keep in mind that the most common goal in machine learning is to create an algorithm that will create accurate predictions based on unseen data. How successful an algorithm is at achieving this goal is reflected in the out-of-sample, or generalization, error.
The error of a machine learning model can be broken into two main categories, bias, and variance.
Bias | The error that occurs when we fit a simple model to a more complex data-generating process. A model with high bias will underfit the training data as we see in the far left panel of the above plot. |
---|---|
Variance | The expected prediction error that occurs when we apply our model to a new dataset that the model has not seen. A model with high variance will usually overfit the training data which results in lower training set error, but will lead to higher error on any data not used for training. |
Because of these two sources of error, fitting machine learning models requires finding the right model complexity without overfitting our training data.
Model Performance Measures
There are a number of methods for evaluating the performance of machine learning models. Ultimately, which performance measure is used should be based on business or research objectives.
Common Performance Measures |
||
---|---|---|
Method | Description | Uses |
Mean Squared Error (MSE) | The average of the squared distance between the target value and the value predicted by the model. | Regression Models |
Mean Absolute Error (MAE) | The average of the absolute value of the distance between the target value and the value predicted by the model. | |
Root Mean Squared Error (RMSE) | The square root of the mean squared error. | |
Accuracy | The number of correct predictions divided by the total number of predictions. | Classifications Models |
Precision | Ratio of true positives to total positive predicted. | |
Recall | The proportion of true positives divided by the sum of true positives and false negatives. | |
F1-score | The harmonic mean of precision and recall. |
Tuning Parameters
Adjusting hyperparameters is one important way that we can impact the performance of machine learning models. Hyperparameters are parameters that:
- Are set before the model is trained and are not learned from the data.
- Determine how the model learns from the data.
- May need to be readjusted to maintain optimal performance as more data is collected.
Example Hyperparameters |
||
---|---|---|
Model | Hyperparameter | |
K-nearest neighbor | The number of neighbors used in classification group, $k$. | |
Ridge regression | $\lambda$, the weight on the L2 penalty. | |
Gradient Boosting Machines | The number of trees, the shrinkage parameter, and the number of splits in each tree. |
Hyperparameters can have a big impact on how well a model performs. For this reason, it is important to systematically and strategically optimize hyperparameters using hyperparameter tuning.
Some popular methods for hyperparameter tuning include:
-
Grid Search: This is a simple but effective method where you specify a set of values for each hyperparameter, and the algorithm tries all possible combinations of values. This can be time-consuming, but it guarantees that you'll find the best set of hyperparameters within the specified options.
-
Random Search: This method randomly selects values for each hyperparameter from a specified range. This can be faster than grid search, especially if you have a large number of hyperparameters, but it's not guaranteed to find the best set of hyperparameters.
-
Bayesian Optimization: This is a more advanced method that uses probability models to choose the next set of hyperparameters to test. It takes into account the results of previous tests to choose values that are more likely to result in better performance.
- Evolutionary Algorithms: This method simulates evolution by creating a population of potential solutions (sets of hyperparameters) and selecting the best ones to "breed" new solutions. This process continues until a good solution is found.
Examples
Today we will consider two examples of hyperparameter tuning. For each example we:
- Use a decision forest model, similar to the one we previously built to predict the U.S. output gap.
- Perform a grid search to determine the best hyperparameter value or values.
- Use mean squared error as our model performance measure.
The Model
Our model:
- Uses a combination of common economic indicators and GDP subcomponents as predictors of CBO-based U.S. output gap.
- Uses a 70/30 training and testing split without shuffling.
- Is estimated using the GAUSS Machine Learning library</a?.
When tuning a decision forest model, there are several hyperparameters that can be considered.
Decision Forest Hyperparameters |
||
---|---|---|
Parameter | Description | Impact |
Number of trees | The number of decision trees that will be trained and combined to make predictions. | Increasing the number of trees can lead to better performance, but can also increase training time and memory requirements. |
Maximum depth | The maximum depth, or number of splits, of each decision tree. | A deeper tree can capture more complex relationships in the data, but can also overfit the data and perform poorly on new data. |
Observations per tree | The percentage of observations used per tree. | Increasing the percentage of observations used in a tree can improve accuracy but it also can increase computational cost, reduce interpretability, and lead to overfitting or loss of diversity. |
Minimum observations per node | The minimum number of observations required to be at a leaf node. | Increasing this value can help prevent overfitting, but can also result in a less complex model. |
Maximum features | The maximum number of features that can be used to split each node. | Limiting the number of features can help prevent overfitting and reduce training time, but can also result in a less accurate model. |
Example One: Tuning a Single Parameter
In our first example, we will use a grid search to tune the number of features used for splitting each node. We will hold all other parameters constant at the GAUSS default values.
Parameter | GAUSS Default |
---|---|
Number of trees | 100 |
Maximum tree depth | Unlimited |
Minimum percentage of observations per tree | 100% |
Minimum observations per leaf | 1 |
Maximum features | $\frac{\text{Number of Variables}}{3}$ |
The dfControl
Structure
The dfControl
structure is an optional argument used to pass hyperparameter values to the decForestRFit
and decForestCFit
procedures.
Using the structure to change hyperparameters requires three steps:
- Declare an instance of the
dfControl
structure using thestruct
keyword. - Fill the default values for the members using the
dfControlCreate
procedure. - Set the desired parameter value using GAUSS "dot",
.
, notation.
// Declare an instance of the
// dfControl structure
struct dfControl dfc;
// Set default values for
// structure members
dfc = dfControlCreate();
// Specify features per node
dfc.featuresPerSplit = 4;
Loading and Splitting our Data
The first step for our hyperparameter tuning example, is to load our data and split it into training and testing datasets. We can do this using the loadd
procedure to load our data and the trainTestSplit
procedure to split our data.
/*
** Load and split
*/
library gml;
// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);
/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];
// Extract features
X = delcols(data, "date"$|"CBO_GAP");
/*
** Split data into 70% training and 30% testing sets
** without shuffling.
*/
shuffle = "False";
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7, shuffle);
Setting Non-Tuning Parameters
Next, we will set the non-tuning hyperparameters to the GAUSS defaults using the dfControl
structure.
/*
** Settings for decision forest
*/
// Declare an instance of the
// dfControl structure
struct dfControl dfc;
// Set default values for
// structure members
dfc = dfControlCreate();
Performing Grid Search
Now that we've set our default non-tuning parameters we will perform our grid search to tune the features per node. The first step is to initialize our grid and storage matrices.
/*
** Initialize grid and
** storage matrices
*/
// Create vector of possible
// features per node values
featuresPerSplit = seqa(1, 1, cols(X));
// Create storage dataframe for MSE
// with one column for training mse
// and one column for testing mse
mse = asDF(zeros(rows(featuresPerSplit), 2), "Train", "Test");
Next, we will loop over each possible value of features per split. For each potential value we:
- Fit decision forest model using the training data.
- Predict outcomes using the training data.
- Predict outcomes using the testing data.
- Compute the MSE for both the training and testing predictions.
- Store the MSE values.
// Loop over all potential values
// of features per node
for i(1, rows(featuresPerSplit), 1);
// Set featuresPerSplit parameter
dfc.featuresPerSplit = featuresPerSplit[i];
/*
** Decision Forest Model
*/
// Declare 'mdl' to be an instance of a
// dfModel structure to hold the estimation results
struct dfModel mdl;
// Fit the model with default settings
mdl = decForestRFit(y_train, X_train, dfc);
// Make predictions using training data
df_prediction_train = decForestPredict(mdl, X_train);
// Make predictions using testing data
df_prediction_test = decForestPredict(mdl, X_test);
/*
** Compute and store mse
*/
// Training set MSE
mse[i, "Train"] = meanSquaredError(y_train, df_prediction_train);
// Testing set MSE
mse[i, "Test"] = meanSquaredError(y_test, df_prediction_test);
endfor;
Note that within our loop we use the GML procedure, meanSquaredError
to compute our MSE.
Results
A visualization of our MSE values gives us some insight into what happens as we increase the features per node in our decision forest model:
- As we increase the features per node up to about 5 or 6, we see a general downward trend in both the testing and training MSE. Over this period, the increased features per node allows the model to capture more complex interactions and dependencies in the data.
- Increasing the features per node beyond 6, results in a general upward trend in testing MSE and downward trend in training MSE. This points to overfitting. The model fits the training data too well - it captures noise and irrelevant patterns, which leads to decreased performance on the unseen testing data.
To confirm our optimal features per node parameter setting, we can locate the minimum testing MSE:
// Find the row index of the lowest MSE
idx = minindc(mse[., "Test"]);
// NOTE: two semi-colons at the end of a print statement
// prevents it from printing a newline at the end
print "Optimal features per node: ";; featuresPerSplit[idx];
print "Minimum test MSE:";; asmatrix(mse[idx, "Test"]);
This confirms that the optimal features per leaf is 6 with a testing MSE of 3.212.
Optimal features per node: 6.0000000 Minimum test MSE: 3.2122050
Example Two: Simultaneously Tuning Hyperparameters
Now that we've seen how to tune a single hyperparameter, let's look at tuning two hyperparameters simultaneously. We will use the same data and set up from our previous example:
Data loading and preliminary setup
/*
** Load and split
*/
// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);
/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];
// Extract features
X = delcols(data, "date"$|"CBO_GAP");
/*
** Split data into 70% training and 30% testing sets
** without shuffling
*/
shuffle = "False";
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7, shuffle);
/*
** Settings for decision forest
*/
// Declare an instance of the
// dfControl structure
struct dfControl dfc;
// Set default values for
// structure members
dfc = dfControlCreate();
// Set features per split
dfc.featuresPerSplit = 6;
featuresPerSplit
value found in the previous section. The optimal value of one hyperparameter depends on the values of the others, so in practice, you should not optimize them separately.Performing Grid Search
In this example, we will tune:
- The minimum observations per leaf, ranging from 1 to 20.
- The percentage of the observations per tree, ranging from 70% to 100%.
First, we initialize our grid and storage matrices. For this example, we will focus only on our testing MSE.
/*
** Initialize grid and
** storage matrices
*/
// Set potential values for
// minimum observations per node
minObsLeaf = seqa(1, 1, 20);
// Set potential values for
// percentage of observations
// in tree
pctObs = seqa(0.7, 0.1, 4);
// Storage matrices
test_mse = zeros(rows(minObsLeaf), rows(pctObs));
Next, we use nested for loops
to search over all potential values of the minimum observations per a leaf and the minimum percentage of observations at the split.
for i(1, rows(minObsLeaf), 1);
// Set the minimum obs per leaf
dfc.minObsLeaf = minObsLeaf[i];
for j(1, rows(pctObs), 1);
// Set percentage of obs used for each tree
dfc.pctObsPerTree = pctObs[j];
/*
** Decision Forest Model
*/
// Declare 'mdl' to be an instance of a
// dfModel structure to hold the estimation results
struct dfModel mdl;
// Estimate the model with default settings
mdl = decForestRFit(y_train, X_train, dfc);
// Make predictions using testing data
df_prediction_test = decForestPredict(mdl, X_test);
/*
** Compute and store mse
*/
// Testing set MSE
test_mse[i, j] = meanSquaredError(y_test, df_prediction_test);
endfor;
endfor;
Note that in this loop:
- We use i, from the outer loop, to index the
minObsLeaf
vector. - We use j, from the inner loop, to index the
pctObs
vector. - Each row in our storage matrices represents a constant minimum samples per leaf.
- Each column in our storage matrices represents a constant percentage of samples.
Results
The above plot shows us that with the GAUSS default settings for a random forest and featuresPerNode
set to 6:
- Taking a sample of 100% of the data for the creation of each tree is almost always best.
- Setting
minObsLeaf
to between 5 and 10 seems best, with the minimum at about 7. - We did not get much of an improvement in our test MSE over the first example.
Optional: Finding the minimum MSE value in the output matrix
The final step is to find our optimal hyperparameter settings by locating the combination of parameters that yields the lowest MSE.
We can break this into two steps. First, we find the column that contains the minimum value.
// Create a column vector with the minimum MSE
// values for each column
mse_col_mins = minc(test_mse);
// Find the index of the smallest
// value in 'mse_col_mins'
idx_col_min = minindc(mse_col_mins);
Now that we have found which column contains the minimum MSE value, we use minindc
to find the index of the smallest value in that column.
// Find the row that contains the smallest MSE value
idx_row_min = minindc(test_mse[.,idx_col_min]);
// Extract the lowest MSE across all
// combinations of tuning parameters
MSE_optimal = test_mse[idx_row_min, idx_col_min];
// Print results
sprintf( "Minimum testing MSE: %4f", MSE_optimal);
print "Minimum MSE occurs with";
sprintf(" minimum samples per leaf : %d", minObsLeaf[idx_row_min]);
sprintf(" percentage of samples per tree: %g%%", 100 * pctObs[idx_col_min]);
This prints our results:
Minimum testing MSE: 3.151047 Minimum MSE occurs with minimum observationss per leaf : 7 percentage of observations per tree: 100%
Conclusion
Today's blog demonstrations how practitioners can use hyperparameters to tune and improve machine learning models. It is important to remember that taking the time to systematically and strategically determine model hyperparameters can greatly improve machine learning model performance.
Stay tuned, because next time we will take a deeper dive into how to think about the data and which hyperparameter settings make sense to try out.
Further Machine Learning Reading
- Predicting Recessions with Machine Learning Techniques
- Applications of Principal Components Analysis in Finance
- Predicting The Output Gap With Machine Learning Regression Models
- Classification with Regularized Logistic Regression
- Understanding Cross-Validation
- Machine Learning With Real-World Data
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.
Thank you for sharing this insightful article on the Fundamentals of Tuning Machine Learning Hyperparameters.