Introduction
If you've ever done empirical work, you know that real-world data rarely, if ever, arrives clean and ready for modeling. No data analysis project consists solely of fitting a model and making predictions.
In today's blog, we walk through a machine learning project from start to finish. We'll give you a foundation for completing your own machine learning project in GAUSS, working through:
- Data Exploration and cleaning.
- Splitting data for training and testing.
- Model fitting and prediction.
- Basic feature engineering.
Background
Our Data
Today we will be working with the California Housing Dataset from Kaggle.
This dataset is built from 1990 Census data. Though it is an older dataset, it is a great demonstration dataset and has been popular in many machine learning examples.
The dataset contains 10 variables measured in California at the block group level:
Variable | Description |
---|---|
longitude | Measure of how far west a house is. |
latitude | Measure of how far north a house is. |
housing_median_age | Median age of a house within a block. |
total_rooms | Total number of rooms within a block. |
total_bedrooms | Total number of bedrooms within a block. |
population | Total number of people residing within a block. |
households | Total number of households, a group of people residing within a home unit, for a block. |
median_income | Median income for households within a block of houses (measured in tens of thousands of US Dollars). |
median_house_value | Median house value for households within a block. |
ocean_proximity | Location of the house w.r.t ocean/sea. |
GAUSS Machine Learning
We will use the new GAUSS Machine Learning (GML) library. This library is extremely user friendly and provides easy-to-use machine learning tools for implementing fundamental machine learning models.
To access these tools, we need to load the library:
// Clear workspace and load library
new;
library gml;
// Set random seed
rndseed 8906876;
Data Exploration and Cleaning
With GML loaded, we our now ready to import and clean our data. The first step is to use the loadd
procedure to import our data into the GAUSS.
/*
** Import datafile
*/
load_path = "data/";
fname = "housing.csv";
// Load all variables
housing_data = loadd(load_path $+ fname);
Descriptive Statistics
Exploratory data analysis allows us to identify important data anomalies, like outliers and missing values.
Let's start by looking at standard descriptive statistics using the dstatmt
procedure:
// Find descriptive statistics
// for all variables in housing_data
dstatmt(housing_data);
This prints a summary table of statistics for all variables.
-------------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing -------------------------------------------------------------------------------------------------- longitude -119.6 2.004 4.014 -124.3 -114.3 20640 0 latitude 35.63 2.136 4.562 32.54 41.95 20640 0 housing_median_age 28.64 12.59 158.4 1 52 20640 0 total_rooms 2636 2182 4.759e+06 2 3.932e+04 20640 0 total_bedrooms 537.9 421.4 1.776e+05 1 6445 20433 207 population 1425 1132 1.282e+06 3 3.568e+04 20640 0 households 499.5 382.3 1.462e+05 1 6082 20640 0 median_income 3.871 1.9 3.609 0.4999 15 20640 0 median_house_value 2.069e+05 1.154e+05 1.332e+10 1.5e+04 5e+05 20640 0 ocean_proximity ----- ----- ----- <1H OCEAN NEAR OCEAN 20640 0
These statistics allow us to quickly identify several data issues that we need to address prior to fitting our model:
- There are 207 missing observations of the
total bedrooms
variable (you may need to scroll to the right of the output). - Many of our variables show potential outliers, with high variance and large ranges. These should be further explored.
Missing Values
To get a better idea of how to best deal with the missing values, let's check the descriptive statistics for the observations with and without missing values separately.
// Conditional check
// for missing values
e = housing_data[., "total_bedrooms"] .== miss();
// Get descriptive statistics
// for dataset with missing values
dstatmt(selif(housing_data, e));
------------------------------------------------------------------------------------------------ Variable Mean Std Dev Variance Minimum Maximum Valid Missing ------------------------------------------------------------------------------------------------ longitude -119.5 2.001 4.006 -124.1 -114.6 207 0 latitude 35.5 2.097 4.399 32.66 40.92 207 0 housing_median_age 29.27 11.96 143.2 4 52 207 0 total_rooms 2563 1787 3.194e+06 154 1.171e+04 207 0 total_bedrooms ----- ----- ----- +INF -INF 0 207 population 1478 1057 1.118e+06 37 7604 207 0 households 510 386.1 1.491e+05 16 3589 207 0 median_income 3.822 1.956 3.824 0.8527 15 207 0 median_house_value 2.06e+05 1.116e+05 1.246e+10 4.58e+04 5e+05 207 0 ocean_proximity ----- ----- ----- <1H OCEAN NEAR OCEAN 207 0
// Get descriptive statistics
// for dataset without missing values
dstatmt(delif(housing_data, e));
------------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ------------------------------------------------------------------------------------------------- longitude -119.6 2.004 4.014 -124.3 -114.3 20433 0 latitude 35.63 2.136 4.564 32.54 41.95 20433 0 housing_median_age 28.63 12.59 158.6 1 52 20433 0 total_rooms 2637 2185 4.775e+06 2 3.932e+04 20433 0 total_bedrooms 537.9 421.4 1.776e+05 1 6445 20433 0 population 1425 1133 1.284e+06 3 3.568e+04 20433 0 households 499.4 382.3 1.462e+05 1 6082 20433 0 median_income 3.871 1.899 3.607 0.4999 15 20433 0 median_house_value 2.069e+05 1.154e+05 1.333e+10 1.5e+04 5e+05 20433 0 ocean_proximity ----- ----- ----- <1H OCEAN NEAR OCEAN 20433 0
From visual inspection, the descriptive statistics for the data with the missing values are very similar to the descriptive statistics data without the missing values.
In addition, the missing values make up less than 1% of the total observations. Given this, we will delete the rows containing missing values, rather than imputing our missing values.
We can delete the rows with missing values using the packr
procedure:
// Remove rows with missing values
// from housing_data
housing_data = packr(housing_data);
Outliers
Now that we've removed missing values, let's look for other data outliers. Data visualizations like histograms and box plots are a great way to identify potential outliers.
First, let's create a grid plot of histograms for all of our continuous variables:
/*
** Data visualizations
*/
// Get variables names
vars = getColNames(housing_data);
// Set up plotControl
// structure for formatting graphs
struct plotControl plt;
plt = plotGetDefaults("bar");
// Set fonts
plotSetFonts(&plt, "title", "Arial", 14);
plotSetFonts(&plt, "ticks", "Arial", 12);
// Loop through the variables and draw histograms
for i(1, rows(vars)-1, 1);
plotSetTitle(&plt, vars[i]);
plotLayout(3, 3, i);
plotHist(plt, housing_data[., vars[i]], 50);
endfor;
From our histograms, it appears that several variables suffer from outliers:
- The
total_rooms
variable, with the majority of the data distributed between 0 and 10,000. - The
total_bedrooms
variable, with the majority of the data distributed between 0 and 2000. - The
households
variable, with the majority of the data distributed between 0 and 2000. - The
population
variable, with the majority of the data distributed between 0 and 100,000.
Box plots of these variables confirm that there are indeed outliers.
plt = plotGetDefaults("box");
// Set fonts
plotSetFonts(&plt, "title", "Arial", 14);
plotSetFonts(&plt, "ticks", "Arial", 12);
string box_vars = { "total_rooms", "total_bedrooms", "households", "population" };
// Loop through the variables and draw boxplots
for i(1, rows(box_vars), 1);
plotLayout(2, 2, i);
plotBox(plt, box_vars[i], housing_data[., box_vars[i]]);
endfor;
Let's filter the data to eliminate these outliers:
/*
** Filter to remove outliers
**
** Delete:
** - total_rooms greater than or equal to 10000
** - total_bedrooms greater than or equal to 20000
** - households greater than or equal to 2000
** - population greater than or equal to 6000
*/
mask = housing_data[., "total_rooms"] .>= 10000;
mask = mask .or housing_data[., "total_bedrooms"] .>= 2000;
mask = mask .or housing_data[., "households"] .>= 2000;
mask = mask .or housing_data[., "population"] .>= 6000;
housing_data = delif(housing_data, mask);
Data Truncation
The histograms also point to truncation issues with housing_median_age
and median_house_value
. Let's look into this a little further:
- We'll confirm that these are the most frequently occurring observations using
modec
. This provides evidence for our suspicion that these are truncation points. - We'll count the number of observations at these locations.
// House value
mode_value = modec(housing_data[., "median_house_value"]);
print "Most frequent median_house_value:" mode_value;
print "Counts:";
sumc(housing_data[., "median_house_value"] .== mode_value);
// House age
mode_age = modec(housing_data[., "housing_median_age"]);
print "Most frequent housing_median_age:" mode_age;
print "Counts:";
sumc(housing_data[., "housing_median_age"] .== mode_age);
modec
because from our histogram we can't identify for that these points occur at the maximum. It makes sense to assume that they do but we can't be certain. Most frequent median_house_value: 500001.00 Counts: 935.00000 Most frequent housing_median_age: 52.000000 Counts: 1262.0000
These combined observations make up about 10% of the total observations. Because we have no further information about what is occurring at these points, let's remove them from our model.
// Create binary vector with a 1 if either
// 'housing_median_age' or 'median_house_value'
// equal their mode value.
mask = (housing_data[., "housing_median_age"] .== mode_age)
.or (housing_data[., "median_house_value"] .== mode_value);
// Delete the rows if they meet our above criteria
housing_data = delif(housing_data, mask);
Feature Modifications
Our final data cleaning step is to make feature modifications including:
- Rescaling the
median_house_value
variable to be measured in 10,000 of US dollars (the same scale asmedian_income
). - Generating dummy variables to account for the categories of
ocean_proximity
.
First, we rescale the median_house_value
:
// Rescale median income variable
housing_data[., "median_house_value"] =
housing_data[., "median_house_value"] ./ 10000;
Next we generate dummy variables for ocean_proximity
.
Let's get a feel for our categorical data using the frequency
procedure:
// Check frequency of
// ocean_proximity categories
frequency(housing_data, "ocean_proximity");
This prints a convenient frequency table:
Label Count Total % Cum. % <1H OCEAN 8095 44.89 44.89 INLAND 6136 34.03 78.93 ISLAND 2 0.01109 78.94 NEAR BAY 1525 8.458 87.39 NEAR OCEAN 2273 12.61 100 Total 18031 100
We can see from this table that the ISLAND
category is a very small category. We'll exclude it from our modeling dataset.
Now let's create our dummy variables using the oneHot
procedure:
/*
** Generate dummy variables for
** the ocean_proximity using
** one hot encoding
*/
dummy_matrix = oneHot(housing_data[., "ocean_proximity"]);
Finally, we'll save our modeling dataset in a GAUSS .gdat file using saved
so we can directly access our clean data in the future:
/*
** Build matrix of features
** Note we exclude:
** - ISLAND dummy variable
** - Original ocean_proximity variable
*/
model_data = delcols(housing_data, "ocean_proximity") ~
delcols(dummy_matrix, "ocean_proximity_ISLAND");
// Saved data matrix
saved(model_data, load_path $+ "/model_data.gdat");
Data Splitting
In machine learning, it's customary to use separate datasets to fit the model and to evaluate model performance. Since the objective of machine learning models is to provide predictions for unseen data, using a testing set provides a more realistic measure of how our model will perform.
To prepare our data for training and testing, we're going to take two steps:
- Separate our target variable,
median_house_value
, and feature set. - Split our data into 70% testing and 30% training dataset using
trainTestSplit
.
new;
library gml;
rndseed 896876;
/*
** Load datafile
*/
load_path = "data/";
fname = "model_data.gdat";
// Load data
housing_data = loadd(load_path $+ fname);
/*
** Feature management
*/
// Separate dependent and independent data
y = housing_data[., "median_house_value"];
X = delcols(housing_data, "median_house_value");
// Split into 70% training data
// and 30% testing data
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);
Fitting Our Model
Now that we've completed our data cleaning, we're finally ready to fit our model. Today we'll use a LASSO regression model to predict our target variable. LASSO is a form of regularization that has found relative success in economic and financial modeling. It offers a data-driven approach to dealing with high-dimensionality in linear models.
Model Fitting
To fit the LASSO model to our target variable, median_house_value
, we'll use lassoFit
from the GAUSS Machine Learning library.
/*
** LASSO Model
*/
// Set lambda values
lambda = { 0, 0.1, 0.3 };
// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;
// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);
The lassoFit
procedure prints a model description and results:
============================================================================== Model: Lasso Target Variable: median_house_value Number observations: 12622 Number features: 12 ============================================================================== =========================================================== Lambda 0 0.1 0.3 =========================================================== longitude -2.347 -1.013 -0.02555 latitude -2.192 -0.9269 0 housing_median_age 0.07189 0.06384 0.03977 total_rooms -0.001004 0 0 total_bedrooms 0.01165 0.006107 0.004828 population -0.004317 -0.003396 -0.001232 households 0.006808 0.005119 0 median_income 3.872 3.569 3.457 ocean_proximity__1H OCEAN -5.509 0 0 ocean_proximity_INLAND -9.437 -5.639 -6.575 ocean_proximity_NEAR BAY -7.083 -0.6395 0 ocean_proximity_NEAR OCEAN -5.198 0.6378 0.6981 CONST. -193.5 -82.98 3.451 =========================================================== DF 12 10 7 Training MSE 33.7 34.7 37.4
The results highlight the variable selection function of LASSO. With $\lambda = 0$, a full least squares model, all features are represented in the model. When we get to $\lambda = 0.3$, the LASSO regression removes 4 of our 12 variables:
latitude
total_rooms
ocean_proximity__1H OCEAN
ocean_proximity_NEAR BAY
As we would expect, median_income
has a large positive impact. However, there are a few noteworthy observations about the coefficients for the location related variables.
As we add more regularization to the model by increasing the value of $\lambda$, ocean_proximity__1H OCEAN
and ocean_proximity_NEAR BAY
are removed from the model, but the effect of ocean_proximity_INLAND
increases substantially. latitude
is also removed from the model. This could be because these effects are largely also explained by the location dummy variables and median_income
.
Prediction
We can now test our model's prediction capability on the testing data using lmPredict
:
// Predictions
predictions = lmPredict(mdl, X_test);
// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;
Testing MSE 33.814993 34.726144 37.199771
As expected, most of these values are above the training MSE but not by much. The test MSE for the model with the highest $\lambda$ value is actually lower than the training MSE. This suggests that our model is not overfitting.
Feature Engineering
Since our model is not overfitting, we can add more variables to the model. We could collect more data variables to add. However, it is likely that there is more information in our current data that we can make more accessible to our estimator. We are going to create some new features from combinations of our current features. This is part of a process called feature engineering which can make substantial contributions to your machine learning models.
We will start by generating per capita variables for total_rooms
, total_bedrooms
, and households
.
/*
** Create per capita variables
** using population
*/
pc_data = housing_data[., "total_rooms" "total_bedrooms" "households"]
./ housing_data[., "population"];
// Convert to a dataframe and add variable names
pc_data = asdf(pc_data, "rooms_pc"$|"bedrooms_pc"$|"households_pc");
Next we will great a variable representing the percentage of total_rooms
made up by total_bedrooms
:
beds_per_room = X[.,"total_bedrooms"] ./ X[.,"total_rooms"];
and add these columns to X
:
X = X ~ pc_data ~ asdf(beds_per_room, "beds_per_room");
Fit and Predict the New Model
// Reset the random seed so we get the
// same test and train splits as our previous model
rndseed 896876;
// Split our new X into train and test splits
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);
// Set lambda values
lambda = { 0, 0.1, 0.3 };
// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;
// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);
// Predictions
predictions = lmPredict(mdl, X_test);
// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;
============================================================================== Model: Lasso Target Variable: median_house_value Number observations: 12622 Number features: 16 ============================================================================== =========================================================== Lambda 0 0.1 0.3 =========================================================== longitude -2.495 -1.008 0 latitude -2.36 -0.9354 0 housing_median_age 0.0808 0.07167 0.04316 total_rooms -0.0001714 0 0 total_bedrooms 0.005301 0.001517 0.0008104 population -0.0004661 0 0 households -0.001611 0 0 median_income 3.947 4.011 3.675 ocean_proximity__1H OCEAN -5.171 0 0 ocean_proximity_INLAND -8.635 -4.963 -6.235 ocean_proximity_NEAR BAY -6.966 -0.875 0 ocean_proximity_NEAR OCEAN -5.219 0.2927 0.1798 rooms_pc 2.678 0.1104 0 bedrooms_pc -11.68 0 0 households_pc 22.23 21.47 20.23 beds_per_room 33.03 17.03 8.029 CONST. -221.9 -95.55 -3.059 =========================================================== DF 16 11 7 Training MSE 31.6 32.5 34.3 Testing MSE 31.505169 32.457936 34.155290
Our train and test MSE have improved for all values of $\lambda$. Of our new variables, households_pc
and beds_per_room
, seem to have the strongest effect.
Extensions
We use a linear regression model, LASSO, for modeling home values. This was chosen somewhat ad hoc, and there are a number of alternative and extensions that could help improve our predictions.
For example we could:
- Use clustering or K-nearest neighbors to capture more location information.
- Use principal component analysis to capture the variation in our features, then estimate the linear relationship between our the median home values and the principle component.
- Use a random forest model, which generally provides good accuracy for tabular datasets.
- Split the home values into bins and perform classification, rather than regression.
Conclusion
In today's blog we've seen the important role that data exploration and cleaning plays in developing a machine learning model. Rarely do we obtain data that we can plug directly into our models. It's best practice, to make time for data exploration and cleaning, because any machine learning model is only as reliable as its data.
Further Machine Learning Reading
- Predicting Recessions with Machine Learning Techniques
- Applications of Principal Components Analysis in Finance
- Predicting The Output Gap With Machine Learning Regression Models
- Fundamentals of Tuning Machine Learning Hyperparameters
- Understanding Cross-Validation
- Classification with Regularized Logistic Regression
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.