Introduction
Logistic regression has been a long-standing popular tool for modeling categorical outcomes. It's widely used across fields like epidemiology, finance, and econometrics.
In today's blog we'll look at the fundamentals of logistic regression. We'll use a real-world survey data application and provide a step-by-step guide to implementing your own regularized logistic regression models using the GAUSS Machine Learning library, including:
- Data preparation.
- Model fitting.
- Classification predictions.
- Evaluating predictions and model fit.
What is Logistic Regression?
Logistic regression is a statistical method that can be used to predict the probability of an event occurring based on observed features or variables. The predicted probabilities can then be used to classify the data based on probability thresholds.
For example, if we are modeling a "TRUE" and "FALSE" outcome, we may predict that an outcome will be "TRUE" for all predicted probabilities of 0.5 and higher.
Mathematically, logistic regression models the relationship between the probability of an outcome as a logistic function of the independent variables:
$$ Pr(Y = 1 | X) = p(X) = \frac{e^{B_0 + B_1X}}{1 + e^{B_0 + B_1X}} $$
This log-odds representation is sometimes more common because it is linear in our independent variables:
$$ \log \bigg( \frac{p(X)}{1 + p(X)} \bigg) = B_0 + B_1X $$
There are some important aspects of this model to keep in mind:
- The logistic regression model always yields a prediction between 0 and 1.
- The magnitude of the coefficients in the logistic regression model cannot be as directly interpreted as in the classic linear model.
- The signs of the coefficients in the logistic regression model can be interpreted as expected. For example, if the coefficient on $X_1$ is negative we can conclude that increasing $X_1$ decreases $p(X)$.
Logistic Regression with Regularization
One potential pitfall of logistic regression is its tendency for overfitting, particularly with high dimensional feature sets.
Regularization with L1 and/or L2 penalty parameters with can help prevent overfitting and improve prediction.
Comparison of L1 and L2 Regularization |
||
---|---|---|
$L1$ penalty (Lasso) | $L2$ penalty (Ridge) | |
Penalty term | $\lambda \sum_{j=1}^p |\beta_j|$ | $\lambda \sum_{j=1}^p \beta_j^2$ |
Robust to outliers | ✓ | |
Shrinks coefficients | ✓ | ✓ |
Can select features | ✓ | |
Sensitive to correlated features | ✓ | |
Useful for preventing overfitting | ✓ | ✓ |
Useful for addressing multicollinearity | ✓ | |
Requires hyperparameter selection (λ) | ✓ | ✓ |
Predicting Customer Satisfaction Using Survey Data
Today we will use airline passenger satisfaction data to demonstrate logistic regression with regularization.
Our task is to predict passenger satisfaction using:
- Available survey answers.
- Flight information.
- Passenger characteristics.
Variable | Description |
---|---|
id | Responder identification number |
Gender | Gender identification: Female or Male. |
Customer Type | Loyal or disloyal customer. |
Age | Customer age in years. |
Type of travel | Personal or business travel. |
Class | Eco or business class seat. |
Flight Distance | Flight distance in miles. |
Wifi service | Customer rating on 0-5 scale. |
Schedule convenient | Customer rating on 0-5 scale. |
Ease of Online booking | Customer rating on 0-5 scale. |
Gate location | Customer rating on 0-5 scale. |
Food and drink | Customer rating on 0-5 scale. |
Seat comfort | Customer rating on 0-5 scale. |
Online boarding | Customer rating on 0-5 scale. |
Inflight entertainment | Customer rating on 0-5 scale. |
On-board service | Customer rating on 0-5 scale. |
Leg room service | Customer rating on 0-5 scale. |
Baggage handling | Customer rating on 0-5 scale. |
Checkin service | Customer rating on 0-5 scale. |
Inflight service | Customer rating on 0-5 scale. |
Cleanliness | Customer rating on 0-5 scale. |
Departure Delay in minutes | Minutes delayed when departing. |
Arrival Delay in minutes | Minutes delayed when arriving. |
satisfaction | Overall airline satisfaction. Possible responses include "satisfied" or "neutral or dissatisfied". |
The first step in our analysis is to load our data using loadd
:
new;
library gml;
rndseed 8906876;
/*
** Load datafile
*/
// Set path and filename
load_path = "data/";
fname = "airline_satisfaction.gdat";
// Load data
airline_data = loadd(load_path $+ fname);
// Split data
y = airline_data[., "satisfaction"];
X = delcols(airline_data, "satisfaction"$|"id");
Data Exploration
Before we begin modeling, let's do some preliminary data exploration. First, let's check for common issues that can arise with survey data.
We'll check for:
- Duplicate observations using
isunique
. - Missing values using
dstatmt
.
First, we'll check for duplicates, so any duplicates can be removed prior to checking our summary statistics:
// Check for duplicates
isunique(airline_data);
The isunique
procedure returns a 1 if the data is unique and 0 if there are duplicates.
1.00000000
In this case, it indicates that we have no duplicates in our data.
Next, we'll check for missing values:
/*
** Check for data cleaning
** issues
*/
// Summary statistics
call dstatmt(airline_data);
This prints summary statistics for all variables:
Variable Mean Std Dev Variance Minimum Maximum Valid Missing ------------------------------------------------------------------------------------------------------- Gender ----- ----- ----- Female Male 103904 0 Customer Type ----- ----- ----- Loyal Cust disloyal C 103904 0 Age 39.38 15.11 228.5 7 85 103904 0 Type of Travel ----- ----- ----- Business t Personal T 103904 0 Class ----- ----- ----- Business Eco Plus 103904 0 Flight Distance 2108 1266 1.603e+06 0 3801 103904 0 Wifi service ----- ----- ----- 0 5 103904 0 Schedule convenient ----- ----- ----- 0 5 103904 0 Ease of Online booking ----- ----- ----- 0 5 103904 0 Gate location ----- ----- ----- 0 5 103904 0 Food and drink ----- ----- ----- 0 5 103904 0 Online boarding ----- ----- ----- 0 5 103904 0 Seat comfort ----- ----- ----- 0 5 103904 0 Inflight entertainment ----- ----- ----- 0 5 103904 0 Onboard service ----- ----- ----- 0 5 103904 0 Leg room service ----- ----- ----- 0 5 103904 0 Baggage handling ----- ----- ----- 1 5 103904 0 Checkin service ----- ----- ----- 0 5 103904 0 Inflight service ----- ----- ----- 0 5 103904 0 Cleanliness ----- ----- ----- 0 5 103904 0 Departure Delay in Minutes 14.82 38.23 1462 0 1592 103904 0 Arrival Delay in Minutes 15.25 38.81 1506 0 1584 103904 0 satisfaction ----- ----- ----- neutral or satisfied 103904 0
The summary statistics give us some useful insights:
- There are no missing values in our dataset.
- The summary statistics of our numerical variables don't indicate any obvious outliers.
- All categorical survey data ranges from 0 to 5 with the exception of
Baggage handling
which ranges from 1 to 5. All categorical variables will need to be converted to dummy variables prior to modeling.
One other observation from our summary statistics is that many of the variable names are longer than necessary. Long variable names can be:
- Difficult to remember.
- Prone to typos
- Cutoff when printing results.
(Not to mention they can be annoying to type!).
Let's streamline our variable names using dfname
:
/*
** Update variable names
*/
// Create string array of short names
string short_names = {"Loyalty", "Reason", "Distance", "Wifi",
"Schedule", "Booking", "Gate", "Boarding",
"Entertainment", "Leg room", "Baggage", "Checkin",
"Departure Delay", "Arrival Delay" };
// Create string array of original names to change
string original_names = { "Customer Type", "Type of Travel", "Flight Distance", "Wifi service",
"Schedule convenient", "Ease of Online booking", "Gate location", "Online boarding",
"Inflight entertainment", "Leg room service", "Baggage handling", "Checkin service",
"Departure Delay in Minutes", "Arrival Delay in Minutes" };
// Change names
airline_data = dfname(airline_data, short_names, original_names);
Data Visualization
Data visualization is a great way to get a feel for the relationships between our target variable and our features.
Let's explore the relationship between the customer and flight characteristics and reported satisfaction.
In particular, we'll look at how satisfaction relates to:
- Age.
- Gender.
- Flight distance.
- Seat class.
- Customer type.
Preparing Our Data for Plotting
Today we'll use bar graphs to explore the relationships in our data. In particular, we will sort our data into subgroups and examine how those subgroups report satisfaction.
For categorical variables, we have naturally defined subgroups. However, For the continuous variables, Age
and Distance
, we first need to generate bins based on ranges of these variables.
First, let's place the Age
variable in bins. To do this we will use the reclassifycuts
and reclassify
procedures:
/*
** Create bins for age
*/
// Set age categories cut points
// Class 0: 20 and Under
// Class 1: 21 - 30
// Class 2: 31 - 40
// Class 3: 41 - 50
// Class 4: 51 - 60
// Class 5: 61 - 70
// Class 6: Over 70
cut_pts = { 20,
30,
40,
50,
60,
70};
// Create numeric classes
age_new = reclassifycuts(airline_data[., "Age"], cut_pts);
// Generate labels to recode to
to = "20 and Under"$|
"21-30"$|
"31-40"$|
"41-50"$|
"51-60"$|
"61-70"$|
"Over 70";
// Recode to categorical variable
age_cat = reclassify(age_new, unique(age_new), to);
// Convert to dataframe
age_cat = asDF(age_cat, "Age Group");
For a quick frequency count of this categorical variable, we can use the frequency
procedure:
// Check frequency of age groups
frequency(age_cat, "Age Group");
Label Count Total % Cum. % 20 and Under 11333 10.91 10.91 21-30 21424 20.62 31.53 31-40 21203 20.41 51.93 41-50 23199 22.33 74.26 51-60 18769 18.06 92.32 61-70 7220 6.949 99.27 Over 70 756 0.7276 100 Total 103904 100
Now we will do the same for Distance
.
/*
** Create bins for light distance
*/
// Set distance categories
// Cut points for data
cut_pts = { 1000,
1500,
2000,
2500,
3000,
3500};
// Create numeric classes
distance_new = reclassifycuts(airline_data[., "Distance"], cut_pts);
// Generate labels to recode to
to = "1000 and Under"$|
"1001-1500"$|
"1501-2000"$|
"2001-2500"$|
"2501-3000"$|
"3000-3500"$|
"Over 3500";
// Recode to categorical variable
distance_cat = reclassify(distance_new, unique(distance_new), to);
// Convert to dataframe
distance_cat = asDF(distance_cat, "Flight Range");
// Check frequencies
frequency(distance_cat, "Flight Range");
Label Count Total % Cum. % 1000 and Under 28017 26.96 26.96 1001-1500 10976 10.56 37.53 1501-2000 9331 8.98 46.51 2001-2500 7834 7.54 54.05 2501-3000 8053 7.75 61.8 3000-3500 24815 23.88 85.68 Over 3500 14878 14.32 100 Total 103904 100
Age
We can see from the plot above that passengers 20 and under and passengers over 60 are less likely to be satisfied than other age groups.
Gender
The plot suggests that gender has little impact on reported satisfaction.
Flight Distance
The flight distance plot shows that there are slightly lower rates of satisfaction for flight lengths 3000 miles and over and flight lengths 1000 miles and under.
Seat Class
There is a clear discrepancy in satisfaction between passengers that fly business class and other passengers. Business class customers have a much higher rate of satisfaction than those in economy or economy plus.
Customer Type
Finally, it also appears that loyal passengers are more often satisfied customers than disloyal passengers.
Feature Engineering
As is common with survey data, a number of our variables are categorical. We need to represent these as dummy variables before modeling.
We'll do this using the oneHot
procedure. However, oneHot
only accepts single variables, so we will need to loop through all the categorical variables.
To do this, we first create a list of all categorical variables.
/*
** Create dummy variables
*/
// Get all variable names
col_names = getColNames(X);
// Get types of all variables
col_types = getColTypes(X);
// Select names of variables
// that are categorical
cat_names = selif(col_names, col_types .== "category");
Next, we loop through all categorical variables and create dummy variables for each one using oneHot
.
// Loop through categorical variables
// to create dummy variables
dummy_vars = {};
for i(1, rows(cat_names), 1);
dummy_vars = dummy_vars~oneHot(X[., cat_names[i]]);
endfor;
// Delete original categorical variables
// and replace with dummy variables
X = delcols(x, cat_names)~dummy_vars;
Model Evaluation
There are a number of classification metrics that are reported using the classificationMetrics
procedure. These metrics provide information about how well the model meets different objectives.
Model Comparison Measures |
||
---|---|---|
Tool | Description | |
Accuracy | Overall model accuracy. Equal to the number of correct predictions divided by the total number of predictions. | |
Precision | How good a model is at correctly identifying the class outcomes. Equal to the number of true positives divided by the number of false positives plus true positives. | |
Recall | How good a model is at correctly predicting all the class outcomes. Equal to the number of true positives divided by the number of false negatives plus true positives. | |
F1-score | The harmonic mean of the precision and recall, it gives a more balanced picture of how our model performs. A score of 1 indicates perfect precision and recall. |
We'll keep these in mind as we fit and test our model.
Logistic Regression Model Fitting
We're now ready to begin fitting our models. To start, we will prepare our data by:
Creating training and testing datasets using trainTestSplit
.
// Split data into 70% training and 30% test set
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);
Scaling our data using rescale
.
/*
** Data rescaling
*/
// Number of variables to rescale
numeric_vars = 4;
// Rescale training data
{ X_train[.,1:numeric_vars], x_mu, x_sd } = rescale(X_train[.,1:numeric_vars], "standardize");
// Rescale test data using same scaling factors as x_train
X_test[.,1:numeric_vars] = rescale(X_test[.,1:numeric_vars], x_mu, x_sd);
Unlike Random Forest models, logistic regression models are sensitive to large differences in the scale of the variables. Standardizing the variables as we do here is a good choice, but is not unequivocally the best option in all cases.
As you can see above, we compute the mean and standard deviation from the training set and use those parameters to scale the test set. This is important.
The purpose of our test set is to give us an estimate of how our model will do on unseen data. Using the mean and standard deviation from the entire dataset, before the train/test split would allow information from the test set to "leak" into our model. Information leakage is beyond the scope of this blog post, but in general the test set should be treated like information that is not available until after the model fit is complete.
Now we're ready to start fitting our models.
Case One: Logistic Regression Without Regularization
As a base case, we'll consider a logistic regression model without any regularization. For this case, we'll use all default settings, so our only inputs are the dependent and independent data.
Using our training data we will:
- Train our model using
logisticRegFit
. - Make predictions on our training data using
lmPredict
. - Evaluate our training model predictions using
classificationMetrics
.
/*************************************
** Base case model
** No regularization
*************************************/
/*
** Training
*/
// Declare 'lr_mdl' to be
// an 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;
// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train);
// Check training set performance
y_hat_train = lmPredict(lr_mdl, X_train);
// Model evaluations
print "Training Metrics";
call classificationMetrics(y_train, y_hat_train);
The classificationMetrics
procedure prints an evaluation table:
No regularization Training Metrics ============================================================== Classification metrics ============================================================== Class Precision Recall F1-score Support neutral or dissatisfied 0.93 0.92 0.93 41102 satisfied 0.90 0.91 0.90 31631 Macro avg 0.91 0.92 0.91 72733 Weighted avg 0.92 0.92 0.92 72733 Accuracy 0.92 72733
/*
** Testing
*/
// Make predictions on the test set, from our trained model
y_hat_test = lmPredict(lr_mdl, X_test);
/*
** Model evaluation
*/
print "Testing Metrics";
call classificationMetrics(y_test, y_hat_test);
This code prints the following to screen:
Testing Metrics ============================================================== Classification metrics ============================================================== Class Precision Recall F1-score Support neutral or dissatisfied 0.93 0.92 0.92 17777 satisfied 0.90 0.91 0.90 13394 Macro avg 0.91 0.91 0.91 31171 Weighted avg 0.91 0.91 0.91 31171 Accuracy 0.91 31171
There are some good observations comparing our training data and testing data performance:
- First, there is little difference in accuracy across our training and testing data set, with a training accuracy of 0.92 and a testing accuracy of 0.91.
- Our model provides the same average F1-score, which provides a balanced measure of performance, across the testing and training dataset.
Why is this important? This comparison provides a good indication that we aren't overfitting our training set. Since the main purpose of regularization is to address overfitting the model to the training data, we don't have much reason to use it. However, for demonstration purposes, we'll show how to implement L2 regularization.
Case Two: Logistic Regression With L2 Regularization
To implement regularization with the logisticRegFit
, we'll use a logisticRegControl
structure.
/*************************************
** L2 Regularization
*************************************/
/*
** Training
*/
// Declare 'lrc' to be a logisticRegControl
// structure and fill with default settings
struct logisticRegControl lrc;
lrc = logisticRegControlCreate();
// Set L2 regularization parameter
lrc.l2 = 0.05;
// Declare 'lr_mdl' to be
// a 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;
// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train, lrc);
/*
** Testing
*/
// Make predictions on the test set
y_hat_l2 = lmPredict(lr_mdl, X_test);
/*
** Model evaluation
*/
call classificationMetrics(y_test, y_hat_l2);
The classification metrics are printed:
L2 regularization ============================================================== Classification metrics ============================================================== Class Precision Recall F1-score Support neutral or dissatisfied 0.89 0.93 0.91 17777 satisfied 0.90 0.84 0.87 13394 Macro avg 0.90 0.89 0.89 31171 Weighted avg 0.89 0.89 0.89 31171 Accuracy 0.89 31171
Note that with the L2 penalty, our model performance drops from the base case model, with lower accuracy (0.89) and lower average F1-score (0.89). This isn't surprising, given that we didn't find support of overfitting in our model.
Conclusion
In today's blog, we've looked at logistic regression and regularization.
Using a real-world airline passenger satisfaction data application we've:
- Performed preliminary data and setup.
- Trained logistic regression models with and without regularization.
- Made classification predictions.
- Interpreted classification metrics.
Further Machine Learning Reading
- Predicting Recessions with Machine Learning Techniques
- Applications of Principal Components Analysis in Finance
- Predicting The Output Gap With Machine Learning Regression Models
- Fundamentals of Tuning Machine Learning Hyperparameters
- Understanding Cross-Validation
- Machine Learning With Real-World Data
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.