Classification with Regularized Logistic Regression

Introduction

Logistic regression has been a long-standing popular tool for modeling categorical outcomes. It's widely used across fields like epidemiology, finance, and econometrics.

In today's blog we'll look at the fundamentals of logistic regression. We'll use a real-world survey data application and provide a step-by-step guide to implementing your own regularized logistic regression models using the GAUSS Machine Learning library, including:

  1. Data preparation.
  2. Model fitting.
  3. Classification predictions.
  4. Evaluating predictions and model fit.

What is Logistic Regression?

Logistic regression is a statistical method that can be used to predict the probability of an event occurring based on observed features or variables. The predicted probabilities can then be used to classify the data based on probability thresholds.

For example, if we are modeling a "TRUE" and "FALSE" outcome, we may predict that an outcome will be "TRUE" for all predicted probabilities of 0.5 and higher.

Mathematically, logistic regression models the relationship between the probability of an outcome as a logistic function of the independent variables:

$$ Pr(Y = 1 | X) = p(X) = \frac{e^{B_0 + B_1X}}{1 + e^{B_0 + B_1X}} $$

This log-odds representation is sometimes more common because it is linear in our independent variables:

$$ \log \bigg( \frac{p(X)}{1 + p(X)} \bigg) = B_0 + B_1X $$

There are some important aspects of this model to keep in mind:

  • The logistic regression model always yields a prediction between 0 and 1.
  • The magnitude of the coefficients in the logistic regression model cannot be as directly interpreted as in the classic linear model.
  • The signs of the coefficients in the logistic regression model can be interpreted as expected. For example, if the coefficient on $X_1$ is negative we can conclude that increasing $X_1$ decreases $p(X)$.

Logistic Regression with Regularization

One potential pitfall of logistic regression is its tendency for overfitting, particularly with high dimensional feature sets.

Regularization with L1 and/or L2 penalty parameters with can help prevent overfitting and improve prediction.

Comparison of L1 and L2 Regularization

$L1$ penalty (Lasso)$L2$ penalty (Ridge)
Penalty term$\lambda \sum_{j=1}^p |\beta_j|$$\lambda \sum_{j=1}^p \beta_j^2$
Robust to outliers
Shrinks coefficients
Can select features
Sensitive to correlated features
Useful for preventing overfitting
Useful for addressing multicollinearity
Requires hyperparameter selection (λ)

Predicting Customer Satisfaction Using Survey Data

Today we will use airline passenger satisfaction data to demonstrate logistic regression with regularization.

Our task is to predict passenger satisfaction using:

  • Available survey answers.
  • Flight information.
  • Passenger characteristics.
VariableDescription
idResponder identification number
GenderGender identification: Female or Male.
Customer TypeLoyal or disloyal customer.
AgeCustomer age in years.
Type of travelPersonal or business travel.
ClassEco or business class seat.
Flight DistanceFlight distance in miles.
Wifi serviceCustomer rating on 0-5 scale.
Schedule convenientCustomer rating on 0-5 scale.
Ease of Online bookingCustomer rating on 0-5 scale.
Gate locationCustomer rating on 0-5 scale.
Food and drinkCustomer rating on 0-5 scale.
Seat comfortCustomer rating on 0-5 scale.
Online boardingCustomer rating on 0-5 scale.
Inflight entertainmentCustomer rating on 0-5 scale.
On-board serviceCustomer rating on 0-5 scale.
Leg room serviceCustomer rating on 0-5 scale.
Baggage handlingCustomer rating on 0-5 scale.
Checkin serviceCustomer rating on 0-5 scale.
Inflight serviceCustomer rating on 0-5 scale.
CleanlinessCustomer rating on 0-5 scale.
Departure Delay in minutesMinutes delayed when departing.
Arrival Delay in minutesMinutes delayed when arriving.
satisfactionOverall airline satisfaction. Possible responses include "satisfied" or "neutral or dissatisfied".


The first step in our analysis is to load our data using loadd:

new;
library gml;
rndseed 8906876;

/*
** Load datafile
*/
// Set path and filename
load_path = "data/";
fname = "airline_satisfaction.gdat";

// Load data
airline_data = loadd(load_path $+ fname);

// Split data
y = airline_data[., "satisfaction"];
X = delcols(airline_data, "satisfaction"$|"id");

Data Exploration

Before we begin modeling, let's do some preliminary data exploration. First, let's check for common issues that can arise with survey data.

We'll check for:

First, we'll check for duplicates, so any duplicates can be removed prior to checking our summary statistics:

// Check for duplicates
isunique(airline_data);

The isunique procedure returns a 1 if the data is unique and 0 if there are duplicates.

1.00000000

In this case, it indicates that we have no duplicates in our data.

Next, we'll check for missing values:

/*
** Check for data cleaning
** issues
*/
// Summary statistics
call dstatmt(airline_data);

This prints summary statistics for all variables:

Variable                       Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------------

Gender                        -----       -----         -----      Female        Male    103904    0
Customer Type                 -----       -----         -----  Loyal Cust  disloyal C    103904    0
Age                           39.38       15.11         228.5           7          85    103904    0
Type of Travel                -----       -----         -----  Business t  Personal T    103904    0
Class                         -----       -----         -----    Business    Eco Plus    103904    0
Flight Distance                2108        1266     1.603e+06           0        3801    103904    0
Wifi service                  -----       -----         -----           0           5    103904    0
Schedule convenient           -----       -----         -----           0           5    103904    0
Ease of Online booking        -----       -----         -----           0           5    103904    0
Gate location                 -----       -----         -----           0           5    103904    0
Food and drink                -----       -----         -----           0           5    103904    0
Online boarding               -----       -----         -----           0           5    103904    0
Seat comfort                  -----       -----         -----           0           5    103904    0
Inflight entertainment        -----       -----         -----           0           5    103904    0
Onboard service               -----       -----         -----           0           5    103904    0
Leg room service              -----       -----         -----           0           5    103904    0
Baggage handling              -----       -----         -----           1           5    103904    0
Checkin service               -----       -----         -----           0           5    103904    0
Inflight service              -----       -----         -----           0           5    103904    0
Cleanliness                   -----       -----         -----           0           5    103904    0
Departure Delay in Minutes    14.82       38.23          1462           0        1592    103904    0
Arrival Delay in Minutes      15.25       38.81          1506           0        1584    103904    0
satisfaction                  -----       -----         -----  neutral or   satisfied    103904    0 

The summary statistics give us some useful insights:

  • There are no missing values in our dataset.
  • The summary statistics of our numerical variables don't indicate any obvious outliers.
  • All categorical survey data ranges from 0 to 5 with the exception of Baggage handling which ranges from 1 to 5. All categorical variables will need to be converted to dummy variables prior to modeling.

One other observation from our summary statistics is that many of the variable names are longer than necessary. Long variable names can be:

  • Difficult to remember.
  • Prone to typos
  • Cutoff when printing results.

(Not to mention they can be annoying to type!).

Let's streamline our variable names using dfname:

/*
** Update variable names
*/
// Create string array of short names
string short_names = {"Loyalty", "Reason", "Distance", "Wifi", 
                      "Schedule", "Booking", "Gate", "Boarding", 
                      "Entertainment", "Leg room", "Baggage", "Checkin", 
                      "Departure Delay", "Arrival Delay" };

// Create string array of original names to change                      
string original_names = { "Customer Type", "Type of Travel", "Flight Distance", "Wifi service",
                          "Schedule convenient", "Ease of Online booking", "Gate location", "Online boarding",
                          "Inflight entertainment", "Leg room service", "Baggage handling", "Checkin service",
                          "Departure Delay in Minutes", "Arrival Delay in Minutes" };

// Change names
airline_data = dfname(airline_data, short_names, original_names);

Data Visualization

Data visualization is a great way to get a feel for the relationships between our target variable and our features.

Let's explore the relationship between the customer and flight characteristics and reported satisfaction.

In particular, we'll look at how satisfaction relates to:

  • Age.
  • Gender.
  • Flight distance.
  • Seat class.
  • Customer type.

Preparing Our Data for Plotting

Today we'll use bar graphs to explore the relationships in our data. In particular, we will sort our data into subgroups and examine how those subgroups report satisfaction.

For categorical variables, we have naturally defined subgroups. However, For the continuous variables, Age and Distance, we first need to generate bins based on ranges of these variables.

First, let's place the Age variable in bins. To do this we will use the reclassifycuts and reclassify procedures:

/*
** Create bins for age
*/
// Set age categories cut points
// Class 0: 20 and Under
// Class 1: 21 - 30
// Class 2: 31 - 40
// Class 3: 41 - 50
// Class 4: 51 - 60
// Class 5: 61 - 70
// Class 6: Over 70
cut_pts = { 20, 
            30, 
            40, 
            50, 
            60, 
            70};

// Create numeric classes
age_new = reclassifycuts(airline_data[., "Age"], cut_pts);

// Generate labels to recode to
to = "20 and Under"$|
       "21-30"$|
       "31-40"$|
       "41-50"$|
       "51-60"$|
       "61-70"$|
       "Over 70";

// Recode to categorical variable
age_cat = reclassify(age_new, unique(age_new), to);

// Convert to dataframe
age_cat = asDF(age_cat, "Age Group");

For a quick frequency count of this categorical variable, we can use the frequency procedure:

// Check frequency of age groups
frequency(age_cat, "Age Group");
       Label      Count   Total %    Cum. %
20 and Under      11333     10.91     10.91
       21-30      21424     20.62     31.53
       31-40      21203     20.41     51.93
       41-50      23199     22.33     74.26
       51-60      18769     18.06     92.32
       61-70       7220     6.949     99.27
     Over 70        756    0.7276       100
       Total     103904       100     

Now we will do the same for Distance.

/*
** Create bins for light distance
*/       
// Set distance categories
// Cut points for data 
cut_pts = { 1000, 
            1500, 
            2000, 
            2500, 
            3000,
            3500};

// Create numeric classes
distance_new = reclassifycuts(airline_data[., "Distance"], cut_pts);

// Generate labels to recode to
to = "1000 and Under"$|
       "1001-1500"$|
       "1501-2000"$|
       "2001-2500"$|
       "2501-3000"$|
       "3000-3500"$|
       "Over 3500";

// Recode to categorical variable
distance_cat = reclassify(distance_new, unique(distance_new), to);

// Convert to dataframe
distance_cat = asDF(distance_cat, "Flight Range");

// Check frequencies
frequency(distance_cat, "Flight Range");
         Label      Count   Total %    Cum. %
1000 and Under      28017     26.96     26.96
     1001-1500      10976     10.56     37.53
     1501-2000       9331      8.98     46.51
     2001-2500       7834      7.54     54.05
     2501-3000       8053      7.75      61.8
     3000-3500      24815     23.88     85.68
     Over 3500      14878     14.32       100
         Total     103904       100    

Age

We can see from the plot above that passengers 20 and under and passengers over 60 are less likely to be satisfied than other age groups.

Gender

The plot suggests that gender has little impact on reported satisfaction.

Flight Distance

The flight distance plot shows that there are slightly lower rates of satisfaction for flight lengths 3000 miles and over and flight lengths 1000 miles and under.

Seat Class

There is a clear discrepancy in satisfaction between passengers that fly business class and other passengers. Business class customers have a much higher rate of satisfaction than those in economy or economy plus.

Customer Type

Finally, it also appears that loyal passengers are more often satisfied customers than disloyal passengers.

Feature Engineering

As is common with survey data, a number of our variables are categorical. We need to represent these as dummy variables before modeling.

We'll do this using the oneHot procedure. However, oneHot only accepts single variables, so we will need to loop through all the categorical variables.

To do this, we first create a list of all categorical variables.

/*
** Create dummy variables
*/
// Get all variable names
col_names = getColNames(X);

// Get types of all variables
col_types = getColTypes(X);

// Select names of variables
// that are categorical
cat_names = selif(col_names, col_types .== "category");

Next, we loop through all categorical variables and create dummy variables for each one using oneHot.

// Loop through categorical variables
// to create dummy variables
dummy_vars = {};
for i(1, rows(cat_names), 1); 
    dummy_vars = dummy_vars~oneHot(X[., cat_names[i]]);
endfor;

// Delete original categorical variables
// and replace with dummy variables
X = delcols(x, cat_names)~dummy_vars;

Model Evaluation

There are a number of classification metrics that are reported using the classificationMetrics procedure. These metrics provide information about how well the model meets different objectives.

Model Comparison Measures

ToolDescription
AccuracyOverall model accuracy. Equal to the number of correct predictions divided by the total number of predictions.
PrecisionHow good a model is at correctly identifying the class outcomes. Equal to the number of true positives divided by the number of false positives plus true positives.
RecallHow good a model is at correctly predicting all the class outcomes. Equal to the number of true positives divided by the number of false negatives plus true positives.
F1-scoreThe harmonic mean of the precision and recall, it gives a more balanced picture of how our model performs. A score of 1 indicates perfect precision and recall.

We'll keep these in mind as we fit and test our model.

Logistic Regression Model Fitting

We're now ready to begin fitting our models. To start, we will prepare our data by:

Creating training and testing datasets using trainTestSplit.

// Split data into 70% training and 30% test set
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

Scaling our data using rescale.

/*
** Data rescaling
*/
// Number of variables to rescale
numeric_vars = 4;

// Rescale training data
{ X_train[.,1:numeric_vars], x_mu, x_sd } = rescale(X_train[.,1:numeric_vars], "standardize");

// Rescale test data using same scaling factors as x_train
X_test[.,1:numeric_vars] = rescale(X_test[.,1:numeric_vars], x_mu, x_sd);

Unlike Random Forest models, logistic regression models are sensitive to large differences in the scale of the variables. Standardizing the variables as we do here is a good choice, but is not unequivocally the best option in all cases.

As you can see above, we compute the mean and standard deviation from the training set and use those parameters to scale the test set. This is important.

The purpose of our test set is to give us an estimate of how our model will do on unseen data. Using the mean and standard deviation from the entire dataset, before the train/test split would allow information from the test set to "leak" into our model. Information leakage is beyond the scope of this blog post, but in general the test set should be treated like information that is not available until after the model fit is complete.

Now we're ready to start fitting our models.

Case One: Logistic Regression Without Regularization

As a base case, we'll consider a logistic regression model without any regularization. For this case, we'll use all default settings, so our only inputs are the dependent and independent data.

Using our training data we will:

  1. Train our model using logisticRegFit.
  2. Make predictions on our training data using lmPredict.
  3. Evaluate our training model predictions using classificationMetrics.
/*************************************
** Base case model
** No regularization
*************************************/

/*
** Training
*/
// Declare 'lr_mdl' to be 
// an 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;

// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train);

// Check training set performance
y_hat_train = lmPredict(lr_mdl, X_train);

// Model evaluations
print "Training Metrics";
call classificationMetrics(y_train, y_hat_train);

The classificationMetrics procedure prints an evaluation table:

No regularization
Training Metrics
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.93    0.92      0.93    41102
              satisfied        0.90    0.91      0.90    31631

              Macro avg        0.91    0.92      0.91    72733
           Weighted avg        0.92    0.92      0.92    72733

               Accuracy                          0.92    72733
/*
** Testing
*/
// Make predictions on the test set, from our trained model
y_hat_test = lmPredict(lr_mdl, X_test);

/*
** Model evaluation
*/
print "Testing Metrics";
call classificationMetrics(y_test, y_hat_test);

This code prints the following to screen:

Testing Metrics
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.93    0.92      0.92    17777
              satisfied        0.90    0.91      0.90    13394

              Macro avg        0.91    0.91      0.91    31171
           Weighted avg        0.91    0.91      0.91    31171

               Accuracy                          0.91    31171

There are some good observations comparing our training data and testing data performance:

  • First, there is little difference in accuracy across our training and testing data set, with a training accuracy of 0.92 and a testing accuracy of 0.91.
  • Our model provides the same average F1-score, which provides a balanced measure of performance, across the testing and training dataset.

Why is this important? This comparison provides a good indication that we aren't overfitting our training set. Since the main purpose of regularization is to address overfitting the model to the training data, we don't have much reason to use it. However, for demonstration purposes, we'll show how to implement L2 regularization.

Case Two: Logistic Regression With L2 Regularization

To implement regularization with the logisticRegFit, we'll use a logisticRegControl structure.

/*************************************
** L2 Regularization
*************************************/

/*
** Training
*/
// Declare 'lrc' to be a logisticRegControl
// structure and fill with default settings 
struct logisticRegControl lrc;
lrc = logisticRegControlCreate();

// Set L2 regularization parameter
lrc.l2 = 0.05;

// Declare 'lr_mdl' to be 
// a 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;

// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train, lrc);

/*
** Testing
*/
// Make predictions on the test set
y_hat_l2 = lmPredict(lr_mdl, X_test);

/*
** Model evaluation
*/
call classificationMetrics(y_test, y_hat_l2);

The classification metrics are printed:

L2 regularization
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.89    0.93      0.91    17777
              satisfied        0.90    0.84      0.87    13394

              Macro avg        0.90    0.89      0.89    31171
           Weighted avg        0.89    0.89      0.89    31171

               Accuracy                          0.89    31171

Note that with the L2 penalty, our model performance drops from the base case model, with lower accuracy (0.89) and lower average F1-score (0.89). This isn't surprising, given that we didn't find support of overfitting in our model.

Conclusion

In today's blog, we've looked at logistic regression and regularization.

Using a real-world airline passenger satisfaction data application we've:

  1. Performed preliminary data and setup.
  2. Trained logistic regression models with and without regularization.
  3. Made classification predictions.
  4. Interpreted classification metrics.

Further Machine Learning Reading

  1. Predicting Recessions with Machine Learning Techniques
  2. Applications of Principal Components Analysis in Finance
  3. Predicting The Output Gap With Machine Learning Regression Models
  4. Fundamentals of Tuning Machine Learning Hyperparameters
  5. Understanding Cross-Validation
  6. Machine Learning With Real-World Data

Leave a Reply