Introduction to Handling Missing Values

Introduction

Handling missing values is an important step in data cleaning that can impact model validity and reliability.

Despite this, it can be difficult to find examples and resources about how to deal with missing values. This blog helps to fill that void and covers:

  • Types of missing values.
  • Dealing with missing values.
  • Missing values in practice.

Why do missing values occur in data?

Missing values can occur in data for a number of reasons, such as survey non-responses or errors in data entry.

While it may seem that a missing value is a missing value, not all missing data is the same.

Missing data is grouped into three broad categories:

  • Missing completely at random.
  • Missing at random.
  • Missing not at random.

Missing completely at random (MCAR)

Data is missing completely at random if all observations have the same likelihood of being missing.

Some hypothetical examples of MCAR data include:

  • Electronic time observations are missing, independent of what lane a swimmer is in.
  • A scale is equally likely to produce missing values when placed on a soft surface or a hard surface (Van Buren, 2018).

Missing at random (MAR)

When data is missing at random (MAR) the likelihood that a data point is missing is not related to the missing data but may be related to other observed data.

Some hypothetical examples of MAR data include:

  • A certain swimming lane is more likely to have missing electronic time observations but the missing data isn't directly related to the actual time.
  • A scale produces more missing values when placed on a soft surface than a hard surface ( Van Buren, 2018), independent of the weight.
  • Childhood health assessment data is more likely to be missing in lower median income counties.

Missing not at random (MNAR)

When data is missing not at random (MNAR) the likelihood of a missing observation is related to its values. It can be difficult to identify MNAR data because the values of missing data are unobserved. This can result in distorted data.

Some hypothetical examples of MNAR data include:

  • When surveyed people with more income are less likely to report their incomes.
  • On a health survey, illicit drug users are less likely to respond to a question about illicit drug use.
  • Individuals surveyed about their age are more likely to leave the age question blank when they are older.
Type of missing valueDescriptionExamplesAcceptable solutions
Missing completely at randomAll observations have the same likelihood of being missing.
  • Electronic time observations are missing, independent of what lane a swimmer is in.
  • A scale is equally likely to produce missing values when placed on a soft surface or a hard surface ( Van Buren, 2018).
  • Geographical location data is eqaully likely to be missing for all locations.
Deletion, Imputation
Missing at randomLikelihood that a data point is missing is not related to the missing data but may be related to other observed data.
  • A certain swimming lane is more likely to have missing electronic time observations.
  • A scale produces more missing values when placed on a soft surface than a hard surface ( Van Buren, 2018).
  • Childhood health assessment data is more likely to be missing in lower median income counties.
Deletion, Imputation
Missing not at randomLikelihood of a missing observation is related to its values.
  • When surveyed people with more income are less likely to report their income.
  • On a health survey illicit drug users are less likely to respond to a question about illicit drug use.
  • Individuals surveyed about their age are more likely to leave the age question blank when they are older.
Imputation

Dealing with missing values

How we should deal with missing data depends both on the cause of the missing values and the characteristics of the data set. For example, we cannot deal with missing categorical data in the same manner that we deal with missing time series data.

Deletion

When data is MCAR and MAR deletion may be a suitable method for dealing with missing values. However, when data is MNAR, deletion of missing observations can lead to bias.

In this section we cover three methods of data deletion for missing values:

  • Listwise deletion
  • Pairwise deletion
  • Variable deletion

Methods of deletion for missing values

MethodDescriptionAdvantagesDisadvantages
Listwise deletionDelete all observations where the missing values occur.Easy to implement.
  • Can result in biased estimates if the missing values are not MCAR.
  • Wastes useful information.
  • Can disrupt time series analysis by creating gaps in dates used for analysis.
Pairwise deletionUses all available data when computing means and covariances.
  • Simple to implement.
  • Uses all available information.
  • Can result in biased estimates if the missing values are not MCAR.
  • Results in different sample sizes being used for different computations.
  • Requires that data follow a normal distribution.
Variable DeletionEliminate a variable from analysis if it is missing a large percentage of observations.
  • Easy to implement.
  • Significant loss in information.
  • May result in missing variable bias.

Example

Consider a small sample of data from the Nelson-Plosser macroeconomic dataset:

year       gnp       ip       emp       cpi
1906 . 9.8 33749 28 1907 . 10 34371 29 1908 . . 33246 28 1909 116.8 8.5 35072 28 1910 120.1 10 35762 28

Listwise deletion

  • Uses only the observations from 1909 and 1910 for all parts of analysis.
  • It eliminates the all data in 1906-1908 because of the missing values in gnp and ip.

Pairwise deletion:

  • Uses the observations in 1906-1910 when computing the means and covariances of emp and cpi.
  • Uses the observations in 1906-1907 and 1909-1910 when computing the means and covariances of ip.
  • Uses the observations in 1909-1910 when computing the means and covariances of gnp.
  • Uses the observations in 1909-1910 for model estimation other than means and covariances.

Variable deletion

  • Uses only emp and cpi for all parts of analysis.

Imputation

Imputing data replaces missing values with statistically determined values. Methods of imputation can vary from simply replacing missing values with the mean to sophisticated multiple imputation processes.

Which method of imputation should be used depends on the characteristics of the data.

Examples of imputation methods

MethodDescriptionAdvantagesDisadvantages
Replacement with mean, median, modeAll missing values are replaced with the variable mean, median or mode.Easy to implement.
  • Distorts the distribution of the data.
  • Reduces data variance.
  • Results in biased estimates.
Linear RegressionMissing values are predicted using a linear model and the other variables in the dataset.
  • Simple to implement.
  • Uses all available information.
  • Biased correlations between variables.
  • Underestimated variability.
  • Falsely strengthens relationship between variables.
Last observation carried forward (LOCF)
  • Used for time series data.
  • Use last observed data value as a replacement for missing data
  • Appropriate for time series data.
  • Easy implementation.
  • Can results in biased estimates even when data is MCAR.
  • Modeling techniques should address that data has been imputed by LOCF.
  • May incorrectly suggest stability across time stretches if used to fill successions of missing data.
Predictive mean matchingReplacements for missing values are drawn randomly from a group of nearby candidate values.
  • Appropriate for time series data.
  • Easy to use and versatile.
  • Robust to data transformations.
  • Valid for discrete data.
  • Will always produce replacements within the observed data range.
  • May duplicate values, especially if sample is small.
  • Not suited for small samples, skewed data, or sparse data.
  • Cannot be used to extrapolate beyond range of the data.

Imputing time series data
Not all imputation methods are appropriate for time series data. Time series data may contain time trends or seasonality, all of which should be addressed when imputing missing data.

Common methods for imputing time series data include:

Missing values in practice

Now that we have some background on missing values in data, let's consider some of the practicalities of dealing with missing values.

In particular, we will look more closely at:

  • Identifying missing values.
  • Deleting missing values.
  • Replacing missing values.

Throughout this section, we will use a sample dataset yellowstone_gaps.csv.

Identifying missing values

Interactively viewing missing values
When loading data, missing values are easy to identify in the data preview and show up as . within grey cells:

Once loaded, the missing values still appear as . in the data editor.

Checking for missing values in a matrix or dataframe
Now suppose we didn't see any missing values right away when previewing our data but want to be certain. The procedure ismiss tests if a matrix or dataframe contains any missing values:

// Load dataset
data = loadd("yellowstone_gaps.csv");

// Check for any missing values
ismiss(data);

This prints:

       1.0000000 

Which indicates that our data does contain missing values.

Counting missing values by variables
This is a great start but now we need to know more about where our missing values are located. Let's use dstatmt to count missing values by variable name:

dstatmt(data);

This prints a detailed descriptive statistics report to screen:

------------------------------------------------------------------------------------
Variable          Mean   Std Dev    Variance     Minimum     Maximum   Valid Missing
------------------------------------------------------------------------------------
Date             -----     -----       -----  1986/01/01  2016/12/01     372    0
Visits          255660    291753   8.512e+10        6261   9.959e+05     372    0
LowtTemp         3.595     20.33       413.2         -46          38     368    4
HighTemp         67.92     17.75       315.1          34          98     367    5
TotalPrecip      1.334     0.798      0.6368           0        5.78     368    4
TotalSnowfall    7.496     8.828       77.94           0        40.7     366    6
CPI             0.2116    0.3272      0.1071        -1.9         1.2     372    0
MCSI             87.34     11.94       142.5        55.3         112     372    0
Unemployment     6.039     1.468       2.155         3.8          10     371    1 

The final column of our report shows that LowTemp, HighTemp, TotalPrecip, TotalSnowfall and Unemployment all have missing values. The most missing values occur for TotalSnowfall.

Identifying the location of missing values
Let's take a closer look at our TotalSnowfall variable to identify when our missing values occur.

The indexcat function finds the indices of the elements of a vector that fall into a specified range or are equal to a specified value. It requires two inputs, a data vector, and the value or range to search for:

// Find indices of missing values
missing = { . };
indx = indexcat(data[., "TotalSnowfall"], missing);

// Print dates associated with missing values
data[indx, "Date"];

This prints the dates of the missing TotalSnowfall to screen:m

            Date
      1986/07/01
      1986/09/01
      1998/07/01
      1998/10/01
      2005/12/01
      2009/12/01 

Checking rows for missing values Suppose that instead of finding which dates contain missing values we want to see if some specific dates have missing values. Combining the information in our dataframe with the procedure rowcontains makes this relatively easy.

// First create indicator of rows with missing values
missing = { . };
row_missing = rowcontains(data[., 2:9], missing);

// Concatenate with date vector
row_missing = data[., "Date"] ~ row_missing;

// Print first ten observations
print row_missing[1:10, .]; 

Our results tell us if each date has any missing data. A 0 indicates no missing values and 1 indicates there is at least one missing value:

      1986/01/01        0.0000000
      1986/02/01        0.0000000
      1986/03/01        0.0000000
      1986/04/01        0.0000000
      1986/05/01        0.0000000
      1986/06/01        0.0000000
      1986/07/01        1.0000000
      1986/08/01        0.0000000
      1986/09/01        1.0000000
      1986/10/01        0.0000000 

Deleting missing values

Now that we've seen how to locate missing values in our data, let's look more closely at methods for deleting missing values from our data.

We will start with the simple procedure, packr. The packr procedure removes any rows with missing values:

// Count rows before removing missing values
"Rows before removing missing data:";
rows(data);

// Remove missing values
data_nomiss = packr(data);

// Count rows after removing missing values
"Row after removing missing values:";
rows(data_nomiss);

Our full dataset has 372 observations before removing missing values. After using packr to remove any rows with missing values, we are left with 364 observations:

Rows before removing missing data:
       372.00000

Row after removing missing values:
       364.00000 

This is a convenient method to remove all missing values from a dataset. However, if we want to remove data more discerningly, we can use the interactive data filtering tool. With the interactive data filtering tool, we can filter out missing values by variable.

For example, suppose we only want to drop observations that are missing TotalSnowFall. To do this we can:

  1. Open the data dataframe in the Data Editor.
  2. Select TotalSnowFall from the Variable name drop-down list on the Filter tab.
  3. Select Is Not Missing from the Operation drop-down list.
  4. Click the + button to add the filter.
  5. Click Apply to remove missing values from Data.

When we click Apply, GAUSS auto-generates the code used to remove all missing values of TotalSnowFall from the Data matrix:

data = selif(data, data[., "TotalSnowfall"] .!= error(0));

We can see that the GAUSS procedure, selif is a convenient programmatic tool for selecting data that is not missing.

Alternatively, we could use the GAUSS procedure delif to delete all observations where TotalSnowFall is missing:

data = delif(data, data[., "TotalSnowfall"] .== error(0));

Replacing missing values

As we discussed earlier, deleting missing values isn't always the most appropriate choice. GAUSS has two functions that can be used to replace missing values:

  1. The missrv function.
  2. The impute function.

The missrv function replaces all missing values in a matrix with a user-specified value(s) and is the opposite of the miss function.

For example, suppose we want to replace all missing values . with -999:

data_replaced = missrv(data, -999);

This may be useful for creating consistent coding of missing data. However, it isn't very useful for setting our data up for modeling.

The impute procedure allows us to replace data in a manner more appropriate for modeling. It is compatible with a number of imputation methods including:

Method Description
"mean" Replace missing values with the mean of the column (default).
"median" Replace missing values with the median of the column.
"mode" Replace missing values with the mode of the column.
"pmm" Replace missing values using predictive mean matching.
"lrd" Replace missing values using local residual draws.
"predict" Replace missing values using linear regression prediction.

Let's take the simple approach of replacing all missing values of the TotalSnowFall variable with its median value

// Replace missing values of 'TotalSnowfall'
// with median value of 'TotalSnowfall'
data[., "TotalSnowfall"] = impute(data[., "TotalSnowfall"], "median");

// Descriptive statistics 
dstatmt(data[., "TotalSnowfall");
--------------------------------------------------------------------------------
Variable         Mean   Std Dev   Variance   Minimum    Maximum   Valid  Missing
--------------------------------------------------------------------------------
TotalSnowfall   7.448     8.764      76.81         0       40.7       372     0

There are a few things to note from our descriptive statistics for TotalSnowfall:

  1. There are no longer any missing observations of TotalSnowfall.
  2. The variance has decreased from 77.94 to 76.81.
  3. The standard deviation has decreased from 8.828 to 8.764.
  4. The mean has risen from 7.496 to 7.448.

Conclusion

Missing values are a fact of life in real-world data. Dealing with missing values can be time-consuming and error-prone. This blog helps with both and provides tools for handling missing values in practice.

Leave a Reply