Introduction
Handling missing values is an important step in data cleaning that can impact model validity and reliability.
Despite this, it can be difficult to find examples and resources about how to deal with missing values. This blog helps to fill that void and covers:
- Types of missing values.
- Dealing with missing values.
- Missing values in practice.
Why do missing values occur in data?
Missing values can occur in data for a number of reasons, such as survey non-responses or errors in data entry.
While it may seem that a missing value is a missing value, not all missing data is the same.
Missing data is grouped into three broad categories:
- Missing completely at random.
- Missing at random.
- Missing not at random.
Missing completely at random (MCAR)
Data is missing completely at random if all observations have the same likelihood of being missing.
Some hypothetical examples of MCAR data include:
- Electronic time observations are missing, independent of what lane a swimmer is in.
- A scale is equally likely to produce missing values when placed on a soft surface or a hard surface (Van Buren, 2018).
Missing at random (MAR)
When data is missing at random (MAR) the likelihood that a data point is missing is not related to the missing data but may be related to other observed data.
Some hypothetical examples of MAR data include:
- A certain swimming lane is more likely to have missing electronic time observations but the missing data isn't directly related to the actual time.
- A scale produces more missing values when placed on a soft surface than a hard surface ( Van Buren, 2018), independent of the weight.
- Childhood health assessment data is more likely to be missing in lower median income counties.
Missing not at random (MNAR)
When data is missing not at random (MNAR) the likelihood of a missing observation is related to its values. It can be difficult to identify MNAR data because the values of missing data are unobserved. This can result in distorted data.
Some hypothetical examples of MNAR data include:
- When surveyed people with more income are less likely to report their incomes.
- On a health survey, illicit drug users are less likely to respond to a question about illicit drug use.
- Individuals surveyed about their age are more likely to leave the age question blank when they are older.
Type of missing value | Description | Examples | Acceptable solutions |
---|---|---|---|
Missing completely at random | All observations have the same likelihood of being missing. |
| Deletion, Imputation |
Missing at random | Likelihood that a data point is missing is not related to the missing data but may be related to other observed data. |
| Deletion, Imputation |
Missing not at random | Likelihood of a missing observation is related to its values. |
| Imputation |
Dealing with missing values
How we should deal with missing data depends both on the cause of the missing values and the characteristics of the data set. For example, we cannot deal with missing categorical data in the same manner that we deal with missing time series data.
Deletion
When data is MCAR and MAR deletion may be a suitable method for dealing with missing values. However, when data is MNAR, deletion of missing observations can lead to bias.
In this section we cover three methods of data deletion for missing values:
- Listwise deletion
- Pairwise deletion
- Variable deletion
Methods of deletion for missing values |
|||
---|---|---|---|
Method | Description | Advantages | Disadvantages |
Listwise deletion | Delete all observations where the missing values occur. | Easy to implement. |
|
Pairwise deletion | Uses all available data when computing means and covariances. |
|
|
Variable Deletion | Eliminate a variable from analysis if it is missing a large percentage of observations. |
|
|
Example
Consider a small sample of data from the Nelson-Plosser macroeconomic dataset:
year gnp ip emp cpi
1906 . 9.8 33749 28 1907 . 10 34371 29 1908 . . 33246 28 1909 116.8 8.5 35072 28 1910 120.1 10 35762 28
Listwise deletion
- Uses only the observations from 1909 and 1910 for all parts of analysis.
- It eliminates the all data in 1906-1908 because of the missing values in
gnp
andip
.
Pairwise deletion:
- Uses the observations in 1906-1910 when computing the means and covariances of
emp
andcpi
. - Uses the observations in 1906-1907 and 1909-1910 when computing the means and covariances of
ip
. - Uses the observations in 1909-1910 when computing the means and covariances of
gnp
. - Uses the observations in 1909-1910 for model estimation other than means and covariances.
Variable deletion
- Uses only
emp
andcpi
for all parts of analysis.
Imputation
Imputing data replaces missing values with statistically determined values. Methods of imputation can vary from simply replacing missing values with the mean to sophisticated multiple imputation processes.
Which method of imputation should be used depends on the characteristics of the data.
Examples of imputation methods |
|||
---|---|---|---|
Method | Description | Advantages | Disadvantages |
Replacement with mean, median, mode | All missing values are replaced with the variable mean, median or mode. | Easy to implement. |
|
Linear Regression | Missing values are predicted using a linear model and the other variables in the dataset. |
|
|
Last observation carried forward (LOCF) |
|
|
|
Predictive mean matching | Replacements for missing values are drawn randomly from a group of nearby candidate values. |
|
|
Imputing time series data
Not all imputation methods are appropriate for time series data. Time series data may contain time trends or seasonality, all of which should be addressed when imputing missing data.
Common methods for imputing time series data include:
- Carrying the last observation forward (LOCF) or carrying the next observation backward (NOCB).
- Linear or spline interpolation.
- Kalman or moving average smoothing.
Missing values in practice
Now that we have some background on missing values in data, let's consider some of the practicalities of dealing with missing values.
In particular, we will look more closely at:
- Identifying missing values.
- Deleting missing values.
- Replacing missing values.
Throughout this section, we will use a sample dataset yellowstone_gaps.csv
.
Identifying missing values
Interactively viewing missing values
When loading data, missing values are easy to identify in the data preview and show up as .
within grey cells:
Once loaded, the missing values still appear as .
in the data editor.
Checking for missing values in a matrix or dataframe
Now suppose we didn't see any missing values right away when previewing our data but want to be certain. The procedure ismiss
tests if a matrix or dataframe contains any missing values:
// Load dataset
data = loadd("yellowstone_gaps.csv");
// Check for any missing values
ismiss(data);
This prints:
1.0000000
Which indicates that our data does contain missing values.
Counting missing values by variables
This is a great start but now we need to know more about where our missing values are located. Let's use dstatmt
to count missing values by variable name:
dstatmt(data);
This prints a detailed descriptive statistics report to screen:
------------------------------------------------------------------------------------ Variable Mean Std Dev Variance Minimum Maximum Valid Missing ------------------------------------------------------------------------------------ Date ----- ----- ----- 1986/01/01 2016/12/01 372 0 Visits 255660 291753 8.512e+10 6261 9.959e+05 372 0 LowtTemp 3.595 20.33 413.2 -46 38 368 4 HighTemp 67.92 17.75 315.1 34 98 367 5 TotalPrecip 1.334 0.798 0.6368 0 5.78 368 4 TotalSnowfall 7.496 8.828 77.94 0 40.7 366 6 CPI 0.2116 0.3272 0.1071 -1.9 1.2 372 0 MCSI 87.34 11.94 142.5 55.3 112 372 0 Unemployment 6.039 1.468 2.155 3.8 10 371 1
The final column of our report shows that LowTemp
, HighTemp
, TotalPrecip
, TotalSnowfall
and Unemployment
all have missing values. The most missing values occur for TotalSnowfall
.
Identifying the location of missing values
Let's take a closer look at our TotalSnowfall
variable to identify when our missing values occur.
The indexcat
function finds the indices of the elements of a vector that fall into a specified range or are equal to a specified value. It requires two inputs, a data vector, and the value or range to search for:
// Find indices of missing values
missing = { . };
indx = indexcat(data[., "TotalSnowfall"], missing);
// Print dates associated with missing values
data[indx, "Date"];
This prints the dates of the missing TotalSnowfall
to screen:m
Date 1986/07/01 1986/09/01 1998/07/01 1998/10/01 2005/12/01 2009/12/01
error
function creates scalar error codes which are equivalent to missing values. Checking rows for missing values
Suppose that instead of finding which dates contain missing values we want to see if some specific dates have missing values. Combining the information in our dataframe with the procedure rowcontains
makes this relatively easy.
// First create indicator of rows with missing values
missing = { . };
row_missing = rowcontains(data[., 2:9], missing);
// Concatenate with date vector
row_missing = data[., "Date"] ~ row_missing;
// Print first ten observations
print row_missing[1:10, .];
Our results tell us if each date has any missing data. A 0 indicates no missing values and 1 indicates there is at least one missing value:
1986/01/01 0.0000000 1986/02/01 0.0000000 1986/03/01 0.0000000 1986/04/01 0.0000000 1986/05/01 0.0000000 1986/06/01 0.0000000 1986/07/01 1.0000000 1986/08/01 0.0000000 1986/09/01 1.0000000 1986/10/01 0.0000000
miss
procedure can be used to replace specified values with the GAUSS missing value code. For example, if -999 was coded to represent missing observations in the data matrix X
, we can use:X = miss(X, -999);
to replace all occurrences of -999 with a missing value.
Deleting missing values
Now that we've seen how to locate missing values in our data, let's look more closely at methods for deleting missing values from our data.
We will start with the simple procedure, packr
. The packr
procedure removes any rows with missing values:
// Count rows before removing missing values
"Rows before removing missing data:";
rows(data);
// Remove missing values
data_nomiss = packr(data);
// Count rows after removing missing values
"Row after removing missing values:";
rows(data_nomiss);
Our full dataset has 372 observations before removing missing values. After using packr
to remove any rows with missing values, we are left with 364 observations:
Rows before removing missing data: 372.00000 Row after removing missing values: 364.00000
This is a convenient method to remove all missing values from a dataset. However, if we want to remove data more discerningly, we can use the interactive data filtering tool. With the interactive data filtering tool, we can filter out missing values by variable.
For example, suppose we only want to drop observations that are missing TotalSnowFall
. To do this we can:
- Open the
data
dataframe in the Data Editor. - Select
TotalSnowFall
from the Variable name drop-down list on the Filter tab. - Select Is Not Missing from the Operation drop-down list.
- Click the
+
button to add the filter. - Click Apply to remove missing values from
Data
.
When we click Apply, GAUSS auto-generates the code used to remove all missing values of TotalSnowFall
from the Data
matrix:
data = selif(data, data[., "TotalSnowfall"] .!= error(0));
We can see that the GAUSS procedure, selif
is a convenient programmatic tool for selecting data that is not missing.
Alternatively, we could use the GAUSS procedure delif
to delete all observations where TotalSnowFall
is missing:
data = delif(data, data[., "TotalSnowfall"] .== error(0));
Replacing missing values
As we discussed earlier, deleting missing values isn't always the most appropriate choice. GAUSS has two functions that can be used to replace missing values:
The missrv
function replaces all missing values in a matrix with a user-specified value(s) and is the opposite of the miss
function.
For example, suppose we want to replace all missing values .
with -999:
data_replaced = missrv(data, -999);
This may be useful for creating consistent coding of missing data. However, it isn't very useful for setting our data up for modeling.
The impute
procedure allows us to replace data in a manner more appropriate for modeling. It is compatible with a number of imputation methods including:
Method | Description |
---|---|
"mean" |
Replace missing values with the mean of the column (default). |
"median" |
Replace missing values with the median of the column. |
"mode" |
Replace missing values with the mode of the column. |
"pmm" |
Replace missing values using predictive mean matching. |
"lrd" |
Replace missing values using local residual draws. |
"predict" |
Replace missing values using linear regression prediction. |
pmm
, lrd
, and predict
methods are more sophisticated imputation methods that we won't cover here in detail. However, it is worth noting the impute
provides a number of controls for fine-tuning the pmm
, lrd
, and predict
methods.Let's take the simple approach of replacing all missing values of the TotalSnowFall
variable with its median value
// Replace missing values of 'TotalSnowfall'
// with median value of 'TotalSnowfall'
data[., "TotalSnowfall"] = impute(data[., "TotalSnowfall"], "median");
// Descriptive statistics
dstatmt(data[., "TotalSnowfall");
-------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing -------------------------------------------------------------------------------- TotalSnowfall 7.448 8.764 76.81 0 40.7 372 0
There are a few things to note from our descriptive statistics for TotalSnowfall
:
- There are no longer any missing observations of
TotalSnowfall
. - The variance has decreased from 77.94 to 76.81.
- The standard deviation has decreased from 8.828 to 8.764.
- The mean has risen from 7.496 to 7.448.
Conclusion
Missing values are a fact of life in real-world data. Dealing with missing values can be time-consuming and error-prone. This blog helps with both and provides tools for handling missing values in practice.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.