Introduction
Principal components analysis (PCA) is a useful tool that can help practitioners streamline data without losing information. In today’s blog, we’ll examine the use of principal components analysis in finance using an empirical example.
Specifically, we’ll look more closely at:
- What PCA is.
- How PCA works.
- How to use the GAUSS Machine Learning library to perform PCA.
- How to interpret PCA results.
What is Principal Components Analysis?
Principal components analysis (PCA) is an unsupervised learning method that results in a low-dimensional representation of a dataset. The intuition behind PCA is that the most important information is drawn from the features by eliminating redundancy and noise. The resulting dataset captures the most interesting components of the data.
PCA Snapshot |
||
---|---|---|
Uses linear transformations to capture the most important characteristics of a set of features. | ||
Uses variance of the features to distinguish relevant features from pure noise. | ||
Identifies and removes redundancy in features. |
How Do We Find Principal Components?
Principal components are found by identifying the normalized, linear combination of features
$$Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \ldots + \phi_{p1}X_p$$
which has the largest variance.
The coefficients $\phi_{11}, \phi_{21}, \ldots, \phi_{p1}$ are referred to as the loadings and are restricted such that their sum of squares is equal to one.
To compute the first principal component we:
- Center our feature data to have a mean of zero.
- Find loadings that result with the largest sample variance, subject to the constraint that $\sum_{j=1}^p \phi_{j,1}^2 = 1$.
Once the first principal component is found, we can find a second principal component, $Z_2$, which is constrained to be uncorrelated with $Z_1$.
When Should You Use PCA?
The most common use of PCA is to reduce the size of a feature set without losing too much information. The feature set can then be used in second stages of modeling. However, this is not the only use of PCA, and there are a number of insightful ways PCA can be applied.
Real World Applications of PCA |
||
---|---|---|
Reducing the size of images. | PCA can be used to reduce the size of an image without significantly impacting the quality. Beyond just reducing the size, this is useful for image classification algorithms. | |
Visualizing multidimensional data. | PCA allows us to represent the information contained in multidimensional data in reduced dimensions which are more compatible with visualization. | |
Finding patterns in high-dimensional datasets. | Examining the relationships between principal components and original features can help uncover patterns in the data that are harder to identify in our full dataset. | |
Stock price prediction in finance. | Many models of stock price prediction rely on estimating covariance matrices. However, this can be difficult with high-dimensional data. PCA can be used for data reduction to help remedy this issue. | |
Dataset reduction in healthcare models. | Healthcare models use high-dimensional datasets because there are many factors that influence healthcare outcomes. PCA provides a method to reduce the dimensionality while still capturing the relevant variance. |
Empirical Example
Let's take a look at principal components analysis in action! We'll start by extending the PCA application to US Treasury bills and bonds from Introductory Econometrics For Finance by Chris Brooks.
In our example we will:
- Update the dataset to use current data.
- Use the pcaFit and pcaTransform functions available in the GAUSS Machine Learning library (GML).
Loading FRED Data
Our initial dataset includes 6 variables capturing short-term and long-term yields on U.S. bonds and bills.
Variable | Description |
---|---|
GS3M | Market yield on 3 month US Treasury bill. |
GS6M | Market yield on 6 month US Treasury bill. |
GS1 | Market yield on 1 year US Treasury bond. |
GS3 | Market yield on 3 year US Treasury bond. |
GS5 | Market yield on 5 year US Treasury bond. |
GS10 | Market yield on 10 year US Treasury bond. |
This data can be directly imported into GAUSS from the FRED database.
/*
** Import U.S bond and bill data
** directly from FRED
*/
// Set observation_start parameter
// to use all data on or after 1990-01-01 and before or on 2023-03-01
params_cpi = fred_set("observation_start", "1990-01-01", "observation_end", "2023-03-01");
// Load data from FRED
data = fred_load("GS3M + GS6M + GS1 + GS3 + GS5 + GS10", params_cpi);
// Reorder data to match the organization in original example
data = order(data, "date"$|"GS3M"$|"GS6M"$|"GS1"$|"GS3"$|"GS5"$|"GS10");
// Preview the first 5 rows
head(data);
The data preview printed to the Command Window helps verify that our data has loaded correctly:
date GS3M GS6M GS1 GS3 GS5 GS10 1990-01-01 7.90 7.96 7.92 8.13 8.12 8.21 1990-02-01 8.00 8.12 8.11 8.39 8.42 8.47 1990-03-01 8.17 8.28 8.35 8.63 8.60 8.59 1990-04-01 8.04 8.27 8.40 8.78 8.77 8.79 1990-05-01 8.01 8.19 8.32 8.69 8.74 8.76
Normalizing Yields
Following the Brooks' example, we will normalize the yields to have zero mean and standard deviation of one using the rescale
procedure.
/*
** Normalizing the yield
*/
// Create a dataframe that contains
// the yields, but not the 'Date' variable
yields = delcols(data, "date");
// Standardize the yields using rescale
{ yields_norm, location, scale_factor } = rescale(yields, "standardize");
head(yields_norm);
This prints a preview of our normalized yields:
GS3M GS6M GS1 GS3 GS5 GS10 2.3153725 2.2469720 2.1773318 2.0802078 2.0025703 1.9626705 2.3591880 2.3159905 2.2593350 2.1936833 2.1395985 2.0916968 2.4336745 2.3850090 2.3629181 2.2984298 2.2218155 2.1512474 2.3767142 2.3806953 2.3844979 2.3638964 2.2994648 2.2504985 2.3635696 2.3461861 2.3499702 2.3246164 2.2857620 2.2356108
Fitting the PCA Model
Next, we will use the pcaFit
procedure from GML to fit our principal components analysis model.
The pcaFit
procedure requires two inputs, a data matrix, and the number of components to compute.
struct pcaModel mdl;
mdl = pcaFit(x, n_components);
- X
- $N \times P$ matrix, feature data to be reduced.
- n_components
- Scalar, the number of components to compute.
The pcaFit
procedures stores all output in a pcaModel
structure. The most relevant members of the pcaModel
structure include:
- mdl.singular_values
- $n_{components} \times 1$ vector, the largest singular values of X. Equal to the square root of the eigenvalues.
- mdl.components
- $P \times n_{components}$ matrix, the principal component vectors which represent the directions of greatest variance. Also known as the factor loadings.
- mdl.explained_variance_ratio
- $n_{components} \times 1$ vector, the variance explained by each of the returned component vectors.
/*
** Perform PCA on normalized yields
*/
// Specify number of components
n_components = 6;
// `pcaModel` structure for holding
// output from model
struct pcaModel mdl;
mdl = pcaFit(yields_norm, n_components);
Dissecting Results
After running the pcaFit
procedure results will be printed to the Command Window. These results include:
- A general summary of model.
- The proportion of variance explained by each component.
- The loadings for all variables in each component.
General Summary
The general summary provides basic information about the model setup, including the number of variables in the original data and the number of components found.
================================================== Model: PCA Number observations: 399 Number variables: 6 Number components: 6 ==================================================
Proportion of Variance
The proportion of variance table tells us how much of the total variance in the data is described by each principal component.
Component Proportion Cumulative Of Variance Proportion PC1 0.960 0.960 PC2 0.038 0.997 PC3 0.002 1.000 PC4 0.000 1.000 PC5 0.000 1.000 PC6 0.000 1.000
For the Treasury bills and bonds yields, the first component captures 96.0% of the total variance, while the first three components explain nearly all of the total variance. If our goal was data reduction for use in a later model, this is quite promising. We could capture 96% of the variance of all 6 of our original variables using just the first principal component.
The Factor Loadings
=========================================================================== Principal components PC1 PC2 PC3 PC4 PC5 PC6 =========================================================================== GS3M -0.4079 0.4111 0.4863 -0.5416 0.3029 0.2076 GS6M -0.4094 0.3883 0.1535 0.2221 -0.5448 -0.5585 GS1 -0.4122 0.2970 -0.2404 0.6120 0.1557 0.5342 GS3 -0.4154 -0.0855 -0.5911 -0.1926 0.4744 -0.4567 GS5 -0.4102 -0.3607 -0.2806 -0.3932 -0.5725 0.3750 GS10 -0.3939 -0.6742 0.5040 0.3020 0.1856 -0.1024
The factor loadings indicate how much each of the variables contributes to the component. As noted in the Brooks example, they also offer some insight into the yield curve:
PC1 |
|
PC2 |
|
PC3 |
|
Transforming Original Data
After fitting the PCA model, we can use the results to transform our original data into its principal components using the pcaTransform
procedure.
// Transform original data
x_trans = pcaTransform(yields_norm, mdl);
Since the first three components capture most of the variation in our data, let's look at them in a plot:
If you're familiar with U.S. interest rates, this plot likely seems to contradict what we observe in the real world. As we said earlier, the first principal component represents the overall level of interest rates. However, our plot of the first principal component shows an overall upward trend through 2022, with a sharp downtick starting post-2022 — exactly opposite the overall trend in U.S. interest rates.
This highlights an important feature of PCA — the sign on the factor loadings is arbitrary.
The signs can all be flipped without any change to our analysis. For example, if we multiply all our factor loadings by -1 our principal components look like:
Conclusion
In today's blog, we've seen that PCA is a powerful data analysis tool with uses beyond data reduction. We've also explored how to use the GAUSS Machine Learning library to fit a PCA model and transform data.
Further Reading
- Predicting Recessions with Machine Learning Techniques
- Predicting The Output Gap With Machine Learning Regression Models
- Fundamentals of Tuning Machine Learning Hyperparameters
- Understanding Cross-Validation
- Machine Learning With Real-World Data
- Classification with Regularized Logistic Regression
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.
Very clear. Thanks for this blog, Eric, it gives the intuition behind PCA.