Introduction
Classical linear regression estimates the mean response of the dependent variable dependent on the independent variables. There are many cases, such as skewed data, multimodal data, or data with outliers, when the behavior at the conditional mean fails to fully capture the patterns in the data.
In these cases, quantile regression provides a useful alternative to linear regression which:
- Can be used to study the distributional relationships of variables.
- Can help detect heteroscedasticity.
- Is useful for dealing with censored variables.
- Is more robust to outliers.
Today we will use quantile regression to analyze Major League Baseball Salary data at the 10%, 25%, 50%, 75%, and 90% quantiles. We will consider the model
$$ ln(salary) = \beta_0 + \beta_1 AtBats + \beta_2 Hits + \beta_3 HmRun + \beta_4 Walks\\ + \beta_5 Years + \beta_6 PutOuts $$
The intuition of quantile regression
To understand the intuition of quantile regression, let's start with the intuition of ordinary least squares. Given the model
$$ y_i = \beta'X_i + \epsilon_i ,$$
the least squares estimate minimizes the sum of the squared error terms
$$ \sum^N_i (y_i - \hat{y_i})^2 .$$
Comparatively, quantile regression minimizes a weighted sum of the positive and negative error terms:
$$ \tau\sum_{y_i \gt \hat{\beta_{\tau}}'X_i} | y_i - \hat{\beta_{\tau}}'X_i |\ +\ (1 - \tau)\sum_{y_i \lt \hat{\beta_{\tau}}'X_i} | y_i - \hat{\beta_{\tau}}'X_i | $$
where $\tau$ is the quantile level.
If we assume that $\tau$ is equal to 0.9, we can compute the quadratic regression loss for the data in the image above, like this:
$$ \tau(d2) + (1 - \tau)(|d1 + d3|)\\ 0.9 * 0.4 + 0.1 * (|-1.3 + -0.4|) = 0.53 $$
Optimizing this loss function results in an estimated linear relationship between $y_i$ and $x_i$ where a portion of the data, $\tau$, lies below the line and the remaining portion of the data, $1-\tau$, lies above the line as shown in the graph below (Leeds, 2014).
Estimating a quantile regression with GAUSS
Today we will use the GAUSS function quantileFit
to estimate our salary model at the 10%, 25%, 50%, 75%, and 90% quantiles. This allows us insight into what factors impact salaries at the extremes of the salary distribution, in addition to those at quantiles in between those extremes.
The quantileFit
function uses formula string syntax and takes the following inputs:
- dataset
- String, name of data set.
- formula
- String, the formula of the model. E.g "y ~ X1 + X2"
- tau
- Optional argument, Mx1 vector, quantile levels. Default = {0.05, 0.5, 0.95};
- w
- Optional argument, Nx1 vector, containing observation weights. Default = uniform weights.
- qCtl
- Optional argument, an instance of the qfitControl structure containing members for controlling parameters of the quantile regression.
We will also use the qFitControl
structure to specify variables names and set up a bootstrap for standard errors and confidence intervals :
// Load variables
y = loadd("islr_hitters.xlsx", "ln(salary)");
x = loadd("islr_hitters.xlsx", "AtBat + Hits + HmRun + Walks + Years + PutOuts");
/*
** Estimate the model
*/
// Set up tau for regression
tau = 0.10 | 0.25 | 0.50 |0.75 | 0.90;
// Declare control structure
// and fill with default values
struct qfitControl qCtl;
qCtl = qfitControlCreate();
// Add variable names
qCtl.varnames = "AtBat" $| "Hits" $| "HmRun" $| "Walks" $| "Years" $| "PutOuts";
// Turn on bootstrapped confidence intervals
qCtl.bootstrap = 1000;
// Call quantileFit
struct qfitOut qOut;
qOut = quantileFit(y, x, tau, qCtl);
Interpreting our results
Coefficients estimates
Variable | OLS | 10% | 25% | 50% | 75% | 90% |
---|---|---|---|---|---|---|
Constant | 4.37*** | 3.69*** | 3.72*** | 4.078*** | 4.663*** | 5.304*** |
(0.133) | (0.107) | (0.105) | (0.277) | (0.157) | (0.483) | |
AtBat | -0.00258** | -0.00324** | -0.00256** | -0.00253* | -0.00173 | -0.00179 |
(0.001) | (0.00156) | (0.00113) | (0.00143) | (0.00124) | (0.00157) | |
Hits | 0.01366*** | 0.01811*** | 0.01576*** | 0.01503*** | 0.01106*** | 0.008907** |
(0.003) | (0.00597) | (0.00377) | (0.00441) | (0.00374) | (0.00384) | |
HmRun | 0.0051 | -0.00289 | 0.000219 | 0.002443 | 0.01687*** | 0.01416* |
(0.0054) | (0.00801) | (0.00583) | (0.00906) | (0.00605) | (0.00821) | |
Walks | 0.0071*** | 0.006536* | 0.009025*** | 0.007767** | 0.006164** | 0.007038** |
(0.0023) | (0.00341) | (0.00284) | (0.00365) | (0.0025) | (0.00325) | |
Years | 0.0932*** | 0.09149*** | 0.1039*** | 0.1054*** | 0.08664*** | 0.07418*** |
(0.008) | (0.00691) | (0.00877) | (0.0154) | (0.0143) | (0.0269) | |
Putouts | 0.0003** | -7.322e-5 | -0.00015 | 0.000462* | 0.000398** | 0.000388** |
(0.0001) | (0.00019) | (0.00028) | (0.00025) | (0.0002) | (0.00018) |
We can see in the table of our results that both the magnitude and intensity of the coefficients on our predictors' changes across the quantiles.
Looking at our table alone, the most interesting results are the coefficients on Hits
and HmRun
. There are several notable things about these results:
- The magnitude of impact that
Hits
has on salary decreases as players' salaries move from the 10% quantile to those in the 90% quantile. Hits
is less statistically significant for the 90% quantile than lower quantiles.HmRun
is only statistically significant for the 75% and 90% quantiles.
This suggests that players with the highest salaries aren't necessarily paid to just hit balls but rather to hit home runs.
Confidence intervals
This paints a nice picture. However, it is inappropriate to make any conclusions without first considering how statistically significant these differences are (Leeds, 2014).
The graph above provides a visualization of the difference in coefficients across the quantiles with the bootstrapped confidence intervals. It also includes the OLS estimates, which are constant across all quantiles, and their confidence intervals.
From this graph, we can see that OLS coefficients fall within the confidence intervals of the quantile regression coefficients. This implies that our quantile regression results are not statistically different from the OLS results.
Conclusions
Today we've learned the basics of quantile regression and seen an application to Major League Baseball Salary data. After today you should have a better understanding of:
- The intuition of quantile regression.
- How to estimate a quantile regression model in GAUSS.
- How to interpret the results from quantile regression estimates.
Code and data from this blog can be found here.
References
Leeds, M. 2014, “Quantile Regression for Sports Economics,” International journal of sport finance, 9, 346-359.