Introduction
Data analysis in reality is rarely as clean and tidy as it is presented in the textbooks. Consider linear regression -- data rarely meets the stringent assumptions required for OLS. Failing to recognize this and incorrectly implementing OLS can lead to embarrassing, inaccurate conclusions.
In today's blog, we'll look at how to use feasible generalized least squares to deal with data that does not meet the OLS assumption of Independent and Identically Distributed (IID) error terms.
What Is Feasible Generalized Least Squares (FGLS)?
FGLS is a flexible and powerful tool that provides a reliable approach for regression analysis in the presence of non-constant variances and correlated errors.
Feasible Generalized Least Squares (FGLS):
- Is an extension of the traditional Ordinary Least Squares (OLS) regression method.
- Accommodates heteroscedasticity and serial correlation in data.
- Allows for more robust parameter estimates by considering the structure of the error terms.
Why is this important?
Recall the fundamental OLS IID assumption which implies that the error terms have constant variance and are uncorrelated. When this assumption is violated:
- OLS estimators are no longer efficient.
- The estimated covariance matrix of the coefficients will be inconsistent.
- Standard inferences will be incorrect.
Unfortunately, many real-world cross-sectional, panel data, and time series datasets do violate this fundamental assumption.
FGLS allows for a more accurate modeling of complex and realistic data structures by accommodating the heteroscedasticity and autocorrelation in the error terms.
How Does FGLS Work?
FGLS uses a weighting matrix that captures the structure of the variance-covariance matrix of the errors.
This allows FGLS to:
- Give more weight to observations with smaller variances.
- Account for correlations.
- Provide more efficient and unbiased estimates in the presence of non-constant variance and serial correlation.
The method uses a relatively simple iterative process:
- Pick a method for estimating the covariance matrix based on believed data structure.
- Make initial OLS parameter estimates.
- Use the OLS residuals and the chosen method to estimate an initial covariance matrix.
- Compute FGLS estimates using the estimated covariance matrix for weighting.
- Calculate residuals and refine the weighting matrix.
- Repeat steps 3, 4, and 5 until convergence.
How Do I Know If I Should Use FGLS?
We've already noted that you should use FGLS when you encounter heteroscedasticity and/or autocorrelation. It's easy to say this but how do you identify when this is the case?
There are a number of tools that can help.
|
||
---|---|---|
Tool | Description | Used to Identify |
Scatter plots |
| Heteroscedasticity and autocorrelation. |
Residual plot |
| Heteroscedasticity and autocorrelation. |
Histogram of residuals | Plot a histogram of the residuals. If the histogram is skewed or has unequal spread, it could suggest heteroscedasticity or non-normal distribution. | Heteroscedasticity. |
Durbin-Watson statistic | The Durbin-Watson statistic tests for first-order autocorrelation in the residuals. The test statistic ranges from 0 to 4, with values around 2 indicating no autocorrelation. | Autocorrelation. |
Breusch-Pagan test | The Breusch-Pagan test considers the null hypothesis of homoscedasticity against the alternative of heteroscedasticity. | Heteroscedasticity. |
Breusch-Godfrey test | The Breusch-Godfrey test extends the Durbin-Watson test to higher-order autocorrelation. The test assesses whether larger lags of residuals and independent variables help explain the current residuals. | Autocorrelation. |
White test | Similar to the Breusch-Pagan test, the White test considers the null hypothesis of homoscedasticity. | Heteroscedasticity. |
Example One: US Consumer Price Index (CPI)
Let's get a better feel for FGLS using real-world data. In this application, we will:
- Find OLS estimates and examine the results for signs of heteroscedasticity and autocorrelation.
- Compute FGLS estimates and discuss results.
Data
For this example, we will use publicly available FRED time series data:
- Consumer Price Index for All Urban Consumers: All Items in U.S. City Average (CPIAUCSL), seasonally adjusted.
- Compensation of employees, paid (COE), seasonally adjusted.
Both variables are quarterly, continuously compounded rates of change spanning from 1947Q2 to 2023Q3.
// Load data
fred_fgls = loadd("fred_fgls.gdat");
// Preview data
head(fred_fgls);
tail(fred_fgls);
date COE CPIAUCSL 1947-04-01 0.013842900 0.014184600 1947-07-01 0.015131100 0.021573900 1947-10-01 0.030381600 0.027915600 1948-01-01 0.025448400 0.020966300 1948-04-01 0.011788800 0.015823300 date COE CPIAUCSL 2022-07-01 0.023345300 0.013491800 2022-10-01 0.0048207000 0.010199500 2023-01-01 0.021001700 0.0093545000 2023-04-01 0.013436000 0.0066823000 2023-07-01 0.013337800 0.0088013000
OLS Estimation
Let's start by using OLS to examine the relationship between COE and CPI returns. We'll be sure to have GAUSS save our residuals so we can use them to evaluate OLS performance.
// Declare 'ols_ctl' to be an olsmtControl structure
// and fill with default settings
struct olsmtControl ols_ctl;
ols_ctl = olsmtControlCreate();
// Set the 'res' member of the olsmtControl structure
// so that 'olsmt' will compute residuals and the
// Durbin-Watson statistic
ols_ctl.res = 1;
// Declare 'ols_out' to be an olsmtOut structure
// to hold the results of the computations
struct olsmtOut ols_out;
// Perform estimation, using settings in the 'ols_ctl'
// control structure and store the results in 'ols_out'
ols_out = olsmt(fred_fgls, "CPIAUCSL ~ COE", ols_ctl);
Valid cases: 306 Dependent variable: CPIAUCSL Missing cases: 0 Deletion method: None Total SS: 0.019 Degrees of freedom: 304 R-squared: 0.197 Rbar-squared: 0.195 Residual SS: 0.016 Std error of est: 0.007 F(1,304): 74.673 Probability of F: 0.000 Durbin-Watson: 0.773 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------- CONSTANT 0.00397578 0.000678566 5.85909 0.000 --- --- COE 0.303476 0.0351191 8.64133 0.000 0.444067 0.444067
Evaluating the OLS Results
Taken at face value, these results look good. The standard errors on both estimates are small and both variables are statistically significant. We may be tempted to stop there. However, let's look more closely using some of the tools mentioned earlier.
Checking For Heteroscedasticity
First, let's create some plots using our residuals to check for heteroscedasticity. We will look at:
- A histogram of the residuals.
- The residuals versus the independent variable.
/*
** Plot a histogram of the residuals
** Check for skewed distribution
*/
plotHist(ols_out.resid, 50);
Our histogram indicates that the residuals from our OLS regression are asymmetric and slightly skewed. While the results aren't dramatic, they warrant further exploration to check for heteroscedasticity.
/*
** Plot residuals against COE
** Check for increasing or decreasing variance
** as the independent variable changes.
*/
plotScatter(fred_fgls[., "COE"], ols_out.resid);
It's hard to determine if these results are indicative of heteroscedasticity or not. Let's add random normal observations to our scatter plot as see how they compare.
// Add random normal observations to our scatter plot
// scale by 100 to put on same scale as residuals
rndseed 897680;
plotAddScatter(fred_fgls[., "COE"], rndn(rows(ols_out.resid), 1)/100);
Our residual plot doesn't vary substantially from the random normal observations and there isn't strong visual evidence of heteroscedasticity.
If we did have heteroscedasticity, our residuals would exhibit a fan-like shape, indicating a change in the spread between residuals as our observed data changes. For example, consider this plot of hypothetical residuals against COE:
Checking For Autocorrelation
As you may have noticed, we don't have to look further than our OLS results for signs of autocorrelation. The olsmt procedure reports the Durbin-Watson statistic as part of the printed output. For this regression, the Durbin-Watson statistic is 0.773, which is significantly below 2, suggesting positive autocorrelation.
We can find further support for this conclusion by inspecting our residual plots, starting with a plot of the residuals against time.
// Checking for autocorrelation
/*
** Plot the residuals over time and
** look for cycles or trends to
** check for autocorrelation.
*/
plotXY(fred_fgls[., "date"], ols_out.resid);
Our time plot of residuals:
- Has extended periods of large residuals, (roughly 1970-1977, 1979-1985, and 2020-2022).
- Suggests positive autocorrelation.
Now let's examine the plot of our residuals against lagged residuals:
/*
** Plot residuals against lagged residuals
** look for relationships and trends
*/
// Lag residuals and remove missing values
lagged_res = lagn(ols_out.resid, 1);
// Trim first observations and plot residuals
// against lagged residuals
plotScatter(lagged_res, ols_out.resid);
This plot gives an even clearer visual of our autocorrelation issue demonstrating:
- A clear linear relationship between the residuals and their lags.
- Larger residuals in the previous period lead to larger residuals in the current period.
FGLS Estimation
After examining the results more closely from the OLS estimation, we have clear support for using FGLS. We can do this using the fgls procedure, introduced in GAUSS 24.
The GAUSS fgls
Procedure
The fgls
procedure allows for model specification in one of two styles. The first style requires a dataframe input and a formula string:
// Calling fgls using a dataframe and formula string
out = fgls(data, formula);
The second option requires an input matrix or dataframe containing the dependent variable and an input matrix or dataframe containing the independent variables:
// Calling fgls using dependent variable
// and independent variable inputs
out = fgls(depvar, indvars);
Both options also allow for:
- An optional input specifying the computation method for the weighting matrix. GAUSS includes 7 pre-programmed options for the weighting matrix or allows for a user-specified weighting matrix.
- An optional
fglsControl
structure input for advanced estimation settings.
out = fgls(data, formula [, method, ctl])
The results from the FGLS estimation are stored in a fglsOut
structure containing the following members:
Member | Description |
---|---|
out.beta_fgls | The feasible least squares estimates of parameters. |
out.sigma_fgls | Covariance matrix of the estimated parameters. |
out.se_fgls | Standard errors of the estimated parameters. |
out.ci | Confidence intervals of the estimated parameters. |
out.t_stats | The t-statistics of the estimated parameters. |
out.pvts | The p-value of the t-statistics of the estimated parameters. |
out.resid | The estimate residuals. |
out.df | Degrees of freedom. |
out.sse | Sum of squared errors. |
out.sst | Total sum of squares. |
out.std_est | Standard deviation of the residuals. |
out.fstat | Model f-stat. |
out.pvf | P-value of the model f-stat. |
out.rsq | R-squared. |
out.dw | Durbin-Watson statistic. |
Running FGLS
Let's use FGLS and see if it helps with autocorrelation. We'll start with the default weighting matrix, which is an AR(1) structure.
// Estimate FGLS parameters using
// default setting
struct fglsOut fOut;
fOut = fgls(fred_fgls, "CPIAUCSL ~ COE");
Valid cases: 306 Dependent variable: COE Total SS: 0.019 Degrees of freedom: 304 R-squared: 0.140 Rbar-squared: 0.137 Residual SS: 0.017 Std error of est: 0.007 F(1,304) 49.511 Probability of F: 0.000 Durbin-Watson 0.614 -------------------------------------------------------------------------------- Standard Prob Variable Estimates Error t-value >|t| [95% Conf. Interval] -------------------------------------------------------------------------------- Constant 0.00652 0.000908 7.19 0.000 0.00474 0.0083 CPIAUCSL 0.14 0.0286 4.9 0.000 0.084 0.196
The FGLS estimates the AR(1) weighting matrix differ from our OLS estimates in both the coefficients and standard errors.
Let's look at a plot of our residuals:
// Plot FGLS residual
lagged_resid = lagn(fOut.resid, 1);
plotScatter(lagged_resid, fOut.resid);
Our residuals suggest that FGLS hasn't fully addressed our autocorrelation. What should we take from this?
This likely means that we need to consider higher-order autocorrelation. We may want to extend this analysis by:
- Running the Breusch-Godfrey test to check for higher-order autocorrelation.
- Examining the autocorrelation function (ACF) and partial autocorrelation functions (PACF).
- Estimating an alternative time series model, such as an ARIMA model.
Ready to give FGLS a try? Get started with GAUSS demo today!
Example Two: American Community Survey
Now let's consider a second example using a subset of data from the 2019 American Community Survey (ACS).
Data
The 2019 ACS data subset was cleaned and provided by the Social Science Computing Cooperative from University of Wisconsin-Madison.
The survey data subset contains 5000 observations of the following variables:
Variable | Census Codebook Name | Description |
---|---|---|
household | SERIALNO | Housing unit identifier. |
person | SPORDER | Person number. |
state | ST | State. |
age | AGEP | Age in years. |
other_languages | LANX | Another language is spoken at home. |
english | ENG | Self-rated ability to speak English, if another language is spoken. |
commute_time | JWMNP | Travel time to work in minutes, top-coded at 200. |
marital_status | MAR | Marital status. |
education | SCHL | Educational attainment, collapsed into categories. |
sex | SEX | Sex (male or female). |
hours_worked | WKHP | Usual hours worked per week in the past 12 months. |
weeks_worked | WKHN | Weeks worked per year in the past 12 months. |
race | RAC1P | Race. |
income | PINCP | Total income in current dollars, rounded. |
Let's run a naive model of income against two independent variables, age and hours_worked.
Our first step is loading our data:
/*
** Step One: Data Loading
** Using the 2019 ACS
*/
// Load data
acs_fgls = loadd("acs2019sample.dta", "income + age + hours_worked");
// Review the summary statistics
dstatmt(acs_fgls);
--------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing --------------------------------------------------------------------------------------------- income 4.062e+04 5.133e+04 2.634e+09 -8800 6.887e+05 4205 795 age 43.38 24.17 584 0 94 5000 0 hours_worked 38.09 13.91 193.5 1 99 2761 2239
Based on our descriptive statistics there are a few data cleaning steps that will help our model:
- Remove missing values using the packr procedure.
- Transform income to thousands of dollars to improve data scaling.
- Remove cases with negative incomes.
// Remove missing values
acs_fgls = packr(acs_fgls);
// Transform income
acs_fgls[., "income"] = acs_fgls[., "income"]/1000;
// Filter out cases with negative incomes
acs_fgls = delif(acs_fgls, acs_fgls[., "income"] .< 0);
OLS Estimation
Now we're ready to run a preliminary OLS estimation.
// Declare 'ols_ctl' to be an olsmtControl structure
// and fill with default settings
struct olsmtControl ols_ctl;
ols_ctl = olsmtControlCreate();
// Set the 'res' member of the olsmtControl structure
// so that 'olsmt' will compute residuals and the Durbin-Watson statistic
ols_ctl.res = 1;
// Declare 'ols_out' to be an olsmtOut structure
// to hold the results of the computations
struct olsmtOut ols_out;
// Perform estimation, using settings in the 'ols_ctl'
// control structure and store the results in 'ols_out'
ols_out = olsmt(acs_fgls, "income ~ age + hours_worked", ols_ctl);
Valid cases: 2758 Dependent variable: income Missing cases: 0 Deletion method: None Total SS: 8771535.780 Degrees of freedom: 2755 R-squared: 0.147 Rbar-squared: 0.146 Residual SS: 7481437.527 Std error of est: 52.111 F(2,2755): 237.536 Probability of F: 0.000 Durbin-Watson: 1.932 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ----------------------------------------------------------------------------------- CONSTANT -31.0341 3.91814 -7.92062 0.000 --- --- age 0.762573 0.0620066 12.2983 0.000 0.216528 0.227563 hours_worked 1.25521 0.0715453 17.5443 0.000 0.308893 0.316628
Our results make intuitive sense and suggest that:
- Both age and hours_worked are statistically significant.
- Increases in age lead to increases in income.
- Increase in hours_worked lead to increases in income.
Evaluating the OLS Results
As we know from our previous example, we need to look beyond the estimated coefficients and standard errors when evaluating our model results. Let's start with the histogram of our residuals:
/*
** Plot a histogram of the residuals
** Check for skewed distribution
*/
plotHist(ols_out.resid, 50);
The histogram of our residuals is right skewed with a long tail on the right side.
However, because our initial data is truncated, residual scatter plots will be more useful for checking for heteroscedasticity.
/*
** Plot residuals against independent variables
** Check for increasing or decreasing variance
** as the independent variable changes.
*/
plotScatter(acs_fgls[., "age"], ols_out.resid);
// Open second plot window
plotOpenWindow();
plotScatter(acs_fgls[., "hours_worked"], ols_out.resid);
Both plots show signs of heteroscedasticity:
- The age scatter plot demonstrates the tell-tale fan-shaped relationship with residuals. This indicates that variance in residuals increases as age increases.
- The hours_worked scatter plot is less obvious but does seem to indicate higher variance in the residuals at the middle ranges (40-60) than the lower and higher ends.
FGLS estimation
To address the issues of heteroscedasticity, let's use FGLS. This time we'll use the "HC0" weighting matrix (White, 1980).
// Estimate FGLS parameters
// using the HC1 weighting matrix
struct fglsOut fOut;
fOut = fgls(acs_fgls, "income ~ age + hours_worked", "HC0");
Valid cases: 2758 Dependent variable: age Total SS: 8771535.780 Degrees of freedom: 2755 R-squared: 0.147 Rbar-squared: 0.146 Residual SS: 7481440.027 Std error of est: 52.111 F(2,2755) 237.535 Probability of F: 0.000 Durbin-Watson 1.932 ------------------------------------------------------------------------------------- Standard Prob Variable Estimates Error t-value >|t| [95% Conf. Interval] ------------------------------------------------------------------------------------- Constant -30.9 0.0743 -416 0.000 -31.1 -30.8 hours_worked 0.762 0.000475 1.61e+03 0.000 0.761 0.763 income 1.25 0.00181 694 0.000 1.25 1.26
While using FGLS results in slightly different coefficient estimates, it has a big impact on the standard error estimations. In this case, these changes don't have an impact on our inferences -- all of our regressors are still statistically significant.
Conclusion
Today we've seen how FGLS offers a potential solution for data that doesn't fall within the restrictive IID assumption of OLS.
After today, you should have a better understanding of how to:
- Identify heteroscedasticity and autocorrelation.
- Compute OLS and FGLS estimates using GAUSS.
Further Reading
- Introduction to the Fundamentals of Time Series Data and Analysis.
- Finding the PACF and ACF.
- Generating and Visualizing Regression Residuals.
- OLS diagnostics: Heteroscedasticity.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.