OLS diagnostics: Multicollinearity

Goals

This tutorial builds on the first five econometrics tutorials. It is suggested that you complete those tutorials prior to starting this one.

This tutorial demonstrates how to test for influential data after OLS regression. After completing this tutorial, you should be able to :

Introduction

Multicollinearity between regressors does not directly violate OLS assumptions. However, it can complicate regression, and exact multicollinearity will make estimation impossible. Signs of multicollinearity include large standard errors combined with high R-squared, high correlation between independent variables, and high correlation between estimated coefficients. We will check for multicollinearity by examining the correlation between regressors and calculating the variance inflation factor (VIF).

The OLS Model

Multicollinearity becomes a concern only when we have multiple regressors in our model. For this reason, we will change our linear model for this tutorial using a data generating process with multiple independent variables:

$ y_{i} = 1.3 + 5.7 x_{i,1} + 0.5 x_{i,2} + 1.9 x_{i,3} + \epsilon_{i} $

where $ \epsilon_{i} $ is the random disturbance term. However, for demonstration we will make $x_3$ a function of $x_1$ and $x_2$:

//Introduce a potential source of multicollinearity
x_3 = 0.4*x[.,1] + 0.8*x[.,2] + rndn(num_obs, 1);

Once we've created the data, we estimate the model parameters using the GAUSS function ols and store the results and we did in previous tutorials.

Compute the Correlation Matrix

We will first look for signs of multicollinearity in the correlation matrix of the independent variables. This is done using the GAUSS command corrx.

//Test correlation between independent variables
print "corr(x):" corrx(indepvars);

The above code will print the following output:

corr(x):
  1.0000   0.0933   0.3042
  0.0933   1.0000   0.6982
  0.3042   0.6982   1.0000

As we should expect given our data generating process, the correlation matrix shows somewhat high correlations between x3 and x2 and x3 and x1. These correlations don't seem to be impacting our regression -- we don't see any unusually large significant errors.

Variance Inflation Factor (VIF)

The variance inflation factor for $x_j$ is given by

$$VIF(x_j) = \frac{1}{1-\hat{R}_j^2}$$

where $\hat{R}_j^2 $ is the R-squared that results when $x_j$ is regressed with intercept against all other explanatory variables. As a rule of thumb, VIF values over 10 are concerning.

To run the VIF test, we will first create a GAUSS procedure that:

  1. Runs the appropriate OLS regression for the inputs y and x.
  2. Computes the VIF using the OLS results.
//proc (number of return values) = procedure_name(input1, input2)
proc (1) = vif(y,x);
    //local variables only exist in this procedure
    local nam, m, b, stb, vc, std,
        sig, cx, rsq, resid, dbw, VIF_x;

    //Turn off printing of 'ols' report
    \_\_output = 0;

    //Run regression
    { nam, m, b, stb, vc, std,
      sig, cx, rsq, resid, dbw } = ols("", y, x);

    //Calculate the VIF
    VIF_x = 1/(1 - rsq);

    //Return the VIF
    retp(VIF_x);

    //The 'endp' keyword ends the procedure
endp;

We then use this new procedure to find the VIF for x1, x2 and x3.

//Call 'vif' procedure for each variable

//Y = column 1, X = column 2 and 3
vif_x1 = vif(indepvars[.,1], indepvars[.,2:3]);
print "VIF for x_1 = " vif_x1;

//Y = column 2, X = column 1 and 3
vif_x2 = vif(indepvars[.,2], indepvars[.,1 3]);
print "VIF for x_2 = " vif_x2;

//Y = column 3, X = column 1 and 2
vif_x3 = vif(indepvars[.,3], indepvars[.,1:2]);
print "VIF for x_3 = " vif_x3;

The above code should print the following output:

VIF for x_1 =   1.1367
VIF for x_2 =   2.0126
VIF for x_3 =   2.1986

Conclusion

Congratulations! You have:

  • Computed correlation between variables.
  • Found the VIF for each variable.

Further reading on diagnosing issues related to multicollinearity is available in our blog post "Diagnosing a Singular Matrix".

The next tutorial examines model specification.

For convenience, the full program text is below.

//Clear the work space
new;

//Set seed to replicate results
rndseed 23423;

//Create 100 observations of two variables
//which are each distributed as N(0,1)
num_obs = 100;
x = rndn(num_obs,2);

//Introduce a potential source of multicollinearity
x_3 = 0.4*x[.,1] + 0.8*x[.,2] + rndn(num_obs, 1);

//Independent variables
//The tilde operator preforms horizontal concatenation
indepvars = x ~ x_3;

//Generate error terms
error_term = rndn(num_obs, 1);

//Generate y from x and errorTerm
y = 1.3 + 5.7*indepvars[.,1] + 0.5*indepvars[.,2] + 1.9*indepvars[.,3] + error_term;

//Turn on residuals computation
_olsres = 1;

//Estimate model and store results in variables
{ nam, m, b, stb, vc, std, sig, cx, rsq, resid, dbw } = ols("", y, x);

//Test correlation between independent variables
print "corr(x):" corrx(indepvars);

//Call 'vif' procedure for each variable

//Y = column 1, X = column 2 and 3
vif_x1 = vif(indepvars[.,1], indepvars[.,2:3]);
print "VIF for x_1 = " vif_x1;

//Y = column 2, X = column 1 and 3
vif_x2 = vif(indepvars[.,2], indepvars[.,1 3]);
print "VIF for x_2 = " vif_x2;

//Y = column 3, X = column 1 and 2
vif_x3 = vif(indepvars[.,3], indepvars[.,1:2]);
print "VIF for x_3 = " vif_x3;

//proc (number of return values) = procedure_name(input1, input2)
proc (1) = vif(y,x);
    //local variables only exist in this procedure
    local nam, m, b, stb, vc, std,
        sig, cx, rsq, resid, dbw, VIF_x;

    //Turn off printing of 'ols' report
    \_\_output = 0;

    //Run regression
    { nam, m, b, stb, vc, std,
      sig, cx, rsq, resid, dbw } = ols("", y, x);

    //Calculate the VIF
    VIF_x = 1/(1 - rsq);

    //Return the VIF
    retp(VIF_x);

    //The 'endp' keyword ends the procedure
endp;

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.