Difference between coefficients from OLS Vs OLSRQ formula?

Hi,

I used the OLSRQ command to get the OLS coefficients of a linear regression. I just wanted to know if there is in any way any difference between the two that I should be worried about.

Thanks!

1 Answer



0



With respect to the estimated coefficients, there are two main differences between ols and olsqr.

Constant term

The first is that the ols procedure will automatically add in the constant term for you. Whereas for olsqr, you need to add a column of 1's to your X matrix to estimate the constant term. For example:

X = rndn(100, 4);
b_true = { 0.8, -1.1, 0.1, 0.6 };
alpha = 2;
y = alpha + X * b_true + rndn(100,1);

call ols("", y, X);

// Add column of 1's to estimate constant
X_c = ones(rows(X), 1) ~ X;

// Estimate with constant
y_hat = olsqr(y, X_c);

Algorithm

The second difference, which is more important, is the estimation algorithm. By default, the ols procedure uses the normal equations method, i.e. invpd(X'X)*X'y. olsqr uses the QR decomposition method.

The normal equations method is less numerically stable and will produce less accuracy for some matrices. Most sources recommend using the QR method.

As mentioned in the documentation, ols will use the QR method if you set:

__olsalg = "QR";

The olsmt control structure has an analogous structure member (ctl.olsalg).

When to use QR

We've already mentioned that most experts recommend to always use QR and we don't disagree. However, it can be helpful to know when it will make the most difference.

The condition number of a matrix tells us how numerically fragile a matrix will be to computing the inverse. Greene's Econometric Analysis textbook says that a condition number above 20 could indicate a problem. The GAUSS function cond will compute the condition number.

Here is an example with a dataset known for being difficult to estimate:

new;
rndseed 2434;

// NIST Longley data
y = { 60323,
    61122,
    60171,
    61187,
    63221,
    63639,
    64989,
    63761,
    66019,
    67857,
    68169,
    66513,
    68655,
    69564,
    69331,
    70551 };

X = { 83    234289     2356     1590   107608     1947,
    88.5    259426     2325     1456   108632     1948,
    88.2    258054     3682     1616   109773     1949,
    89.5    284599     3351     1650   110929     1950,
    96.2    328975     2099     3099   112075     1951,
    98.1    346999     1932     3594   113270     1952,
    99      365385     1870     3547   115094     1953,
    100     363112     3578     3350   116219     1954,
    101.2   397469     2904     3048   117388     1955,
    104.6   419180     2822     2857   118734     1956,
    108.4   442769     2936     2798   120445     1957,
    110.8   444546     4681     2637   121950     1958,
    112.6   482704     3813     2552   123366     1959,
    114.2   502601     3931     2514   125368     1960,
    115.7   518173     4806     2572   127852     1961,
    116.9   554894     4007     2827   130081     1962 };

b_true = { -3482258.63459582,
            15.0618722713733,
      -0.358191792925910E-01,
           -2.02022980381683,
           -1.03322686717359,
      -0.511041056535807E-01,
            1829.15146461355 };

// Add column of 1's to estimate constant
X_c = ones(rows(X), 1) ~ X;

// Normal equations estimate
b_hat_ne = invpd(X_c'X_c)*X_c'y;

// QR estimate
b_hat_qr = olsqr(y, X_c);

// Find condition number of X
cx = cond(X);

// Find condition number of random matrix of same size
X_r = rndn(rows(X), cols(X));
cxr = cond(X_r);

/*
** Find number of accurate digits
*/

nads_qr = -log(abs((b_true-b_hat_qr)./b_true));
nads_ne = -log(abs((b_true-b_hat_ne)./b_true));

After running the above code:

cx = 456037.68    cxr = 2.823

nads_ne = 8.0229  nads_qr = 12.2611
          7.5016            10.9613
          7.5455            11.8447
          8.1336            12.5838
          8.3716            13.2114
7.2266 11.6944 8.0484 12.2677

In this case, we are using a dataset that is known to be difficult to estimate due to high multicollinearity. We see that the least accurate estimate from the normal equations estimate has about 7 accurate digits, while the least accurate estimate from olsqr is nearly 11 accurate digits.

This is unlikely to impact the results of the project. However, researchers do run into datasets that exhibit extreme levels of multicollinearity and even higher condition numbers than are seen here.

The solution to this is beyond the scope of this answer, but generally, involve scaling the data, dropping variables, collecting more data or some combination of all three.

aptech

1,773

Your Answer

1 Answer

0

With respect to the estimated coefficients, there are two main differences between ols and olsqr.

Constant term

The first is that the ols procedure will automatically add in the constant term for you. Whereas for olsqr, you need to add a column of 1's to your X matrix to estimate the constant term. For example:

X = rndn(100, 4);
b_true = { 0.8, -1.1, 0.1, 0.6 };
alpha = 2;
y = alpha + X * b_true + rndn(100,1);

call ols("", y, X);

// Add column of 1's to estimate constant
X_c = ones(rows(X), 1) ~ X;

// Estimate with constant
y_hat = olsqr(y, X_c);

Algorithm

The second difference, which is more important, is the estimation algorithm. By default, the ols procedure uses the normal equations method, i.e. invpd(X'X)*X'y. olsqr uses the QR decomposition method.

The normal equations method is less numerically stable and will produce less accuracy for some matrices. Most sources recommend using the QR method.

As mentioned in the documentation, ols will use the QR method if you set:

__olsalg = "QR";

The olsmt control structure has an analogous structure member (ctl.olsalg).

When to use QR

We've already mentioned that most experts recommend to always use QR and we don't disagree. However, it can be helpful to know when it will make the most difference.

The condition number of a matrix tells us how numerically fragile a matrix will be to computing the inverse. Greene's Econometric Analysis textbook says that a condition number above 20 could indicate a problem. The GAUSS function cond will compute the condition number.

Here is an example with a dataset known for being difficult to estimate:

new;
rndseed 2434;

// NIST Longley data
y = { 60323,
    61122,
    60171,
    61187,
    63221,
    63639,
    64989,
    63761,
    66019,
    67857,
    68169,
    66513,
    68655,
    69564,
    69331,
    70551 };

X = { 83    234289     2356     1590   107608     1947,
    88.5    259426     2325     1456   108632     1948,
    88.2    258054     3682     1616   109773     1949,
    89.5    284599     3351     1650   110929     1950,
    96.2    328975     2099     3099   112075     1951,
    98.1    346999     1932     3594   113270     1952,
    99      365385     1870     3547   115094     1953,
    100     363112     3578     3350   116219     1954,
    101.2   397469     2904     3048   117388     1955,
    104.6   419180     2822     2857   118734     1956,
    108.4   442769     2936     2798   120445     1957,
    110.8   444546     4681     2637   121950     1958,
    112.6   482704     3813     2552   123366     1959,
    114.2   502601     3931     2514   125368     1960,
    115.7   518173     4806     2572   127852     1961,
    116.9   554894     4007     2827   130081     1962 };

b_true = { -3482258.63459582,
            15.0618722713733,
      -0.358191792925910E-01,
           -2.02022980381683,
           -1.03322686717359,
      -0.511041056535807E-01,
            1829.15146461355 };

// Add column of 1's to estimate constant
X_c = ones(rows(X), 1) ~ X;

// Normal equations estimate
b_hat_ne = invpd(X_c'X_c)*X_c'y;

// QR estimate
b_hat_qr = olsqr(y, X_c);

// Find condition number of X
cx = cond(X);

// Find condition number of random matrix of same size
X_r = rndn(rows(X), cols(X));
cxr = cond(X_r);

/*
** Find number of accurate digits
*/

nads_qr = -log(abs((b_true-b_hat_qr)./b_true));
nads_ne = -log(abs((b_true-b_hat_ne)./b_true));

After running the above code:

cx = 456037.68    cxr = 2.823

nads_ne = 8.0229  nads_qr = 12.2611
          7.5016            10.9613
          7.5455            11.8447
          8.1336            12.5838
          8.3716            13.2114
7.2266 11.6944 8.0484 12.2677

In this case, we are using a dataset that is known to be difficult to estimate due to high multicollinearity. We see that the least accurate estimate from the normal equations estimate has about 7 accurate digits, while the least accurate estimate from olsqr is nearly 11 accurate digits.

This is unlikely to impact the results of the project. However, researchers do run into datasets that exhibit extreme levels of multicollinearity and even higher condition numbers than are seen here.

The solution to this is beyond the scope of this answer, but generally, involve scaling the data, dropping variables, collecting more data or some combination of all three.


You must login to post answers.

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.