Hi,
I used the OLSRQ command to get the OLS coefficients of a linear regression. I just wanted to know if there is in any way any difference between the two that I should be worried about.
Thanks!
1 Answer
0
With respect to the estimated coefficients, there are two main differences between ols
and olsqr
.
Constant term
The first is that the ols
procedure will automatically add in the constant term for you. Whereas for olsqr
, you need to add a column of 1's to your X
matrix to estimate the constant term. For example:
X = rndn(100, 4);
b_true = { 0.8, -1.1, 0.1, 0.6 };
alpha = 2;
y = alpha + X * b_true + rndn(100,1);
call ols("", y, X);
// Add column of 1's to estimate constant
X_c = ones(rows(X), 1) ~ X;
// Estimate with constant
y_hat = olsqr(y, X_c);
Algorithm
The second difference, which is more important, is the estimation algorithm. By default, the ols
procedure uses the normal equations method, i.e. invpd(X'X)*X'y
. olsqr
uses the QR decomposition method.
The normal equations method is less numerically stable and will produce less accuracy for some matrices. Most sources recommend using the QR method.
As mentioned in the documentation, ols
will use the QR method if you set:
__olsalg = "QR";
The olsmt
control structure has an analogous structure member (ctl.olsalg
).
When to use QR
We've already mentioned that most experts recommend to always use QR and we don't disagree. However, it can be helpful to know when it will make the most difference.
The condition number of a matrix tells us how numerically fragile a matrix will be to computing the inverse. Greene's Econometric Analysis textbook says that a condition number above 20 could indicate a problem. The GAUSS function cond
will compute the condition number.
Here is an example with a dataset known for being difficult to estimate:
new;
rndseed 2434;
// NIST Longley data
y = { 60323,
61122,
60171,
61187,
63221,
63639,
64989,
63761,
66019,
67857,
68169,
66513,
68655,
69564,
69331,
70551 };
X = { 83 234289 2356 1590 107608 1947,
88.5 259426 2325 1456 108632 1948,
88.2 258054 3682 1616 109773 1949,
89.5 284599 3351 1650 110929 1950,
96.2 328975 2099 3099 112075 1951,
98.1 346999 1932 3594 113270 1952,
99 365385 1870 3547 115094 1953,
100 363112 3578 3350 116219 1954,
101.2 397469 2904 3048 117388 1955,
104.6 419180 2822 2857 118734 1956,
108.4 442769 2936 2798 120445 1957,
110.8 444546 4681 2637 121950 1958,
112.6 482704 3813 2552 123366 1959,
114.2 502601 3931 2514 125368 1960,
115.7 518173 4806 2572 127852 1961,
116.9 554894 4007 2827 130081 1962 };
b_true = { -3482258.63459582,
15.0618722713733,
-0.358191792925910E-01,
-2.02022980381683,
-1.03322686717359,
-0.511041056535807E-01,
1829.15146461355 };
// Add column of 1's to estimate constant
X_c = ones(rows(X), 1) ~ X;
// Normal equations estimate
b_hat_ne = invpd(X_c'X_c)*X_c'y;
// QR estimate
b_hat_qr = olsqr(y, X_c);
// Find condition number of X
cx = cond(X);
// Find condition number of random matrix of same size
X_r = rndn(rows(X), cols(X));
cxr = cond(X_r);
/*
** Find number of accurate digits
*/
nads_qr = -log(abs((b_true-b_hat_qr)./b_true));
nads_ne = -log(abs((b_true-b_hat_ne)./b_true));
After running the above code:
cx = 456037.68 cxr = 2.823 nads_ne = 8.0229 nads_qr = 12.2611 7.5016 10.9613 7.5455 11.8447 8.1336 12.5838 8.3716 13.2114
7.2266 11.6944 8.0484 12.2677
In this case, we are using a dataset that is known to be difficult to estimate due to high multicollinearity. We see that the least accurate estimate from the normal equations estimate has about 7 accurate digits, while the least accurate estimate from olsqr
is nearly 11 accurate digits.
This is unlikely to impact the results of the project. However, researchers do run into datasets that exhibit extreme levels of multicollinearity and even higher condition numbers than are seen here.
The solution to this is beyond the scope of this answer, but generally, involve scaling the data, dropping variables, collecting more data or some combination of all three.
Your Answer
1 Answer
With respect to the estimated coefficients, there are two main differences between ols
and olsqr
.
Constant term
The first is that the ols
procedure will automatically add in the constant term for you. Whereas for olsqr
, you need to add a column of 1's to your X
matrix to estimate the constant term. For example:
X = rndn(100, 4);
b_true = { 0.8, -1.1, 0.1, 0.6 };
alpha = 2;
y = alpha + X * b_true + rndn(100,1);
call ols("", y, X);
// Add column of 1's to estimate constant
X_c = ones(rows(X), 1) ~ X;
// Estimate with constant
y_hat = olsqr(y, X_c);
Algorithm
The second difference, which is more important, is the estimation algorithm. By default, the ols
procedure uses the normal equations method, i.e. invpd(X'X)*X'y
. olsqr
uses the QR decomposition method.
The normal equations method is less numerically stable and will produce less accuracy for some matrices. Most sources recommend using the QR method.
As mentioned in the documentation, ols
will use the QR method if you set:
__olsalg = "QR";
The olsmt
control structure has an analogous structure member (ctl.olsalg
).
When to use QR
We've already mentioned that most experts recommend to always use QR and we don't disagree. However, it can be helpful to know when it will make the most difference.
The condition number of a matrix tells us how numerically fragile a matrix will be to computing the inverse. Greene's Econometric Analysis textbook says that a condition number above 20 could indicate a problem. The GAUSS function cond
will compute the condition number.
Here is an example with a dataset known for being difficult to estimate:
new;
rndseed 2434;
// NIST Longley data
y = { 60323,
61122,
60171,
61187,
63221,
63639,
64989,
63761,
66019,
67857,
68169,
66513,
68655,
69564,
69331,
70551 };
X = { 83 234289 2356 1590 107608 1947,
88.5 259426 2325 1456 108632 1948,
88.2 258054 3682 1616 109773 1949,
89.5 284599 3351 1650 110929 1950,
96.2 328975 2099 3099 112075 1951,
98.1 346999 1932 3594 113270 1952,
99 365385 1870 3547 115094 1953,
100 363112 3578 3350 116219 1954,
101.2 397469 2904 3048 117388 1955,
104.6 419180 2822 2857 118734 1956,
108.4 442769 2936 2798 120445 1957,
110.8 444546 4681 2637 121950 1958,
112.6 482704 3813 2552 123366 1959,
114.2 502601 3931 2514 125368 1960,
115.7 518173 4806 2572 127852 1961,
116.9 554894 4007 2827 130081 1962 };
b_true = { -3482258.63459582,
15.0618722713733,
-0.358191792925910E-01,
-2.02022980381683,
-1.03322686717359,
-0.511041056535807E-01,
1829.15146461355 };
// Add column of 1's to estimate constant
X_c = ones(rows(X), 1) ~ X;
// Normal equations estimate
b_hat_ne = invpd(X_c'X_c)*X_c'y;
// QR estimate
b_hat_qr = olsqr(y, X_c);
// Find condition number of X
cx = cond(X);
// Find condition number of random matrix of same size
X_r = rndn(rows(X), cols(X));
cxr = cond(X_r);
/*
** Find number of accurate digits
*/
nads_qr = -log(abs((b_true-b_hat_qr)./b_true));
nads_ne = -log(abs((b_true-b_hat_ne)./b_true));
After running the above code:
cx = 456037.68 cxr = 2.823 nads_ne = 8.0229 nads_qr = 12.2611 7.5016 10.9613 7.5455 11.8447 8.1336 12.5838 8.3716 13.2114
7.2266 11.6944 8.0484 12.2677
In this case, we are using a dataset that is known to be difficult to estimate due to high multicollinearity. We see that the least accurate estimate from the normal equations estimate has about 7 accurate digits, while the least accurate estimate from olsqr
is nearly 11 accurate digits.
This is unlikely to impact the results of the project. However, researchers do run into datasets that exhibit extreme levels of multicollinearity and even higher condition numbers than are seen here.
The solution to this is beyond the scope of this answer, but generally, involve scaling the data, dropping variables, collecting more data or some combination of all three.