Introduction
Both ordinary least squares and generalized linear models can be computed directly from a dataset using the GAUSS formula string syntax. In addition, the ability to transform variables, including factor variables, makes for compact and efficient modeling.
In this tutorial, we will examine several ways to utilize formula strings in OLS. When using formula strings in the GAUSS procedure ols
two inputs are required, dataset name and the formula.
Represent a model with formula strings
In a model with a dependent (or response variable), the formula will list the dependent variable first, followed by a tilde ~
and then the independent variables. For example, to represent the model
$$weight = \alpha + \beta*height$$
The correct formula string would be "weight ~ height"
.
Descriptive Statistics
In this example, we will again use the auto2.dta
dataset. To learn a little more about the dataset, let’s first look at the descriptive statistics using dstatmt
.
//Create file name with full path
fname = getGAUSSHome() $+"examples/auto2.dta";
//Descriptive statistics
dstatmt(fname);
The output from this is
--------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Min Max Valid Missing --------------------------------------------------------------------------------------- make ----- ----- ----- ----- ----- 74 0 price 6165.2568 2949.4959 8699525.9743 3291.00 15906.00 74 0 mpg 21.2973 5.7855 33.4720 12.00 41.00 74 0 rep78 ----- ----- ----- ----- ----- 74 0 headroom 2.9932 0.8460 0.7157 1.50 5.00 74 0 trunk 13.7568 4.2774 18.2962 5.00 23.00 74 0 weight 3019.4595 777.1936 604029.8408 1760.00 4840.00 74 0 length 187.9324 22.2663 495.7899 142.00 233.00 74 0 turn 39.6486 4.3994 19.3543 31.00 51.00 74 0 displacement 197.2973 91.8372 8434.0748 79.00 425.00 74 0 gear_ratio 3.0149 0.4563 0.2082 2.19 3.89 74 0 foreign ----- ----- ----- ----- ----- 74 0
There are a few important things to note from this output. First, the full row of missing values for make
tell us that make
is not compatible with dstatmt
. When this occurs, it is most likely because the variable is a string variable.
Second, note that rep78
and foreign only contain values for the minimum and maximum observation. All other statistics are missing. This occurs because a variable is recognized by GAUSS as a categorical variable. We can preview the data in the data import wizard to confirm that make
is a string variable and rep78
and foreign are categorical variables:
OLS With A Subset of Variables
Now that we know a little more about our data, let’s set up our linear model. For our first model, let’s run a simple regression of mpg against weight
and length
.
$$mpg = \alpha + \beta_1*weight + \beta_2*length$$
The GAUSS formula string representing this model is "mpg ~ weight + length
".
call ols(fname, "mpg ~ weight + length");
The output from this regression reads:
Valid cases: 74 Dependent variable: mpg Missing cases: 0 Deletion method: None Total SS: 2443.459 Degrees of freedom: 71 R-squared: 0.661 Rbar-squared: 0.652 Residual SS: 827.379 Std error of est: 3.414 F(2,71): 69.341 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------- CONSTANT 47.884873 6.087870 7.865620 0.000 --- --- weight -0.003851 0.001586 -2.428452 0.018 -0.517387 -0.807175 length -0.079593 0.055358 -1.437802 0.155 -0.306327 -0.795779
Include Factor Variables
We now wish to extend our previous model to include the levels of rep78
. To specify that a variable is a categorical variable in a formula we use factor followed by the name of the variable inside a pair of parentheses. The formula for our extended model with be "mpg ~ weight + length + factor(foreign)"
.
call ols(fname, "mpg ~ weight + length + factor(rep78)");
Using factor
in the formula strings tells GAUSS that dummy variables representing the different categories of rank should be included in the regression. This is seen in the printed output table which now includes coefficients for rep78=fair,average,good, excellent
. Note that rep78=poor
is automatically excluded from the regression as the base level.
Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 49.954158 6.734554 7.417590 0.000 --- ---
weight -0.002299 0.001724 -1.333602 0.187 -0.310730 -0.805520 length -0.115486 0.058422 -1.976769 0.053 -0.447805 -0.803676 rep78: Fair -0.093428 2.710368 -0.034471 0.973 -0.005136 -0.134619 rep78: Average -0.531709 2.496377 -0.212992 0.832 -0.045260 -0.279593 rep78: Good -0.343326 2.551735 -0.134546 0.893 -0.025887 0.038439 rep78: Excellent 2.403347 2.668859 0.900515 0.371 0.151069 0.454192
Include Interaction Effects
Now let’s look at extending our model one step further to include interaction effects using formula strings. Two different operators are available for adding interaction terms. The colon operator, :
, is used to add only a pure interaction term and an asterisk, *
, is used to add each individual term, as well as the interaction term.
Let’s first consider using :
to add the interaction of length
and weight
to our model. In this case the formula for our model is "mpg ~ factor(foreign) + weight + length length:weight"
.
//Case one with ":"
call ols(fname, "mpg ~ weight + length + length:weight + factor(rep78)");
In the output from this call we see that the coefficient for the interaction term length:weight
has been added to our output table just below the coefficient for length
.
Valid cases: 69 Dependent variable: mpg Missing cases: 5 Deletion method: Listwise Total SS: 2340.203 Degrees of freedom: 61 R-squared: 0.702 Rbar-squared: 0.668 Residual SS: 697.753 Std error of est: 3.382 F(7,61): 20.513 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 69.748294 14.957072 4.663232 0.000 --- ---
weight -0.009885 0.005407 -1.828120 0.072 -1.336016 -0.805520 length -0.212137 0.087303 -2.429906 0.018 -0.822577 -0.803676 length:weight 0.000037 0.000025 1.478611 0.144 1.378839 -0.802872 rep78: Fair -0.313868 2.688941 -0.116726 0.907 -0.017255 -0.134619 rep78: Average -0.939373 2.488154 -0.377538 0.707 -0.079961 -0.279593 rep78: Good -1.203796 2.593793 -0.464107 0.644 -0.090766 0.038439 rep78: Excellent 1.765676 2.678632 0.659171 0.512 0.110986 0.454192
Now we will estimate the same model using *
. In this case, the formula for our model is "mpg ~ length*weight + factor(foreign)"
.
//Case two with "*"
call ols(fname, "mpg ~ weight*length + factor(rep78)");
The resulting output table shows that coefficients for weight
, length
, and weight:length
are estimated.
Valid cases: 69 Dependent variable: mpg Missing cases: 5 Deletion method: Listwise Total SS: 2340.203 Degrees of freedom: 61 R-squared: 0.702 Rbar-squared: 0.668 Residual SS: 697.753 Std error of est: 3.382 F(7,61): 20.513 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 69.748294 14.957072 4.663232 0.000 --- ---
weight -0.009885 0.005407 -1.828120 0.072 -1.336016 -0.805520 length -0.212137 0.087303 -2.429906 0.018 -0.822577 -0.803676 weight:length 0.000037 0.000025 1.478611 0.144 1.378839 -0.802872 rep78: Fair -0.313868 2.688941 -0.116726 0.907 -0.017255 -0.134619 rep78: Average -0.939373 2.488154 -0.377538 0.707 -0.079961 -0.279593 rep78: Good -1.203796 2.593793 -0.464107 0.644 -0.090766 0.038439 rep78: Excellent 1.765676 2.678632 0.659171 0.512 0.110986 0.454192
OLS Without a Constant
As a final adjustment to our model, let’s remove the constant from our regression. The default when using GAUSS formulas for ols
is to include a constant in the model. In order to run the model without a constant, we must add a -1
after the ~
in our formula. The -1
should be the first item on our list of independent variables. To remove the constant from our previous model we use the formula "mpg ~ -1 + weight + length + factor(foreign)"
call ols(fname , "mpg ~ -1 + weight + length + factor(rep78)");
The output from this line reads
Valid cases: 69 Dependent variable: mpg Missing cases: 5 Deletion method: Listwise Total SS: 33615.000 Degrees of freedom: 63 R-squared: 0.959 Rbar-squared: 0.956 Residual SS: 1364.160 Std error of est: 4.653 F(6,63): 248.236 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- weight -0.011862 0.001560 -7.604068 0.000 -1.683494 0.880216 length 0.271706 0.035757 7.598726 0.000 2.334454 0.932449 rep78: Fair 4.735923 3.585777 1.320752 0.191 0.073061 0.295039 rep78: Average 5.855229 3.193495 1.833486 0.071 0.174919 0.580552 rep78: Good 5.490387 3.308433 1.659513 0.102 0.127049 0.501374 rep78: Excellent 8.676492 3.449911 2.514990 0.014 0.156955 0.494998