Introduction
The aggregate
function, first available in GAUSS version 20, computes statistics within data groups. This is particularly useful for panel data.
In today's blog, we take a closer look at aggregate
. We will:
- Introduce the basics of the
aggregate
function. - Explain how to use the
aggregate
function - Demonstrate a real-world application of the
aggregate
function using current account data from the International Monetary Fund.
The GAUSS Aggregate Function
The GAUSS aggregate
function computes statistics within a group based upon a specified group identifier. The function supports a variety of GAUSS statistics including:
- mean
- median
- mode
- min
- max
- sample standard deviation
- sum
- sample variance
For example, consider a panel dataset which includes observed weights for three individuals across a 6-month time span:
Name | Jan. Weight | Feb. Weight | Mar. Weight | Apr. Weight | May Weight | June Weight |
---|---|---|---|---|---|---|
Sarah | 135 | 134 | 138 | 142 | 144 | 145 |
Tom | 196 | 192 | 182 | 183 | 184 | 181 |
Nikki | 143 | 144 | 146 | 147 | 145 | 143 |
We can use the aggregate
function to find the 6-month mean weights for Sarah, Tom and Nikki:
Name | Jan. Weight | Feb. Weight | Mar. Weight | Apr. Weight | May Weight | June Weight | Mean Weight |
---|---|---|---|---|---|---|---|
Sarah | 135 | 134 | 138 | 142 | 144 | 145 | 139.7 |
Tom | 196 | 192 | 182 | 183 | 184 | 181 | 186.3 |
Nikki | 143 | 144 | 146 | 147 | 145 | 143 | 144.7 |
Alternatively, we could find the monthly standard deviation of the weights across Sarah, Tom and Nikki:
Name | Jan. Weight | Feb. Weight | Mar. Weight | Apr. Weight | May Weight | June Weight |
---|---|---|---|---|---|---|
Sarah | 135 | 134 | 138 | 142 | 144 | 145 |
Tom | 196 | 192 | 182 | 183 | 184 | 181 |
Nikki | 143 | 144 | 146 | 147 | 145 | 143 |
Monthly Std. Dev. | 33.2 | 31.0 | 23.4 | 22.4 | 22.8 | 21.4 |
How to Use The Aggregate Function
The aggregate
function takes two required inputs:
x_a = aggregate(x, method);
- x
- NxK data matrix, must have group identifiers in the first column.
- method
- String, method to use. Valid methods include:
"mean"
,"median"
,"mode"
,"max"
,"min"
,"sd"
,"sum"
,"variance"
.
The Input Data Matrix
The aggregate
function requires the data matrix input to:
- Have numerical group identifiers in the first column.
- Be in stacked panel data format.
Let's consider our example dataset from above. In order to use this data as an input to the GAUSS aggregate
function we need to:
- Recode our group identifiers from names to numbers.
- Stack our data into a pooled dataset.
Name | Jan. Weight | Feb. Weight | Mar. Weight | Apr. Weight | May Weight | June Weight |
---|---|---|---|---|---|---|
Sarah | 135 | 134 | 138 | 142 | 144 | 145 |
Tom | 196 | 192 | 182 | 183 | 184 | 181 |
Nikki | 143 | 144 | 146 | 147 | 145 | 143 |
$$\text{Sarah} \rightarrow 1$$ $$\text{Tom} \rightarrow 2$$ $$\text{Nikki} \rightarrow 3$$
$$\Downarrow$$
Group | Jan. Weight | Feb. Weight | Mar. Weight | Apr. Weight | May Weight | June Weight |
---|---|---|---|---|---|---|
1 | 135 | 134 | 138 | 142 | 144 | 145 |
2 | 196 | 192 | 182 | 183 | 184 | 181 |
3 | 143 | 144 | 146 | 147 | 145 | 143 |
$$\Downarrow$$
Group | Month | Weight |
---|---|---|
1 | 1 | 135 |
1 | 2 | 134 |
1 | 3 | 138 |
1 | 4 | 142 |
1 | 5 | 144 |
1 | 6 | 145 |
2 | 1 | 196 |
2 | 2 | 192 |
⁞ | ⁞ | ⁞ |
3 | 6 | 143 |
The Method Input
The method input into the aggregate
function should always be a string indicating which statistic you wish to compute.
Each method works on groups within the panel the same way the analogous pooled data function would work, including its handling of missing values.
Method | Pooled Function |
---|---|
mean | meanc |
median | median |
mode | modec |
max | maxc |
min | minc |
sum | sumc |
sd | stdc |
variance | varCovXS |
Example of How to Use Aggregate
Let's use aggregate
to find the means by group for weight data:
weights = { 1 1 135,
1 2 134,
1 3 138,
1 4 142,
1 5 144,
1 6 145,
2 1 196,
2 2 192,
2 3 182,
2 4 183,
2 5 184,
2 6 181,
3 1 143,
3 2 144,
3 3 146,
3 4 147,
3 5 145,
3 6 143 };
/*
** Find the mean by person.
** We will use the first column
** as the group indicator and will find
** the mean of the weights.
*/
print aggregate(weights[., 1 3], "mean");
This prints the group means to the output window:
1 139.67
2 186.33
3 144.67
We can also use the month identifiers to find the sample standard deviation by month:
/*
** Find the standard deviation by month.
** We will use the second column of weights
** as the group indicator and will find
** the standard deviation of the weights.
*/
print aggregate(weights[., 2 3], "sd");
Now the standard deviations, along with their associated months will be printed to the output window:
1 33.151
2 31.005
3 23.438
4 22.368
5 22.811
6 21.385
Using Aggregate to Examine Trends in Current Account Balances
Our simple example dataset is useful for demonstrating the basics of the aggregate
function. However, a real-world panel dataset better demonstrates its true power. In this section, we will use aggregate
to examine some of the trends in international current account balances.
The Data
We will use current account balance measured as a percentage of GDP. This unbalanced panel data is a modified version of a dataset from the International Monetary Fund and spans 1953-Q3 to 2019-Q4 and includes a total of 46 countries, across 5 different regions.
It contains the following variables:
Variable name | Description |
---|---|
Country | String, name of the country. |
Country ID | Integer country identifier. |
World Region | String, name of the corresponding world region. |
Region ID | Integer world region identifier. |
Time | String, the date of the observation. |
CAB | Decimal numeric, the Current Account Balance. |
Mean and Median Current Account Balances By Country
We first examine variations in the mean and median current account balances across countries. Using aggregate
we find the mean and median current account balances for each country in the panel across all observations.
/*
** Load the 'Country ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Country Id + CAB");
// Compute mean and median current
// account balances by Country ID
mean_cab_cid = aggregate(X, "mean");
median_cab_cid = aggregate(X, "median");
After the above code mean_cab_cid
and median_cab_cid
will be both be $46\times2$ matrices. Each element in the first column will be a unique country ID. The corresponding element in the second column will be the average (mean or median) current account balance for that country.
We include a graph of this data below, where we see that Germany leads the pack with the highest average current account balance while Finland has the lowest average current account balances.
Mean and Median Current Account Balances By Region
We can similarly consider the mean and median current account balances across geographical regions with the code below.
/*
** Load the 'Region ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Region ID + CAB");
// Compute mean and median current
// account balances by world region
mean_cab_wreg = aggregate(X, "mean");
median_cab_wreg = aggregate(X, "median");
mean_cab_wreg
and median_cab_wreg
will be two column matrices with unique world region IDs in the first column and the corresponding statistics in the second column.
Mean and Median Current Account Balances Time Series
Finally, we consider how the mean and median current account balances vary across time in the time series plot below.
/*
** Load the 'Time' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
** Wrapping 'Time' in 'date($)' tells GAUSS that 'Time' is a string
** variable that we want GAUSS to convert to a date.
*/
X = loadd("imf_cab_mod.xlsx", "date($Time) + CAB");
mean_cab_date = aggregate(X, "mean");
median_cab_date = aggregate(X, "median");
This time the first column of our resulting matrices, mean_cab_date
and median_cab_date
, will contain each unique date from our dataset. The second column will contain the statistic computed for each unique date.
Check out this tutorial to learn more.
Below is a graph of the Current Account Balance data grouped by quarter.
Conclusion
In today's blog, we examined the fundaments of the aggregate
procedure. After reading you should have a better understanding of:
- The basics of the
aggregate
function. - How to use the
aggregate
function. - How to examine trends in real-world panel data using
aggregate
.
Further Reading
- Panel data, structural breaks and unit root testing
- Panel Data Basics: One-way Individual Effects
- Introduction to the Fundamentals of Panel Data
- Panel Data Stationarity Test With Structural Breaks
- Transforming Panel Data to Long Form in GAUSS
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.