How to Aggregate Panel Data in GAUSS

by Eric · Published November 22, 2019 · Updated April 25, 2024

Introduction

The aggregate function, first available in GAUSS version 20, computes statistics within data groups. This is particularly useful for panel data.

In today's blog, we take a closer look at aggregate. We will:

Introduce the basics of the aggregate function.
Explain how to use the aggregate function
Demonstrate a real-world application of the aggregate function using current account data from the International Monetary Fund.

The GAUSS Aggregate Function

The GAUSS aggregate function computes statistics within a group based upon a specified group identifier. The function supports a variety of GAUSS statistics including:

mean
median
mode
min
max
sample standard deviation
sum
sample variance

For example, consider a panel dataset which includes observed weights for three individuals across a 6-month time span:

Name	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight
Sarah	135	134	138	142	144	145
Tom	196	192	182	183	184	181
Nikki	143	144	146	147	145	143

We can use the aggregate function to find the 6-month mean weights for Sarah, Tom and Nikki:

Name	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight	Mean Weight
Sarah	135	134	138	142	144	145	139.7
Tom	196	192	182	183	184	181	186.3
Nikki	143	144	146	147	145	143	144.7

Alternatively, we could find the monthly standard deviation of the weights across Sarah, Tom and Nikki:

Name	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight
Sarah	135	134	138	142	144	145
Tom	196	192	182	183	184	181
Nikki	143	144	146	147	145	143
Monthly Std. Dev.	33.2	31.0	23.4	22.4	22.8	21.4

How to Use The Aggregate Function

The aggregate function takes two required inputs:


x_a = aggregate(x, method);

x: NxK data matrix, must have group identifiers in the first column.
method: String, method to use. Valid methods include: "mean", "median", "mode", "max", "min", "sd", "sum", "variance".

The Input Data Matrix

The aggregate function requires the data matrix input to:

Have numerical group identifiers in the first column.
Be in stacked panel data format.

Let's consider our example dataset from above. In order to use this data as an input to the GAUSS aggregate function we need to:

Recode our group identifiers from names to numbers.
Stack our data into a pooled dataset.

Name	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight
Sarah	135	134	138	142	144	145
Tom	196	192	182	183	184	181
Nikki	143	144	146	147	145	143

Sarah \to 1

$\text{Sarah} \rightarrow 1$

Tom \to 2

$\text{Tom} \rightarrow 2$

Nikki \to 3

$\text{Nikki} \rightarrow 3$

⇓

$\Downarrow$

Group	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight
1	135	134	138	142	144	145
2	196	192	182	183	184	181
3	143	144	146	147	145	143

⇓

$\Downarrow$

Group	Month	Weight
1	1	135
1	2	134
1	3	138
1	4	142
1	5	144
1	6	145
2	1	196
2	2	192
⁞	⁞	⁞
3	6	143

The Method Input

The method input into the aggregate function should always be a string indicating which statistic you wish to compute.

Each method works on groups within the panel the same way the analogous pooled data function would work, including its handling of missing values.

Method	Pooled Function
mean	`meanc`
median	`median`
mode	`modec`
max	`maxc`
min	`minc`
sum	`sumc`
sd	`stdc`
variance	`varCovXS`

Example of How to Use Aggregate

Let's use aggregate to find the means by group for weight data:


weights = { 1 1 135,
            1 2 134,
            1 3 138,
            1 4 142,
            1 5 144,
            1 6 145,
            2 1 196,
            2 2 192,
            2 3 182,
            2 4 183,
            2 5 184,
            2 6 181,
            3 1 143,
            3 2 144,
            3 3 146,
            3 4 147,
            3 5 145,
            3 6 143 };
 
/*
** Find the mean by person.
** We will use the first column
** as the group indicator and will find
** the mean of the weights.
*/
print aggregate(weights[., 1 3], "mean");

This prints the group means to the output window:


1   139.67 
2   186.33 
3   144.67

We can also use the month identifiers to find the sample standard deviation by month:


/*
** Find the standard deviation by month.
** We will use the second column of weights
** as the group indicator and will find
** the standard deviation of the weights.
*/
print aggregate(weights[., 2 3], "sd");

Now the standard deviations, along with their associated months will be printed to the output window:

Using Aggregate to Examine Trends in Current Account Balances

Our simple example dataset is useful for demonstrating the basics of the aggregate function. However, a real-world panel dataset better demonstrates its true power. In this section, we will use aggregate to examine some of the trends in international current account balances.

The Data

We will use current account balance measured as a percentage of GDP. This unbalanced panel data is a modified version of a dataset from the International Monetary Fund and spans 1953-Q3 to 2019-Q4 and includes a total of 46 countries, across 5 different regions.

It contains the following variables:

Variable name	Description
Country	String, name of the country.
Country ID	Integer country identifier.
World Region	String, name of the corresponding world region.
Region ID	Integer world region identifier.
Time	String, the date of the observation.
CAB	Decimal numeric, the Current Account Balance.

Mean and Median Current Account Balances By Country

We first examine variations in the mean and median current account balances across countries. Using aggregate we find the mean and median current account balances for each country in the panel across all observations.


/*
** Load the 'Country ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Country Id + CAB");
 
// Compute mean and median current
// account balances by Country ID
mean_cab_cid = aggregate(X, "mean");
median_cab_cid = aggregate(X, "median");

After the above code mean_cab_cid and median_cab_cid will be both be $46\times2$ matrices. Each element in the first column will be a unique country ID. The corresponding element in the second column will be the average (mean or median) current account balance for that country.

We include a graph of this data below, where we see that Germany leads the pack with the highest average current account balance while Finland has the lowest average current account balances.

Mean and Median Current Account Balances By Region

We can similarly consider the mean and median current account balances across geographical regions with the code below.


/*
** Load the 'Region ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Region ID + CAB");
 
// Compute mean and median current
// account balances by world region
mean_cab_wreg = aggregate(X, "mean");
median_cab_wreg = aggregate(X, "median");

mean_cab_wreg and median_cab_wreg will be two column matrices with unique world region IDs in the first column and the corresponding statistics in the second column.

Mean and Median Current Account Balances Time Series

Finally, we consider how the mean and median current account balances vary across time in the time series plot below.


/*
** Load the 'Time' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
** Wrapping 'Time' in 'date($)' tells GAUSS that 'Time' is a string
** variable that we want GAUSS to convert to a date.
*/
X = loadd("imf_cab_mod.xlsx", "date($Time) + CAB");
 
mean_cab_date = aggregate(X, "mean");
median_cab_date = aggregate(X, "median");

This time the first column of our resulting matrices, mean_cab_date and median_cab_date, will contain each unique date from our dataset. The second column will contain the statistic computed for each unique date.

Interested in learning more about loading dates in GAUSS?
Check out this tutorial to learn more.

Below is a graph of the Current Account Balance data grouped by quarter.

Conclusion

In today's blog, we examined the fundaments of the aggregate procedure. After reading you should have a better understanding of:

The basics of the aggregate function.
How to use the aggregate function.
How to examine trends in real-world panel data using aggregate.

	1 33.151
	2 31.005
	3 23.438
	4 22.368
	5 22.811
	6 21.385

	weights = { 1 1 135,
	1 2 134,
	1 3 138,
	1 4 142,
	1 5 144,
	1 6 145,
	2 1 196,
	2 2 192,
	2 3 182,
	2 4 183,
	2 5 184,
	2 6 181,
	3 1 143,
	3 2 144,
	3 3 146,
	3 4 147,
	3 5 145,
	3 6 143 };

	/*
	** Find the mean by person.
	** We will use the first column
	** as the group indicator and will find
	** the mean of the weights.
	*/
	print aggregate(weights[., 1 3], "mean");

	/*
	** Find the standard deviation by month.
	** We will use the second column of weights
	** as the group indicator and will find
	** the standard deviation of the weights.
	*/
	print aggregate(weights[., 2 3], "sd");

	/*
	** Load the 'Country ID' and 'CAB' (Current Account Balance) variables.
	** Notice that the grouping variable will be in the first column of 'X'.
	*/
	X = loadd("imf_cab_mod.xlsx", "Country Id + CAB");

	// Compute mean and median current
	// account balances by Country ID
	mean_cab_cid = aggregate(X, "mean");
	median_cab_cid = aggregate(X, "median");

	/*
	** Load the 'Region ID' and 'CAB' (Current Account Balance) variables.
	** Notice that the grouping variable will be in the first column of 'X'.
	*/
	X = loadd("imf_cab_mod.xlsx", "Region ID + CAB");

	// Compute mean and median current
	// account balances by world region
	mean_cab_wreg = aggregate(X, "mean");
	median_cab_wreg = aggregate(X, "median");

	/*
	** Load the 'Time' and 'CAB' (Current Account Balance) variables.
	** Notice that the grouping variable will be in the first column of 'X'.
	** Wrapping 'Time' in 'date($)' tells GAUSS that 'Time' is a string
	** variable that we want GAUSS to convert to a date.
	*/
	X = loadd("imf_cab_mod.xlsx", "date($Time) + CAB");

	mean_cab_date = aggregate(X, "mean");
	median_cab_date = aggregate(X, "median");