Introduction
The posterior probability distribution is the heart of Bayesian statistics and a fundamental tool for Bayesian parameter estimation. Naturally, how to infer and build these distributions is a widely examined topic, the scope of which cannot fit in one blog.
We can, however, start to build a better understanding of sampling by examining three basic, but fundamental techniques:
- The importance sampler
- Metropolis-Hastings sampler
- Gibbs sampler
These samplers fall into the family of Monte Carlo samplers. Monte Carlo samplers:
- Use repeated random draws to approximate a target probability distribution.
- Produce a sequence of draws that can be used to estimate unknown parameters.
The Metropolis-Hastings and Gibbs samplers are also both Markov Chain Monte Carlo (MCMC). Markov Chain Monte Carlo samplers:
- Are a type of Monte Carlo sampler.
- Build a Markov Chain by starting at a specific state and making random changes to the state during each iteration.
- Result in a Markov Chain that converges to the target distribution.
The importance sampler
Importance samplers use weighted draws from a proposed importance distribution to approximate characteristics of a different target distribution.
Importance sampling is useful when the area we are interested in may lie in a region that has a small probability of occurrence. In these cases, other sampling techniques may fail to even draw from that area.
Importance sampling overcomes this issue by sampling from a distribution which overweights the region of interest.
Preparing for sampling
In order to implement the importance sampler we must decide on our importance distribution and our target distribution. The importance distribution is what we will draw from during each iteration.
There are a few things to consider when deciding on our importance distribution (Koop, Poirier & Tobias, 2007):
- The fewer draws you make, the more crucial it is to have an importance distribution that is a close approximation to the target distribution.
- In general, importance distributions should have fatter tails than the target distribution. For example, today we will use a t-distribution as our importance distribution for estimating the mean and variance of a standard normal distribution.
- Choosing an importance distribution is not as obvious or simple as our toy example suggestion. In reality, there are statistical techniques to help choose the optimal importance distribution.
The algorithm
The importance sampler algorithm iterates over the following steps:
- Draw a single sample, $X$, from the importance distribution. For example, let's suppose we draw -0.01014 from a $t(0, 1, 2)$ distribution.
- Calculate the importance sampling weight. This is a likelihood ratio of the probability of $X$ being drawn from the target distribution, $p(x)$, and probability of $X$ being drawn from the importance distribution, $q(X)$.
$$p(X) = \text{Probability of -0.010139 from } N(0, 1) = 0.3989$$ $$q(X) = \text{Probability of -0.010139 from } t(0, 1, 2) = 0.3535$$ $$w = \frac{p(X)}{q(X)} = \frac{0.3989}{0.3535} = 1.1284$$ - Multiply draw $X$ by the importance weight. $$ w * X = 1.1284 * -0.01014 = -0.01144 $$
- Repeat steps 1-3 until a sample of the desired size is created.
- Use the sample collected in steps 1-4 to compute the expected value and variance of the target distribution.
Example*
Now let's use the iterative importance sampler to estimate the mean and variance of the standard normal distribution, $N(0,1)$.
We can implement this importance sampler using the importanceSamplerTDist
function found in the GAUSS samplerlib library.
This function takes one required input and three optional inputs:
- keep_draws
- Scalar, the total number of draws to be kept.
- dof_is
- Optional input, Scalar, the degrees of freedom of the t-distribution. Default = 2.
- mean_is
- Optional input, Scalar, the mean of the importance function. Default = 0.
- scale_is
- Optional input, Scalar, the scale factor of the importance function. Default = 1.
We'll run the importanceSamplerTDist
using 50,000 repetitions and the default dof_is
, mean_is
, and scale_is
:
new;
library samplerlib;
rndseed 34532;
// Number of iterations
keep_draws = 50000;
// Call the function
struct importanceOut iOut;
iOut = importanceSamplerTdist(keep_draws);
The importanceSamplerTDist
function returns a filled instance of the importanceOut
structure. The structure iOut
contains the following relevant members:
- iOut.theta_draw_is
- Vector, sequence of draws from the importance sampler.
- iOut.theta_mean_is
- Scalar, importance sampler posterior mean.
- iOut.theta_std_is
- Scalar, importance sampler posterior standard deviation.
- iOut.w
- Vector, the importance weights.
- iOut.wmean
- Scalar, the mean of importance sampling weights.
- iOut.wstd
- Scalar, standard deviation of importance sampling weights.
The function prints the posterior distribution mean and variance-covariance matrix as output:
Importance Sampling Posterior Mean and Standard Deviation -0.00064283983 0.99621825 Mean and standard deviation of importance sampling weights 1.0000216 0.38137929
We can see that our estimated mean of -0.0006 and variance of 0.99 are fairly close to the true mean of 0 and variance of 1.
Additionally, the mean and standard deviation of our importance weights provide some insight into the relationship between our importance distribution and the target distribution.
Advantages
The importance sampler:
- Allows us to solve problems that may not be feasible using other sampling methods.
- Can be used to study one distribution using samples generated from another distribution.
- By choosing the appropriate importance distribution we can sample from interesting or important regions other Monte Carlo sampling techniques may overlook.
- Examples include Bayesian inference, rare event simulation in finance or insurance, and high energy physics.
Disadvantages
There are a few disadvantages of the importance sampler to consider:
- It does not work well in high dimensions. The variance of the samples increases as the dimensionality increases.
- It can be difficult to identify a suitable importance distribution that is easy-to-sample from.
Gibbs Sampling
Gibbs sampling is a Markov Chain Monte Carlo technique used to sample from distributions with at least two dimensions.
The Gibbs sampler draws iteratively from posterior conditional distributions rather than drawing directly from the joint posterior distribution. By iteration, we build a chain of draws, with each current draw depending on the previous draw.
The Gibbs Sampler is particularly useful, as the joint posterior is not always easy to work with. However, we can solve the joint estimation problem as a series of smaller, easier estimation problems.
Preparing for sampling
Before running the Gibbs sampler we must find the joint posteriors. Today we are going to use the Gibbs sampler to estimate two parameters from a bivariate normal posterior distribution such that
$$ \begin{eqnarray*} \begin{pmatrix} \theta_1\\ \theta_2 \end{pmatrix} & \sim & N\left[\left(\begin{array}{c} 0\\ 0 \end{array}\right),\left(\begin{array}{ccc} 1 & \rho\\ \rho & 1 \end{array}\right)\right]\\ \end{eqnarray*} $$
where $\theta_1$ and $\theta_2$ are unknown parameters of the model, while $\rho$ is the known posterior correlation between $\theta_1$ and $\theta_2$.
Using the properties of the multivariate normal distribution we can define the conditional distributions necessary for our sampler
$$\theta_1|\theta_2,\: y \sim N(\rho\theta_2,\: 1-\rho^2) \sim \rho\theta_2 + \sqrt{1-\rho^2}N(0,\:1)$$
and
$$\theta_2|\theta_1,\: y \sim N(\rho\theta_1,\: 1-\rho^2) \sim \rho\theta_1 + \sqrt{1-\rho^2}N(0,\:1) .$$
The algorithm
The Gibbs sampler algorithm is dependent on the number of dimensions in our model. Let's consider a bivariate normal sampler with two parameters, $\theta_1$ and $\theta_2$.
- Determine the conditional distributions for each variable, $\theta_1$ and $\theta_2$.
- Choose a starting values $\theta_1^{(0)}$ and $\theta_2^{(0)}$. $$\theta_1^{(0)} = 0$$ $$\theta_2^{(0)} = 0$$
- Draw $\theta_2^{(r)}$ from $\:p(\theta_2|y,\:\theta_1^{(r-1)})$. $$\theta_2^{(1)}|\theta_1^{(0)} = \rho\theta_1^{(0)} + \sqrt{1-\rho^2}N(0,\:1) =\\ \rho*0 + \sqrt{1-\rho^2}N(0,\:1) = -0.15107$$.
- Draw $\theta_1^{(r)}$ from $\:p(\theta_1|y,\:\theta_2^{(r)})$.
$$\theta_1^{(1)}|\theta_2^{(1)} = \rho\theta_2^{(1)} + \sqrt{1-\rho^2}N(0,\:1) =\\ \rho*-0.15107 + \sqrt{1-\rho^2}N(0,\:1) = -0.65117$$. - Repeat 2-3 for desired number of iterations.
Example*
To implement this Gibbs sampler we can use the gibbsSamplerBiN
function found in the GAUSS samplerlib library.
This function takes two required inputs and three optional inputs:
- keep_draws
- Scalar, the total number of draws to be kept.
- rho
- Scalar, the correlation parameter.
- burn_in
- Optional input, Scalar, the number of burn-in iterations. Default = 10% of kept draws.
- theta_1_init
- Optional input, Scalar, the initial value of $\theta_1$. Default = 0.
- theta_2_init
- Optional input, Scalar, the initial value of $\theta_2$. Default = 0.
We'll run the gibbsSamplerBiN
using $\rho = 0.6$, 1000 burn-in iterations, and keeping 10,000 repetitions:
new;
library samplerlib;
// Set random seed for repeatable random numbers
rndseed 45352;
/*
** Gibbs Sampler Specifications
** Known correlation parameter
*/
rho = 0.6;
// Burn-in for the Gibbs sampler
burn_in = 1000;
// Draws to keep from sampler
keep_draws = 10000;
// Call Gibbs sampler
struct gibbsBiNOut gOut;
gOut = gibbsSamplerBiN(keep_draws, rho, burn_in);
The gibbsSamplerBiN
function returns a filled instance of the gibbsBiNOut
structure. The structure gOut
contains the following relevant members:
- gOut.theta_keep_gibbs
- Matrix, sequence of kept draws from the Gibbs sampler.
- gOut.theta_mean_gibbs
- Matrix, Gibbs sampler posterior mean.
- gOut.theta_std_gibbs
- Scalar, Gibbs sampler posterior standard deviation.
The functions prints the posterior distribution mean an variance-covariance matrix as output:
--------------------------------------------------- Variance-Covariance Matrix from Gibbs Sampler --------------------------------------------------- Sample Mean: 0.0060170250 0.0032377421 Sample Variance-Covariance Matrix: 0.97698091 0.58179239 0.58179239 0.98180547
In this case, we actually knew the true standard values of our correlation parameter and means, which gives us some insight into the performance of our sampler.
Note that our sample estimates $\hat{\rho} = 0.58179239$ while our true correlation parameter is $\rho = 6$. In addition, the true $\theta_1$ and $\theta_2$ were both equal to 0, while our sampler estimates $\hat{\theta}_1=0.006$ and $\hat{\theta}_2=0.003$.
Advantages
The Gibbs sampler is a powerful sampler with a number of advantages:
- Joint distributions may be complex to draw directly from but we may be able to sample directly from less complicated conditional distributions.
- Conditional distributions are lower in dimension than joint distributions and may be more suitable for applying other sampling techniques.
Disadvantages
The Gibbs sampler is not without disadvantages, though:
- To perform the Gibbs sampler we must be able to find posterior conditionals for each of the variables.
- As the correlation between variables increases, the performance of the Gibbs sampler decreases. This is because the sequence of draws from the Gibbs sampler becomes more correlated.
- Even if we can extract the conditional distributions they may not be known forms, so we could not draw from them.
- Drawing from multiple conditional distributions may be slow and inefficient.
Metropolis-Hastings sampler
Like the Gibbs sampler, the Metropolis-Hastings sampler is a MCMC sampler. While the Gibbs sampler relies on conditional distributions, the Metropolis-Hastings sampler uses a full joint density distribution to generate a candidate draws.
The candidate draws are not automatically added to the chain but rather an acceptance probability distribution is used to accept or reject candidate draws.
Preparing for sampling
In order to implement the Metropolis-Hastings sampler two things must be chosen:
- The proposal distribution which defines the probability of a candidate draw given the previous draw. In our example we will use the normal distribution, $ N\big(0,\text{ }d^2\big)$, to generate the candidate draws.
- The acceptance probability which is the probability that we accept the candidate draw. In our example we will use
$$ \alpha\big(\theta^{(r-1)}, \theta^*\big) = min \bigg[exp\Big[\frac{1}{2}\Big( | \theta^{(r-1)}| - | \theta^*| - \bigg(\frac{\theta^{(r-1)}}{d}\bigg)^2 + \bigg(\frac{\theta^*}{d}\bigg)^2\Big) \Big],1\bigg]$$
where $d$ is the performance parameter, which we will set to $d=6$.
The algorithm
The general Metropolis-Hastings algorithm can be broken down into simple steps:
- Choose a starting value for $\theta$. $$\theta^{(0)} = 1$$
- Draw $\theta^*$ from the candidate generating density. $$\theta^* = N\big(0,\text{ }36\big) = -1.3847$$
- Calculate the acceptance probability $\alpha(\theta^{(r-1)}, \theta^*)$.
$$ \alpha \big( 1, -1.3847 \big) = min \bigg[ exp \Big[ \frac{1}{2} \Big( |1| - |-1.3847| - \bigg( \frac{1}{6} \bigg)^2 + \bigg( \frac{-1.3847}{6} \bigg)^2 \Big) \Big], 1 \bigg] = 0.8356$$ - Set $\theta^{(r)} = \theta^*$ with probability $\alpha(\theta^{(r-1)}, \theta^*)$, or else set $\theta^{(r)} = \theta^{(r-1)}$. $$ 0.8356 > U(0,1) \rightarrow \theta^{(1)} = -1.3847$$
- Repeat steps 3-5 for $r = 1, ... , R$.
Example*
In this example we will use the Metropolis-Hastings sampler to make inference about a parameter, $\theta$, which we assume has a posterior distribution given by
$$ p(\theta|y) \propto exp\bigg(-\frac{1}{2}|(\theta)|\bigg) $$
This a special case of the Laplace distribution which has a true mean of 0 and a variance of 8.
Though we know the true mean and variance, we are going to see how to use the independent chain Metropolis-Hastings sampler to estimate both.
We will implement this Metropolis-Hastings sampler using the metropolisHastingsIC
function found in the GAUSS samplerlib library.
This function takes one required input and three optional inputs:
- keep_draws
- Scalar, the total number of draws to be kept.
- burn_in
- Optional input, Scalar, the number of burn-in iterations. Default = 10% of kept draws.
- sd_ic
- Optional input, Scalar, standard deviation of Metropolis-Hastings draw. Default = 1.
- theta_init
- Optional input, Scalar, the initial value of $\theta$. Default = 0.
We'll run the metropolisHastingsIC
using 100 burn-in iterations, and keeping 10,000 repetitions, sd_ic = 6
, and theta_init = 1
:
new;
library samplerlib;
// Set random seed for repeatable random numbers
rndseed 97980;
/*
** Set up Metropolis Hastings chain
** Discard r0 burnin replications
*/
burn_in = 100;
// Specify the total number of replications
keep_draws = 10000;
/*
** Standard deviation for the increment
** in the independent chain M-H algorithm
*/
sd_ic = 6;
// Set initial theta
theta_init = 1;
// Call Metropolis-Hastings sampler
struct metropolisOut mOut;
mOut = metropolisHastingsIC(keep_draws, burn_in, sd_ic, theta_init);
The metropolisHastingsIC
function returns a filled instance of the metropolisOut
structure. The structure mOut
contains the following relevant members:
- mOut.theta_draw_mh
- Vector, the vector of accepted draws in the sample.
- mOut.theta_mean_mh
- Scalar, the Metropolis-Hastings posterior mean.
- mOut.theta_std_mh
- Matrix, the Metropolis-Hastings posterior standard deviation.
- mOut.accepted_count_mh
- Scalar, the proportion of accepted candidate draws.
The functions prints the posterior distribution mean and variance-covariance matrix as output:
Posterior Mean and Variance 0.038943707 8.1069691 Proportion of accepted candidate draws: Independence chain M-H 0.48455446
Advantages
The Metropolis-Hastings sampler has number of advantages:
- It is simple to implement. There is no need to determine conditional distributions.
- Reasonable for sampling from correlated, high dimensional distributions.
Disadvantages
- Can have a poor convergence rate.
- Struggles with multi-modal distributions.
- The sampler is sensitive to the step size between draws. Either too large or too small of a step size can have a negative impact on convergence.
Conclusion
Congratulations! Today we've learned about three fundamental types of Bayesian samplers, the importance sampler, the Gibbs sampler, and the Metropolis-Hastings sampler.
In particular, we looked at:
- The algorithms of each.
- Some of the disadvantages and advantages of the samplers.
- Examples of how to implement the samplers using the GAUSS samplerlib library.
References
- Our examples today are based on examples provided in the Bayesian Econometric Methods textbook by Gary Koop, Dale Poirer, and Justin Tobias.
Koop, G., Poirier, D. J., & Tobias, J. L. (2007). Bayesian econometric methods. Cambridge University Press.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.