Introduction
Maximum likelihood is a widely used technique for estimation with applications in many areas including time series modeling, panel data, discrete data, and even machine learning.
In today's blog, we cover the fundamentals of maximum likelihood estimation.
In particular, we discuss:
- The basic theory of maximum likelihood.
- The advantages and disadvantages of maximum likelihood estimation.
- The log-likelihood function.
- Modeling applications.
In addition, we consider a simple application of maximum likelihood estimation to a linear regression model.
What is Maximum Likelihood Estimation?
Maximum likelihood estimation is a statistical method for estimating the parameters of a model. In maximum likelihood estimation, the parameters are chosen to maximize the likelihood that the assumed model results in the observed data.
This implies that in order to implement maximum likelihood estimation we must:
- Assume a model, also known as a data generating process, for our data.
- Be able to derive the likelihood function for our data, given our assumed model (we will discuss this more later).
Once the likelihood function is derived, maximum likelihood estimation is nothing more than a simple optimization problem.
What are the Advantages and Disadvantages of Maximum Likelihood Estimation?
At this point, you may be wondering why you should pick maximum likelihood estimation over other methods such as least squares regression or the generalized method of moments. The reality is that we shouldn't always choose maximum likelihood estimation. Like any estimation technique, maximum likelihood estimation has advantages and disadvantages.
Advantages of Maximum Likelihood Estimation
There are many advantages of maximum likelihood estimation:
- If the model is correctly assumed, the maximum likelihood estimator is the most efficient estimator.
- It provides a consistent but flexible approach which makes it suitable for a wide variety of applications, including cases where assumptions of other models are violated.
- It results in unbiased estimates in larger samples.
Disadvantages of Maximum Likelihood Estimation
- It relies on the assumption of a model and the derivation of the likelihood function which is not always easy.
- Like other optimization problems, maximum likelihood estimation can be sensitive to the choice of starting values.
- Depending on the complexity of the likelihood function, the numerical estimation can be computationally expensive.
- Estimates can be biased in small samples.
What is the Likelihood Function?
Maximum likelihood estimation hinges on the derivation of the likelihood function. For this reason, it is important to have a good understanding of what the likelihood function is and where it comes from.
Let's start with the very simple case where we have one series $y$ with 10 independent observations: 5, 0, 1, 1, 0, 3, 2, 3, 4, 1.
The Probability Density
The first step in maximum likelihood estimation is to assume a probability distribution for the data. A probability density function measures the probability of observing the data given a set of underlying model parameters.
In this case, we will assume that our data has an underlying Poisson distribution which is a common assumption, particularly for data that is nonnegative count data.
The Poisson probability density function for an individual observation, $y_i$, is given by
$$f(y_i | \theta ) = \frac{e^{-\theta}\theta^{y_i}}{y_i!}$$
Because the observations in our sample are independent, the probability density of our observed sample can be found by taking the product of the probability of the individual observations:
$$f(y_1, y_2, \ldots, y_{10}|\theta) = \prod_{i=1}^{10} \frac{e^{-\theta}\theta^{y_i}}{y_i!} = \frac{e^{-10\theta}\theta^{\sum_{i=1}^{10}y_i}}{\prod_{i=1}^{10}y_i!} $$
We can use the probability density to answer the question of how likely it is that our data occurs given specific parameters.
The Likelihood Function
The differences between the likelihood function and the probability density function are nuanced but important.
- A probability density function expresses the probability of observing our data given the underlying distribution parameters. It assumes that the parameters are known.
- The likelihood function expresses the likelihood of parameter values occurring given the observed data. It assumes that the parameters are unknown.
Mathematically the likelihood function looks similar to the probability density:
$$L(\theta|y_1, y_2, \ldots, y_{10}) = f(y_1, y_2, \ldots, y_{10}|\theta)$$
For our Poisson example, we can fairly easily derive the likelihood function
$$L(\theta|y_1, y_2, \ldots, y_{10}) = \frac{e^{-10\theta}\theta^{\sum_{i=1}^{10}y_i}}{\prod_{i=1}^{10}y_i!} = \frac{e^{-10\theta}\theta^{20}}{207,360}$$
The maximum likelihood estimate of the unknown parameter, $\theta$, is the value that maximizes this likelihood.
The Log-Likelihood Function
In practice, the joint distribution function can be difficult to work with and the $\ln$ of the likelihood function is used instead. In the case of our Poisson dataset the log-likelihood function is:
$$\ln(L(\theta|y)) = -n\theta + \ln \sum_{i=1}^{n} y_i - \ln \theta \sum_{i=1}^{n} y_i! = -10\theta + 20 \ln(\theta) - \ln(207,360)$$
The log-likelihood is usually easier to optimize than the likelihood function.
The Maximum Likelihood Estimator
A graph of the likelihood and log-likelihood for our dataset shows that the maximum likelihood occurs when $\theta = 2$. This means that our maximum likelihood estimator, $\hat{\theta}_{MLE} = 2$.
The Conditional Maximum Likelihood
In the simple example above, we use maximum likelihood estimation to estimate the parameters of our data's density. We can extend this idea to estimate the relationship between our observed data, $y$, and other explanatory variables, $x$. In this case, we work with the conditional maximum likelihood function:
$$L(\theta | y, x)$$
We will look more closely at this in our next example.
Example Applications of Maximum Likelihood Estimation
The versatility of maximum likelihood estimation makes it useful across many empirical applications. It can be applied to everything from the simplest linear regression models to advanced choice models.
In this section we will look at two applications:
- The linear regression model
- The probit model
Maximum Likelihood Estimation and the Linear Model
In linear regression, we assume that the model residuals are identical and independently normally distributed:
$$\epsilon = y - \hat{\beta}x \sim N(0, \sigma^2)$$
Based on this assumption, the log-likelihood function for the unknown parameter vector, $\theta = \{\beta, \sigma^2\}$, conditional on the observed data, $y$ and $x$ is given by:
$$\ln L(\theta|y, x) = - \frac{1}{2}\sum_{i=1}^n \Big[ \ln \sigma^2 + \ln (2\pi) + \frac{y-\hat{\beta}x}{\sigma^2} \Big] $$
The maximum likelihood estimates of $\beta$ and $\sigma^2$ are those that maximize the likelihood.
Maximum Likelihood Estimation and the Probit Model
The probit model is a fundamental discrete choice model.
The probit model assumes that there is an underlying latent variable driving the discrete outcome. The latent variables follow a normal distribution such that:
$$y^* = x\theta + \epsilon$$ $$\epsilon \sim N(0,1)$$
where
$$ y_i = \begin{cases} 0 \text{ if } y_i^* \le 0\\ 1 \text{ if } y_i^* \gt 0\\ \end{cases} $$
The probability density
$$P(y_i = 1|X_i) = P(y_i^* \gt 0|X_i) = P(x\theta + \epsilon\gt 0|X_i) = $$ $$P(\epsilon \gt -x\theta|X_i) = 1 - \Phi(-x\theta) = \Phi(x\theta)$$
where $\Phi$ represents the normal cumulative distribution function.
The log-likelihood for this model is
$$\ln L(\theta) = \sum_{i=1}^n \Big[ y_i \ln \Phi (x_i\theta) + (1 - y_i) \ln (1 - (x_i\theta)) \Big] $$
Conclusions
Congratulations! After today's blog, you should have a better understanding of the fundamentals of maximum likelihood estimation. In particular, we've covered:
- The basic theory of maximum likelihood estimation.
- The advantages and disadvantages of maximum likelihood estimation.
- The log-likelihood function.
- The conditional maximum likelihood function.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.
There is a typo in the log likelihood function for the normal distribution. The numerator in the last term should read (y-\hat{\beta}x)^2, so the square is missing.
Thank you for pointing this out. This has been fixed.