Over the last few sections, we have looked at how the parameters of probability distributions affect their overall properties. In this section, we are going to introduce a generic approach to estimate the parameters for a given distribution.
The basic idea is as follows. Suppose we have some data that you want to model using a specific probabily distribution (e.g. multivariate normal) with parameters . Our goal is to choose the parameters so that the data is a likely as possible under the resulting distribution. This is called Maximum Likelihood Estimation (MLE).
Let's walk through a simple example. Imagine that your dataset is a collection of indoor temperature measurements (Celsius) in a supermarket. Assuming that a normal distribution is an appropriate choice, let's look at how suitable different parameters are. Let's start with and : how likely is it that a normal distribution with and can generate the temperature data we observed? This is obviously not very likely, since such a normal distribution has most of its probability density between -1 and 1, which would be a very low value for an indoor temperature. Now consider another normal distribution with and : this distribution looks a lot more likely, since most of its probability density would be in the interval , which is a much more likely value for indoor temperatures. The latter choice of parameters makes the data much more likely.
What we want to know is: what are the parameters that maximise the likelhood of the data? There are different ways to estimate parameters using MLE: in some cases, estimating parameters via MLE is straightforward; in other cases, it may require iterative optimisation methods such as gradient descent. In any case, the departure point for Maximum Likelihood Estimation can be written as:
Now let's look at simple example of MLE. We won't cover the proof here, but it turns out that the MLE parameters for a multivariate normal distribution can be computed in closed form given the data: the empirical mean and the empirical covariance matrix are the maximum likelihood estimates. Let's see this in action using the Iris dataset once again:
Now let's plot the distribution with the parameters above along with the data to make sure we have a sensible fit:
As we expected, we find a pretty good agreement between the distribution and the data, whereby most of the data points are concetrated in areas of high probability density.