manyspikes

Multivariate distributions, covariance and correlation

Initialising environment...

So far, we have been dealing with univariate probability distributions: that is, the distributions we looked at depend on a single scalar value xx. For instance, we saw how a Gaussian distribution may be used to model the height of a given population. In turn, this allows us to calculate the probability of observing an individual with height greater than e.g. 190 cm in that population.

What if we wanted to quantify the probability of observing an individual with a height greater than 190 cm and a weight smaller than 80 kg? Here we are introducing a new variable (weight), so we are now in the territory of multivariate distributions, which we will explore in this and the next sections.

But before we begin, why can't we just treat weight and height separately? This is because they are not independent. If they were, we could compute the respective probabilities and multiply them to obtain the desired joint probability. In reality, we know that measurements of height and weight are correlated, so we cannot compute the joint probability assuming independence.

The way forward is to work with multivariate distributions which take into account correlations between variables. Often this is done by incorporating a covariance matrix into the probability distribution, as we will see later.

Covariance

Consider a collection of nn random variables X1,X2,,XnX_1, X_2, \cdots, X_n. The covariance matrix for these random variables can be written as

ΣX=[var[X1]cov[X1,X2]cov[X1,Xn]cov[X2,X1]var[X2]cov[X2,Xn]cov[Xn,X1]cov[Xn,X2]var[Xn]]\begin{equation} \Sigma_{\mathbf{X}} = \begin{bmatrix} \text{var}[X_1] & \text{cov}[X_1, X_2] & \cdots & \text{cov}[X_1, X_n] \\ \text{cov}[X_2, X_1] & \text{var}[X_2] & \cdots & \text{cov}[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{cov}[X_n, X_1] & \text{cov}[X_n, X_2] & \cdots & \text{var}[X_n] \end{bmatrix} \end{equation}

where var[Xi]\text{var}[X_i] represents the variance of the random variable XiX_i and cov[Xi,Xj]\text{cov}[X_i, X_j] represents the covariance between XiX_i and XjX_j. In turn, the covariance between XiX_i and XjX_j is defined as:

cov[Xi,Xj]=E[(XiE[Xi])(XjE[Xj])]\begin{equation} \text{cov}[X_i, X_j] = \mathbb{E} \left[ (X_i - \mathbb{E}\left[X_i\right]) (X_j - \mathbb{E}\left[X_j\right]) \right] \end{equation}

where E[Xi]\mathbb{E}[X_i] denotes the expected value of XiX_i.

In the absence of true values for the expectation of XiX_i and XjX_j, we can compute an empirical (or sample) covariance matrix. Assume that xi\mathbf{x_i} and xj\mathbf{x_j} are two samples of size N from the random variables XiX_i and XjX_j, respectively. Their covariance is defined as:

Cij=1N1k=1N(xikxiˉ)(xjkxjˉ)\begin{equation} C_{ij} = \frac{1}{N-1} \sum_{k=1}^N \left(x_{ik} - \bar{x_i}) (x_{jk} - \bar{x_j}\right) \end{equation}

We can also write the equation above in vector form:

Cij=1N1(xixiˉ)(xjxjˉ)\begin{equation} C_{ij} = \frac{1}{N-1} \left(\mathbf{x}_i - \bar{x_i}) \cdot (\mathbf{x}_j - \bar{x_j}\right) \end{equation}

where \cdot represents the dot product operation. Now let's demonstrate this in practice: we will load data from the Iris dataset, which is a dataset of measurements from different species of Iris flowers (read this for more details on the dataset). The dataset is preloaded and available as a dataframe via our built-in Dataset object (this is a custom object we preload in our environment and is not available as part of the Python standard library). Let's plot the measurements of petal width and height for the versicolor species.

Now let's compute the sample covariance between petal width and petal length, as defined in (3):

While the above works, we would need to write a bit of code everytime we need to compute covariances between multiple variables. Fortunately, Numpy provides the cov function, which computes the entire covariance matrix given a multivariate dataset. Let's compute the covariance matrix across all variables for the versicolor data.

The matrix above is the covariance matrix for the versicolor examples in the Iris dataset. Along the diagonals we have the sample variances for each of the variables and the off-diagonals tell us the degree to which different variables covary. For instance, the first element of the second row gives us the covariance between second variable (petal width) and the first variable (petal length). As you can see, this matches the value we got by computing the covariance manually by applying equation (3).

The other thing to note is that the covariance matrix is symmetric. That is, the covariance between x1\mathbf{x}_1 and x2\mathbf{x}_2 is the same as the covariance between x2\mathbf{x}_2 and x1\mathbf{x}_1. Thus, for any valid covariance matrix, Cij=CjiC_{ij} = C_{ji}.

So what does covariance actually mean? We know it expresses the degree to which two variables covary, but let's look at equation (4) for a moment: we are defining covariance as the dot product between two vectors. Back in the Linear Algebra module, we noted that the dot product is a measure of similarity, but we also noted that it is heavily dependent on the magnitude of the vectors. This is where correlation comes in.

Correlation

Correlation is a normalized version of covariance. It can be defined as:

corr[X1,X2]=cov[X1,X2]var[X1]var[X2]\begin{equation} \text{corr}[X_1, X_2] = \frac{\text{cov}[X_1, X_2]}{\sqrt{\text{var}[X_1]\text{var}[X_2]}} \end{equation}

The nice thing about this normalisation is that correlation values are now limited between -1 (perfect anticorrelation) and 1 (perfect correlation), regardless of the magnitude of the vectors involved. As with covariance, we can build a correlation matrix where the entries ijij represent the correlation between xi\mathbf{x}_i and xj\mathbf{x}_j.

One thing to keep in mind is that covariance and correlation only capture linear relationships. For instance, it is possible that variables xix_i and xjx_j are deterministically related by some non-linear transformation and yet their correlation may be 0. A correlation value of 0 only indicates an absence of a linear relationship between two variables.