Manyspikes

As we mentioned in the previous section, the least squares minimisation problem has multiple solutions when input features are correlated with one another. To see why that happens, consider the extreme example of a linear model with 3 perfectly correlated features:

\begin{equation} \hat{y} = w_{1} x_1 + w_{2} x_2 + w_{3} x_3 + b, \end{equation}

where $x_2 = 2x_1$ and $x_3 = 5x_1$ . You can check that for any choice of $(x_1, x_2, x_3)$ that respects the two previous equalities, you would get the same prediction under the following weights:

\begin{align} \hat{y} &= 10 x_1 + 5 x_2 + 2 x_3 + b\\[3 ex] \hat{y} &= 5 x_1 + 10 x_2 + 1 x_3 + b\\[3 ex] \hat{y} &= 50 x_1 - 5 x_2 - 2 x_3 + b. \end{align}

The weights above are clearly very different, and there is an infinite number of combinations of weights that would produce the same output: for instance, we could keep increasing $w_1$ while decreasing the values of $w_2$ and $w_3$ accordingly to produce the same result. This is problematic from an intepretability viewpoint.

One way to avoid the issues introduced by highly correlated input features is to penalise the choice of large weights—this is known as regularisation. As we will see later, there are a number of techniques for linear regression and they differ on the penalty they impose on the choice of weights.

Ridge regression

Ridge regression is a regularisation technique that gives preference to weights with smaller magnitude, i.e. a smaller $\ell_2$ -norm. In particular, the loss function for ridge regression is

\begin{equation} L(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) + \lambda||\mathbf{w}||_2^2, \end{equation}

where $\lambda$ is a regularisation constant (also called ridge parameter) which controls the extent to which larger weight vectors should be penalised. Note that $\lambda||\mathbf{w}||_2^2 = ||\lambda\mathbf{I}\mathbf{w}||_2^2$ where $\mathbf{I}$ is the identity matrix. Thus, the loss can be written as:

\begin{equation} L(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) + (\lambda\mathbf{I}\mathbf{w})^T(\lambda\mathbf{I}\mathbf{w}). \end{equation}

Now we set the derivative of the loss to zero and solve for $\mathbf{w}$

\begin{align} &\frac{\partial}{\partial \mathbf{w}}\left((\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) + (\lambda\mathbf{I}\mathbf{w})^T(\lambda\mathbf{I}\mathbf{w}) \right) = 0 \\[4 ex] \implies &\frac{\partial}{\partial \mathbf{w}} \left( \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w} + (\lambda\mathbf{I}\mathbf{w})^T(\lambda\mathbf{I}\mathbf{w}) \right) = 0 \\[4 ex] \implies &-2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w} + 2\lambda\mathbf{I}\mathbf{w} = 0\\[4 ex] \implies &\mathbf{w} = (\mathbf{X}^T\mathbf{X} - \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} \end{align}

Adding the diagonal matrix $\lambda\mathbf{I}$ to $\mathbf{X}^T\mathbf{X}$ effecively lowers the correlations between features, making the resulting matrix easier to invert.

Example

Let's take the dataset we used in the previous section and apply ridge regression to it. We will consider varying values of $\lambda$ and we will see what happens to the magnitude of the vector $\mathbf{w}$ .

As expected, the norm of the estimated weight vector decreases as we increase the regularisation constant $\lambda$ .

But how do we choose an appropriate value for $\lambda$ ? In a sense, $\lambda$ is also a parameter of the model, but we are not fitting it to data. Such parameters are called hyperparameters.

Hyperparameter selection is commonly done via cross-validation. The idea behind it is quite simple:

We pick a value for $\lambda$
We split the data in two sets, a training set and a validation set
We fit the model to the training set, using the selected value of lambda
We check how well the model predicts the validation set

We repeat these steps with different values of $\lambda$ and different splits of training and validation data. In the end, we pick the value of $\lambda$ that best predicted the validation set. If this sounds a bit confusing, don't worry: we will see cross-validation in action in subsequent modules.