manyspikes

Ridge regression

Initialising environment...

As we mentioned in the previous section, the least squares minimisation problem has multiple solutions when input features are correlated with one another. To see why that happens, consider the extreme example of a linear model with 3 perfectly correlated features:

y^=w1x1+w2x2+w3x3+b,\begin{equation} \hat{y} = w_{1} x_1 + w_{2} x_2 + w_{3} x_3 + b, \end{equation}

where x2=2x1x_2 = 2x_1 and x3=5x1x_3 = 5x_1. You can check that for any choice of (x1,x2,x3)(x_1, x_2, x_3) that respects the two previous equalities, you would get the same prediction under the following weights:

y^=10x1+5x2+2x3+by^=5x1+10x2+1x3+by^=50x15x22x3+b.\begin{align} \hat{y} &= 10 x_1 + 5 x_2 + 2 x_3 + b\\[3 ex] \hat{y} &= 5 x_1 + 10 x_2 + 1 x_3 + b\\[3 ex] \hat{y} &= 50 x_1 - 5 x_2 - 2 x_3 + b. \end{align}

The weights above are clearly very different, and there is an infinite number of combinations of weights that would produce the same output: for instance, we could keep increasing w1w_1 while decreasing the values of w2w_2 and w3w_3 accordingly to produce the same result. This is problematic from an intepretability viewpoint.

One way to avoid the issues introduced by highly correlated input features is to penalise the choice of large weights—this is known as regularisation. As we will see later, there are a number of techniques for linear regression and they differ on the penalty they impose on the choice of weights.

Ridge regression

Ridge regression is a regularisation technique that gives preference to weights with smaller magnitude, i.e. a smaller 2\ell_2-norm. In particular, the loss function for ridge regression is

L(w)=(yXw)T(yXw)+λw22,\begin{equation} L(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) + \lambda||\mathbf{w}||_2^2, \end{equation}

where λ\lambda is a regularisation constant (also called ridge parameter) which controls the extent to which larger weight vectors should be penalised. Note that λw22=λIw22\lambda||\mathbf{w}||_2^2 = ||\lambda\mathbf{I}\mathbf{w}||_2^2 where I\mathbf{I} is the identity matrix. Thus, the loss can be written as:

L(w)=(yXw)T(yXw)+(λIw)T(λIw).\begin{equation} L(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) + (\lambda\mathbf{I}\mathbf{w})^T(\lambda\mathbf{I}\mathbf{w}). \end{equation}

Now we set the derivative of the loss to zero and solve for w\mathbf{w}

w((yXw)T(yXw)+(λIw)T(λIw))=0    w(yTy2wTXTy+wTXTXw+(λIw)T(λIw))=0    2XTy+2XTXw+2λIw=0    w=(XTXλI)1XTy\begin{align} &\frac{\partial}{\partial \mathbf{w}}\left((\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) + (\lambda\mathbf{I}\mathbf{w})^T(\lambda\mathbf{I}\mathbf{w}) \right) = 0 \\[4 ex] \implies &\frac{\partial}{\partial \mathbf{w}} \left( \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w} + (\lambda\mathbf{I}\mathbf{w})^T(\lambda\mathbf{I}\mathbf{w}) \right) = 0 \\[4 ex] \implies &-2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w} + 2\lambda\mathbf{I}\mathbf{w} = 0\\[4 ex] \implies &\mathbf{w} = (\mathbf{X}^T\mathbf{X} - \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} \end{align}

Adding the diagonal matrix λI\lambda\mathbf{I} to XTX\mathbf{X}^T\mathbf{X} effecively lowers the correlations between features, making the resulting matrix easier to invert.

Example

Let's take the dataset we used in the previous section and apply ridge regression to it. We will consider varying values of λ\lambda and we will see what happens to the magnitude of the vector w\mathbf{w}.

As expected, the norm of the estimated weight vector decreases as we increase the regularisation constant λ\lambda.

But how do we choose an appropriate value for λ\lambda? In a sense, λ\lambda is also a parameter of the model, but we are not fitting it to data. Such parameters are called hyperparameters.

Hyperparameter selection is commonly done via cross-validation. The idea behind it is quite simple:

  1. We pick a value for λ\lambda
  2. We split the data in two sets, a training set and a validation set
  3. We fit the model to the training set, using the selected value of lambda
  4. We check how well the model predicts the validation set

We repeat these steps with different values of λ\lambda and different splits of training and validation data. In the end, we pick the value of λ\lambda that best predicted the validation set. If this sounds a bit confusing, don't worry: we will see cross-validation in action in subsequent modules.