manyspikes

Lasso and elastic net

Initialising environment...

In addition to ridge regression, there are two other regularisation techniques usually applied to linear regression: (i) Lasso, which stands for Least Absolute Shrinkage and Selection Operator, and (ii) Elastic net.

Both methods attempt to discourage large values for the weights of the model, but they differ in how they define the penalty applied in the loss function.

Lasso

Lasso attempts to keep the 1\ell_1-norm of the weights vector as small as possible. The corresponding loss function is:

L(w)=(yXw)T(yXw)+λw1,\begin{equation} L(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) + \lambda||\mathbf{w}||_1, \end{equation}

where w1=i=1Nwi||w||_1 = \sum_{i=1}^N |w_i| is the 1\ell_1-norm of the vector w\mathbf{w}. Note that the loss is not longer continuous because of the absolute value in w1||w||_1. Thus, a closed-form solution to the parameter estimates is generally not possible. Instead, we must rely on numerical optimisation in order to arrive at the Lasso estimates. In practice, this is often done using an optimisation algorithm called coordinate descent.

Elastic net

Elastic net imposes a combination of 1\ell_1 and 2\ell_2 penalties, as follows:

L(w)=(yXw)T(yXw)+λ1w1+λ2w22,\begin{equation} L(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) + \lambda_1||\mathbf{w}||_1 + \lambda_2||\mathbf{w}||_2^2, \end{equation}

Instead of a single regularisation parameter, we now have two parameters which control the extent to which we penalise the 1\ell_1- and 2\ell_2-norms. As with Lasso, we rely on numerical optimisation algorithms to estimate the parameters w\mathbf{w} that minimise the loss.