manyspikes

Multinomial logistic regression

Initialising environment...

In this module, we will focus on modifying the binary logistic regression algorithm to solve multiclass classification problems. The first think we will need to change is the output of the model: in the binary case, we had a single output which we thresholded to obtain a binary prediction, either 0 or 1. So, what sort of output do we use if we want to discriminated between KK different classes?

The solution is to use a technique called one-hot encoding, which consists of representing class ii by a vector of length KK where the ithi^{th} element equals 1 and all other elements equal 0. For instance, in the Iris dataset, we can one-hot encode the Iris species as follows:

  • Setosa: [1,0,0][1, 0, 0]
  • Versicolor: [0,1,0][0, 1, 0]
  • Virginica: [0,0,1][0, 0, 1]

Given instances of the classes above, we want the model to produce an output that is as close as possible to the one-hot encoded vectors. However, we want the values of the different elements of the vector to be bounded between 0 and 1, so that we can interpret them as probabilities. This leads us to the following modifications:

  1. The model weights should map the inputs to a KK-dimensional vector as opposed to a scalar.
  2. The sigmoid function must be replaced by a function that takes as input a KK-dimensional vector and squashes all the values in it so that
    • the resulting values are in the interval [0,1][0, 1]
    • the sum of the values is 1, since we assume that the instance must belong to one of the KK classes.

To address the first change, we can simply replace the weights vector by an nn-by-KK weights matrix, where KK is the number of classes. The weight matrix WRn×K\mathbf{W} \in \mathbb{R}^{n \times K} can be written as

W=[w1,w2,,wK]\begin{equation} \mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \cdots, \mathbf{w}_K] \end{equation}

where wkw_k represents the weights attributed to the class kk.

As for the second modification, we can make use of the softmax function:

softmax(x)i=expxik=1Kexpxk\begin{equation} \text{softmax}(x)_i = \frac{\exp x_i}{\sum_{k=1}^K \exp x_k} \end{equation}

The multiclass logistic regression model can thus be rewritten as:

y^=softmax(WTx)\begin{equation} \mathbf{\hat{y}} = \text{softmax}(\mathbf{W}^T\mathbf{x}) \end{equation}

where y^\mathbf{\hat{y}} is a kk-by-1 vector containing the estimated class probabilities for each of the KK classes, W\mathbf{W} is an mm-by-KK weight matrix and x\mathbf{x} is a nn-by-1 vector containing the input data for a single example. We can modify equation (3) so that we process multiple input examples at once, in which case we can write

Y^=softmax(XW),\begin{equation} \mathbf{\hat{Y}} = \text{softmax}(\mathbf{X}\mathbf{W}), \end{equation}

where XRm×n\mathbf{X} \in \mathbb{R}^{m \times n} is a matrix where rows represent different instances and columns represent features, and Y^Rm×K\mathbf{\hat{Y}} \in \mathbb{R}^{m \times K} represent the softmax predictions for mm instances and KK classes (the softmax operation is applied row-wise).

Loss, gradient and Hessian

Let's recall the definition of the loss function in the binary case, where yiy_i was a scalar:

L(w)=i=1m[yilog(σ(wTxi))+(1yi)log(1σ(wTxi))].\begin{equation} L(\mathbf{w}) = - \sum_{i=1}^m \left[ y_i\log \left(\sigma(\mathbf{w}^T\mathbf{x_i})\right) + \left(1-y_i\right)\log\left( 1-\sigma(\mathbf{w}^T\mathbf{x_i}) \right) \right]. \end{equation}

Now let's try to rearrange (5) so that it works in the equivalent multiclass case where K=2K=2. The prediction yi\mathbf{y}_i is now a KK-dimensional vector, so we can write

L(w)=i=1mk=1Kyiklogy^ik,\begin{equation} L(\mathbf{w}) = - \sum_{i=1}^m \sum_{k=1}^K y_{ik}\log \hat{y}_{ik}, \end{equation}

where yiky_{ik} and y^ik\hat{y}_{ik} are the target and predicted probabilities for example ii and class kk.

If the loss follows naturally from the binary case, so does the gradient. In the binary case, the partial derivative of the loss (for a single example) with respect to weight wjw_j is given by

L(w)wj=(y^y)xj.\begin{equation} \frac{\partial L(\mathbf{w})}{\partial w_j} = (\hat{y} - y) x_j. \end{equation}

To generalise to the multiclass scenario, we need the gradients for the weights of each of the KK classes, wk\mathbf{w}_{k} where k={1,2,,K}k=\{1,2,\dots,K\}, as per equation (1). This takes the following form:

L(w)wkj=(y^kyk)xj.\begin{equation} \frac{\partial L(\mathbf{w})}{\partial w_{kj}} = (\hat{y}_k - y_k) x_j. \end{equation}

Example

Let's now take a look at how to implement multinomial logistic regression using the Iris dataset. However, we will now attempt to discriminate between 3 species of Iris flowers: Setosa, Versicolor and Virginica. We will now also be including two more features: sepal width and sepal length.

In summary, multinomial logistic regression extends binary logistic regression by (i) using one weight vector per class and (ii) using a softmax operation instead of a sigmoid function. One thing to note is that we are still in the linear domain: unless any non-linear transformations are applied to the inputs (i.e. by manual feature engineering), the logistic regression model is only capable of discriminating between linearly separable classes, since the decision boundary is linear.