Manyspikes

In this module, we will focus on modifying the binary logistic regression algorithm to solve multiclass classification problems. The first think we will need to change is the output of the model: in the binary case, we had a single output which we thresholded to obtain a binary prediction, either 0 or 1. So, what sort of output do we use if we want to discriminated between $K$ different classes?

The solution is to use a technique called one-hot encoding, which consists of representing class $i$ by a vector of length $K$ where the $i^{th}$ element equals 1 and all other elements equal 0. For instance, in the Iris dataset, we can one-hot encode the Iris species as follows:

Setosa: $[1, 0, 0]$
Versicolor: $[0, 1, 0]$
Virginica: $[0, 0, 1]$

Given instances of the classes above, we want the model to produce an output that is as close as possible to the one-hot encoded vectors. However, we want the values of the different elements of the vector to be bounded between 0 and 1, so that we can interpret them as probabilities. This leads us to the following modifications:

The model weights should map the inputs to a $K$ -dimensional vector as opposed to a scalar.
The sigmoid function must be replaced by a function that takes as input a $K$ $K$ -dimensional vector and squashes all the values in it so that
- the resulting values are in the interval $[0, 1]$
- the sum of the values is 1, since we assume that the instance must belong to one of the $K$ classes.

To address the first change, we can simply replace the weights vector by an $n$ -by- $K$ weights matrix, where $K$ is the number of classes. The weight matrix $\mathbf{W} \in \mathbb{R}^{n \times K}$ can be written as

\begin{equation} \mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \cdots, \mathbf{w}_K] \end{equation}

where $w_k$ represents the weights attributed to the class $k$ .

As for the second modification, we can make use of the softmax function:

\begin{equation} \text{softmax}(x)_i = \frac{\exp x_i}{\sum_{k=1}^K \exp x_k} \end{equation}

The multiclass logistic regression model can thus be rewritten as:

\begin{equation} \mathbf{\hat{y}} = \text{softmax}(\mathbf{W}^T\mathbf{x}) \end{equation}

where $\mathbf{\hat{y}}$ is a $k$ -by-1 vector containing the estimated class probabilities for each of the $K$ classes, $\mathbf{W}$ is an $m$ -by- $K$ weight matrix and $\mathbf{x}$ is a $n$ -by-1 vector containing the input data for a single example. We can modify equation (3) so that we process multiple input examples at once, in which case we can write

\begin{equation} \mathbf{\hat{Y}} = \text{softmax}(\mathbf{X}\mathbf{W}), \end{equation}

where $\mathbf{X} \in \mathbb{R}^{m \times n}$ is a matrix where rows represent different instances and columns represent features, and $\mathbf{\hat{Y}} \in \mathbb{R}^{m \times K}$ represent the softmax predictions for $m$ instances and $K$ classes (the softmax operation is applied row-wise).

Loss, gradient and Hessian

Let's recall the definition of the loss function in the binary case, where $y_i$ was a scalar:

\begin{equation} L(\mathbf{w}) = - \sum_{i=1}^m \left[ y_i\log \left(\sigma(\mathbf{w}^T\mathbf{x_i})\right) + \left(1-y_i\right)\log\left( 1-\sigma(\mathbf{w}^T\mathbf{x_i}) \right) \right]. \end{equation}

Now let's try to rearrange (5) so that it works in the equivalent multiclass case where $K=2$ . The prediction $\mathbf{y}_i$ is now a $K$ -dimensional vector, so we can write

\begin{equation} L(\mathbf{w}) = - \sum_{i=1}^m \sum_{k=1}^K y_{ik}\log \hat{y}_{ik}, \end{equation}

where $y_{ik}$ and $\hat{y}_{ik}$ are the target and predicted probabilities for example $i$ and class $k$ .

If the loss follows naturally from the binary case, so does the gradient. In the binary case, the partial derivative of the loss (for a single example) with respect to weight $w_j$ is given by

\begin{equation} \frac{\partial L(\mathbf{w})}{\partial w_j} = (\hat{y} - y) x_j. \end{equation}

To generalise to the multiclass scenario, we need the gradients for the weights of each of the $K$ classes, $\mathbf{w}_{k}$ where $k=\{1,2,\dots,K\}$ , as per equation (1). This takes the following form:

\begin{equation} \frac{\partial L(\mathbf{w})}{\partial w_{kj}} = (\hat{y}_k - y_k) x_j. \end{equation}

Example

Let's now take a look at how to implement multinomial logistic regression using the Iris dataset. However, we will now attempt to discriminate between 3 species of Iris flowers: Setosa, Versicolor and Virginica. We will now also be including two more features: sepal width and sepal length.

In summary, multinomial logistic regression extends binary logistic regression by (i) using one weight vector per class and (ii) using a softmax operation instead of a sigmoid function. One thing to note is that we are still in the linear domain: unless any non-linear transformations are applied to the inputs (i.e. by manual feature engineering), the logistic regression model is only capable of discriminating between linearly separable classes, since the decision boundary is linear.

Multinomial logistic regression

Loss, gradient and Hessian

Example