In this module, we will focus on modifying the binary logistic regression algorithm to solve multiclass classification problems. The first think we will need to change is the output of the model: in the binary case, we had a single output which we thresholded to obtain a binary prediction, either 0 or 1. So, what sort of output do we use if we want to discriminated between different classes?
The solution is to use a technique called one-hot encoding, which consists of representing class by a vector of length where the element equals 1 and all other elements equal 0. For instance, in the Iris dataset, we can one-hot encode the Iris species as follows:
Given instances of the classes above, we want the model to produce an output that is as close as possible to the one-hot encoded vectors. However, we want the values of the different elements of the vector to be bounded between 0 and 1, so that we can interpret them as probabilities. This leads us to the following modifications:
To address the first change, we can simply replace the weights vector by an -by- weights matrix, where is the number of classes. The weight matrix can be written as
where represents the weights attributed to the class .
As for the second modification, we can make use of the softmax function:
The multiclass logistic regression model can thus be rewritten as:
where is a -by-1 vector containing the estimated class probabilities for each of the classes, is an -by- weight matrix and is a -by-1 vector containing the input data for a single example. We can modify equation (3) so that we process multiple input examples at once, in which case we can write
where is a matrix where rows represent different instances and columns represent features, and represent the softmax predictions for instances and classes (the softmax operation is applied row-wise).
Let's recall the definition of the loss function in the binary case, where was a scalar:
Now let's try to rearrange (5) so that it works in the equivalent multiclass case where . The prediction is now a -dimensional vector, so we can write
where and are the target and predicted probabilities for example and class .
If the loss follows naturally from the binary case, so does the gradient. In the binary case, the partial derivative of the loss (for a single example) with respect to weight is given by
To generalise to the multiclass scenario, we need the gradients for the weights of each of the classes, where , as per equation (1). This takes the following form:
Let's now take a look at how to implement multinomial logistic regression using the Iris dataset. However, we will now attempt to discriminate between 3 species of Iris flowers: Setosa, Versicolor and Virginica. We will now also be including two more features: sepal width and sepal length.
In summary, multinomial logistic regression extends binary logistic regression by (i) using one weight vector per class and (ii) using a softmax operation instead of a sigmoid function. One thing to note is that we are still in the linear domain: unless any non-linear transformations are applied to the inputs (i.e. by manual feature engineering), the logistic regression model is only capable of discriminating between linearly separable classes, since the decision boundary is linear.