Manyspikes

Binary logistic regression is a model that allows us to discriminate between two classes, which means that the desired output of the model will be one out of two possible values. Typically, we use the values 0 and 1 as indicators of the classes we are trying to predict.

For instance, let's assume that we are building a model to identify defects in a product manufacturing line, based on images collected in a quality control station within the factory. One possibility would be to assign the target value 0 to all the examples which do not have defects, and the value 1 to all the examples which do have a defect. It would be just as valid to assign the values vice-versa, but it is somewhat more common to use the values 0 and 1 to encode the absence and presence of something, respectively.

Thus, we are asking our classifier to produce an output equal to 0 or 1. Reaching out to our probability toolbox, we could say that the output of the model is thus a Bernoulli variable. As a refresher, the probability mass function for the Bernoulli distribution is given by:

\begin{equation} p(y|\theta) = \text{Ber}(y|\theta) = \left\{ \begin{array}{ll} \theta & \textrm{if }\,y=1 \\ 1 - \theta & \textrm{if }\, y=0\\ \end{array} \right. \end{equation}

where $\theta$ is the only parameter of the model and represents the probability that the outcome equals 1. Values of $\theta$ close to 1 mean that we are very likely to observe the output of 1, while if $\theta$ is close to 0 we are very likely to observ the output 0.

For our classifier to be useful, we need its outputs to be dependent on the input features in some way. In turn, under our definition above, we control the output of the classifier by varying the value of $\theta$ . Thus, we can make $\theta$ dependent on our input data.

\begin{equation} \theta = f(\mathbf{x}) \end{equation}

Now, some features may be more important than others in discerning between the two classes. Thus, similarly to linear regression, we use a set of parameters $\mathbf{w}$ to help us weigh our features appropriately:

\begin{equation} \theta = f(\mathbf{w}^T\mathbf{x}) \end{equation}

We haven't quite defined what the function $f$ should be like, but one key requirement is that it is bounded between 0 and 1, since $\theta \in [0, 1]$ . In binary logistic regression, we use the sigmoid function

\begin{equation} \sigma(x) = \frac{1}{1 + \exp(-x)}. \end{equation}

Here is what the sigmoid function looks like:

Putting it all together, the binary logistic regression model can be defined as:

\begin{equation} p(y|\mathbf{w}, \mathbf{x}) = \text{Ber}(y|\sigma(\mathbf{w}^T\mathbf{x})) \end{equation}

Let's reiterate what we've got so far. We are saying that the predictions of our classifier can be interpreted as a Bernoulli random variable with a variable probability of success. In turn, the probability of success depends on the input features and is given by $\sigma(\mathbf{w}^T\mathbf{x})$ .

Now that our model is specified, we need to find the parameters $\mathbf{w}$ that make our predictions as accurate as possible given a particular dataset. For that, we will again revisit a concept we touched on during the Introduction to Probability module: Maximum Likelihood Estimation (MLE).

Going back to the Bernoulli likelihood, it is convenient to rewrite (1) as

\begin{equation} p(y|\theta) = \theta^{\mathbf{1}(y=1)} (1-\theta)^{\mathbf{1}(y=0)}, \end{equation}

where $\mathbf{1}$ is the indicator function, which is in this case is defined as

\begin{equation} \mathbf{1}(y) = \left\{ \begin{array}{ll} 1 & \textrm{if }\, y=1 \\ 0 & \textrm{if }\, y=0 \\ \end{array} \right. \end{equation}

Note that the role of this indicator function is simply to select which factor to use as $p(y|\theta)$ :

\begin{align} p(y=1|\theta) &= \theta^1 (1-\theta)^0 = \theta \\[3 ex] p(y=0|\theta) &= \theta^0 (1-\theta)^1 = 1-\theta \end{align}

Coming back to equation (6), and recalling that $\theta = \sigma(\mathbf{w}^T\mathbf{x})$ , we can write

\begin{equation} p(y|\mathbf{w}, \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x})^{\mathbf{1}(y=1)} (1-\sigma(\mathbf{w}^T\mathbf{x}))^{\mathbf{1}(y=0)}. \end{equation}

In MLE, we want to find the parameters $\mathbf{w}$ that maximise the likelihood of observing the data under the specified model. For explanatory purposes, let's imagine that our dataset contains a single example where $y=0$ (this is a silly scenario in practice, but it is helpful for building understanding). In this case, we want the second factor to be very close to 1, which means that $\sigma(\mathbf{w}^T\mathbf{x})$ needs to be very close to zero. In turn, $\mathbf{w}^T\mathbf{x}$ should thus be as negative as possible.

Assuming all examples in the dataset are independent, the likelihood of observing the dataset under the model is given by:

\begin{equation} p(\mathbf{y}| \mathbf{w}, \mathbf{X}) = \prod_{i=1}^m p(y_i|\mathbf{w}, \mathbf{x}_i) \end{equation}

Coming back to our goal, we want to find the parameters $\mathbf{w}$ that maximise likelihood defined in equation (11). As we have seen before, this is equivalent to minimising the negative log-likelihood

\begin{align} L(\mathbf{w}) &= - \log p(\mathbf{y}| \mathbf{w}, \mathbf{X}) \\[3 ex] &= - \log \prod_{i=1}^m p(y_i|\mathbf{w}, \mathbf{x}_i) \\[3 ex] &= - \sum_{i=1}^m \log p(y_i|\mathbf{w}, \mathbf{x_i})\\[3 ex] &= - \sum_{i=1}^m \log \left( \sigma(\mathbf{w}^T\mathbf{x_i})^{\mathbf{1}(y=1)} (1-\sigma(\mathbf{w}^T\mathbf{x_i}))^{\mathbf{1}(y=0)} \right) \\[3 ex] &= - \sum_{i=1}^m \left[ y_i\log \left(\sigma(\mathbf{w}^T\mathbf{x_i})\right) + \left(1-y_i\right)\log\left( 1-\sigma(\mathbf{w}^T\mathbf{x_i}) \right) \right] \end{align}

A few notes on how we got to equation (16). First, we obtained equation (14) by applying the logarithm product rule. We then substitute $p(y_i|\mathbf{w}, \mathbf{x})$ according to (10) to arrive at equation (15). We again apply the logarithm product rule and we replace exponentiation by the indicator function by a simple multiplication as the mechanism for selecting the correct term depending on whether $y_i=0$ or $y_i=1$ .

Gradient and Hessian

Now that we have written down the negative log-likelihood, we proceed by computing its gradient with respect to the parameters $\mathbf{w}$ . We take the derivative of (16) with respect to $\mathbf{w}$ and, by application of the chain rule a few times, we arrive at the following expression for the gradient:

\begin{equation} \nabla_w L(\mathbf{w}) = \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} = \mathbf{X}^T (\sigma(\mathbf{X}\mathbf{w}) - \mathbf{y}), \end{equation}

where $\mathbf{X}$ is a matrix of $m$ examples by $n$ features.

When working with linear regression, we simply set the gradient to 0 and solved for $\mathbf{w}$ to arrive at the maximum likelihood estimate of the parameters. Unfortunately that is not possible in the logistic regression case, as no closed-form solution exists; instead, we need to resort to numerical optimisation. For that, we will be using the gradient in (17) as well as the Hessian. As we saw in the differential calculus course, the Hessian is the matrix of partial second derivatives, which can be computed by differentiating the gradient. By taking the derivative of (17) and applying the chain rule again, we arrive at the following expression

\begin{equation} \mathbf{H}(\mathbf{w}) = \frac{\partial \nabla_x f}{\partial \mathbf{w}} = \mathbf{X}^T \text{diag}(\sigma(\mathbf{X}\mathbf{w})(1 - \sigma(\mathbf{X}\mathbf{w}))) \mathbf{X} \end{equation}

Both the gradient and the Hessian can be then used by second-order optimisation methods to arrive at the values of $\mathbf{w}$ that minimise the loss.

Example

Now we will look at a simple classification example using a dataset we introduced in the Introduction to Linear Algebra module: the Iris dataset. The dataset contains a series of morphological measurements of species of Iris flowers. In this example, we will use measurements of petal width and height to discriminate between the Setosa and Versicolor species. Let's start by plotting the data:

Our job is to build a logistic regression model which is able to predict what species a given Iris flower is likely to be, given the height and width of its petals.

In this example, we will use plain gradient descent—we will ignore the Hessian for the sake of brevity. The purpose here is to demonstrate the algorithm in action rather than arrive at the best possible estimate of the model weights in an efficient manner. In practice, we recommend using ML frameworks such as sklearn, since these already provide optimised methods for training different types of machine learning models.

That being said, let's get back to our example. In the cell below we implement gradient descent to fit the parameters of the logistic regression model to the dataset above. We use the expressions of the loss and the gradient as defined in equations (16) and (17).

Reassuringly, we see that during gradient descent the loss decreases, while the training set accuracy increases. This is a good sign: a non-decreasing loss could indicate some form of bug in our code or a learning rate that is too large.

However, keep in mind that—in any sort of practical application—one must check that the model performs well on held-out data, i.e a test set.

In this simple example, we can also easily inspect the decision boundary of the model by plotting the line defined by $w_1x_1 + w_2x_2 + w_3 = 0$ . This helps us understand how different areas of the input space are classified by the model.

We saw a similar example in the Introduction to Linear Algebra module, but it might be worth reiterating the logic behind the decision boundary. Let's assume that we now have to classify a new example. A new data point can be seen as a new vector and, depending on its coordinates, the following holds:

If the dot product between the new vector $[x_1, x_2, 1]$ and the model weights is greater than 0 (i.e. the new data point lies above the decision boundary), then $\sigma(\mathbf{w}^T\mathbf{x}) > 0.5$ , which means that after thresholding at 0.5 we would predict class 1 (Versicolor, the orange class).
If the dot product between the new vector $[x_1, x_2, 1]$ and the model weights is negative (i.e. the new data point lies below the decision boundary), then $\sigma(\mathbf{w}^T\mathbf{x}) < 0.5$ , which means that after thresholding at 0.5 we would predict class 0 (Setosa, the blue class).

In this example, the decision boundary completely separates the two classes. These two classes are thus said to be linearly separable.

Final notes

As we have mentioned above, we don't recommend that you use gradient descent as implemented above for fitting logistic regression models—there are other algorithms that are better and faster than gradient descent for this purpose. The other reason is that there are mature frameworks written in Python that already implement these algorithms for you, so you can use them out of the box.

In addition, the implementation above does not include regularisation of any kind. This is a problem because we do not want the weight vector to keep growing indefinitely during the optimisation process. This is especially relevant when the classes are linearly separable, as the gradient will keep pushing the input to the sigmoid further away from 0, so that the outputs of the sigmoid operation become increasingly closer to the target values 0 and 1.

In the next section, we will look at how we go from the binary classification scenario we covered here to a multiclass scenario.