As we mentioned before, derivatives are heavily used to help us optimise parameters of models. This is because they tell us how the loss function of a model is varying, so we can adjust our parameters so that the loss function decreases.
Let's start by illustrating this process with a function . Here, we are assuming that is the only parameter of our model and is the loss function of the model. Let's plot :
In a machine learning model, the parameters are often randomly initialised, so let's say we initialised to be zero. Our task is to iteratively adjust the value of , so that we get closer to the minimum with every iteration.
So let's start with our initial value of . How do we adjust so that decreases? Do we increase or decrease ? That is what the derivative tells us. By applying some basic derivation rules (we will cover those later), we know that the derivative of is . At the point , the derivative is equal to
This number has a geometric interpretation: it represents the slope of the line that is tangent to the function at the point . Let's confirm that by plotting a line that passes to the point x=0, y=5 and has a slope of -4.
The slope of the line (i.e. the derivative) tells us in which direction the function is decreasing: if the derivative is negative, we are going downhill with increasing values of ; if the derivative is positive, we are going uphill with increasing values of ; if the derivative is zero, then it means that we have reached a minimum or maximum of the function.
Now let's try to use the derivative to find the minimum of the function . The idea is to start at the point and then take steps in the direction that minimizes . That means that if the derivative is negative, we increase by some small amount and if it is positive we decrease it by a small amount. One possible rule to adjust the value of at each iteration is:
As we will see later in more detail, this update rule is gradient descent. Let's implement this so we can see it in action.
The speed at which we get closer to the minimum depends on the learning rate . If this parameter is too small, it might take many iterations to get to the minimum. Conversely, if the learning rate is too large, we might end up taking steps that are too large, effectively "jumping over" the minimum point.
In the next cell, play around with the inputs to the gradient_descent
function and execute the code. See if you can produce the following behaviors:
Now that we have some intuition on what a derivative is and how it can be useful, let's introduce the formal definition and some common derivatives. The derivative of a single variable function is defined as:
It is worth taking some time to interpret the definition above: as tends to zero, the derivative is ratio between how much the the function changes and , which is effectively the rate of change of the function.
Luckily, many explicit functions have known rules for derivation, so in practice we do not often have to use equation (2) to compute derivatives. Here are a few rules that are often useful to know:
As we will see later, knowing the expressions for these derivatives (together with the rules we will cover in the next section) is incredibly powerful: it is, in fact, the crux of the powerful backpropagation algorithm that is used to optimise deep neural networks.