manyspikes

Derivatives

Initialising environment...

As we mentioned before, derivatives are heavily used to help us optimise parameters of models. This is because they tell us how the loss function of a model is varying, so we can adjust our parameters so that the loss function decreases.

Let's start by illustrating this process with a function f(x)=(x2)2+1f(x) = (x - 2)^2 + 1 . Here, we are assuming that xx is the only parameter of our model and f(x)f(x) is the loss function of the model. Let's plot f(x)f(x):

In a machine learning model, the parameters are often randomly initialised, so let's say we initialised xx to be zero. Our task is to iteratively adjust the value of xx, so that we get closer to the minimum with every iteration.

So let's start with our initial value of x=0x=0. How do we adjust xx so that f(x)f(x) decreases? Do we increase or decrease xx? That is what the derivative tells us. By applying some basic derivation rules (we will cover those later), we know that the derivative of ff is f(x)=2(x2)f'(x) = 2(x-2). At the point x=0x=0, the derivative is equal to

f(x)x=0=2(02)=4f'(x)|_{x=0} = 2(0-2) = -4

This number has a geometric interpretation: it represents the slope of the line that is tangent to the function ff at the point x=0x=0. Let's confirm that by plotting a line that passes to the point x=0, y=5 and has a slope of -4.

The slope of the line (i.e. the derivative) tells us in which direction the function ff is decreasing: if the derivative is negative, we are going downhill with increasing values of xx; if the derivative is positive, we are going uphill with increasing values of xx; if the derivative is zero, then it means that we have reached a minimum or maximum of the function.

Now let's try to use the derivative to find the minimum of the function ff. The idea is to start at the point x=0x=0 and then take steps in the direction that minimizes ff. That means that if the derivative is negative, we increase xx by some small amount and if it is positive we decrease it by a small amount. One possible rule to adjust the value of xx at each iteration is:

xi+1=xiαf(x)x=xi\begin{equation} x_{i+1} = x_i - \alpha f'(x)|_{x=x_i} \end{equation}

As we will see later in more detail, this update rule is gradient descent. Let's implement this so we can see it in action.

The speed at which we get closer to the minimum depends on the learning rate α\alpha. If this parameter is too small, it might take many iterations to get to the minimum. Conversely, if the learning rate is too large, we might end up taking steps that are too large, effectively "jumping over" the minimum point.

In the next cell, play around with the inputs to the gradient_descent function and execute the code. See if you can produce the following behaviors:

  • xx decreases towards the minimum, but it never reaches the point at which the function has its minimum value (x=2x=2)
  • xx goes above and below x=2x=2 and does not seem to converge to the minimum.

Common derivatives

Now that we have some intuition on what a derivative is and how it can be useful, let's introduce the formal definition and some common derivatives. The derivative of a single variable function f(x)f(x) is defined as:

f(x)=dfdx=limϵ0f(x+ϵ)f(x)ϵ\begin{equation} f'(x) = \frac{df}{dx} = \lim_{\epsilon \rightarrow 0} \frac{f(x + \epsilon) - f(x)}{\epsilon} \end{equation}

It is worth taking some time to interpret the definition above: as epsilonepsilon tends to zero, the derivative is ratio between how much the the function changes and epsilonepsilon, which is effectively the rate of change of the function.

Luckily, many explicit functions have known rules for derivation, so in practice we do not often have to use equation (2) to compute derivatives. Here are a few rules that are often useful to know:

  • Constant rule: ddxC=0\frac{d}{dx}C = 0, for any constant C.
  • Lienar rule: ddx(ax)=a\frac{d}{dx}(ax) = a.
  • Power rule: ddxxa=axa1\frac{d}{dx}x^a = ax^{a-1}. This implies that ddxx=1\frac{d}{dx}x = 1.
  • Exponential: ddxax=axln(a)\frac{d}{dx}a^x = a^x \ln(a). This implies that ddxex=ex\frac{d}{dx}e^x = e^x.
  • Logarithm: ddxloga(x)=1xln(a)\frac{d}{dx}\log_a(x) = \frac{1}{x\ln(a)}. This implies that ddxln(x)=1x\frac{d}{dx}\ln(x) = \frac{1}{x}.

As we will see later, knowing the expressions for these derivatives (together with the rules we will cover in the next section) is incredibly powerful: it is, in fact, the crux of the powerful backpropagation algorithm that is used to optimise deep neural networks.