manyspikes

Derivatives and local function approximators

Initialising environment...

We will now look at derivatives from a slightly different perspective. Earlier in this module, we have seen how the derivative of a function at a given point corresponds to the slope of the tangent line at that point. Because differentiable functions vary smoothly, the tangent approximates the value of the function over a small enough interval around that point.

For example, let's have a look at an example where f(x)=sinxf(x)=\sin x.

By the rules of derivation, we know that the derivative of ff is

dfdx=d(sinx)dx=cosx.\begin{equation} \frac{df}{dx} = \frac{d(\sin x)}{dx} = \cos x. \end{equation}

Now let's plot the tangent to the function ff at x=0x=0. We know that it must have a slope is dfdxx=0=cos(0)=1\frac{df}{dx}|_{x=0} = \cos(0) = 1 and the intercept must be 0 since f(0)=sin(0)f(0) = \sin(0). Let's plot it together with the function so we can see how well the approximation works in the interval $[-1, 1].

It looks like the tangent is a reasonable approximation around x=0x=0, but as we get further away from it the function deviates significantly from the tangent line. The reason for this is that as we move away from x=0x=0 the derivative at x=0x=0 no longer represents how the function is varying. In other words, the derivative of ff is itself changing over xx, and we need to account for that. The rate of change of the derivative of ff over xx is given by its derivative, which we refer to as the second derivative of ff, d2fdx2\frac{d^2f}{dx^2}. In turn, the second derivative may also be varying over xx, and its variation is characterized by the third derivative d3fdx3\frac{d^3f}{dx^3}. This can go on indefinitely for some functions, as it is the case for f(x)=sinxf(x) = \sin{x}.

The Taylor series, also referred to as Taylor expansion, tells us how to use these successive derivatives to approximate a function. Formally, the Taylor series is defined as:

T(x)=f(a)+n=1(xa)nn!dnfdxnx=a\begin{equation} T(x) = f(a) + \sum_{n=1}^{\infty} \frac{(x - a)^n}{n!} \left. \frac{d^nf}{dx^n}\right|_{x=a} \end{equation}

where aa is an anchor or reference point (the point at which we evaluate the function and its derivatives). In practice, we cannot indefinitely evaluate the expression above, so we need to restrict ourselves to a finite number of derivatives:

TN(x)=f(a)+n=1N(xa)nn!dnfdxnx=a\begin{equation} T_N(x) = f(a) + \sum_{n=1}^{N} \frac{(x - a)^n}{n!} \left. \frac{d^nf}{dx^n}\right|_{x=a} \end{equation}

The equation above defines a polynomial of degree NN: this is known as the NthN^{th} degree Taylor polynomial. Let's implement an approximation of f(x)=sinxf(x) = \sin x using the 5th degree Taylor polynomial:

This looks like a much better approximation. However, if we look over a wider range, we see that the function is clearly different

works well. Taylor polynomials are commonly used to approximate functions because (i) they can

Increasing the degree of the Taylor polynomial widens the range over which the approximation works well. Taylor polynomials are commonly used to approximate functions because (i) they can provide good approximations and (ii) polynomials are easy to work with. For instance, they are easy to differentiate, which is helpful if we are looking to minimise a function.

using the first derivative in each dimension, i.e. the gradient, as follows:

Multivariate Taylor Series

The Taylor series can also be used to approximate multivariate functions. This section might feel a bit more difficult than the previous ones, so we'll move slightly slower. To start with, let's write down Taylor polynomial of degree 1 for approximating a univariate function f(x)f(x):

T1(x)=f(a)+n=11(xa)nn!dnfdxnx=a=f(a)+(xa)11!d1fdx1x=a=f(a)+(xa)dfdxx=a\begin{align} T_1(x) &= f(a) + \sum_{n=1}^{1} \frac{(x - a)^n}{n!} \left. \frac{d^nf}{dx^n}\right|_{x=a} \\[3 ex] &= f(a) + \frac{(x - a)^1}{1!} \left. \frac{d^1f}{dx^1}\right|_{x=a} \\[3 ex] &= f(a) + (x - a) \left. \frac{df}{dx}\right|_{x=a} \end{align}

The logic is the same in the multivariate case, but we have to account for the contribution of each partial derivative. For instance, let's assume that we are approximating a function f(x)f(\mathbf{x}) with x=[x1,x2]\mathbf{x}=\left[ x_1, x_2 \right]^\top using a Taylor polynomial of degree 1 with an anchor point a=[a1,a2]\mathbf{a} = [a_1, a_2]^\top. Pretty much as in the univariate case, we can approximate the function by using the first derivative in each dimension, as follows:

T1(x)=f(a)+[(x1a1)fx1x=a+(x2a2)fx2x=a]\begin{equation} T_1(\mathbf{x}) = f(\mathbf{a}) + \left[ (x_1 - a_1)\left. \frac{\partial f}{\partial x_1}\right|_{x=a} + (x_2 - a_2)\left. \frac{\partial f}{\partial x_2}\right|_{x=a} \right] \end{equation}

We can rewrite this using vector notation:

T1(x)=f(a)+xfx=a(xa)\begin{equation} T_1(\mathbf{x}) = f(\mathbf{a}) + \left. \nabla_x f\right|_{\mathbf{x=a}} (\mathbf{x} - \mathbf{a}) \end{equation}

where xf=[fx1,fx2]\nabla_x f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2} \right] is the gradient of ff.

Now let's say we would like to improve our approximation using the second derivatives of ff, thereby adding a third term to the expansion (n=2). Because ff is a function of two variables, we have the following second derivatives at our disposal:

2fx12=x1(fx1)2fx22=x2(fx2)2fx1x2=x1(fx2)2fx2x1=x2(fx1)\begin{align} \frac{\partial^2 f}{\partial x_1^2} = \frac{\partial}{\partial x_1}\left( \frac{\partial f}{\partial x_1} \right)\\[3 ex] \frac{\partial^2 f}{\partial x_2^2} = \frac{\partial}{\partial x_2}\left( \frac{\partial f}{\partial x_2} \right)\\[3 ex] \frac{\partial^2 f}{\partial x_1 \partial x_2} = \frac{\partial}{\partial x_1}\left( \frac{\partial f}{\partial x_2} \right)\\[3 ex] \frac{\partial^2 f}{\partial x_2 \partial x_1} = \frac{\partial}{\partial x_2}\left( \frac{\partial f}{\partial x_1} \right)\\ \end{align}

There are two things to note before we move forward. The first one is that each of these partials contributes to the overall change in ff. The second is that we are dealing with the term of degree 2, which means that the change in x\mathbf{x} is now quadratic. Thus, the term of degree 2 can be written as:

12[(x1a1)22fx12x=a+(x2a2)22fx22x=a(x1a1)(x2a2)2fx1x2x=a+(x2a2)(x1a1)2fx2x1x=a]\begin{equation} \frac{1}{2} \left[ (x_1-a_1)^2 \left. \frac{\partial^2 f}{\partial x_1^2} \right|_{x=a} + (x_2-a_2)^2 \left. \frac{\partial^2 f}{\partial x_2^2} \right|_{x=a} (x_1-a_1)(x_2 - a_2) \left. \frac{\partial^2 f}{\partial x_1 \partial x_2} \right|_{x=a} + (x_2-a_2)(x_1-a_1) \left. \frac{\partial^2 f}{\partial x_2 \partial x_1} \right|_{x=a} \right] \end{equation}

We can now use this term to extend equation (7), and the Taylor expansion of degree 2 becomes:

T2(x)=f(a)+[(x1a1)fx1x=a+(x2a2)fx2x=a]+12[(x1a1)22fx12x=a+(x2a2)22fx22x=a+(x1a1)(x2a2)2fx1x2x=a+(x2a2)(x1a1)2fx2x1x=a]\begin{align} T_2(\mathbf{x}) =& f(\mathbf{a}) + \left[ (x_1 - a_1)\left. \frac{\partial f}{\partial x_1}\right|_{x=a} + (x_2 - a_2)\left. \frac{\partial f}{\partial x_2}\right|_{x=a} \right]\nonumber\\[3 ex] &+ \frac{1}{2} \left[ (x_1-a_1)^2 \left. \frac{\partial^2 f}{\partial x_1^2} \right|_{x=a} + (x_2-a_2)^2 \left. \frac{\partial^2 f}{\partial x_2^2} \right|_{x=a} \right. \nonumber\\[3 ex] &+ \left. (x_1-a_1)(x_2 - a_2) \left. \frac{\partial^2 f}{\partial x_1 \partial x_2} \right|_{x=a} + (x_2-a_2)(x_1-a_1) \left. \frac{\partial^2 f}{\partial x_2 \partial x_1} \right|_{x=a} \right] \end{align}

We are only working with a function of two variables but this expression is already getting a bit complicated. However, expressing this in vector notation becomes much more readable:

T2(x)=f(a)+xfx=a(xa)+12(xa)Hx=a(xa)\begin{equation} T_2(\mathbf{x}) = f(\mathbf{a}) + \left. \nabla_x f\right|_{\mathbf{x=a}} (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \left.\mathbf{H}\right|_{\mathbf{x=a}} (\mathbf{x} - \mathbf{a}) \end{equation}

In the equation above, the matrix H\mathbf{H} is a 2-by-2 matrix containing the partial derivatives of ff:

H=[2fx122fx1x22fx2x12fx22]\begin{equation} \mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} \\[3 ex] \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} \end{bmatrix} \end{equation}

The matrix H\mathbf{H} is called the Hessian. In general, the Hessian of a function of kk variables is a kk-by-kk matrix and it is symmetric for continuous functions, which means that Hij=HjiH_{ij} = H_{ji}

Example

Let's put all this theory into practice by approximating the multivariate function f(x1,x2)=sin(x12+x22)f(x_1, x_2) = \sin (x_1^2 + x_2^2) around the anchor point a=[0,0]\mathbf{a}=[0, 0]. Before we dive into the calculations, let's visualize this function as a 3D surface plot.

Let's see how well we can approximate this function using a Taylor polynomial of degree 2. As we have seen above, it takes the following form:

T2(x)=f(a)+xfx=a(xa)+12(xa)Hx=a(xa)\begin{equation} T_2(\mathbf{x}) = f(\mathbf{a}) + \left. \nabla_x f\right|_{\mathbf{x=a}} (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \left.\mathbf{H}\right|_{\mathbf{x=a}} (\mathbf{x} - \mathbf{a}) \end{equation}

All we need to do is to find the gradient xf\nabla_x f and the Hessian H\mathbf{H}. We won't cover the analytical calculations step-by-step here for brevity, but if you would like to have a go at it, remember to use the chain rule and the product rule! Let's start by calculating the gradient:

fx1=2x1cos(x12+x22)fx2=2x2cos(x12+x22)xf=[2x1cos(x12+x22),2x2cos(x12+x22)]\begin{align} \frac{\partial f}{\partial x_1} &= 2 x_1 \cos(x_1^2 + x_2^2) \\[3 ex] \frac{\partial f}{\partial x_2} &= 2 x_2 \cos(x_1^2 + x_2^2) \\[3 ex] \nabla_x f &= \left[ 2 x_1 \cos(x_1^2 + x_2^2), 2 x_2 \cos(x_1^2 + x_2^2) \right] \end{align}

Next, we calculate the second order partials for the Hessian:

2fx12=x1(fx1)=2cos(x12+x22)4x12sin(x12+x22)2fx22=x2(fx2)=2cos(x12+x22)4x22sin(x12+x22)2fx1x2=x1(fx2)=4x1x2sin(x12+x22)2fx2x1=x2(fx1)=4x1x2sin(x12+x22)\begin{align} \frac{\partial^2 f}{\partial x_1^2} &= \frac{\partial}{\partial x_1} \left( \frac{\partial f}{\partial x_1} \right) = 2 \cos(x_1^2 + x_2^2) - 4x_1^2\sin(x_1^2 + x_2^2) \\[3 ex] \frac{\partial^2 f}{\partial x_2^2} &= \frac{\partial}{\partial x_2} \left( \frac{\partial f}{\partial x_2} \right) = 2 \cos(x_1^2 + x_2^2) - 4x_2^2\sin(x_1^2 + x_2^2) \\[3 ex] \frac{\partial^2 f}{\partial x_1 \partial x_2} &= \frac{\partial}{\partial x_1} \left( \frac{\partial f}{\partial x_2} \right) = -4 x_1 x_2 \sin(x_1^2 + x_2^2) \\[3 ex] \frac{\partial^2 f}{\partial x_2 \partial x_1} &= \frac{\partial}{\partial x_2} \left( \frac{\partial f}{\partial x_1} \right) = -4 x_1 x_2 \sin(x_1^2 + x_2^2) \end{align}

The Hessian is thus

H=[2fx122fx1x22fx2x12fx22]=[2cos(x12+x22)4x12sin(x12+x22)4x1x2sin(x12+x22)4x1x2sin(x12+x22)2cos(x12+x22)4x22sin(x12+x22)]\begin{equation} \mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} \\[3 ex] \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} \end{bmatrix} = \begin{bmatrix} 2 \cos(x_1^2 + x_2^2) - 4x_1^2\sin(x_1^2 + x_2^2) & -4 x_1 x_2 \sin(x_1^2 + x_2^2) \\[3 ex] -4 x_1 x_2 \sin(x_1^2 + x_2^2) & 2 \cos(x_1^2 + x_2^2) - 4x_2^2\sin(x_1^2 + x_2^2) \end{bmatrix} \end{equation}

Now let's check that the results above are correct by approximating the function ff around the anchor point a=[0,0]a=[0, 0]. In the implementation below, we will use the np.vectorize decorator to help us apply the gradient, the hessian and the Taylor approximation functions throughout a grid of values for x1x_1 and x2x_2. That means that we will write the functions as if x1 and x2 where scalar values and we decorate this function with np.vectorize. The resulting vectorized function is then able to accept arrays, applying the original function to their individual elements. Note: Using np.vectorize is not the most efficient way of performing this computation, but it is often convenient and can make the calculations easier to follow.

In the plot above, the purple wire frame plot depicts the approximation of our original function. As you can see, the approximation is pretty good around the anchor point a=[0,0]\mathbf{a}=[0, 0]. However, as we move further from it, the approximation naturally becomes worse, for the reasons we highlighted in the univariate scenario above.

Conclusion

Congratulations on reaching the end of this section! You should now have a good understanding of how derivatives can be useful to approximate univariate and multivariate functions using the Taylor series. This approach is sometimes used for optimisation purposes: when we cannot optimise a given function, we may be able to approximate it using a Taylor polynomial and optimise the approximation instead!