manyspikes

Partial derivatives and the gradient

Initialising environment...

In the previous sections, we have restricted ourselves to working with univariate functions, i.e. functions of a single variable. In this section, we will start looking at functions of multiple variables. These are called multivariate functions.

In single variable functions, the output of a function f(x)f(x) depends only on the value of x. If xx does not change, then the value of f(x)f(x) will remain the same. In multivariate functions, changes in the output of a function can be caused by changes in one or more of its variables.

For instance, consider a function f(x1,x2)f(x_1, x_2). This function can vary by changing x1x_1 and/or x2x_2. Moreover, changes in x1x_1 most likely affect the function in a different way than changes in x2x_2. Thus, we now need to be able to characterize how the function f(x1,x2)f(x_1, x_2) changes with respect to both x1x_1 and x2x_2.

This is where partial derivatives come in. A partial derivative characterises how the output of a multivariate function changes if we vary a single variable, keeping all the others fixed. Thus, a multivariate function has as many partial derivatives as it has variables and they are defined as:

fxi=limϵ0f(x1,,xi+ϵ,xn)f(x1,x2,,xn)ϵ\begin{equation} \frac{\partial f}{\partial x_i} = \lim_{\epsilon \rightarrow 0} \frac{f(x_1, \ldots, x_i + \epsilon \ldots, x_n) - f(x_1,x_2,\ldots,x_n)}{\epsilon}\\ \end{equation}

The gradient of the function ff is a vector whose elements are all the partial derivatives of ff:

xf=[fx1,fx2,,fxn]\nabla_x f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots , \frac{\partial f}{\partial x_n} \right]

The gradient is also commonly denoted by dfdx\frac{df}{d\mathbf{x}} and less commonly referred to as the Jacobian.

Example

Consider the function f(x1,x2)=(x1+4)2+3(x21)2f(x_1, x_2) = (x_1+4)^2 + 3(x_2 - 1)^2. Let's compute the gradient for this function. First, let's compute the partial derivative with respect to x1x_1:

fx1=2(x1+4)\begin{equation} \frac{\partial f}{\partial x_1} = 2(x_1 + 4) \end{equation}

The second term in f(x1,x2)f(x_1, x_2) vanishes because it is treated as a constant, i.e. by definition, the partial derivative with respect to x1x_1 imples that x2x_2 does not vary. Now let's compute the partial with respect to x2x_2:

fx2=6(x21)\begin{equation} \frac{\partial f}{\partial x_2} = 6(x_2 - 1) \end{equation}

Here, the first term vanishes for the very same reason: the partial with respect to x2x_2 implies that x1x_1 does not vary, so any term dependent on x1x_1 alone vanishes.

Finally, we can assemple our gradient:

xf=[fx1,fx2]=[2(x1+4),6(x21)]\begin{equation} \nabla_x f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2} \right] = \left[ 2(x_1 + 4), 6(x_2 - 1) \right] \end{equation}

Sum, Product and Chain rules

The rules we covered when looking at derivatives of univariate functions still apply in the multivariate case (you may have noticed that we have applied the sum rule in the previous example). The main difference is that the rules are now expressed in terms of partial derivatives with respect to vectors, i.e. the vectors of partial derivatives. We can write them as follows:

Sum rule:x(g(x)+h(x))=gx+hxProduct rule:x(g(x)h(x))=gxh(x)+hxg(x)Chain rule:xg(h(x))=ghhx\begin{align} \text{Sum rule:}& \qquad \frac{\partial}{\partial \mathbf{x}}\left(g(\mathbf{x}) + h(\mathbf{x})\right) = \frac{\partial g}{\partial \mathbf{x}} + \frac{\partial h}{\partial \mathbf{x}}\\[3 ex] \text{Product rule:}& \qquad \frac{\partial}{\partial \mathbf{x}}\left(g(\mathbf{x}) \cdot h(\mathbf{x})\right) = \frac{\partial g}{\partial \mathbf{x}}h(\mathbf{x}) + \frac{\partial h}{\partial \mathbf{\mathbf{x}}}g(\mathbf{x})\\[3 ex] \text{Chain rule:}& \qquad \frac{\partial}{\partial \mathbf{x}}g(h(\mathbf{x})) = \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial \mathbf{x}} \end{align}

Example 1: Sum rule

Let's go through some examples to make this more concrete. Consider a vector x=[x1,x2]\mathbf{x}=[x_1, x_2] and a function f(x)=(x1+4)2x2+3(x21)2x1f(\mathbf{x}) = (x_1+4)^2x_2 + 3(x_2 - 1)^2x_1. This function can be rewritten as f(x)=g(x)+h(x)f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x}) with g(x)=(x1+4)2x2g(\mathbf{x})=(x_1 + 4)^2 x_2 and h(x)=3(x21)2x1h(\mathbf{x})=3(x_2 - 1)^2x_1.

Now let's apply the sum rule. We start by calculating the partials of gg with respect to the vector xx.

gx=[gx1,gx2]=[2x2(x1+4),(x1+4)2]\begin{align} \frac{\partial g}{\partial \mathbf{x}} &= \left[ \frac{\partial g}{\partial x_1}, \frac{\partial g}{\partial x_2} \right] \\[3 ex] &= \left[ 2x_2(x_1+4), (x_1+4)^2 \right] \end{align}

Now we do the same for hh:

hx=[hx1,hx2]=[3(x21)2,6x1(x21)]\begin{align} \frac{\partial h}{\partial \mathbf{x}} &= \left[ \frac{\partial h}{\partial x_1}, \frac{\partial h}{\partial x_2} \right] \\[3 ex] &= \left[ 3(x_2-1)^2, 6x_1(x2-1) \right] \end{align}

Finally, we can perform vector addition to arrive at the final gradient for ff:

fx=gx+hx=[2x2(x1+4),(x1+4)2]+[3(x21)2,6x1(x21)]=[2x2(x1+4)+3(x21)2,(x1+4)2+6x1(x21)]\begin{align} \frac{\partial f}{\partial \mathbf{x}} &= \frac{\partial g}{\partial \mathbf{x}} + \frac{\partial h}{\partial \mathbf{x}} \\[3 ex] &= \left[ 2x_2(x_1+4), (x_1+4)^2 \right] + \left[ 3(x_2-1)^2, 6x_1(x2-1) \right]\\[3 ex] &= \left[ 2x_2(x_1+4) + 3(x_2-1)^2, (x_1+4)^2 + 6x_1(x2-1)\right] \end{align}

Again, let's check that we are correct using the method of finite differences:

Example 2: Chain rule

Consider the functions:

f(u,v)=uvu(a)=eav(b)=eba(x1,x2)=x12+x22b(x1,x2)=(x11)2+(x21)2\begin{align} f(u, v) &= u - v \\ u(a) &= e^{-a}\\ v(b) &= e^{-b}\\ a(x_1, x_2) &=x_1^2 + x_2^2 \\ b(x_1, x_2) &=(x_1-1)^2 + (x_2-1)^2 \\ \end{align}

Now let's apply the chain rule to compute the partial derivatives of ff with respect to x1x_1 and x2x_2. Starting with x1x_1, we see that ff depends on it via uu and vv, since both functions depend on it (via aa and bb, respectively). Thus, we can start by writing:

fx1=fuux1+fvvx1\begin{equation} \frac{\partial f}{\partial x_1} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial x_1} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial x_1} \end{equation}

If we continue applying the chain rule to the partials of uu and vv with respect to x1x_1, we get:

fx1=fuuaax1+fvvbbx1\begin{equation} \frac{\partial f}{\partial x_1} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial a}\frac{\partial a}{\partial x_1} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial b}\frac{\partial b}{\partial x_1} \end{equation}

We can follow the same logic to compute the partials with respect to x2x_2:

fx2=fuux2+fvvx2=fuuaax2+fvvbbx2\begin{align} \frac{\partial f}{\partial x_2} &= \frac{\partial f}{\partial u}\frac{\partial u}{\partial x_2} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial x_2}\\[3 ex] &= \frac{\partial f}{\partial u}\frac{\partial u}{\partial a}\frac{\partial a}{\partial x_2} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial b}\frac{\partial b}{\partial x_2} \end{align}

Now let's calculate the factors above:

ax1=2x1ax2=2x2bx1=2x12bx2=2x22ua=eavb=ebfu=1fv=1\begin{align} \frac{\partial a}{\partial x_1} &= 2x_1\\ \frac{\partial a}{\partial x_2} &= 2x_2\\ \frac{\partial b}{\partial x_1} &= 2x_1 - 2\\ \frac{\partial b}{\partial x_2} &= 2x_2 - 2\\ \frac{\partial u}{\partial a} &= -e^{-a}\\ \frac{\partial v}{\partial b} &= -e^{-b}\\ \frac{\partial f}{\partial u} &= 1\\ \frac{\partial f}{\partial v} &= -1 \end{align}

Finally, we can apply these in equations (21) and (23):

fx1=fuuaax1+fvvbbx1=1×(e(x12+x22))×2x11×(e((x11)2+(x21)2))×(2x12)=2x1e(x12+x22)+(2x12)e((x11)2+(x21)2)\begin{align} \frac{\partial f}{\partial x_1} &= \frac{\partial f}{\partial u}\frac{\partial u}{\partial a}\frac{\partial a}{\partial x_1} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial b}\frac{\partial b}{\partial x_1}\\[3 ex] &=1 \times \left(-e^{-\left(x_1^2+x_2^2\right)}\right)\times2x_1 - 1 \times \left(-e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)}\right)\times(2x_1 - 2)\\[3 ex] &=- 2x_1e^{-\left(x_1^2+x_2^2\right)} + (2x_1 - 2)e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)} \end{align}

In this specific case, the expression for x2x_2 is fairly similar:

fx2=fuuaax2+fvvbbx2=1×(e(x12+x22))×2x21×(e((x11)2+(x21)2))×(2x22)=2x2e(x12+x22)+(2x22)e((x11)2+(x21)2)\begin{align} \frac{\partial f}{\partial x_2} &= \frac{\partial f}{\partial u}\frac{\partial u}{\partial a}\frac{\partial a}{\partial x_2} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial b}\frac{\partial b}{\partial x_2}\\[3 ex] &=1 \times \left(-e^{-\left(x_1^2+x_2^2\right)}\right)\times2x_2 - 1 \times \left(-e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)}\right)\times(2x_2 - 2)\\[3 ex] &=- 2x_2e^{-\left(x_1^2+x_2^2\right)} + (2x_2 - 2)e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)} \end{align}

Our gradient is thus:

xf=[2x1e(x12+x22)+(2x12)e((x11)2+(x21)2),2x2e(x12+x22)+(2x22)e((x11)2+(x21)2)]\begin{equation} \nabla_x f = \left[- 2x_1e^{-\left(x_1^2+x_2^2\right)} + (2x_1 - 2)e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)}, - 2x_2e^{-\left(x_1^2+x_2^2\right)} + (2x_2 - 2)e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)} \right] \end{equation}

Note that the similarity between the partial derivatives with respect to x1x_1 and x2x_2 is not coincidental. This happens because x1x_1 and x2x_2 affect the function ff in the very same way—you could swap one for the other and you would obtain the same expression for ff.

The calculations above are a bit involved, so we should definitely run some gradient checking to verify that they are correct:

Happy days! The method of finite differences returns a value that is indeed very close to the value we computed analytically.

Multivariate gradient descent

In the last part of this section, we refocus on the problem in hand. In the context of training a machine learning model, the reason we need the gradient is because we want to find parameters that minimise a (multivariate) function that tells us how wrong a model is on a particular dataset, i.e. the loss function. While we are not ready yet to solve that problem (this will be covered in our ML courses), we can now see how the gradient can be used to minimize a multivariate function.

Let's go back to our function ff, now written directly as a function of (x1,x2)(x_1, x_2).

f(x1,x2)=e(x12+x22)e((x11)2+(x21)2)f(x_1, x_2) = e^{-\left(x_1^2 + x_2^2\right)} - e^{-\left(\left(x_1-1\right)^2 + \left(x_2-1\right)^2\right)}

Since ff is a scalar-valued function of two variables, we can easily plot it in a 3-D surface plot:

The plot above is interactive, so you can use your mouse to visualize the function from different viewpoints. Another way to visualise it is through contour plots, which use lines and colors to encode the value of the function:

In the graph above, regions colored in red represent larger values of ff, while regions colored in blue represent smaller values of ff, as shown in the color bar on the right of the plot.

We will try to find the minimum of this function using gradient descent. The process is similar to the one we followed to minimise a univariate function, but now we need to work with two variables, x1x_1 and x2x_2.

We will start at an initial point x=[0.5,1]\mathbf{x}=[0.5, -1] and then take steps in the direction that minimizes ff more rapidly, i.e. the direction opposite to that of the gradient. Thus, the update rule at a given iteration becomes:

xi+1=xiαxfx=xi\begin{equation} \mathbf{x}_{i+1} = \mathbf{x}_i - \alpha \nabla_x f|_{x=x_i} \end{equation}

Let's implement it so we can see it in action. For every iteration, we will plot the current value of x=[x1,x2]\mathbf{x}=[x_1, x_2] so we can visualise how the gradients are pushing x\mathbf{x} to the point at which the function is minimized.

To summarize this section, we have seen how to apply the rules of derivation to multivariate functions using partial derivatives. By doing so, we obtain the gradient of a function, which is the vector of partial derivatives for each of the variables of the multivariate function. Finally, we have also seen how to use the gradient to minimize multivariate functions.

The concepts we covered here are heavily used in many different ML techniques, where we aim to minimise the loss function of a model. For instance, deep learning relies on applying the chain rule throughout the layers of a neural network, thereby computing the gradient of the loss function with respect to each of the parameters of the model. Then, we use gradient descent to update the values of the parameters in an iterative process. In this section, we have applied the chain rule manually, but fortunately it is possible to do this automatically via automatic differentiation, as we will see in later sections.