Manyspikes

In the previous sections, we have restricted ourselves to working with univariate functions, i.e. functions of a single variable. In this section, we will start looking at functions of multiple variables. These are called multivariate functions.

In single variable functions, the output of a function $f(x)$ depends only on the value of x. If $x$ does not change, then the value of $f(x)$ will remain the same. In multivariate functions, changes in the output of a function can be caused by changes in one or more of its variables.

For instance, consider a function $f(x_1, x_2)$ . This function can vary by changing $x_1$ and/or $x_2$ . Moreover, changes in $x_1$ most likely affect the function in a different way than changes in $x_2$ . Thus, we now need to be able to characterize how the function $f(x_1, x_2)$ changes with respect to both $x_1$ and $x_2$ .

This is where partial derivatives come in. A partial derivative characterises how the output of a multivariate function changes if we vary a single variable, keeping all the others fixed. Thus, a multivariate function has as many partial derivatives as it has variables and they are defined as:

\begin{equation} \frac{\partial f}{\partial x_i} = \lim_{\epsilon \rightarrow 0} \frac{f(x_1, \ldots, x_i + \epsilon \ldots, x_n) - f(x_1,x_2,\ldots,x_n)}{\epsilon}\\ \end{equation}

The gradient of the function $f$ is a vector whose elements are all the partial derivatives of $f$ :

\nabla_x f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots , \frac{\partial f}{\partial x_n} \right]

The gradient is also commonly denoted by $\frac{df}{d\mathbf{x}}$ and less commonly referred to as the Jacobian.

Example

Consider the function $f(x_1, x_2) = (x_1+4)^2 + 3(x_2 - 1)^2$ . Let's compute the gradient for this function. First, let's compute the partial derivative with respect to $x_1$ :

\begin{equation} \frac{\partial f}{\partial x_1} = 2(x_1 + 4) \end{equation}

The second term in $f(x_1, x_2)$ vanishes because it is treated as a constant, i.e. by definition, the partial derivative with respect to $x_1$ imples that $x_2$ does not vary. Now let's compute the partial with respect to $x_2$ :

\begin{equation} \frac{\partial f}{\partial x_2} = 6(x_2 - 1) \end{equation}

Here, the first term vanishes for the very same reason: the partial with respect to $x_2$ implies that $x_1$ does not vary, so any term dependent on $x_1$ alone vanishes.

Finally, we can assemple our gradient:

\begin{equation} \nabla_x f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2} \right] = \left[ 2(x_1 + 4), 6(x_2 - 1) \right] \end{equation}

Sum, Product and Chain rules

The rules we covered when looking at derivatives of univariate functions still apply in the multivariate case (you may have noticed that we have applied the sum rule in the previous example). The main difference is that the rules are now expressed in terms of partial derivatives with respect to vectors, i.e. the vectors of partial derivatives. We can write them as follows:

\begin{align} \text{Sum rule:}& \qquad \frac{\partial}{\partial \mathbf{x}}\left(g(\mathbf{x}) + h(\mathbf{x})\right) = \frac{\partial g}{\partial \mathbf{x}} + \frac{\partial h}{\partial \mathbf{x}}\\[3 ex] \text{Product rule:}& \qquad \frac{\partial}{\partial \mathbf{x}}\left(g(\mathbf{x}) \cdot h(\mathbf{x})\right) = \frac{\partial g}{\partial \mathbf{x}}h(\mathbf{x}) + \frac{\partial h}{\partial \mathbf{\mathbf{x}}}g(\mathbf{x})\\[3 ex] \text{Chain rule:}& \qquad \frac{\partial}{\partial \mathbf{x}}g(h(\mathbf{x})) = \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial \mathbf{x}} \end{align}

Example 1: Sum rule

Let's go through some examples to make this more concrete. Consider a vector $\mathbf{x}=[x_1, x_2]$ and a function $f(\mathbf{x}) = (x_1+4)^2x_2 + 3(x_2 - 1)^2x_1$ . This function can be rewritten as $f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x})$ with $g(\mathbf{x})=(x_1 + 4)^2 x_2$ and $h(\mathbf{x})=3(x_2 - 1)^2x_1$ .

Now let's apply the sum rule. We start by calculating the partials of $g$ with respect to the vector $x$ .

\begin{align} \frac{\partial g}{\partial \mathbf{x}} &= \left[ \frac{\partial g}{\partial x_1}, \frac{\partial g}{\partial x_2} \right] \\[3 ex] &= \left[ 2x_2(x_1+4), (x_1+4)^2 \right] \end{align}

Now we do the same for $h$ :

\begin{align} \frac{\partial h}{\partial \mathbf{x}} &= \left[ \frac{\partial h}{\partial x_1}, \frac{\partial h}{\partial x_2} \right] \\[3 ex] &= \left[ 3(x_2-1)^2, 6x_1(x2-1) \right] \end{align}

Finally, we can perform vector addition to arrive at the final gradient for $f$ :

\begin{align} \frac{\partial f}{\partial \mathbf{x}} &= \frac{\partial g}{\partial \mathbf{x}} + \frac{\partial h}{\partial \mathbf{x}} \\[3 ex] &= \left[ 2x_2(x_1+4), (x_1+4)^2 \right] + \left[ 3(x_2-1)^2, 6x_1(x2-1) \right]\\[3 ex] &= \left[ 2x_2(x_1+4) + 3(x_2-1)^2, (x_1+4)^2 + 6x_1(x2-1)\right] \end{align}

Again, let's check that we are correct using the method of finite differences:

Example 2: Chain rule

Consider the functions:

\begin{align} f(u, v) &= u - v \\ u(a) &= e^{-a}\\ v(b) &= e^{-b}\\ a(x_1, x_2) &=x_1^2 + x_2^2 \\ b(x_1, x_2) &=(x_1-1)^2 + (x_2-1)^2 \\ \end{align}

Now let's apply the chain rule to compute the partial derivatives of $f$ with respect to $x_1$ and $x_2$ . Starting with $x_1$ , we see that $f$ depends on it via $u$ and $v$ , since both functions depend on it (via $a$ and $b$ , respectively). Thus, we can start by writing:

\begin{equation} \frac{\partial f}{\partial x_1} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial x_1} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial x_1} \end{equation}

If we continue applying the chain rule to the partials of $u$ and $v$ with respect to $x_1$ , we get:

\begin{equation} \frac{\partial f}{\partial x_1} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial a}\frac{\partial a}{\partial x_1} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial b}\frac{\partial b}{\partial x_1} \end{equation}

We can follow the same logic to compute the partials with respect to $x_2$ :

\begin{align} \frac{\partial f}{\partial x_2} &= \frac{\partial f}{\partial u}\frac{\partial u}{\partial x_2} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial x_2}\\[3 ex] &= \frac{\partial f}{\partial u}\frac{\partial u}{\partial a}\frac{\partial a}{\partial x_2} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial b}\frac{\partial b}{\partial x_2} \end{align}

Now let's calculate the factors above:

\begin{align} \frac{\partial a}{\partial x_1} &= 2x_1\\ \frac{\partial a}{\partial x_2} &= 2x_2\\ \frac{\partial b}{\partial x_1} &= 2x_1 - 2\\ \frac{\partial b}{\partial x_2} &= 2x_2 - 2\\ \frac{\partial u}{\partial a} &= -e^{-a}\\ \frac{\partial v}{\partial b} &= -e^{-b}\\ \frac{\partial f}{\partial u} &= 1\\ \frac{\partial f}{\partial v} &= -1 \end{align}

Finally, we can apply these in equations (21) and (23):

\begin{align} \frac{\partial f}{\partial x_1} &= \frac{\partial f}{\partial u}\frac{\partial u}{\partial a}\frac{\partial a}{\partial x_1} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial b}\frac{\partial b}{\partial x_1}\\[3 ex] &=1 \times \left(-e^{-\left(x_1^2+x_2^2\right)}\right)\times2x_1 - 1 \times \left(-e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)}\right)\times(2x_1 - 2)\\[3 ex] &=- 2x_1e^{-\left(x_1^2+x_2^2\right)} + (2x_1 - 2)e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)} \end{align}

In this specific case, the expression for $x_2$ is fairly similar:

\begin{align} \frac{\partial f}{\partial x_2} &= \frac{\partial f}{\partial u}\frac{\partial u}{\partial a}\frac{\partial a}{\partial x_2} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial b}\frac{\partial b}{\partial x_2}\\[3 ex] &=1 \times \left(-e^{-\left(x_1^2+x_2^2\right)}\right)\times2x_2 - 1 \times \left(-e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)}\right)\times(2x_2 - 2)\\[3 ex] &=- 2x_2e^{-\left(x_1^2+x_2^2\right)} + (2x_2 - 2)e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)} \end{align}

Our gradient is thus:

\begin{equation} \nabla_x f = \left[- 2x_1e^{-\left(x_1^2+x_2^2\right)} + (2x_1 - 2)e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)}, - 2x_2e^{-\left(x_1^2+x_2^2\right)} + (2x_2 - 2)e^{-\left(\left(x_1-1\right)^2+\left(x_2-1\right)^2\right)} \right] \end{equation}

Note that the similarity between the partial derivatives with respect to $x_1$ and $x_2$ is not coincidental. This happens because $x_1$ and $x_2$ affect the function $f$ in the very same way—you could swap one for the other and you would obtain the same expression for $f$ .

The calculations above are a bit involved, so we should definitely run some gradient checking to verify that they are correct:

Happy days! The method of finite differences returns a value that is indeed very close to the value we computed analytically.

Multivariate gradient descent

In the last part of this section, we refocus on the problem in hand. In the context of training a machine learning model, the reason we need the gradient is because we want to find parameters that minimise a (multivariate) function that tells us how wrong a model is on a particular dataset, i.e. the loss function. While we are not ready yet to solve that problem (this will be covered in our ML courses), we can now see how the gradient can be used to minimize a multivariate function.

Let's go back to our function $f$ , now written directly as a function of $(x_1, x_2)$ .

f(x_1, x_2) = e^{-\left(x_1^2 + x_2^2\right)} - e^{-\left(\left(x_1-1\right)^2 + \left(x_2-1\right)^2\right)}

Since $f$ is a scalar-valued function of two variables, we can easily plot it in a 3-D surface plot:

The plot above is interactive, so you can use your mouse to visualize the function from different viewpoints. Another way to visualise it is through contour plots, which use lines and colors to encode the value of the function:

In the graph above, regions colored in red represent larger values of $f$ , while regions colored in blue represent smaller values of $f$ , as shown in the color bar on the right of the plot.

We will try to find the minimum of this function using gradient descent. The process is similar to the one we followed to minimise a univariate function, but now we need to work with two variables, $x_1$ and $x_2$ .

We will start at an initial point $\mathbf{x}=[0.5, -1]$ and then take steps in the direction that minimizes $f$ more rapidly, i.e. the direction opposite to that of the gradient. Thus, the update rule at a given iteration becomes:

\begin{equation} \mathbf{x}_{i+1} = \mathbf{x}_i - \alpha \nabla_x f|_{x=x_i} \end{equation}

Let's implement it so we can see it in action. For every iteration, we will plot the current value of $\mathbf{x}=[x_1, x_2]$ so we can visualise how the gradients are pushing $\mathbf{x}$ to the point at which the function is minimized.

To summarize this section, we have seen how to apply the rules of derivation to multivariate functions using partial derivatives. By doing so, we obtain the gradient of a function, which is the vector of partial derivatives for each of the variables of the multivariate function. Finally, we have also seen how to use the gradient to minimize multivariate functions.

The concepts we covered here are heavily used in many different ML techniques, where we aim to minimise the loss function of a model. For instance, deep learning relies on applying the chain rule throughout the layers of a neural network, thereby computing the gradient of the loss function with respect to each of the parameters of the model. Then, we use gradient descent to update the values of the parameters in an iterative process. In this section, we have applied the chain rule manually, but fortunately it is possible to do this automatically via automatic differentiation, as we will see in later sections.

Partial derivatives and the gradient

Example

Sum, Product and Chain rules

Example 1: Sum rule

Example 2: Chain rule

Multivariate gradient descent