In the previous sections, we have restricted ourselves to working with univariate functions, i.e. functions of a single variable. In this section, we will start looking at functions of multiple variables. These are called multivariate functions.
In single variable functions, the output of a function depends only on the value of x. If does not change, then the value of will remain the same. In multivariate functions, changes in the output of a function can be caused by changes in one or more of its variables.
For instance, consider a function . This function can vary by changing and/or . Moreover, changes in most likely affect the function in a different way than changes in . Thus, we now need to be able to characterize how the function changes with respect to both and .
This is where partial derivatives come in. A partial derivative characterises how the output of a multivariate function changes if we vary a single variable, keeping all the others fixed. Thus, a multivariate function has as many partial derivatives as it has variables and they are defined as:
The gradient of the function is a vector whose elements are all the partial derivatives of :
The gradient is also commonly denoted by and less commonly referred to as the Jacobian.
Consider the function . Let's compute the gradient for this function. First, let's compute the partial derivative with respect to :
The second term in vanishes because it is treated as a constant, i.e. by definition, the partial derivative with respect to imples that does not vary. Now let's compute the partial with respect to :
Here, the first term vanishes for the very same reason: the partial with respect to implies that does not vary, so any term dependent on alone vanishes.
Finally, we can assemple our gradient:
The rules we covered when looking at derivatives of univariate functions still apply in the multivariate case (you may have noticed that we have applied the sum rule in the previous example). The main difference is that the rules are now expressed in terms of partial derivatives with respect to vectors, i.e. the vectors of partial derivatives. We can write them as follows:
Let's go through some examples to make this more concrete. Consider a vector and a function . This function can be rewritten as with and .
Now let's apply the sum rule. We start by calculating the partials of with respect to the vector .
Now we do the same for :
Finally, we can perform vector addition to arrive at the final gradient for :
Again, let's check that we are correct using the method of finite differences:
Consider the functions:
Now let's apply the chain rule to compute the partial derivatives of with respect to and . Starting with , we see that depends on it via and , since both functions depend on it (via and , respectively). Thus, we can start by writing:
If we continue applying the chain rule to the partials of and with respect to , we get:
We can follow the same logic to compute the partials with respect to :
Now let's calculate the factors above:
Finally, we can apply these in equations (21) and (23):
In this specific case, the expression for is fairly similar:
Our gradient is thus:
Note that the similarity between the partial derivatives with respect to and is not coincidental. This happens because and affect the function in the very same way—you could swap one for the other and you would obtain the same expression for .
The calculations above are a bit involved, so we should definitely run some gradient checking to verify that they are correct:
Happy days! The method of finite differences returns a value that is indeed very close to the value we computed analytically.
In the last part of this section, we refocus on the problem in hand. In the context of training a machine learning model, the reason we need the gradient is because we want to find parameters that minimise a (multivariate) function that tells us how wrong a model is on a particular dataset, i.e. the loss function. While we are not ready yet to solve that problem (this will be covered in our ML courses), we can now see how the gradient can be used to minimize a multivariate function.
Let's go back to our function , now written directly as a function of .
Since is a scalar-valued function of two variables, we can easily plot it in a 3-D surface plot:
The plot above is interactive, so you can use your mouse to visualize the function from different viewpoints. Another way to visualise it is through contour plots, which use lines and colors to encode the value of the function:
In the graph above, regions colored in red represent larger values of , while regions colored in blue represent smaller values of , as shown in the color bar on the right of the plot.
We will try to find the minimum of this function using gradient descent. The process is similar to the one we followed to minimise a univariate function, but now we need to work with two variables, and .
We will start at an initial point and then take steps in the direction that minimizes more rapidly, i.e. the direction opposite to that of the gradient. Thus, the update rule at a given iteration becomes:
Let's implement it so we can see it in action. For every iteration, we will plot the current value of so we can visualise how the gradients are pushing to the point at which the function is minimized.
To summarize this section, we have seen how to apply the rules of derivation to multivariate functions using partial derivatives. By doing so, we obtain the gradient of a function, which is the vector of partial derivatives for each of the variables of the multivariate function. Finally, we have also seen how to use the gradient to minimize multivariate functions.
The concepts we covered here are heavily used in many different ML techniques, where we aim to minimise the loss function of a model. For instance, deep learning relies on applying the chain rule throughout the layers of a neural network, thereby computing the gradient of the loss function with respect to each of the parameters of the model. Then, we use gradient descent to update the values of the parameters in an iterative process. In this section, we have applied the chain rule manually, but fortunately it is possible to do this automatically via automatic differentiation, as we will see in later sections.