When we covered the cosine similarity between two vectors, we briefly introduced the dot product, which we said is an operation that maps two vectors and to a scalar. We provided the following definition:
This states that the dot product is the sum over the element-wise multiplication of vectors and . We also provided the following definition for the cosine similarity:
where is the angle formed between vectors and . By rearranging the previous equation, we can express the dot product as:
This form is quite useful to build some intuition. Let's plot out two vectors and :
Now let's look at the equation again. What do the last two factors of the right hand side of the equation represent? Trigonometry tells us that the length of times the cosine of the angle is the length of the projection of onto . Thus, one can think of the dot product as the length of the projection of the vector onto vector , scaled by the length of .
Let's represent this visually by adding the projection of onto to the previous graph:
But what happens if is not pointing in the same direction as ? For instance, if , then the vectors would be pointing in completely different directions. So what happens in this case?
It turns out that we were not entirely accurate when we stated that is the length of the projection of onto . Note that can be negative if the two vectors point in different directions, so would also be negative. According to our previous interpretation, this would constitute a negative length, which doesn't really make much sense.
Perhaps a better way of phrasing it is that represents the extent to which projects onto . If the vectors point in different directions, then would project onto instead (the negative of ), and hence would be negative.
Another interesting case to consider is when and are orthogonal, i.e. . What happens to the projection of onto ? Because then and therefore the projection vanishes.
The final thing to note is that the sign of the dot product depends entirely on , since and are positive by definition. Thus, we can draw the following conclusions:
We have seen how the dot product can help us determine if an arbitrary vector points in the direction of some other vector. In this section, we will see how that can be used in the context of a machine learning classification problem.
We will start by introducing the Iris dataset. It consists of measurements of various features of iris flowers, such as sepal length, sepal width, petal length, and petal width. In this section we will be looking at discriminating between two species of iris flowers based on petal length and petal width measurements.
Let's start by loading the dataset. The code below extracts the petal length, petal width and species data from the pre-loaded Iris dataset. Then, we extract the data for "Setosa" and "Versicolor" species.
Let's visualise the dataset by plotting the petal width against the petal length measurements, and color coding it according to the species. Here, we use red for versicolor and blue for setosa species.
Note that even though we have plotted it as a scatter plot, each dot in the plot still represents a vector—in this case, a 2-dimensional vector. We are not drawing the lines and arrows to avoid making the plot cluttered, but in reality all the data points above are just 2-d vectors.
Now we ask the question, if we are given the petal length and width for a new Iris flower, say , how could we predict if it is versicolor or setosa?
One possibility would be to define a vector for which the dot product returns values of different sign depending on the species. But how do we pick that vector? Let's start by defining an imaginary line that separates the data from the two species. Here's one option:
There are an infinite number of lines we could pick and we wouldn't know which one is the best without doing some more maths, but we've picked this one as it seems to be a reasonable way to sepearate the two classes.
If a new data point lands further up and to the right relative to the boundary, we can probably say it is likely to be a versicolor flower: in this case, let's say we would like the dot product to return a positive number.
Conversely, if it lands further down and to the left relative to the boundary, it is likely to be a setosa: in this case, we would expect to produce a negative number.
Rememeber that the dot product between two vectors is 0 if the vectors are orthogonal, negative if they point in different directions and positive if they point in the same direction. Thus, we can pick a vector that is orthogonal to the decision boundary we came up with. One such vector would be vector with its origin at .
Now, before we can use the dot product to between and for classification purposes, we need to remember that is defined relative to its origin . Thus, we need to define the incoming data point relative to the same origin (as opposed to the actual origin ). That can be achieved by substracting . We can do this by computing a new vector .
The expression of relative to the origin of might have been slightly confusing, so let's discuss that in more detail. All we have done was to express all vectors relative to a different origin , rather than relative to the canonical originial . That is equivalent to subtracting the vector from all the other vectors (apart from , which was already defined relative to ). This has the effect of moving to the origin, shifting the remaining vectors accordingly.
Note how the origin of is now at . The data point that we want to classify, , is plotted in green.
With and defined relative to the same origin, we can simply compute the dot product and :
For , the dot product is positive, which means that (according to our definition above) we can classify this example as an instance of versicolor.
Now let's look at another example of a flower with a petal length of 2 and a petal width of 0.5. In this case, .
The dot product is now negative, meaning that we would classify this example as a setosa instance. This makes sense when we look at where the new vector lands.
That's it: we now have a simple classifier that allows us to predict whether a flower from the Iris dataset is likely to be a setosa or a versicolor. However, we engineered this classifier in a completely manual way, i.e. we handpicked the value of the vector just by visually inspecting the data.
The purpose of machine learning is to find the values for the vector that can best discriminate between the classes, based on some input data. In two dimensions, we aim to find a vector which is perpendicular to a line that separates data from different classes as best as possible. In three dimensions, we look for a vector that is perpendicular to the plane that best separates data from different classes. In higher dimensions, we say that we aim to find the vector that is perpendicular to the hyperplane that best separates data from different classes.
Once appropriate values of are found, the mechanics of prediction are often as simple as this example, whereby the dot product between inputs and vectors provide the output necessary for discriminating between different classes.