Manyspikes

When we covered the cosine similarity between two vectors, we briefly introduced the dot product, which we said is an operation that maps two vectors $\mathbf{a}$ and $\mathbf{b}$ to a scalar. We provided the following definition:

\mathbf{a} \cdot \mathbf{b} = \sum_{i}^{n} a_i b_i

This states that the dot product $\mathbf{a} \cdot \mathbf{b}$ is the sum over the element-wise multiplication of vectors $\mathbf{a}$ and $\mathbf{b}$ . We also provided the following definition for the cosine similarity:

\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \ ||\mathbf{b}||}

where $\theta$ is the angle formed between vectors $\mathbf{a}$ and $\mathbf{b}$ . By rearranging the previous equation, we can express the dot product as:

\mathbf{a} \cdot \mathbf{b} = ||\mathbf{a}|| \ ||\mathbf{b}|| \cos(\theta)

This form is quite useful to build some intuition. Let's plot out two vectors $\mathbf{a}=[3, 1]$ and $\mathbf{b}=[1, 2]$ :

Now let's look at the equation $\mathbf{a} \cdot \mathbf{b} = ||\mathbf{a}|| \ ||\mathbf{b}|| \cos(\theta)$ again. What do the last two factors of the right hand side of the equation represent? Trigonometry tells us that the length of $\mathbf{b}$ times the cosine of the angle $\theta$ is the length of the projection of $\mathbf{b}$ onto $\mathbf{a}$ . Thus, one can think of the dot product $\mathbf{a} \cdot \mathbf{b}$ as the length of the projection of the vector $\mathbf{b}$ onto vector $\mathbf{a}$ , scaled by the length of $\mathbf{a}$ .

Let's represent this visually by adding the projection of $\mathbf{b}$ onto $\mathbf{a}$ to the previous graph:

But what happens if $\mathbf{b}$ is not pointing in the same direction as $\mathbf{a}$ ? For instance, if $90\degree<\theta<270\degree$ , then the vectors would be pointing in completely different directions. So what happens in this case?

It turns out that we were not entirely accurate when we stated that $||\mathbf{b}||\cos(\theta)$ is the length of the projection of $\mathbf{b}$ onto $\mathbf{a}$ . Note that $\cos(\theta)$ can be negative if the two vectors point in different directions, so $||\mathbf{b}||\cos(\theta)$ would also be negative. According to our previous interpretation, this would constitute a negative length, which doesn't really make much sense.

Perhaps a better way of phrasing it is that $||\mathbf{b}||\cos(\theta)$ represents the extent to which $\mathbf{b}$ projects onto $\mathbf{a}$ . If the vectors point in different directions, then $\mathbf{b}$ would project onto $-1\mathbf{a}$ instead (the negative of $\mathbf{a}$ ), and hence $||\mathbf{b}||\cos(\theta)$ would be negative.

Another interesting case to consider is when $\mathbf{a}$ and $\mathbf{b}$ are orthogonal, i.e. $\theta=90\degree$ . What happens to the projection of $\mathbf{b}$ onto $\mathbf{a}$ ? Because $\theta=90\degree$ then $\cos(\theta)=0$ and therefore the projection vanishes.

The final thing to note is that the sign of the dot product depends entirely on $\theta$ , since $||a||$ and $||b||$ are positive by definition. Thus, we can draw the following conclusions:

If $\mathbf{a}$ and $\mathbf{b}$ point in the same direction, i.e. $-90\degree<\theta<90\degree$ , then their dot product will be positive.
If $\mathbf{a}$ and $\mathbf{b}$ point in different directions, i.e. $90\degree<\theta<270\degree$ , then their dot product will be negative.
If $\mathbf{a}$ and $\mathbf{b}$ are orthogonal, i.e. $\theta=90\degree$ or $\theta=270\degree$ , then their dot product will be zero.

Classification using the dot product

We have seen how the dot product can help us determine if an arbitrary vector points in the direction of some other vector. In this section, we will see how that can be used in the context of a machine learning classification problem.

We will start by introducing the Iris dataset. It consists of measurements of various features of iris flowers, such as sepal length, sepal width, petal length, and petal width. In this section we will be looking at discriminating between two species of iris flowers based on petal length and petal width measurements.

Let's start by loading the dataset. The code below extracts the petal length, petal width and species data from the pre-loaded Iris dataset. Then, we extract the data for "Setosa" and "Versicolor" species.

Let's visualise the dataset by plotting the petal width against the petal length measurements, and color coding it according to the species. Here, we use red for versicolor and blue for setosa species.

Note that even though we have plotted it as a scatter plot, each dot in the plot still represents a vector—in this case, a 2-dimensional vector. We are not drawing the lines and arrows to avoid making the plot cluttered, but in reality all the data points above are just 2-d vectors.

Now we ask the question, if we are given the petal length and width for a new Iris flower, say $\mathbf{x}=\left[ 3.7, 1.5 \right]$ , how could we predict if it is versicolor or setosa?

One possibility would be to define a vector $\mathbf{w}$ for which the dot product $\mathbf{w} \cdot \mathbf{x}$ returns values of different sign depending on the species. But how do we pick that vector? Let's start by defining an imaginary line that separates the data from the two species. Here's one option:

There are an infinite number of lines we could pick and we wouldn't know which one is the best without doing some more maths, but we've picked this one as it seems to be a reasonable way to sepearate the two classes.

If a new data point lands further up and to the right relative to the boundary, we can probably say it is likely to be a versicolor flower: in this case, let's say we would like the dot product $\mathbf{w} \cdot \mathbf{x}$ to return a positive number.

Conversely, if it lands further down and to the left relative to the boundary, it is likely to be a setosa: in this case, we would expect $\mathbf{w} \cdot \mathbf{x}$ to produce a negative number.

Rememeber that the dot product between two vectors is 0 if the vectors are orthogonal, negative if they point in different directions and positive if they point in the same direction. Thus, we can pick a vector $\mathbf{w}$ that is orthogonal to the decision boundary we came up with. One such vector would be vector $\mathbf{w}=\left[2, 0.5\right]$ with its origin at $\mathbf{w_o}=\left[2, 1 \right]$ .

Now, before we can use the dot product to between $\mathbf{x}$ and $\mathbf{w}$ for classification purposes, we need to remember that $\mathbf{w}$ is defined relative to its origin $\mathbf{w_o}$ . Thus, we need to define the incoming data point $\mathbf{x}$ relative to the same origin (as opposed to the actual origin $[0, 0]$ ). That can be achieved by substracting $\mathbf{w_o}$ . We can do this by computing a new vector $\mathbf{x'} = \mathbf{x} - \mathbf{w_o}$ .

The expression of $\mathbf{x}$ relative to the origin of $\mathbf{w}$ might have been slightly confusing, so let's discuss that in more detail. All we have done was to express all vectors relative to a different origin $\mathbf{w_o}$ , rather than relative to the canonical originial $\left[ 0, 0 \right]$ . That is equivalent to subtracting the vector $\mathbf{w_o}$ from all the other vectors (apart from $\mathbf{w}$ , which was already defined relative to $\mathbf{w_o}$ ). This has the effect of moving $\mathbf{w_o}$ to the origin, shifting the remaining vectors accordingly.

Note how the origin of $\mathbf{w}$ is now at $[0, 0]$ . The data point that we want to classify, $\mathbf{x'}$ , is plotted in green.

With $\mathbf{w}$ and $\mathbf{x'}$ defined relative to the same origin, we can simply compute the dot product $\mathbf{w}$ and $\mathbf{x'}$ :

For $\mathbf{x}=\left[ 3.7, 1.5 \right]$ , the dot product $\mathbf{w} \cdot \mathbf{x'}$ is positive, which means that (according to our definition above) we can classify this example as an instance of versicolor.

Now let's look at another example of a flower with a petal length of 2 and a petal width of 0.5. In this case, $\mathbf{x} = \left[ 2, 0.5 \right]$ .

The dot product is now negative, meaning that we would classify this example as a setosa instance. This makes sense when we look at where the new vector lands.

That's it: we now have a simple classifier that allows us to predict whether a flower from the Iris dataset is likely to be a setosa or a versicolor. However, we engineered this classifier in a completely manual way, i.e. we handpicked the value of the vector $\mathbf{w}$ just by visually inspecting the data.

The purpose of machine learning is to find the values for the vector $\mathbf{w}$ that can best discriminate between the classes, based on some input data. In two dimensions, we aim to find a vector which is perpendicular to a line that separates data from different classes as best as possible. In three dimensions, we look for a vector that is perpendicular to the plane that best separates data from different classes. In higher dimensions, we say that we aim to find the vector that is perpendicular to the hyperplane that best separates data from different classes.

Once appropriate values of $\mathbf{w}$ are found, the mechanics of prediction are often as simple as this example, whereby the dot product between inputs and vectors provide the output necessary for discriminating between different classes.