Manyspikes

In this section, we will start looking at linear regression based on a single feature: this is called univariate linear regression.

The dataset we will work with comes from the World Happiness Report, which essentially collects data on happiness levels across different countries.

Here, our task will be to predict happiness levels in different countries based on their GDP (Gross Domestic Product) per capita, which is a measure of how good an economy is.

The linear regression model for this problem can be written as:

\begin{equation} \text{Happiness} = w \times \text{GDP} + b \end{equation}

where $w$ is a parameter that scales the value of GDP linearly and $b$ is a constant offset, often called a bias term. If $w$ is large, then happiness grows more rapidly with increases in GDP. If $w$ is small, then happiness grows slowly as the GDP increases.

Note: The model above defines a linear relationship between happiness and GDP, but it says nothing about causation! We will touch on this later, but for the time being remember that this model does not imply that e.g. increasing a country's GDP will necessarily improve happiness levels.

To make the equations less verbose, let's use $y$ to represent happiness and $x$ to represent GDP, so we can write:

\begin{equation} \text{y} = wx + b \end{equation}

The dataset above gives a collection of pairs of values for $x$ and $y$ . For instance, here are 3 examples from the dataset:

| Country   | Happiness    | log(GDP per capita)  | 
|-----------+--------------+----------------------|
| Denmark   | 7.504        | 10.996               |
| Egypt     | 3.881        | 9.480                |
| Portugal  | 5.954        | 10.504               |

Let's say we take an example from the dataset and we plug it into equation (2), substituting $y$ by the happiness score and $x$ by the GDP per capita (note that we are actually working with the logarithm of the GDP per capita).

Imagine we selected Denmark's example, for which the happiness score was 7.504 and the GDP measure was 10.996. Substituting in equation (2) we get:

\begin{equation} 7.504 = 10.996w + b \end{equation}

We have one equation and we want to find out the values of two unknowns: $w$ and $b$ . Unfortunately, that is not possible because there is an infinite number of values for $w$ and $b$ that will make equation (3) valid. Thinking about it geometrically, there is an infinite number of lines that can pass through a single point.

Let's take another example from the dataset to give us another equation to work with, say the Egypt example. We now have:

\begin{equation} \begin{cases} 7.504 = 10.996w + b \\[2 ex] 3.881 = 9.480w + b \end{cases} \end{equation}

You probably recognise this as a system of linear equations with two equations and two unknowns. We can solve for $w$ and $b$ and that would essentially give us a linear model which we could use to make a prediction.

In this case, we got $b \approx -18.77$ and $m \approx 2.39$ . This would be the correct solution if all the examples in the dataset were accurately described by the line above, i.e. if all the examples were precisely on top of the orange line. Unfortunately, that is not the case. Let's add Portugal to the plot:

Unfortunately, there is no way we can find a line that covers all 3 points above. We have a system of linear equations with 3 equations and 2 unknowns, but it is an inconsistent system: we can't find a set of values for $m$ and $b$ that satisfy all three equations.

We need to accept that we won't be able to find a line that runs through all the examples in our dataset, so instead we can ask: can we find the line that gets as close as possible to all the examples in our dataset?

To do that, we need to define what we mean by as close as possible. Let's define some notation to help us formalise this idea. Let $(x_i, y_i)$ represent the $i^{th}$ example of our dataset. Additionally, let $\hat{y_i}$ be the $y$ -coordinate of the line evaluated at the point $x_i$ , i.e.

\begin{equation} \hat{y_i} = mx_i + b \end{equation}

We want the distance between $y_i$ and $\hat{y_i}$ to be as small as possible, so one possibility would be to minimise following the loss function:

\begin{equation} L(m,b) = \sum_{i=1}^{N}(y_i - \hat{y_i})^2 = \sum_{i=1}^{N}(y_i - (mx_i + b))^2. \end{equation}

This loss function is known as the sum of squared errors (SSE).

Minimising the loss

Now, we want to find the values of $m$ and $b$ that minimise $L$ . In general, there are two ways to go about this sort of problem:

We find the parameters that minimise the loss analytically
We use an iterative optimisation method to arrive at the best possible parameter estimates

In many cases it is not possible to arrive at an analytical solution, so iterative optimisation is the best we can do. However, it is possible to work out the optimal parameters analytically for linear regression, as we shall see next.

We start by computing the partial derivatives of the loss with respect to $m$ and $b$ :

\begin{align} \frac{\partial L}{\partial m} &= \sum_{i=1}^{N} -2x_i(y_i - (mx_i + b))\\[3 ex] \frac{\partial L}{\partial b} &= \sum_{i=1}^{N} -2(y_i - (mx_i + b))\\ \end{align}

To arrive at the equation above, we departed from equation (6) and applied the chain rule of differentiation $h'(x) = f'(g(x))g'(x)$ , with $f(x) = x^2$ and $g(x)=y_i - (mx_i + b)$ .

As we know from calculus, the minimum of a differentiable function must be associated with a derivative equal to zero, so we need to find the values of $m$ and $b$ which satisfy:

\begin{align} \frac{\partial L}{\partial m} &= 0\\[3 ex] \frac{\partial L}{\partial b} &= 0\\ \end{align}

Let's start with the partial with respect to $b$ .

\begin{equation} \frac{\partial L}{\partial b} =\sum_{i=1}^{N} -2(y_i - (mx_i + b)) = 0. \end{equation}

We can get rid of the factor $-2$ as that does not affect the solution. Furthermore, splitting up the sum into two separate sums we get

\begin{equation} \sum_{i=1}^{N} y_i - \sum_{i=1}^{N}(mx_i + b)) = 0. \end{equation}

Let's make use of the fact that $\sum_{i=1}^{N} y_i = N\bar{y}$ , where $\bar{y}$ denotes the average of $y$ . We can thus write

\begin{equation} N\bar{y} - \sum_{i=1}^{N}(mx_i + b)) = 0. \end{equation}

Following the same logic, we can get rid of the remaining sum:

\begin{equation} N\bar{y} - (mN\bar{x} + Nb) = 0. \end{equation}

Finally, solving for $b$ we get

\begin{equation} b = \bar{y} - m\bar{x}. \end{equation}

Following a similar approach to solve $\frac{\partial L}{\partial m} = 0$ , we get to the solution

\begin{equation} m = \frac{\sum_{i=1}^N (x_i - \bar{x}) (y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2}. \end{equation}

Equations (15) and (16) are called the ordinary least squares (OLS) solutions because they follow from minimising the sum of squared errors.

Example

Now let's see this working in action. We have added more data to our dataset in the cell below and we will compute the estimates for $m$ and $b$ as per the OLS solutions above.

As expected, we get a pretty good fit to the data. We would find a pretty similar solution if we were to use an iterative optimisation approach (e.g. using gradient descent), but when possible we favour analytical solutions.

As a final note, notice the general approach we followed here: we started by defining a model, we wrote down it's loss function and finally we worked on finding the minimum of the loss function with respect to the parameters. We will be using this standard approach for pretty much every ML model.

In the next sections, we will move on to multivariate linear regression, where we use multiple features to predict the target value.

Linear regression

Minimising the loss

Example