manyspikes

Linear regression

Initialising environment...

In this section, we will start looking at linear regression based on a single feature: this is called univariate linear regression.

The dataset we will work with comes from the World Happiness Report, which essentially collects data on happiness levels across different countries.

Here, our task will be to predict happiness levels in different countries based on their GDP (Gross Domestic Product) per capita, which is a measure of how good an economy is.

The linear regression model for this problem can be written as:

Happiness=w×GDP+b\begin{equation} \text{Happiness} = w \times \text{GDP} + b \end{equation}

where ww is a parameter that scales the value of GDP linearly and bb is a constant offset, often called a bias term. If ww is large, then happiness grows more rapidly with increases in GDP. If ww is small, then happiness grows slowly as the GDP increases.

Note: The model above defines a linear relationship between happiness and GDP, but it says nothing about causation! We will touch on this later, but for the time being remember that this model does not imply that e.g. increasing a country's GDP will necessarily improve happiness levels.

To make the equations less verbose, let's use yy to represent happiness and xx to represent GDP, so we can write:

y=wx+b\begin{equation} \text{y} = wx + b \end{equation}

The dataset above gives a collection of pairs of values for xx and yy. For instance, here are 3 examples from the dataset:

| Country   | Happiness    | log(GDP per capita)  | 
|-----------+--------------+----------------------|
| Denmark   | 7.504        | 10.996               |
| Egypt     | 3.881        | 9.480                |
| Portugal  | 5.954        | 10.504               |

Let's say we take an example from the dataset and we plug it into equation (2), substituting yy by the happiness score and xx by the GDP per capita (note that we are actually working with the logarithm of the GDP per capita).

Imagine we selected Denmark's example, for which the happiness score was 7.504 and the GDP measure was 10.996. Substituting in equation (2) we get:

7.504=10.996w+b\begin{equation} 7.504 = 10.996w + b \end{equation}

We have one equation and we want to find out the values of two unknowns: ww and bb. Unfortunately, that is not possible because there is an infinite number of values for ww and bb that will make equation (3) valid. Thinking about it geometrically, there is an infinite number of lines that can pass through a single point.

Let's take another example from the dataset to give us another equation to work with, say the Egypt example. We now have:

{7.504=10.996w+b3.881=9.480w+b\begin{equation} \begin{cases} 7.504 = 10.996w + b \\[2 ex] 3.881 = 9.480w + b \end{cases} \end{equation}

You probably recognise this as a system of linear equations with two equations and two unknowns. We can solve for ww and bb and that would essentially give us a linear model which we could use to make a prediction.

In this case, we got b18.77b \approx -18.77 and m2.39m \approx 2.39. This would be the correct solution if all the examples in the dataset were accurately described by the line above, i.e. if all the examples were precisely on top of the orange line. Unfortunately, that is not the case. Let's add Portugal to the plot:

Unfortunately, there is no way we can find a line that covers all 3 points above. We have a system of linear equations with 3 equations and 2 unknowns, but it is an inconsistent system: we can't find a set of values for mm and bb that satisfy all three equations.

We need to accept that we won't be able to find a line that runs through all the examples in our dataset, so instead we can ask: can we find the line that gets as close as possible to all the examples in our dataset?

To do that, we need to define what we mean by as close as possible. Let's define some notation to help us formalise this idea. Let (xi,yi)(x_i, y_i) represent the ithi^{th} example of our dataset. Additionally, let yi^\hat{y_i} be the yy-coordinate of the line evaluated at the point xix_i, i.e.

yi^=mxi+b\begin{equation} \hat{y_i} = mx_i + b \end{equation}

We want the distance between yiy_i and yi^\hat{y_i} to be as small as possible, so one possibility would be to minimise following the loss function:

L(m,b)=i=1N(yiyi^)2=i=1N(yi(mxi+b))2.\begin{equation} L(m,b) = \sum_{i=1}^{N}(y_i - \hat{y_i})^2 = \sum_{i=1}^{N}(y_i - (mx_i + b))^2. \end{equation}

This loss function is known as the sum of squared errors (SSE).

Minimising the loss

Now, we want to find the values of mm and bb that minimise LL. In general, there are two ways to go about this sort of problem:

  1. We find the parameters that minimise the loss analytically
  2. We use an iterative optimisation method to arrive at the best possible parameter estimates

In many cases it is not possible to arrive at an analytical solution, so iterative optimisation is the best we can do. However, it is possible to work out the optimal parameters analytically for linear regression, as we shall see next.

We start by computing the partial derivatives of the loss with respect to mm and bb:

Lm=i=1N2xi(yi(mxi+b))Lb=i=1N2(yi(mxi+b))\begin{align} \frac{\partial L}{\partial m} &= \sum_{i=1}^{N} -2x_i(y_i - (mx_i + b))\\[3 ex] \frac{\partial L}{\partial b} &= \sum_{i=1}^{N} -2(y_i - (mx_i + b))\\ \end{align}

To arrive at the equation above, we departed from equation (6) and applied the chain rule of differentiation h(x)=f(g(x))g(x)h'(x) = f'(g(x))g'(x), with f(x)=x2f(x) = x^2 and g(x)=yi(mxi+b)g(x)=y_i - (mx_i + b).

As we know from calculus, the minimum of a differentiable function must be associated with a derivative equal to zero, so we need to find the values of mm and bb which satisfy:

Lm=0Lb=0\begin{align} \frac{\partial L}{\partial m} &= 0\\[3 ex] \frac{\partial L}{\partial b} &= 0\\ \end{align}

Let's start with the partial with respect to bb.

Lb=i=1N2(yi(mxi+b))=0.\begin{equation} \frac{\partial L}{\partial b} =\sum_{i=1}^{N} -2(y_i - (mx_i + b)) = 0. \end{equation}

We can get rid of the factor 2-2 as that does not affect the solution. Furthermore, splitting up the sum into two separate sums we get

i=1Nyii=1N(mxi+b))=0.\begin{equation} \sum_{i=1}^{N} y_i - \sum_{i=1}^{N}(mx_i + b)) = 0. \end{equation}

Let's make use of the fact that i=1Nyi=Nyˉ\sum_{i=1}^{N} y_i = N\bar{y}, where yˉ\bar{y} denotes the average of yy. We can thus write

Nyˉi=1N(mxi+b))=0.\begin{equation} N\bar{y} - \sum_{i=1}^{N}(mx_i + b)) = 0. \end{equation}

Following the same logic, we can get rid of the remaining sum:

Nyˉ(mNxˉ+Nb)=0.\begin{equation} N\bar{y} - (mN\bar{x} + Nb) = 0. \end{equation}

Finally, solving for bb we get

b=yˉmxˉ.\begin{equation} b = \bar{y} - m\bar{x}. \end{equation}

Following a similar approach to solve Lm=0\frac{\partial L}{\partial m} = 0, we get to the solution

m=i=1N(xixˉ)(yiyˉ)i=1N(xixˉ)2.\begin{equation} m = \frac{\sum_{i=1}^N (x_i - \bar{x}) (y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2}. \end{equation}

Equations (15) and (16) are called the ordinary least squares (OLS) solutions because they follow from minimising the sum of squared errors.

Example

Now let's see this working in action. We have added more data to our dataset in the cell below and we will compute the estimates for mm and bb as per the OLS solutions above.

As expected, we get a pretty good fit to the data. We would find a pretty similar solution if we were to use an iterative optimisation approach (e.g. using gradient descent), but when possible we favour analytical solutions.

As a final note, notice the general approach we followed here: we started by defining a model, we wrote down it's loss function and finally we worked on finding the minimum of the loss function with respect to the parameters. We will be using this standard approach for pretty much every ML model.

In the next sections, we will move on to multivariate linear regression, where we use multiple features to predict the target value.