Manyspikes

In the previous section we learned how to build a simple linear regression model to predict happiness scores based on the GDP per capita of different countries.

We will now focus on building linear regression models based on multiple features: this is called multivariate linear regression.

In particular, we're going to add a feature which captures perceived levels of corruption. With these two features at our disposal, we can now estimate happiness as follows:

\begin{equation} \hat{y} = w_{\text{GDP}} \times \text{GDP} + w_{\text{corruption}} \times \text{corruption} + b \end{equation}

To make the notation simpler, let's define GDP as being our first feature, and corruption as being our second. We can then write:

\begin{equation} \hat{y} = w_{1} \times x_1 + w_{2} \times x_2 + b. \end{equation}

We can simplify this equation further by writing it as the dot product between a vector of weights $\mathbf{w}$ and a vector of features $\mathbf{x}$

\begin{equation} \hat{y} = \mathbf{w}^T\mathbf{x} + b, \end{equation}

where $\mathbf{w} = [w_{1}, w_{2}]$ and $\mathbf{x} = [x_1, x_2]$ .

Let's now incorporate the bias term in the vector multiplication. Let $\mathbf{w} = [w_1, w_2, b]$ and $\mathbf{x} = [x_1, x_2, 1]$ , we can write i.e.

\begin{equation} \hat{y} = \mathbf{w}^T\mathbf{x}. \end{equation}

Equation (4) gives us the prediction for a single example $\mathbf{x}^{(i)} = [x_1^{(i)}, x_2^{(i)}, 1]$ . To compute the estimates for $m$ examples, we can stack the vectors $\mathbf{x}^{(i)}$ into a matrix $\mathbf{X}$ and compute the predictions with the matrix-vector product

\begin{equation} \hat{\mathbf{y}} = \mathbf{X}\mathbf{w}, \end{equation}

where $\mathbf{X}$ is an $m$ -by-3 matrix and $\mathbf{w}$ is a 3-by-1 vector. Thus, $\hat{\mathbf{y}}$ is an $m$ -by-1 vector whose elements are the predictions for each of the $m$ examples.

Parameter estimation

Now that we have define the multivariate linear model, the next step is to come up with an approach to estimate its parameters. As in the univariate case, we need to write down a loss function and look for ways of minimising it.

In the multivariate case, the loss function is pretty much identical: we use the sum of squared errors across a set of $m$ predictions

\begin{equation} L(\mathbf{w}) = \sum_{i=1}^{m}(y_i - \hat{y_i})^2, \end{equation}

which can also be written as the dot product between the vectors $y$ and $\hat{y}$ ,

\begin{equation} L(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}). \end{equation}

Again, we want to determine the parameters $\mathbf{w}$ that minimise the loss $L(\mathbf{w})$ . For the loss function above, this can be done analytically following the same approach we followed in the univariate case. We compute the derivative of the loss with respect to the parameters and we aim to find the values of $\mathbf{w}$ for which this derivative is zero.

\begin{equation} \frac{\partial}{\partial \mathbf{w}}(\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w}) = 0. \end{equation}

Expanding equation (8), we can equivalently write

\begin{equation} \frac{\partial}{\partial \mathbf{w}} \left( \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w} \right) = 0 \end{equation}

Now we can use matrix calculus to compute the derivatives above. Note that $\mathbf{y}^T\mathbf{y}$ does not depend on $\mathbf{w}$ , so the derivative of the first term vanishes, so we get:

\begin{equation} -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w} = 0 \end{equation}

Solving with respect to $\mathbf{w}$ , we get

\begin{equation} \mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{equation}

Note that the matrix $\mathbf{X}^T\mathbf{X}$ must be invertible. In practice, inverting this matrix can be a bit tricky from a numerical perspective, so software packages that implement linear regression usually rely on alternative approaches to matrix inversion for the sake of numerical stability.

Example

Now let's look at a practical example. As we indicated above, we will try to build a model that predicts happiness scores based on GPD per capita and levels of perceived corruption across different countries. We will build one model using equation (11) and another one using numpy's least squares implementation.

Reassuringly, we get identical results with both approaches! However, you should definitely use the numpy version as it is numerically more stable.

As we have done in the univariate case, we may be tempted to look at the computed weights to infer a relationship between the input features and the target variable. In this particular case, can the weights assigned to GDP and corruption levels tell us something about how they relate to happiness scores?

Our results show a positive weight for the GDP feature and a negative weight for levels of perceived corruption, which sort of makes sense: in wealthier countries people may report higher happiness scores; similarly people should be happier if they live in a less corrupt environment.

However, beyond the remark we made about avoiding conclusions about causality based on the presence or absence of a linear relationship, there is a more subtle problem in the multivariate case. When two or more input features are correlated, there are multiple solutions to the least squares equation. In fact, you will often get different very different solutions if you run linear regression twice on a dataset with highly correlated features. Thus, we cannot attribute any meaning to the values of the weights. In the next sections, we will look into methods which try to address this problem.

Multivariate linear regression

Parameter estimation

Example