manyspikes

Regression

Welcome to your first module on ML models! Linear regression is a simple and very popular model, which makes it a good place to start your journey into ML. But before we start looking into the actual model, we will define some key concepts and useful terminology.

We will start by asking the question: what is regression? Regression is the process of predicting numerical values based on some data. For instance, predicting house prices based on various house attributes (e.g. area, number of bedrooms) is an example of a regression problem. We call the attributes of the house features (sometimes also called regressors or covariates). The corresponding house price is called label or target, with the latter being more adequate when dealing with regression problems. The target and the corresponding set of features is called an example (sometimes also called instance). Assuming that we would like to predict house prices solely based on house area and number of bedrooms, an example would then consist of the following information:

  • House Price (Target)
  • Number of bedrooms (Feature)
  • Area (Feature)

We would then need a collection of such examples in order to develop a model: this is called a dataset. For reasons that will become obvious later, we want to split our dataset in at least two sets:

  1. Training set: the collection of records that will be used to develop the model.
  2. Test set: the collection of records that will be used to test the performance of the model.

More often than not, datasets are split into three sets, with an additional validation set (also called development set) being used to optimise model hyperparameters or better guide the training process. We will see examples of this in later modules.

Make sure you fully understand the meaning of the words in bold—we will use them heavily going forward, including in the next section, where we will start looking into the actual linear regression algorithm.