Extension: Selecting The Value of k

4.9. Extension: Selecting The Value of k#

Changing the value of k changes the number of neighbours used to average over when making a prediction. Here is a visualisation of what changing the value of k looks like on our water intake dataset.

A few observations:

The smaller the value of k, the more complex the model is. The complexity refers to the shape of the red line. You’ll notice that for a large k, in this example k = 101, we have a simple model, which takes the form of a flat line. We get a flat line because there are only 101 samples in the dataset, so every training samples is always a neighbour and our prediction is just the average water intake of all the training data.
As you increase the value of k, the predictions at the edges of the region are worse. This is what we call a bias and it happens because our training sample are not equally balanced around our point of interest.

4.9.1. Understanding The Bias#

If we consider a point near the centre of our dataset, e.g. someone who has burned 900 calories. We’ll see that in our dataset we can find many similar neighbours, i.e. other gym goers who have also burned ~900 calories with a roughly equal number of people who have burned less than 900 calories and more than 900 calories.

Now if we look towards the left end region of our dataset trying to predict the amount of water someone drinks after burning 400 calories, we’ll see that we don’t have a lot of samples representing gym goers who have burned less than 400 calories. When we look at all our neighbours there are more neighbours burning more than 400 calories, and are also drinking more water. This means that in the left end region of our model we tend to predict a higher water intake (red line) than we probably should (black dashed line).

Similarly, in the far right region when we try to predict the amount of water someone drinks after burning 1500 calories, we tend to have more neighbours burning less than 1500 calories, and are drinking less water. So in the far right region of our model we tend to predict a lower water intake (red line) than we probably should (black dashed line).

4.9.2. Selecting k#

When selecting the best value of k we want to find a balance where our model isn’t too complicated but also doesn’t have a high bias. The best way to select a value of k is to use a validation set to determine which value of k results in the best performance on new data. This is the same process that you would use to select the polynomial degree in polynomial regression.

We break our data up into:

training data: Used to fit the model (it’s what you provide when you call .fit()). This is what the model ‘sees’ as it’s trying to figure out the the shape of the curve it should produce.
validation data: Used to determine the best value of k that should be used to fit the data.
test data: Used to estimate the performance of the final model.

Essentially we would try different values of k and then calculate the mean squared error on the validation data and fill in the table below. We start at k=1 and go up to k=n, where n is the number of training samples.

k	Age (years)	Height (cm)
1
2
…
n

The value of k that we should pick is the one that results in the lowest MSE on the validation data!

Extension: Selecting The Value of k

Contents

4.9. Extension: Selecting The Value of k#

4.9.1. Understanding The Bias#

4.9.2. Selecting k#