Extension: Selecting The Value of k

4.9. Extension: Selecting The Value of k#

Changing the value of k changes the number of neighbours used to average over when making a prediction. Here is a visualisation of what changing the value of k looks like on our water intake dataset.

../../_images/different_ks.png

A few observations:

  • The smaller the value of k, the more complex the model is. The complexity refers to the shape of the red line. You’ll notice that for a large k, in this example k = 101, we have a simple model, which takes the form of a flat line. We get a flat line because there are only 101 samples in the dataset, so every training samples is always a neighbour and our prediction is just the average water intake of all the training data.

  • As you increase the value of k, the predictions at the edges of the region are worse. This is what we call a bias and it happens because our training sample are not equally balanced around our point of interest.

4.9.1. Understanding The Bias#

If we consider a point near the centre of our dataset, e.g. someone who has burned 900 calories. We’ll see that in our dataset we can find many similar neighbours, i.e. other gym goers who have also burned ~900 calories with a roughly equal number of people who have burned less than 900 calories and more than 900 calories.

../../_images/no_bias.png

Now if we look towards the left end region of our dataset trying to predict the amount of water someone drinks after burning 400 calories, we’ll see that we don’t have a lot of samples representing gym goers who have burned less than 400 calories. When we look at all our neighbours there are more neighbours burning more than 400 calories, and are also drinking more water. This means that in the left end region of our model we tend to predict a higher water intake (red line) than we probably should (black dashed line).

../../_images/bias_a.png

Similarly, in the far right region when we try to predict the amount of water someone drinks after burning 1500 calories, we tend to have more neighbours burning less than 1500 calories, and are drinking less water. So in the far right region of our model we tend to predict a lower water intake (red line) than we probably should (black dashed line).

../../_images/bias_b.png

4.9.2. Selecting k#

When selecting the best value of k we want to find a balance where our model isn’t too complicated but also doesn’t have a high bias. The best way to select a value of k is to use a validation set to determine which value of k results in the best performance on new data. This is the same process that you would use to select the polynomial degree in polynomial regression.

We break our data up into:

  • training data: Used to fit the model (it’s what you provide when you call .fit()). This is what the model ‘sees’ as it’s trying to figure out the the shape of the curve it should produce.

  • validation data: Used to determine the best value of k that should be used to fit the data.

  • test data: Used to estimate the performance of the final model.

Essentially we would try different values of k and then calculate the mean squared error on the validation data and fill in the table below. We start at k=1 and go up to k=n, where n is the number of training samples.

k

Age (years)

Height (cm)

1

2

n

The value of k that we should pick is the one that results in the lowest MSE on the validation data!

Code challenge: Extension: Select The Best Value of k

We will use the validation data to determine the best value of k for or knn regression model to predict water intake from calories burned.

Instructions

  1. Copy and paste in your code from the ‘Build a KNN Regression Model (k=1)’ challenge

  2. You will need to adapt this code so that you read the 'Calories' and 'Water' columns of water_intake_train.csv into x_train and y_train and the columns of water_intake_vali.csv into x_vali and y_vali

  3. Convert these to numpy arrays

  4. Construct a for loop to test values of k from 1 to 101 (inclusive)

  5. For each k, create a KNeighborsRegressor model to fit the training data and then calculate the mean squared error on the validation data

  6. Choose the k that corresponds to lowest mean squared error on the validation data

  7. Rebuild your KNeighborsRegressor model using the chosen k

  8. Create x and y values to visualise the model.

    • Use np.linspace(250, 1750, 1750) to create an array of x values

    • Use .predict() to create a corresponding set of y values

  9. Produce a figure that:

    • Has figsize=(4, 4)

    • Plots the training data as a scatter plot

    • Plots the validation data as a scatter plot (just add another plt.scatter)

    • Plots the KNN regression as a line, in red

    • Has labels Calories Burned and Water Intake (litres)

Your plot should look like this:

../../_images/ext_select_best_value_of_k.png
Solution

Solution is locked

Code challenge: Extension: Evaluate Your KNN Model

Now that we’ve chosen the best value of k, let’s evaluate the performance of this model on test data!

Generally the more data we have to train our model, the better our model is. Now that we’ve used our validation data to determine our best value of k, we no longer need it. But there’s no point throwing out good data, we can add it to our training data! Remember that when we evaluate our model on the test data, the model is not allowed to have seen the test data before. But since our training and validation data is different from the test data, we aren’t breaking any rules by combining our training and validation data to make a larger training set.

Instructions

  1. Copy and paste in your code from the ‘Build a KNN Regression Model (k=1)’ challenge

  2. You will need to adapt this code so that you read the 'Calories' and 'Water' columns of water_intake_train.csv into x_train and y_train, the columns of water_intake_vali.csv into x_vali and y_vali and the columns of water_intake_test.csv into x_test and y_test

  3. Convert these to numpy arrays

  4. Use np.concatenate() to combine the training and validation data (see hint below)

  5. Using sklearn, create a KneighborsRegressor model to fit to the combined training and validation data

  6. Set the value of k to the same value that of k you chose for the ‘Extension: Select The best Value of k’ exercise

  7. Predict the water intake of the samples in the test data

  8. Calculate and print the mean squared error of the model on the test data

Your output should look like this:

X.XXXXXXXXXXXXXXXXX

You do not need to produce a figure of your KNN model, but if you did it should look like this:

../../_images/ext_evaluate_your_knn_model.png

You will notice that the model looks slightly different from the figure in the previous exercise and that’s because it’s being trained on both the training and the validation data whereas in the previous exercise the model was built using only the training data. Since the model is given more data we should, in theory, get a more accurate model, so the fit should look better and be a closer representation of the true relationship between calories burned and water intake.

Note

You can use np.concatenate((x, y)) to join numpy arrays. Here is an example:

import numpy as np
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
z = np.concatenate((x, y)) # Note that we use two sets of brackets
print(z)
Solution

Solution is locked