4.12. Extension: Building a KNN Classification Model#

Building a KNN model using sklearn is very similar to how we have built our KNeighborsRegressor model, except instead of a regressor we use classifier.

Let’s first look at our dataset views.csv. It contains social score and economic score of each voter, and the way each voter has voted. There are 5 parties the voters can vote for:

  • 0: The green party

  • 1: The grey party

  • 2: The red party

  • 3: The blue party

  • 4: The orange party

import pandas as pd

data = pd.read_csv("views.csv")
print(data)
Output
economic    social  vote
0   0.500000  0.500000     2
1   0.625000  0.250000     3
2   0.375000  0.375000     2
3   0.375000  0.625000     0
4   0.625000  0.625000     1
5   0.625000  0.500000     2
6   0.750000  0.375000     3
7   0.750000  0.125000     4
8   0.500000  0.875000     0
9   0.562500  0.312500     3
10  0.750000  0.562500     2
11  0.500000  0.562500     1
12  0.187500  0.750000     0
13  0.437500  0.500000     2
14  0.812500  0.250000     3
15  0.502441  0.260759     2
16  0.255138  0.877244     0
17  0.621183  0.444795     3
18  0.809379  0.457089     3
19  0.898183  0.119172     4
20  0.514586  0.126445     2
21  0.628402  0.958780     0
22  0.478552  0.729356     1
23  0.288511  0.516631     2
24  0.138908  0.893501     0
25  0.398931  0.139958     2
26  0.685574  0.764026     3
27  0.730914  0.194496     4
28  0.482168  0.397233     2
29  0.376092  0.495733     0
30  0.738228  0.763712     1
31  0.247808  0.378422     0
32  0.813439  0.380882     3
33  0.568105  0.443347     3
34  0.459687  0.196591     2
35  0.367975  0.746852     0
36  0.634882  0.540511     1
37  0.758338  0.633532     1
38  0.735519  0.231446     2
39  0.678271  0.243186     4
40  0.253510  0.621930     2
41  0.461919  0.542602     2
42  0.610444  0.733065     1
43  0.757655  0.487665     3
44  0.623316  0.112221     4
45  0.420448  0.168019     3
46  0.632816  0.356909     2
47  0.547329  0.430936     3
48  0.891050  0.479855     3
49  0.516897  0.604805     2

4.12.1. KNN Classification 1D#

We’ll start by just predicting the party a voter votes for based on their social score. To build our KNN model we import the model.

from sklearn.neighbors import KNeighborsClassifier

Then we create the model and specify the number of neighbours we want, i.e. the value of k. For now we’ll set k=1.

knn = KNeighborsClassifier(n_neighbors=1)

We fit our model to the data using

knn.fit(x, y)

Note that just like with our linear regression model, x must be a 2D array, so we use .reshape(-1, 1) to get the right dimensions.

We can make predictions using

knn.predict(x)

Here is the complete example:

from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np

# Load data
data = pd.read_csv("views.csv")
x = data["social"].to_numpy()
y = data["vote"].to_numpy()

# Build KNN model
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x.reshape(-1, 1), y)

# Make prediction
samples = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.9])
print(knn.predict(samples.reshape(-1, 1)))
Output
[4 3 2 2 1 0]

The model’s predictions are shown in the table below.

Social Score

Prediction

0.1

4

0.3

3

0.4

2

0.6

2

0.7

1

0.9

0

Here’s a visualisation of the predictions.

../../_images/views_1d_c.png

4.13. KNN 2D#

We can give more than 1 input variable to our KNN classifier model. This is similar to how you would build a multiple linear regression model. When we read in the data, we read multiple columns into our variable x.

DataFrame[[column_1, column_2, ...]]

We then need to reshape our data using .reshape(-1, N), where N is the number of input variables. In the 2D case, N is 2.

Here’s a complete example where we give the social score and economic score. We have also provided the test sample (0.45, 0.7).

from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Load data
data = pd.read_csv("views.csv")
x = data[["economic", "social"]].to_numpy()
y = data["vote"].to_numpy()

# Build KNN model
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x.reshape(-1, 2), y)

# Make prediction
samples = np.array([0.45, 0.7])
print(knn.predict(samples.reshape(-1, 2)))
Output
[1]

Try changing the value of k from 1 to 4. You should see that

  • for k=1 the model predicts 1

  • for k=4 the model predicts 0

  • for k=14 the model predicts 2

Here’s a visualisation of these predictions.

../../_images/views_2d_c.png
Code challenge: Extension: MNIST KNN classification

The MNIST dataset is one of the most famous datasets in machine learning, and is often used to benchmark models. It contains 28x28 pixel images of handwritten digits. We have obtained a csv copy of the dataset from Kaggle , but have provided a smaller version of the data here.

Your tasks is to build a KNN classification model and to use it to classify the five samples provided in the test dataset. These images are provided below.

../../_images/mnist_test.png

You have been given code that reads in the data into x_train, y_train, x_test and y_test mnist_data.zip. You have also been provided with a function called visualise that you can use to view individual samples. We have provided an example visualisation of the first sample in the training set.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

def visualise(img):
   img = img.reshape(28,28)
   plt.imshow(img, cmap='binary')
   plt.axis('off')
   plt.savefig('plot')

x_train = pd.read_csv('mnist_train_x.csv').to_numpy()
y_train = pd.read_csv('mnist_train_y.csv').to_numpy().flatten()
x_test = pd.read_csv('mnist_test_x.csv').to_numpy()
y_test = pd.read_csv('mnist_test_y.csv').to_numpy().flatten()

# Visualisation of the first training sample
visualise(x_train[0])

# YOUR CODE GOES HERE

Instructions

  1. Using sklearn, create a KNeighborsClassifier model to fit the training data

  2. Set the value of k, i.e. n_neighbors to 10

  3. Fit this model to the training data. Note that for a 28x28 pixel image the data should contain 784 columns.

  4. Use the KNN model to predict the labels of the five samples in the test data.

Your output should look like this:

[X X X X X]
Solution

Solution is locked