1.12. Measuring Error Using the MSE#
The mean squared error (MSE) is a commonly used metric for measuring the performance of a machine learning regression model. To calculate the MSE we take all of the error values, square them, and then take the average. We previously estimated the MSE on our study data.
To calculate the MSE we take all of the error values, square them, and then take the average.
Time Spent Studying (hours) |
Predicted Exam Mark (%) |
Actual Exam Mark (%) |
Error (Predicted - Actual) |
|---|---|---|---|
6 |
67 |
71 |
-4 |
2 |
34 |
35 |
-1 |
7.5 |
80 |
78 |
2 |
To calculate the mean squared error we take all of the error values, square them, and then take the average.
Error values: -4, -1, 2
Square the error values: 16, 1, 4 (note that all of these values are now positive)
Take the average: (16 + 1+ 4)/3 = 7
Hence, our MSE is 7.
1.12.1. MSE in Python#
To calculate the MSE in Python we can use the mean_squared_error
function from sklearn.metrics. We can import it with the following:
from sklearn.metrics import mean_squared_error as mse
The function is mse() and it has the syntax:
mse(actual_values, predicted_values)
Note
You can switch the order of the arguments for the mse() function
because the MSE involves the difference squared so order doesn’t matter
Here is an example using the data from the table above:
from sklearn.metrics import mean_squared_error as mse
import numpy as np
actual_mark = np.array([71, 35, 78])
predicted_mark = np.array([67, 34, 80])
print(mse(actual_mark, predicted_mark))
Output
7.0
Putting this all together with the code we had to read in the data from a csv file and build a linear regression mode we have:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
import pandas as pd
import numpy as np
# Load data
data = pd.read_csv("study.csv")
x = data["Time Spent Studying (hours)"].to_numpy()
y = data["Exam Mark (%)"].to_numpy()
# Build linear regression model
linear_reg = LinearRegression()
linear_reg.fit(x.reshape(-1, 1), y)
# Make predictions on the test data
hours_studied = np.array([6, 2, 7.5])
actual_mark = np.array([71, 35, 78])
predicted_mark = linear_reg.predict(hours_studied.reshape(-1, 1))
# Calculate MSE
print(mse(actual_mark, predicted_mark))
Output
6.382587126526617
Note
Our earlier estimation of the MSE, which was 7 was approximate since our predicted exam marks were approximated.
1.12.2. Interpreting the MSE#
It’s hard to give an exact interpretation of the MSE. Since all the error values are squared, the units of the MSE are the squared value of the error units. In this case, the error was an error in the student’s mark as a percentage, so the units of the MSE are ‘percentage squared’. Taking the square root of the MSE gets us back to units of percentage. In this case \(\sqrt{7} = 2.6\). This gives us an estimate for how ‘wrong’ our model is from the true value in general, i.e. our model can roughly predict a student’s exam mark and will be within around 2.6% of the student’s actual mark.
Code Challenge: Measure Error
Here are the true box office results of the movies in our test data.
Movie |
Budget ($M) |
Actual Box office ($M) |
|---|---|---|
Barbie |
145 |
1446 |
Wicked |
150 |
752 |
Everything Everywhere All At Once |
25 |
143 |
Instructions
Copy and paste your code from Making Predictions.
Create an array storing the movie budgets for the test data
Create an array storing the movie box office results for the test data
Predict the box office results for the test data
Calculate and print the MSE
Your output should look like this:
XXXXXX.XXXXXXXXXX
Try taking the square root of this number. This gives you an estimate of how ‘correct’ your model’s prediction will be each time.
Solution
Solution is locked