Measuring Error Using the MSE

1.12. Measuring Error Using the MSE#

The mean squared error (MSE) is a commonly used metric for measuring the performance of a machine learning regression model. To calculate the MSE we take all of the error values, square them, and then take the average. We previously estimated the MSE on our study data.

To calculate the MSE we take all of the error values, square them, and then take the average.

Time Spent Studying (hours)	Predicted Exam Mark (%)	Actual Exam Mark (%)	Error (Predicted - Actual)
6	67	71	-4
2	34	35	-1
7.5	80	78	2

To calculate the mean squared error we take all of the error values, square them, and then take the average.

Error values: -4, -1, 2
Square the error values: 16, 1, 4 (note that all of these values are now positive)
Take the average: (16 + 1+ 4)/3 = 7

Hence, our MSE is 7.

1.12.1. MSE in Python#

To calculate the MSE in Python we can use the mean_squared_error function from sklearn.metrics. We can import it with the following:

from sklearn.metrics import mean_squared_error as mse

The function is mse() and it has the syntax:

mse(actual_values, predicted_values)

Note

You can switch the order of the arguments for the mse() function because the MSE involves the difference squared so order doesn’t matter

Here is an example using the data from the table above:

from sklearn.metrics import mean_squared_error as mse
import numpy as np

actual_mark = np.array([71, 35, 78])
predicted_mark = np.array([67, 34, 80])

print(mse(actual_mark, predicted_mark))

Putting this all together with the code we had to read in the data from a csv file and build a linear regression mode we have:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
import pandas as pd
import numpy as np

# Load data
data = pd.read_csv("study.csv")
x = data["Time Spent Studying (hours)"].to_numpy()
y = data["Exam Mark (%)"].to_numpy()

# Build linear regression model
linear_reg = LinearRegression()
linear_reg.fit(x.reshape(-1, 1), y)

# Make predictions on the test data
hours_studied = np.array([6, 2, 7.5])
actual_mark = np.array([71, 35, 78])
predicted_mark = linear_reg.predict(hours_studied.reshape(-1, 1))

# Calculate MSE
print(mse(actual_mark, predicted_mark))

Note

Our earlier estimation of the MSE, which was 7 was approximate since our predicted exam marks were approximated.

1.12.2. Interpreting the MSE#

It’s hard to give an exact interpretation of the MSE. Since all the error values are squared, the units of the MSE are the squared value of the error units. In this case, the error was an error in the student’s mark as a percentage, so the units of the MSE are ‘percentage squared’. Taking the square root of the MSE gets us back to units of percentage. In this case $\sqrt{7} = 2.6$. This gives us an estimate for how ‘wrong’ our model is from the true value in general, i.e. our model can roughly predict a student’s exam mark and will be within around 2.6% of the student’s actual mark.

Code Challenge: Measure Error

Here are the true box office results of the movies in our test data.

Movie	Budget ($M)	Actual Box office ($M)
Barbie	145	1446
Wicked	150	752
Everything Everywhere All At Once	25	143

Instructions

Copy and paste your code from Making Predictions.
Create an array storing the movie budgets for the test data
Create an array storing the movie box office results for the test data
Predict the box office results for the test data
Calculate and print the MSE

Your output should look like this:

XXXXXX.XXXXXXXXXX

Try taking the square root of this number. This gives you an estimate of how ‘correct’ your model’s prediction will be each time.

Measuring Error Using the MSE

Contents

1.12. Measuring Error Using the MSE#

1.12.1. MSE in Python#

1.12.2. Interpreting the MSE#