Multiple Linear Regression

1.14. Multiple Linear Regression#

Thus far, we have been looking at simple linear regression. This is when the model we are working with only has 1 input variable. Simple linear regression models are of the form

\[y = \beta_0 + \beta_1 x\]

where $x$ is your input variable and $y$ is the value you are trying to predict.

In multiple linear regression we have multiple input variables. These linear regression models are of the form

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_N x_N\]

Let’s consider the following dataset:

Time Spent Studying (hours)	Assignment Mark (%)	Exam Mark (%)
4.5	73	60
8	89	80
1.5	65	31
3.5	66	54
5.5	67	58
3	65	30
6.5	92	78

In this case we have 2 pieces of information

The amount each student studied
The assignment mark of each student

And our goal is to predict each student’s exam mark.

When we read in the data, we want to read multiple columns into our variable x. We can do this by using

DataFrame[[column_1, column_2, ...]]

import pandas as pd

data = pd.read_csv("study_assignment.csv")
x = data[["Time Spent Studying (hours)", "Assignment Mark (%)"]].to_numpy()
y = data["Exam Mark (%)"].to_numpy()

print(x)

Recall that previously when we use linear_reg.fit(x, y) that

Note

x: must be a 2D array with $n$ rows, one for each sample in the dataset and 1 column. An easy way to achieve this is to use .reshape(-1, 1).

This means -1 rows and 1 column. The -1 will act as a place holder, and numpy will work out how many rows is required based on the specified number of columns, i.e. the data will automatically be reshaped into a column of data.

The code and the process for fitting a linear regression model to this data is the same as before except that when we call .fit(), we give x as a 2D array with $n$ rows, one for each sample in the dataset and 2 columns since we now have 2 input variables. You’ll have noticed that x already has the correct dimensions, so we don’t need to reshape the data, but if we did we would use .reshape(-1, 2).

Below is a complete example, including a prediction for a test student who studied 6 hours and has an assignment mark of 77.

from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

# Load data
data = pd.read_csv("study_assignment.csv")
x = data[["Time Spent Studying (hours)", "Assignment Mark (%)"]].to_numpy()
y = data["Exam Mark (%)"].to_numpy()

# Build linear regression model
linear_reg = LinearRegression()
linear_reg.fit(x, y)

print("intercept: {}".format(linear_reg.intercept_))
print("coefficients: {}".format(linear_reg.coef_))

x_test = np.array([[6, 77]])
print("predicted mark: {}".format(linear_reg.predict(x_test)))

A few things to note:

The intercept $\beta_0$ is -9.21
Coefficients is a list containing $\beta_1$ and $\beta_2$, $\beta_1 = 6.21$ and $\beta_2 = 0.49$
The equation that describes our model is:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2\]

\[\text{Exam Mark} = -9.21 + 6.21 \times \text{Time Spent Studying} + 0.49 \times \text{Assignment Mark}\]

To make a prediction, x_test has dimensions 1 x 2, since there is 1 test samples and 2 input variables
Our model predicts a student who has studied 6 hours and had an assignment mark of 77 will get an exam mark of 65.8.

Code Challenge: Multiple Linear Regression

In addition to the budget of our movies, we also have the release year. Let’s try adding this additional information into our linear regression model. Then, we can evaluate this new model on the test data shown below.

Movie	Budget ($M)	Release Year	Actual Box office ($M)
Barbie	145	2023	1446
Wicked	150	2024	752
Everything Everywhere All At Once	25	2022	143

Instructions

Copy and paste in your code from the Measuring Error Using the MSE challenge
Update x so that it is a 2D numpy array with two columns, the first corresponds to 'Budget ($M)' and the second corresponds to 'Release Year'
Fit your linear regression model ensuring you provide the model x with the correct dimensions
Create a 2D numpy array of the test data, again with the columns corresponding to budget and release year.
Predict the box office results for the test data
Calculate and print the MSE

Your output should look like this:

XXXXXX.XXXXXXXXXX

Things to consider (extension):

As you add more input variables to your model would you expect your model to perform better or worse? Why?
How does the performance of this model (with 2 inputs: budget and release year) compare to the performance of the previous model (with 1 input: budget)?
Can you explain the differences in the performance you observed?