1.5. Reading in Data With Pandas#
One of the most common formats in which data is stored in is a comma separated value (csv) file. One of the ways we can extract data from a csv file into an array is to use the pandas library. To import this library we use:
import pandas as pd
Any time we want to use a function from this library we use:
pd.function
To read in a csv file, we will use the pandas function read_csv(). It will
look like this:
pd.read_csv(name_of_file)
This will read our csv file into a pandas DataFrame. You can think of a
DataFrame as a table of data with rows and columns. Here is an example
where we have read in a file called study.csv, which is stored in the folder
course.
import pandas as pd
data = pd.read_csv("study.csv")
print(data)
Output
Time Spent Studying (hours) Exam Mark (%)
0 4.5 60
1 8.0 80
2 1.5 31
3 3.5 54
4 5.5 58
5 3.0 30
6 6.5 78
Here is a copy of study.csv.
To extract out a single column we can use the following:
DataFrame[column_name]
You’ll have noticed that there are two columns in study.csv, these are:
Time Spent Studying (hours)
Exam Mark (%)
We can extract these out into x our independent variable and y our
dependent variable. x and y are pandas series objects. These are
like 1-dimensional arrays.
import pandas as pd
data = pd.read_csv("study.csv")
x = data["Time Spent Studying (hours)"]
y = data["Exam Mark (%)"]
print(x)
print(y)
To convert these to numpy arrays we use:
series.to_numpy()
import pandas as pd
data = pd.read_csv("study.csv")
x = data["Time Spent Studying (hours)"].to_numpy()
y = data["Exam Mark (%)"].to_numpy()
print(x)
print(y)
We use numpy arrays because it’s easier to manipulate their dimensions using
.reshape(rows, columns)
This allows you to quickly change your 1D vector into a 2D column or row vector. You’ll see that this will be useful later.
import numpy as np
array = np.array([1, 2, 3, 4, 5]) # 1D array
print(array.reshape(5, 1)) # 2D array with 5 rows and 1 column
You would have seen this in Multi-Dimensional Arrays in Year 11 > Python Fundamentals > Data structures.
Code Challenge: Read In Movie Data
You have been provided with a csv file called movies.csv
with data obtained from StatCrunch . This contains the following columns:
Release Year
Movie
Budget ($M)
Box Office ($M)
We will use this data to develop a linear regression model that can help film produces predict a movie’s box office result based on the movie budget.
Instructions
Using pandas, read the file
movies.csvinto aDataFrameExtract the
'Budget ($M)'column into the variablexExtract the
Box Office ($M)column into the variableyConvert both
xandyto numpy arraysPrint
xandy
Your output should look like this:
[XXX. XXX. XXX. XXX. XXX. XXX. XXX. XXX. XXX. XXX.
...
X.X X.X X.X X.X X. X. X.X X.X X.X ]
[X.XXXXXXXXe+XX X.XXXXXXXXe+XX X.XXXXXXXXe+XX X.XXXXXXXXe+XX
...
X.XXXXXXXXe+XX X.XXXXXXXXe+XX X.XXXXXXXXe+XX X.XXXXXXXXe+XX
X.XXXXXXXXe+XX]
Solution
Solution is locked