Reading in Data With Pandas

1.5. Reading in Data With Pandas#

One of the most common formats in which data is stored in is a comma separated value (csv) file. One of the ways we can extract data from a csv file into an array is to use the pandas library. To import this library we use:

import pandas as pd

Any time we want to use a function from this library we use:

pd.function

To read in a csv file, we will use the pandas function read_csv(). It will look like this:

pd.read_csv(name_of_file)

This will read our csv file into a pandas DataFrame. You can think of a DataFrame as a table of data with rows and columns. Here is an example where we have read in a file called study.csv, which is stored in the folder course.

import pandas as pd

data = pd.read_csv("study.csv")
print(data)
Output
Time Spent Studying (hours)  Exam Mark (%)
0                          4.5             60
1                          8.0             80
2                          1.5             31
3                          3.5             54
4                          5.5             58
5                          3.0             30
6                          6.5             78

Here is a copy of study.csv.

study.csv

To extract out a single column we can use the following:

DataFrame[column_name]

You’ll have noticed that there are two columns in study.csv, these are:

  • Time Spent Studying (hours)

  • Exam Mark (%)

We can extract these out into x our independent variable and y our dependent variable. x and y are pandas series objects. These are like 1-dimensional arrays.

import pandas as pd

data = pd.read_csv("study.csv")

x = data["Time Spent Studying (hours)"]
y = data["Exam Mark (%)"]

print(x)
print(y)

To convert these to numpy arrays we use:

series.to_numpy()
import pandas as pd

data = pd.read_csv("study.csv")

x = data["Time Spent Studying (hours)"].to_numpy()
y = data["Exam Mark (%)"].to_numpy()

print(x)
print(y)

We use numpy arrays because it’s easier to manipulate their dimensions using

.reshape(rows, columns)

This allows you to quickly change your 1D vector into a 2D column or row vector. You’ll see that this will be useful later.

import numpy as np

array = np.array([1, 2, 3, 4, 5])  # 1D array
print(array.reshape(5, 1))  # 2D array with 5 rows and 1 column

You would have seen this in Multi-Dimensional Arrays in Year 11 > Python Fundamentals > Data structures.

Code Challenge: Read In Movie Data

You have been provided with a csv file called movies.csv with data obtained from StatCrunch . This contains the following columns:

  • Release Year

  • Movie

  • Budget ($M)

  • Box Office ($M)

We will use this data to develop a linear regression model that can help film produces predict a movie’s box office result based on the movie budget.

Instructions

  1. Using pandas, read the file movies.csv into a DataFrame

  2. Extract the 'Budget ($M)' column into the variable x

  3. Extract the Box Office ($M) column into the variable y

  4. Convert both x and y to numpy arrays

  5. Print x and y

Your output should look like this:

[XXX.   XXX.   XXX.   XXX.   XXX.   XXX.   XXX.   XXX.   XXX.   XXX.
...
X.X    X.X    X.X    X.X    X.     X.     X.X    X.X    X.X ]
[X.XXXXXXXXe+XX X.XXXXXXXXe+XX X.XXXXXXXXe+XX X.XXXXXXXXe+XX
...
X.XXXXXXXXe+XX X.XXXXXXXXe+XX X.XXXXXXXXe+XX X.XXXXXXXXe+XX
X.XXXXXXXXe+XX]
Solution

Solution is locked