1. The pandas module#

NumPy is immensely flexible and powerful. You could use NumPy for all analyses you have without problems. But, you might have noticed, it is a little low-level - arrays can only contain values of a single data type, and if you have a lot of columns of data, it can be tricky to keep track of what data is in what column. Doing more complex operations can be even trickier, such as computing values across certain factors, or restructuring data to fit certain shapes and critera.

There is a module called pandas that is built (partly) on top of NumPy, and makes up for many of the formers shortcomings. It is purpose-built for handling data, and has a huge range of methods and functions to carry out complex operations on data easily. The downside is that there are more things to learn, which can be quite complex at times!

We’ll use pandas heavily in the rest of the course, but you will sometimes need to drop into NumPy or use its functionalities. Since the two modules share similarities, a lot of pandas methods and approaches will be familiar, and you can use some NumPy functions with pandas too.

1.1. import pandas as pd and the DataFrame#

Like NumPy, pandas has its own special data collection - the DataFrame. This is like a NumPy array on steroids - each column has a name, and each row has a label. The DataFrame can hold variables with different types.

To use pandas, we need to import it. The pd naming convention is used.

How can we make a DataFrame? There are many ways to create one, as the function pd.DataFrame that creates one is very flexible. First, lets import pandas and NumPy.

import numpy as np
import pandas as pd
# Create a dictionary with keys and lists as values
my_data = {'Sex': ['Female', 'Male', 'Female', 'Male', 'Male'], 'Age': [30, 28, 24, 38, 21],
          'Score': [8.902, 23.1291, 33.2810, 10.903, 15.3290]}

# Use the DataFrame constructor
df1 = pd.DataFrame(my_data)

# Display
display(df1)
Sex Age Score
0 Female 30 8.9020
1 Male 28 23.1291
2 Female 24 33.2810
3 Male 38 10.9030
4 Male 21 15.3290
# Make a DataFrame from a NumPy array, specifying column names and an index name
random_data = np.random.randint(low=1, high=6, size=(5,3))

# Construct
df2 = pd.DataFrame(random_data, columns=['a', 'b', 'c'], index=['one', 'two', 'three', 'four', 'five'])

display(df2)
a b c
one 3 4 5
two 3 3 3
three 3 2 3
four 5 3 4
five 2 5 5

1.2. Reading data#

Creating DataFrames will get us so far, but what if we need to get data into Python to work from - perhaps experimental output stored in data files, or a dataset sent to you by a collaborator or supervisor. In fact, this is the most frequent way of getting a dataset into Python.

Pandas has a host of functions to read data. We’ll explore the use of read_csv and read_excel which are common in psychology. A ‘csv’ file means a comma-separated-values file - very similar to an Excel spreadsheet, but much simpler, and where the values in the spreadsheet are separated by ‘,’. Excel allows you to save ‘.xlsx’ files as csv, and I recommend you keep your data stored on a csv file.

1.3. pd.read_csv#

To read a csv file into Python, use pd.read_csv. This function needs what’s known as a ‘path’ as its first argument - where can the file be found? This is essential. It also has many other arguments that can help format the way your data comes in. Pandas is capable of reading a datafile straight from the internet.

# Demonstrate read_csv, grab a dataset from the internet!
mtcars = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')

# Use the head method to see a sneak preview of this data!
display(mtcars.head())
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

So if you know where a comma-separated-values dataset lives on the internet, you can pass its URL to pandas to read!

But not all your data will be on the web, and some of it will be spread among many, many files. How do you read a file from your computer? Let’s assume your Jupyter Notebook is running in the folder ‘/Experiment Data’. If a csv file was in that folder called mtcars_data.csv, then it can be read in easily.

df = pd.read_csv('mtcars_data.csv')

Similarly, if a file is in another folder on your computer, you can access it by specifying the full path to it. So from my home directory, I can access data files in a folder with the same name as above stored on the desktop like:

path = '/alexjones/Desktop/Experiment Data/mtcars_data.csv'

df = pd.read_csv(path)

It’s often easier to simply be in the same directory as your data when you start coding, though there are workarounds in other modules you should explore.

1.4. Examining your data - .describe() and .info()#

With a dataset now ‘in memory’, analysis can begin. DataFrames come with a few methods give you some at a glance information about your data.

  • .info() returns information about the size and shape of the DataFrame, as well as the type of the data in each column. Very useful to check as sometimes data will be in the wrong format - e.g. strings as numbers.

  • .describe() computes summary statistics for any numeric data columns, including mean, quartiles, and range.

# Get info on mtcars
mtcars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   model   32 non-null     object 
 1   mpg     32 non-null     float64
 2   cyl     32 non-null     int64  
 3   disp    32 non-null     float64
 4   hp      32 non-null     int64  
 5   drat    32 non-null     float64
 6   wt      32 non-null     float64
 7   qsec    32 non-null     float64
 8   vs      32 non-null     int64  
 9   am      32 non-null     int64  
 10  gear    32 non-null     int64  
 11  carb    32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB
# Describe mtcars
mtcars.describe()
mpg cyl disp hp drat wt qsec vs am gear carb
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.0000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250 3.687500 2.8125
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457 1.786943 0.504016 0.498991 0.737804 1.6152
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000 14.500000 0.000000 0.000000 3.000000 1.0000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250 16.892500 0.000000 0.000000 3.000000 2.0000
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000 17.710000 0.000000 0.000000 4.000000 2.0000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000 18.900000 1.000000 1.000000 4.000000 4.0000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000 22.900000 1.000000 1.000000 5.000000 8.0000