The pandas module
Contents
1. The pandas
module#
NumPy is immensely flexible and powerful. You could use NumPy for all analyses you have without problems. But, you might have noticed, it is a little low-level - arrays can only contain values of a single data type, and if you have a lot of columns of data, it can be tricky to keep track of what data is in what column. Doing more complex operations can be even trickier, such as computing values across certain factors, or restructuring data to fit certain shapes and critera.
There is a module called pandas
that is built (partly) on top of NumPy, and makes up for many of the formers shortcomings. It is purpose-built for handling data, and has a huge range of methods and functions to carry out complex operations on data easily. The downside is that there are more things to learn, which can be quite complex at times!
We’ll use pandas heavily in the rest of the course, but you will sometimes need to drop into NumPy or use its functionalities. Since the two modules share similarities, a lot of pandas methods and approaches will be familiar, and you can use some NumPy functions with pandas too.
1.1. import pandas as pd
and the DataFrame#
Like NumPy, pandas has its own special data collection - the DataFrame. This is like a NumPy array on steroids - each column has a name, and each row has a label. The DataFrame can hold variables with different types.
To use pandas, we need to import it. The pd
naming convention is used.
How can we make a DataFrame? There are many ways to create one, as the function pd.DataFrame
that creates one is very flexible. First, lets import pandas and NumPy.
import numpy as np
import pandas as pd
# Create a dictionary with keys and lists as values
my_data = {'Sex': ['Female', 'Male', 'Female', 'Male', 'Male'], 'Age': [30, 28, 24, 38, 21],
'Score': [8.902, 23.1291, 33.2810, 10.903, 15.3290]}
# Use the DataFrame constructor
df1 = pd.DataFrame(my_data)
# Display
display(df1)
Sex | Age | Score | |
---|---|---|---|
0 | Female | 30 | 8.9020 |
1 | Male | 28 | 23.1291 |
2 | Female | 24 | 33.2810 |
3 | Male | 38 | 10.9030 |
4 | Male | 21 | 15.3290 |
# Make a DataFrame from a NumPy array, specifying column names and an index name
random_data = np.random.randint(low=1, high=6, size=(5,3))
# Construct
df2 = pd.DataFrame(random_data, columns=['a', 'b', 'c'], index=['one', 'two', 'three', 'four', 'five'])
display(df2)
a | b | c | |
---|---|---|---|
one | 3 | 4 | 5 |
two | 3 | 3 | 3 |
three | 3 | 2 | 3 |
four | 5 | 3 | 4 |
five | 2 | 5 | 5 |
1.2. Reading data#
Creating DataFrames will get us so far, but what if we need to get data into Python to work from - perhaps experimental output stored in data files, or a dataset sent to you by a collaborator or supervisor. In fact, this is the most frequent way of getting a dataset into Python.
Pandas has a host of functions to read data. We’ll explore the use of read_csv
and read_excel
which are common in psychology. A ‘csv’ file means a comma-separated-values file - very similar to an Excel spreadsheet, but much simpler, and where the values in the spreadsheet are separated by ‘,’. Excel allows you to save ‘.xlsx’ files as csv, and I recommend you keep your data stored on a csv file.
1.3. pd.read_csv
#
To read a csv file into Python, use pd.read_csv
. This function needs what’s known as a ‘path’ as its first argument - where can the file be found? This is essential. It also has many other arguments that can help format the way your data comes in. Pandas is capable of reading a datafile straight from the internet.
# Demonstrate read_csv, grab a dataset from the internet!
mtcars = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')
# Use the head method to see a sneak preview of this data!
display(mtcars.head())
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
So if you know where a comma-separated-values dataset lives on the internet, you can pass its URL to pandas to read!
But not all your data will be on the web, and some of it will be spread among many, many files. How do you read a file from your computer? Let’s assume your Jupyter Notebook is running in the folder ‘/Experiment Data’. If a csv file was in that folder called mtcars_data.csv
, then it can be read in easily.
df = pd.read_csv('mtcars_data.csv')
Similarly, if a file is in another folder on your computer, you can access it by specifying the full path to it. So from my home directory, I can access data files in a folder with the same name as above stored on the desktop like:
path = '/alexjones/Desktop/Experiment Data/mtcars_data.csv'
df = pd.read_csv(path)
It’s often easier to simply be in the same directory as your data when you start coding, though there are workarounds in other modules you should explore.
1.4. Examining your data - .describe()
and .info()
#
With a dataset now ‘in memory’, analysis can begin. DataFrames come with a few methods give you some at a glance information about your data.
.info()
returns information about the size and shape of the DataFrame, as well as the type of the data in each column. Very useful to check as sometimes data will be in the wrong format - e.g. strings as numbers..describe()
computes summary statistics for any numeric data columns, including mean, quartiles, and range.
# Get info on mtcars
mtcars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 32 non-null object
1 mpg 32 non-null float64
2 cyl 32 non-null int64
3 disp 32 non-null float64
4 hp 32 non-null int64
5 drat 32 non-null float64
6 wt 32 non-null float64
7 qsec 32 non-null float64
8 vs 32 non-null int64
9 am 32 non-null int64
10 gear 32 non-null int64
11 carb 32 non-null int64
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB
# Describe mtcars
mtcars.describe()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.0000 |
mean | 20.090625 | 6.187500 | 230.721875 | 146.687500 | 3.596563 | 3.217250 | 17.848750 | 0.437500 | 0.406250 | 3.687500 | 2.8125 |
std | 6.026948 | 1.785922 | 123.938694 | 68.562868 | 0.534679 | 0.978457 | 1.786943 | 0.504016 | 0.498991 | 0.737804 | 1.6152 |
min | 10.400000 | 4.000000 | 71.100000 | 52.000000 | 2.760000 | 1.513000 | 14.500000 | 0.000000 | 0.000000 | 3.000000 | 1.0000 |
25% | 15.425000 | 4.000000 | 120.825000 | 96.500000 | 3.080000 | 2.581250 | 16.892500 | 0.000000 | 0.000000 | 3.000000 | 2.0000 |
50% | 19.200000 | 6.000000 | 196.300000 | 123.000000 | 3.695000 | 3.325000 | 17.710000 | 0.000000 | 0.000000 | 4.000000 | 2.0000 |
75% | 22.800000 | 8.000000 | 326.000000 | 180.000000 | 3.920000 | 3.610000 | 18.900000 | 1.000000 | 1.000000 | 4.000000 | 4.0000 |
max | 33.900000 | 8.000000 | 472.000000 | 335.000000 | 4.930000 | 5.424000 | 22.900000 | 1.000000 | 1.000000 | 5.000000 | 8.0000 |