1. The `pandas` module#

NumPy is immensely flexible and powerful. You could use NumPy for all analyses you have without problems. But, you might have noticed, it is a little low-level - arrays can only contain values of a single data type, and if you have a lot of columns of data, it can be tricky to keep track of what data is in what column. Doing more complex operations can be even trickier, such as computing values across certain factors, or restructuring data to fit certain shapes and critera.

There is a module called pandas that is built (partly) on top of NumPy, and makes up for many of the formers shortcomings. It is purpose-built for handling data, and has a huge range of methods and functions to carry out complex operations on data easily. The downside is that there are more things to learn, which can be quite complex at times!

We’ll use pandas heavily in the rest of the course, but you will sometimes need to drop into NumPy or use its functionalities. Since the two modules share similarities, a lot of pandas methods and approaches will be familiar, and you can use some NumPy functions with pandas too.

1.1. `import pandas as pd` and the DataFrame#

Like NumPy, pandas has its own special data collection - the DataFrame. This is like a NumPy array on steroids - each column has a name, and each row has a label. The DataFrame can hold variables with different types.

To use pandas, we need to import it. The pd naming convention is used.

How can we make a DataFrame? There are many ways to create one, as the function pd.DataFrame that creates one is very flexible. First, lets import pandas and NumPy.

import numpy as np
import pandas as pd

# Create a dictionary with keys and lists as values
my_data = {'Sex': ['Female', 'Male', 'Female', 'Male', 'Male'], 'Age': [30, 28, 24, 38, 21],
          'Score': [8.902, 23.1291, 33.2810, 10.903, 15.3290]}

# Use the DataFrame constructor
df1 = pd.DataFrame(my_data)

# Display
display(df1)

	Sex	Age	Score
0	Female	30	8.9020
1	Male	28	23.1291
2	Female	24	33.2810
3	Male	38	10.9030
4	Male	21	15.3290

# Make a DataFrame from a NumPy array, specifying column names and an index name
random_data = np.random.randint(low=1, high=6, size=(5,3))

# Construct
df2 = pd.DataFrame(random_data, columns=['a', 'b', 'c'], index=['one', 'two', 'three', 'four', 'five'])

display(df2)

	a	b	c
one	3	4	5
two	3	3	3
three	3	2	3
four	5	3	4
five	2	5	5

1.2. Reading data#

Creating DataFrames will get us so far, but what if we need to get data into Python to work from - perhaps experimental output stored in data files, or a dataset sent to you by a collaborator or supervisor. In fact, this is the most frequent way of getting a dataset into Python.

Pandas has a host of functions to read data. We’ll explore the use of read_csv and read_excel which are common in psychology. A ‘csv’ file means a comma-separated-values file - very similar to an Excel spreadsheet, but much simpler, and where the values in the spreadsheet are separated by ‘,’. Excel allows you to save ‘.xlsx’ files as csv, and I recommend you keep your data stored on a csv file.

1.3. `pd.read_csv`#

To read a csv file into Python, use pd.read_csv. This function needs what’s known as a ‘path’ as its first argument - where can the file be found? This is essential. It also has many other arguments that can help format the way your data comes in. Pandas is capable of reading a datafile straight from the internet.

# Demonstrate read_csv, grab a dataset from the internet!
mtcars = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')

# Use the head method to see a sneak preview of this data!
display(mtcars.head())

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2

So if you know where a comma-separated-values dataset lives on the internet, you can pass its URL to pandas to read!

But not all your data will be on the web, and some of it will be spread among many, many files. How do you read a file from your computer? Let’s assume your Jupyter Notebook is running in the folder ‘/Experiment Data’. If a csv file was in that folder called mtcars_data.csv, then it can be read in easily.

df = pd.read_csv('mtcars_data.csv')

Similarly, if a file is in another folder on your computer, you can access it by specifying the full path to it. So from my home directory, I can access data files in a folder with the same name as above stored on the desktop like:

path = '/alexjones/Desktop/Experiment Data/mtcars_data.csv'

df = pd.read_csv(path)

It’s often easier to simply be in the same directory as your data when you start coding, though there are workarounds in other modules you should explore.

1.4. Examining your data - `.describe()` and `.info()`#

With a dataset now ‘in memory’, analysis can begin. DataFrames come with a few methods give you some at a glance information about your data.

.info() returns information about the size and shape of the DataFrame, as well as the type of the data in each column. Very useful to check as sometimes data will be in the wrong format - e.g. strings as numbers.
.describe() computes summary statistics for any numeric data columns, including mean, quartiles, and range.

# Get info on mtcars
mtcars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   model   32 non-null     object 
 1   mpg     32 non-null     float64
 2   cyl     32 non-null     int64  
 3   disp    32 non-null     float64
 4   hp      32 non-null     int64  
 5   drat    32 non-null     float64
 6   wt      32 non-null     float64
 7   qsec    32 non-null     float64
 8   vs      32 non-null     int64  
 9   am      32 non-null     int64  
 10  gear    32 non-null     int64  
 11  carb    32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB

# Describe mtcars
mtcars.describe()

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
count	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.0000
mean	20.090625	6.187500	230.721875	146.687500	3.596563	3.217250	17.848750	0.437500	0.406250	3.687500	2.8125
std	6.026948	1.785922	123.938694	68.562868	0.534679	0.978457	1.786943	0.504016	0.498991	0.737804	1.6152
min	10.400000	4.000000	71.100000	52.000000	2.760000	1.513000	14.500000	0.000000	0.000000	3.000000	1.0000
25%	15.425000	4.000000	120.825000	96.500000	3.080000	2.581250	16.892500	0.000000	0.000000	3.000000	2.0000
50%	19.200000	6.000000	196.300000	123.000000	3.695000	3.325000	17.710000	0.000000	0.000000	4.000000	2.0000
75%	22.800000	8.000000	326.000000	180.000000	3.920000	3.610000	18.900000	1.000000	1.000000	4.000000	4.0000
max	33.900000	8.000000	472.000000	335.000000	4.930000	5.424000	22.900000	1.000000	1.000000	5.000000	8.0000

An introduction to data analysis in Python

The pandas module

Contents

1. The `pandas` module#

1.1. `import pandas as pd` and the DataFrame#

1.2. Reading data#

1.3. `pd.read_csv`#

1.4. Examining your data - `.describe()` and `.info()`#

An introduction to data analysis in Python

The pandas module

Contents

1. The pandas module#

1.1. import pandas as pd and the DataFrame#

1.2. Reading data#

1.3. pd.read_csv#

1.4. Examining your data - .describe() and .info()#

1. The `pandas` module#

1.1. `import pandas as pd` and the DataFrame#

1.3. `pd.read_csv`#

1.4. Examining your data - `.describe()` and `.info()`#