1 - Introduction

1 - Introduction#

Using Python for Data Analysis#

Python is a vast programming language with almost unlimited flexibility. Our focus in this module will be to use it to:

Read data and do basic manipulations of it
Fit statistical models to this data
Visualise the data through plots

While there are a range of uses for the tools we will cover, particularly the art of ‘data wrangling’, these are outside the scope of the content. You will however get the skills to learn these more advanced approaches and there are many resources to advance you.

Part 1 - Basic syntax#

Here, we take a look at the basic syntax, or rules, we can use to get Python working for us.

# This is a comment. 
# Anything following the '#' symbol is ignored, but it lets us write notes to ourselves

Python is capable of very basic mathematical operations!

# Addition
print(10 + 5)

# Subtraction
print(10 - 5)

# Division
print(10 / 2)

# Multiplication
print(10 * 2)

Variables#

We will do more than simple maths with Python, and this involves using variables to store information, or the results of operations, or even vast collections of data, under a single name - think of it like a box you place results in.

Naming a variable is a simple but strict process:

Must begin with a letter (a-z, A-Z) or an underscore, _
Other characters may be letters, numbers or underscores
Variables are case sensitive. my_variable is different to My_variable!
Can be any length, but be sensible.
Some words you cannot use as a variable name, because Python has reserved them
- Examples include and, or, for, return, del, def, in, yield, True, False

# An example
weight = 90
height = 1.90

print(weight, height)

90 1.9

We can now work with these variables as if they were the numbers themselves.

# Make a new variable
BMI = weight / (height ** 2)
print(BMI)

24.930747922437675

Types#

There are several common data types that you see during analysis.

Integers, known as int. These are whole numbers, e.g. 1, 45.
Floating points, or float. These are numbers with extra precision, e.g. 3.894, 99.100000003
Booleans, or bool. These are special variables that are represented by True and False. These have a range of uses in comparing variables, or checking whether certain conditions are met.
Strings, or str. These represent text data and have a range of uses in data analysis.

How do you know what type your variable is? Use the type function!

# Make an integer
an_integer = 1
print(type(an_integer))

<class 'int'>

# Make a float
a_float = 5.78
print(type(a_float))

<class 'float'>

# Booleans already exist, but we can assign them to a variable
is_true = True
print(type(is_true))

<class 'bool'>

# Make a string
option1 = 'Hello world'
option2 = "Hello again"
print(type(option1), type(option2))

<class 'str'> <class 'str'>

Working with different types of data can lead to unusual outcomes, so its important to know what type your data is before you do something with it. For example…

# Adding an integer and a float makes sense
print(an_integer + a_float)

# Adding an integer and a boolean works
print(an_integer + is_true)

# Adding strings together also works
print(option1 + option2)

6.78
2
Hello worldHello again

Data Containers#

Variables are not just single numbers or strings. Often they are collections of information. There are many containers but two you must know about are lists and dictionaries.

Lists#

A list is an ordered collection of other bits of data that is all kept together in a single group. You use square brackets [] to indicate a list and everything that is to go inside it is separated with a comma.

# An example list
list1 = [1, 3.45, 'inside the list', True, False, [100, 200, 300]]

# You can get things out of a list by 'indexing' them, like so
print(list1[2])

# Or if you know something is in a list, you can ask it for the location, like this:
print(list1.index(3.45))
print(list1[list1.index(3.45)])

# You can retrieve items even from the 'nested' parts, like this
print(list1[-1][0])

inside the list
1
3.45
100

Some statistical output will be given to you in a list, or you will specify inputs to analysis programs, plots, and so on, as lists.

Dictionaries#

A dictionary is an unordered collection of other bits of data that is also kept together in a single group. You use curly brackets {} to indicate a dictionary, but the difference is that everything that goes in needs a ‘key’ that is associated with a ‘value’, like a real dictionary that has entries and definitions. The keys are strings, but the values can be anything!

# Dictionary example
dict1 = {'one': 1, 'two': ['a', 'b', 'c'], 'three': 4500}

# Getting things out requires the key
print(dict1['three'])

# You can examine all keys by asking for the .keys() attribute
print(dict1.keys())

# You can add things to a dictionary with the 'update' method
dict1.update({'four': 999})
print(dict1.keys())

4500
dict_keys(['one', 'two', 'three'])
dict_keys(['one', 'two', 'three', 'four'])

Dictionaries are again often used to give inputs to statistical methods or plots, or you will have results returned to you in a dictionary.

Part 2 - Pandas#

We come now to the workhorse of our Python journey - pandas (Panel Data). This package allows us to read in and manipulate data, and also to pass this data off to various other packages that will do our statistics and plots for us. A good way to think of it is the ‘Excel’ of Python, but the data is stored in something called a DataFrame.

In order to work with pandas, we must first import it, as it is an ‘add-on’ to basic Python.

# We import packages like so - we will often import things!
import pandas as pd # an alias to make typing easier

Now we have access to some advanced data handling techniques.

Where can we find some data? If we know a URL for a dataset, it can be read straight from the internet! Here, we read a small dataset about cars from the internet - specifically, a comma-separated-values file (csv).

# Demonstrate read_csv, grab a dataset from the internet
mtcars = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')

# Use the head method to see a sneak preview of this data
display(mtcars.head())

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2

Alternatively, if a copy is available on our own machine, in the same folder as our notebook, we simply pass the filename:

# From file
mtcars_file = pd.read_csv('mtcars.csv')

# Display
display(mtcars_file.head(2))

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.9	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.9	2.875	17.02	0	1	4	4

Data at a glance - .describe() and .info()#

With a dataset now ‘in memory’, analysis can begin. DataFrames come with a few methods give you some at a glance information about your data.

.info() returns information about the size and shape of the DataFrame, as well as the type of the data in each column. Very useful to check as sometimes data will be in the wrong format - e.g. strings as numbers. .describe() computes summary statistics for any numeric data columns, including mean, quartiles, and range.

# Get info on mtcars
mtcars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   model   32 non-null     object 
 1   mpg     32 non-null     float64
 2   cyl     32 non-null     int64  
 3   disp    32 non-null     float64
 4   hp      32 non-null     int64  
 5   drat    32 non-null     float64
 6   wt      32 non-null     float64
 7   qsec    32 non-null     float64
 8   vs      32 non-null     int64  
 9   am      32 non-null     int64  
 10  gear    32 non-null     int64  
 11  carb    32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB

# Describe mtcars
mtcars.describe()

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
count	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.0000
mean	20.090625	6.187500	230.721875	146.687500	3.596563	3.217250	17.848750	0.437500	0.406250	3.687500	2.8125
std	6.026948	1.785922	123.938694	68.562868	0.534679	0.978457	1.786943	0.504016	0.498991	0.737804	1.6152
min	10.400000	4.000000	71.100000	52.000000	2.760000	1.513000	14.500000	0.000000	0.000000	3.000000	1.0000
25%	15.425000	4.000000	120.825000	96.500000	3.080000	2.581250	16.892500	0.000000	0.000000	3.000000	2.0000
50%	19.200000	6.000000	196.300000	123.000000	3.695000	3.325000	17.710000	0.000000	0.000000	4.000000	2.0000
75%	22.800000	8.000000	326.000000	180.000000	3.920000	3.610000	18.900000	1.000000	1.000000	4.000000	4.0000
max	33.900000	8.000000	472.000000	335.000000	4.930000	5.424000	22.900000	1.000000	1.000000	5.000000	8.0000

# You can check the datatypes using the `.dtypes` accessor
print(mtcars.dtypes)

model     object
mpg      float64
cyl        int64
disp     float64
hp         int64
drat     float64
wt       float64
qsec     float64
vs         int64
am         int64
gear       int64
carb       int64
dtype: object

Accessing data in Pandas#

You can access data in pandas much like a dictionary, and sometimes you’ll need to do this. You can get a single column, or multiple columns, via using column names in a list. You’ll need to do this sometimes for some applications!

# Get model and mpg
display(mtcars[['model', 'mpg']])

	model	mpg
0	Mazda RX4	21.0
1	Mazda RX4 Wag	21.0
2	Datsun 710	22.8
3	Hornet 4 Drive	21.4
4	Hornet Sportabout	18.7
5	Valiant	18.1
6	Duster 360	14.3
7	Merc 240D	24.4
8	Merc 230	22.8
9	Merc 280	19.2
10	Merc 280C	17.8
11	Merc 450SE	16.4
12	Merc 450SL	17.3
13	Merc 450SLC	15.2
14	Cadillac Fleetwood	10.4
15	Lincoln Continental	10.4
16	Chrysler Imperial	14.7
17	Fiat 128	32.4
18	Honda Civic	30.4
19	Toyota Corolla	33.9
20	Toyota Corona	21.5
21	Dodge Challenger	15.5
22	AMC Javelin	15.2
23	Camaro Z28	13.3
24	Pontiac Firebird	19.2
25	Fiat X1-9	27.3
26	Porsche 914-2	26.0
27	Lotus Europa	30.4
28	Ford Pantera L	15.8
29	Ferrari Dino	19.7
30	Maserati Bora	15.0
31	Volvo 142E	21.4

Filtering a dataframe#

Selecting specific rows is easy. We use the .query method on a dataframe, and using a short expression to say what we want variables to be equal to, or less or more than:

# Get only the cars with 4 or more gears
gears = mtcars.query('gear > 3')
display(gears)

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
7	Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
8	Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
9	Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
10	Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
17	Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
18	Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
19	Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
25	Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
26	Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
27	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
28	Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
29	Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
30	Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
31	Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

# Combinations of filtering is simple
gears2 = mtcars.query('gear > 3 and am == 0')
display(gears2)

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
7	Merc 240D	24.4	4	146.7	62	3.69	3.19	20.0	1	4	2
8	Merc 230	22.8	4	140.8	95	3.92	3.15	22.9	1	4	2
9	Merc 280	19.2	6	167.6	123	3.92	3.44	18.3	1	4	4
10	Merc 280C	17.8	6	167.6	123	3.92	3.44	18.9	1	4	4

You can filter data according to the following syntax rules:

> means ‘greater than’, while >= means ‘equal to or greater than’. So for example, 5 > 5 would result in Python saying False, while 5 >= 5 is a True.
Similarly, < means ‘less than’, and <= means ‘equal to or less than’.
Conversely, a pair of double equal symbols, == exactly means ‘is equal to’. This works with strings too, so 'Hello' == 'Hello' is True, while 'Hello == 'hello' is False.
You can modify the double equals symbols to mean ‘not equal’, like this: !=. So 4 != 3 is True, while 5 != 5 is False.
You can combine these operations with the keywords and, or.
and will take two comparisons and return True if they are BOTH true. For example, 5 == 5 and 4 == 4 is True, while 5 == 5 and 4 == 3 is not.
or will take two comparisons and return True if ONE OR BOTH are true. For example, 5 == 5 or 4 == 4 is True, as well as 5 == 5 or 4 == 3 - because the first one is.

Dropping data, imputing values, and getting a summary statistic#

Sometimes, your data will have missing values. Here I replace a few values with a ‘missing’ indicator, and show how we can remove them, or replace them.

# Make some data missing!
mtcars.loc[[0, 1, 2, 5, 7], ['mpg', 'gear']] = pd.NA
display(mtcars.head())

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	NaN	6	160.0	110	3.90	2.620	16.46	0	1	NaN	4
1	Mazda RX4 Wag	NaN	6	160.0	110	3.90	2.875	17.02	0	1	NaN	4
2	Datsun 710	NaN	4	108.0	93	3.85	2.320	18.61	1	1	NaN	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3.0	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3.0	2

Similarly, we can find the missing data using .isna(), which will highlight which values are missing.

# Identify them
display(mtcars.isna())

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	False	True	False	False	False	False	False	False	False	False	True	False
1	False	True	False	False	False	False	False	False	False	False	True	False
2	False	True	False	False	False	False	False	False	False	False	True	False
3	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False	False
5	False	True	False	False	False	False	False	False	False	False	True	False
6	False	False	False	False	False	False	False	False	False	False	False	False
7	False	True	False	False	False	False	False	False	False	False	True	False
8	False	False	False	False	False	False	False	False	False	False	False	False
9	False	False	False	False	False	False	False	False	False	False	False	False
10	False	False	False	False	False	False	False	False	False	False	False	False
11	False	False	False	False	False	False	False	False	False	False	False	False
12	False	False	False	False	False	False	False	False	False	False	False	False
13	False	False	False	False	False	False	False	False	False	False	False	False
14	False	False	False	False	False	False	False	False	False	False	False	False
15	False	False	False	False	False	False	False	False	False	False	False	False
16	False	False	False	False	False	False	False	False	False	False	False	False
17	False	False	False	False	False	False	False	False	False	False	False	False
18	False	False	False	False	False	False	False	False	False	False	False	False
19	False	False	False	False	False	False	False	False	False	False	False	False
20	False	False	False	False	False	False	False	False	False	False	False	False
21	False	False	False	False	False	False	False	False	False	False	False	False
22	False	False	False	False	False	False	False	False	False	False	False	False
23	False	False	False	False	False	False	False	False	False	False	False	False
24	False	False	False	False	False	False	False	False	False	False	False	False
25	False	False	False	False	False	False	False	False	False	False	False	False
26	False	False	False	False	False	False	False	False	False	False	False	False
27	False	False	False	False	False	False	False	False	False	False	False	False
28	False	False	False	False	False	False	False	False	False	False	False	False
29	False	False	False	False	False	False	False	False	False	False	False	False
30	False	False	False	False	False	False	False	False	False	False	False	False
31	False	False	False	False	False	False	False	False	False	False	False	False

To summarise each column, we can ask if any or all the results are missing, like this:

# Find columns with missing values
display(mtcars.isna().any())

model    False
mpg       True
cyl      False
disp     False
hp       False
drat     False
wt       False
qsec     False
vs       False
am       False
gear      True
carb     False
dtype: bool

# We can drop rows that have NaN
no_nan = mtcars.dropna(how='any')
display(no_nan)

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3.0	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3.0	2
6	Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3.0	4
8	Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4.0	2
9	Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4.0	4
10	Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4.0	4
11	Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3.0	3
12	Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3.0	3
13	Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3.0	3
14	Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3.0	4
15	Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3.0	4
16	Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3.0	4
17	Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4.0	1
18	Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4.0	2
19	Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4.0	1
20	Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3.0	1
21	Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3.0	2
22	AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3.0	2
23	Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3.0	4
24	Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3.0	2
25	Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4.0	1
26	Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5.0	2
27	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5.0	2
28	Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5.0	4
29	Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5.0	6
30	Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5.0	8
31	Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4.0	2

# Or we can fill them with a specific value
display(mtcars.fillna(9999))

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	9999.0	6	160.0	110	3.90	2.620	16.46	0	1	9999.0	4
1	Mazda RX4 Wag	9999.0	6	160.0	110	3.90	2.875	17.02	0	1	9999.0	4
2	Datsun 710	9999.0	4	108.0	93	3.85	2.320	18.61	1	1	9999.0	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3.0	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3.0	2
5	Valiant	9999.0	6	225.0	105	2.76	3.460	20.22	1	0	9999.0	1
6	Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3.0	4
7	Merc 240D	9999.0	4	146.7	62	3.69	3.190	20.00	1	0	9999.0	2
8	Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4.0	2
9	Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4.0	4
10	Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4.0	4
11	Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3.0	3
12	Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3.0	3
13	Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3.0	3
14	Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3.0	4
15	Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3.0	4
16	Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3.0	4
17	Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4.0	1
18	Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4.0	2
19	Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4.0	1
20	Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3.0	1
21	Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3.0	2
22	AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3.0	2
23	Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3.0	4
24	Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3.0	2
25	Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4.0	1
26	Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5.0	2
27	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5.0	2
28	Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5.0	4
29	Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5.0	6
30	Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5.0	8
31	Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4.0	2

Aggregating#

We can aggregate columns to get specific statistics. pandas has shortcuts where we can state the name of a function we’d like to use as a string to the agg function in a dictionary - the key being the name of the column to work on!

# Get the mean and median of mpg, and std of disp
summaries = mtcars.agg({'mpg': ['mean', 'median'], 'disp': 'std'})
display(summaries)

	mpg	disp
mean	19.837037	NaN
median	18.700000	NaN
std	NaN	123.938694

Aggregating by groups#

Often, we may wish to aggregate our data by a specific variable or sets of them. For example, if we wanted to compute the average horsepower (hp) of each of the manual and automatic (am) cars, we need to do a groupby:

# Grouped average
grouped = mtcars.groupby(by=['am']).agg({'hp': 'mean'})
display(grouped)

	hp
am
0	160.263158
1	126.846154

Reshaping data#

Data comes in two formats - wide, and long.

Wide data has a single row per observation, and multiple columns that denote different variables. This is the format you’re used to seeing in programs like JASP or Excel.
Long data has multiple rows for a single observation, and columns that ‘track’ the scores on the variables.

You need to be able to move between these for things like plots or analysis functions. Fortunately, its not too difficult, but can take practice!

Wide to long#

To go from a wide dataset to a long one, we need to melt a dataframe.. This requires us to specify two things.

id_vars - the names of columns to keep as a ‘tracker’ for which observation the scores belong to.
value_vars - the names of columns we’d like to melt down into a single column, each entry tracking the name of a variable.
If we want, var_name and value_name allow us to specify the names of the column that tracks the variable name, and the associated score.

Let us melt the cars dataset, keeping the model as the identifier, and melting down the mpg, cyl, and disp variables.

# Long mtcars
mtcar_long = mtcars.melt(id_vars='model', 
                         value_vars=['mpg', 'cyl', 'disp'],
                         var_name='engine',
                         value_name='value')

# Sort by the model, not necessary but helps
mtcar_long = mtcar_long.sort_values(by='model')

# Show
display(mtcar_long.head(6), mtcar_long.tail(6))

	model	engine	value
86	AMC Javelin	disp	304.0
22	AMC Javelin	mpg	15.2
54	AMC Javelin	cyl	8.0
78	Cadillac Fleetwood	disp	472.0
46	Cadillac Fleetwood	cyl	8.0
14	Cadillac Fleetwood	mpg	10.4

	model	engine	value
69	Valiant	disp	225.0
37	Valiant	cyl	6.0
5	Valiant	mpg	NaN
31	Volvo 142E	mpg	21.4
63	Volvo 142E	cyl	4.0
95	Volvo 142E	disp	121.0

Long to wide#

To go from a long dataframe to a wide one, we must pivot it. This requires us to specify three things:

index, the name of the column that represents the identifier keeping track of what observation is what.
columns - the name of the columns that contain the variables we’d like to push into new columns.
values - the name of the column that contains the scores.

An example is instructive, undoing the earlier action of melting:

# Long to wide - note the `reset index`
mtcar_wide = mtcar_long.pivot(index='model', 
                              columns='engine',
                              values='value').reset_index()
display(mtcar_wide.head(10))

engine	model	cyl	disp	mpg
0	AMC Javelin	8.0	304.0	15.2
1	Cadillac Fleetwood	8.0	472.0	10.4
2	Camaro Z28	8.0	350.0	13.3
3	Chrysler Imperial	8.0	440.0	14.7
4	Datsun 710	4.0	108.0	NaN
5	Dodge Challenger	8.0	318.0	15.5
6	Duster 360	8.0	360.0	14.3
7	Ferrari Dino	6.0	145.0	19.7
8	Fiat 128	4.0	78.7	32.4
9	Fiat X1-9	4.0	79.0	27.3

Part 3 - Statistical Plots with Seaborn#

Pandas gives us the power to work with raw data, but Seaborn allows us to visualise that data easily. Seaborn can produce a vast array of plots, and we will use it to visualise different aspects of the data we work with. Lets see some basic examples.

# First, import seaborn
import seaborn as sns # traditionally as `sns`

# These lines are optional, but make our plots look nicer!
sns.set_style('whitegrid')
sns.set_context('talk')

# produce a scatter plot of mpg against number of cylinders, 
# but colour points by automatic/manual
sns.scatterplot(data=mtcars,
                x='mpg',
                y='cyl',
                hue='am')

<Axes: xlabel='mpg', ylabel='cyl'>

../_images/e7de223de53c1411f20bfc3f9fe9a40737a9f2daa481be7e3eea0171e8258775.png

# Make a barplot of the number of gears for automatic/manual
sns.barplot(data=mtcars, y='gear', hue='am')

<Axes: ylabel='gear'>

../_images/ae764b3b2e0b392298ec4e21c017f7d132ef11f2a71430cb6869f1f581d61357.png

# Very complex plots can be made easily
# using the 'relplot' (for an association) 
sns.relplot(data=mtcars, 
            kind='scatter',
            row='am',
            col='gear',
            x='mpg', y='hp',
            s=200, alpha=.3);

../_images/b8fac7a6bee60739a2ea90ec8d293dab6cac879e281b6a854349ca8c88bad15c.png

# ...or `catplot` for categorical data
sns.catplot(data=mtcars, 
            kind='violin',
            row='am',
            col='gear',
            y='mpg', hue='carb');

../_images/41d7ee4c118fef8c735f9bc2ab5c053a87990ee726cb24368a1e9a5fe972d1ef.png