Data surgery - reshaping data with melt
Contents
2. Data surgery - reshaping data with melt
#
If you’re used to analysing data with SPSS or Excel, you are probably used to wide format data. Each variable has its own column, and each participant has their own row. Consider the example below, taken from the paper by Ryan, Wilde, & Crist (2013), which examines the willingness of participants to kill an insect as a function of its ‘disgustingness’ and ‘frighteningness’. Data was taken from JASP’s datasets, describing an experiment where participants rated willingness to kill an insect across four conditions (e.g., low disgust, high fright). They are stored on the OSF, available to download using pd.read_csv
.
# Read and display
rm_data = pd.read_csv('https://osf.io/mrhjn/download')
display(rm_data.head())
Subject | Gender | Region | Education | Lo D, Lo F | Lo D, Hi F | Hi D, Lo F | Hi D, Hi F | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Female | North | some | 6.0 | 6.0 | 9.0 | 10.0 |
1 | 2 | Female | North | advance | 10.0 | NaN | 10.0 | 10.0 |
2 | 3 | Female | Europe | college | 5.0 | 10.0 | 10.0 | 10.0 |
3 | 4 | Female | North | college | 6.0 | 9.0 | 6.0 | 9.0 |
4 | 5 | Female | North | some | 3.0 | 6.5 | 5.5 | 8.5 |
2.1. From wide to long#
For most analysis purposes, this format works well. But as we’ll see, some data operations (especially plotting) require the data in a different format - known as long
or sometimes as narrow
. In fact, most of the analysis world outside of SPSS require data in this long
format. What does a long
format version look like?
long = pd.melt(rm_data, id_vars=['Subject', 'Gender', 'Region', 'Education'],
value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
var_name='Condition', value_name='Rating')
long.sort_values(by='Subject', inplace=True)
# Show some of the long format data
display(long[:8])
Subject | Gender | Region | Education | Condition | Rating | |
---|---|---|---|---|---|---|
0 | 1 | Female | North | some | Lo D, Lo F | 6.0 |
186 | 1 | Female | North | some | Hi D, Lo F | 9.0 |
279 | 1 | Female | North | some | Hi D, Hi F | 10.0 |
93 | 1 | Female | North | some | Lo D, Hi F | 6.0 |
94 | 2 | Female | North | advance | Lo D, Hi F | NaN |
1 | 2 | Female | North | advance | Lo D, Lo F | 10.0 |
187 | 2 | Female | North | advance | Hi D, Lo F | 10.0 |
280 | 2 | Female | North | advance | Hi D, Hi F | 10.0 |
Notice how the participant data is now repeated per row, and there is a separate row for each ‘Condition’. Rather than 4 columns for the data, there is now one, represented by ‘Rating’. This is long
data. Importantly, working with data in this way poses no additional problems for us. Consider computing the means in each condition, across ratings:
# Means for original data
display(rm_data.mean().sort_values())
# Means for long data - simple group-by works
display(long.groupby('Condition').Rating.mean().sort_values())
/var/folders/mw/xt4ddf0j2n3dr4qcr__qhqlr0000gn/T/ipykernel_2980/3531915259.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
display(rm_data.mean().sort_values())
Lo D, Lo F 5.715054
Hi D, Lo F 6.741758
Lo D, Hi F 7.379121
Hi D, Hi F 7.853261
Subject 50.215054
dtype: float64
Condition
Lo D, Lo F 5.715054
Hi D, Lo F 6.741758
Lo D, Hi F 7.379121
Hi D, Hi F 7.853261
Name: Rating, dtype: float64
2.2. Melting data#
So how can we go from a ‘wide’ dataset our SPSS-using colleagues send us, and get it into a long format for other operations? There is a function dedicated solely for that - pd.melt()
.
This functions allows us to bend a dataset into a format where some columns are kept as ‘identifier’ variables, and all others are collapsed into a pair of columns - one that identifies the variable name, and the other that identifies the numeric value. Example:
# Subset rm_data to two participants; use for melt illustration
example = rm_data.iloc[0:2, :].copy()
# Melt!
melted_df = pd.melt(frame=example, id_vars=['Subject', 'Gender', 'Region', 'Education'],
value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
var_name='Condition', value_name='Rating')
# Or use the identical 'melt' method each DataFrame has - same outcome here as above, use as you wish
example.melt(id_vars=['Subject', 'Gender', 'Region', 'Education'],
value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
var_name='Condition', value_name='Rating')
display(example)
display(melted_df)
Subject | Gender | Region | Education | Lo D, Lo F | Lo D, Hi F | Hi D, Lo F | Hi D, Hi F | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Female | North | some | 6.0 | 6.0 | 9.0 | 10.0 |
1 | 2 | Female | North | advance | 10.0 | NaN | 10.0 | 10.0 |
Subject | Gender | Region | Education | Condition | Rating | |
---|---|---|---|---|---|---|
0 | 1 | Female | North | some | Lo D, Lo F | 6.0 |
1 | 2 | Female | North | advance | Lo D, Lo F | 10.0 |
2 | 1 | Female | North | some | Lo D, Hi F | 6.0 |
3 | 2 | Female | North | advance | Lo D, Hi F | NaN |
4 | 1 | Female | North | some | Hi D, Lo F | 9.0 |
5 | 2 | Female | North | advance | Hi D, Lo F | 10.0 |
6 | 1 | Female | North | some | Hi D, Hi F | 10.0 |
7 | 2 | Female | North | advance | Hi D, Hi F | 10.0 |
The function is smart enough to know that anything not specified in id_vars
should be collapsed into the pair of columns specifying the conditions and the values. However, anything not specified in id_vars
or value_vars
is simply dropped from the result!
# Lazy usages
ex1 = pd.melt(frame=example, id_vars=['Subject', 'Gender', 'Region', 'Education'],
var_name='Condition', value_name='Rating')
# Remember to specify everything!
ex2 = pd.melt(frame=example, id_vars='Subject',
value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F'],
var_name='Condition', value_name='Rating')
display(ex1)
display(ex2)
Subject | Gender | Region | Education | Condition | Rating | |
---|---|---|---|---|---|---|
0 | 1 | Female | North | some | Lo D, Lo F | 6.0 |
1 | 2 | Female | North | advance | Lo D, Lo F | 10.0 |
2 | 1 | Female | North | some | Lo D, Hi F | 6.0 |
3 | 2 | Female | North | advance | Lo D, Hi F | NaN |
4 | 1 | Female | North | some | Hi D, Lo F | 9.0 |
5 | 2 | Female | North | advance | Hi D, Lo F | 10.0 |
6 | 1 | Female | North | some | Hi D, Hi F | 10.0 |
7 | 2 | Female | North | advance | Hi D, Hi F | 10.0 |
Subject | Condition | Rating | |
---|---|---|---|
0 | 1 | Lo D, Lo F | 6.0 |
1 | 2 | Lo D, Lo F | 10.0 |
2 | 1 | Lo D, Hi F | 6.0 |
3 | 2 | Lo D, Hi F | NaN |
4 | 1 | Hi D, Lo F | 9.0 |
5 | 2 | Hi D, Lo F | 10.0 |
If you don’t specify any arguments, the whole DataFrame will be collapsed into two columns, repeating the column header for every observation in that column:
# Total meltdown
display(pd.melt(example))
variable | value | |
---|---|---|
0 | Subject | 1 |
1 | Subject | 2 |
2 | Gender | Female |
3 | Gender | Female |
4 | Region | North |
5 | Region | North |
6 | Education | some |
7 | Education | advance |
8 | Lo D, Lo F | 6.0 |
9 | Lo D, Lo F | 10.0 |
10 | Lo D, Hi F | 6.0 |
11 | Lo D, Hi F | NaN |
12 | Hi D, Lo F | 9.0 |
13 | Hi D, Lo F | 10.0 |
14 | Hi D, Hi F | 10.0 |
15 | Hi D, Hi F | 10.0 |