2. Data surgery - reshaping data with melt#

If you’re used to analysing data with SPSS or Excel, you are probably used to wide format data. Each variable has its own column, and each participant has their own row. Consider the example below, taken from the paper by Ryan, Wilde, & Crist (2013), which examines the willingness of participants to kill an insect as a function of its ‘disgustingness’ and ‘frighteningness’. Data was taken from JASP’s datasets, describing an experiment where participants rated willingness to kill an insect across four conditions (e.g., low disgust, high fright). They are stored on the OSF, available to download using pd.read_csv.

# Read and display
rm_data = pd.read_csv('https://osf.io/mrhjn/download')
display(rm_data.head())
Subject Gender Region Education Lo D, Lo F Lo D, Hi F Hi D, Lo F Hi D, Hi F
0 1 Female North some 6.0 6.0 9.0 10.0
1 2 Female North advance 10.0 NaN 10.0 10.0
2 3 Female Europe college 5.0 10.0 10.0 10.0
3 4 Female North college 6.0 9.0 6.0 9.0
4 5 Female North some 3.0 6.5 5.5 8.5

2.1. From wide to long#

For most analysis purposes, this format works well. But as we’ll see, some data operations (especially plotting) require the data in a different format - known as long or sometimes as narrow. In fact, most of the analysis world outside of SPSS require data in this long format. What does a long format version look like?

long = pd.melt(rm_data, id_vars=['Subject', 'Gender', 'Region', 'Education'], 
                 value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
                var_name='Condition', value_name='Rating')

long.sort_values(by='Subject', inplace=True)
# Show some of the long format data
display(long[:8])
Subject Gender Region Education Condition Rating
0 1 Female North some Lo D, Lo F 6.0
186 1 Female North some Hi D, Lo F 9.0
279 1 Female North some Hi D, Hi F 10.0
93 1 Female North some Lo D, Hi F 6.0
94 2 Female North advance Lo D, Hi F NaN
1 2 Female North advance Lo D, Lo F 10.0
187 2 Female North advance Hi D, Lo F 10.0
280 2 Female North advance Hi D, Hi F 10.0

Notice how the participant data is now repeated per row, and there is a separate row for each ‘Condition’. Rather than 4 columns for the data, there is now one, represented by ‘Rating’. This is long data. Importantly, working with data in this way poses no additional problems for us. Consider computing the means in each condition, across ratings:

# Means for original data
display(rm_data.mean().sort_values())

# Means for long data - simple group-by works
display(long.groupby('Condition').Rating.mean().sort_values())
/var/folders/mw/xt4ddf0j2n3dr4qcr__qhqlr0000gn/T/ipykernel_2980/3531915259.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  display(rm_data.mean().sort_values())
Lo D, Lo F     5.715054
Hi D, Lo F     6.741758
Lo D, Hi F     7.379121
Hi D, Hi F     7.853261
Subject       50.215054
dtype: float64
Condition
Lo D, Lo F    5.715054
Hi D, Lo F    6.741758
Lo D, Hi F    7.379121
Hi D, Hi F    7.853261
Name: Rating, dtype: float64

2.2. Melting data#

So how can we go from a ‘wide’ dataset our SPSS-using colleagues send us, and get it into a long format for other operations? There is a function dedicated solely for that - pd.melt().

This functions allows us to bend a dataset into a format where some columns are kept as ‘identifier’ variables, and all others are collapsed into a pair of columns - one that identifies the variable name, and the other that identifies the numeric value. Example:

# Subset rm_data to two participants; use for melt illustration
example = rm_data.iloc[0:2, :].copy()

# Melt!
melted_df = pd.melt(frame=example, id_vars=['Subject', 'Gender', 'Region', 'Education'],
                   value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
                var_name='Condition', value_name='Rating')

# Or use the identical 'melt' method each DataFrame has - same outcome here as above, use as you wish
example.melt(id_vars=['Subject', 'Gender', 'Region', 'Education'],
                   value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
                var_name='Condition', value_name='Rating')

display(example)
display(melted_df)
Subject Gender Region Education Lo D, Lo F Lo D, Hi F Hi D, Lo F Hi D, Hi F
0 1 Female North some 6.0 6.0 9.0 10.0
1 2 Female North advance 10.0 NaN 10.0 10.0
Subject Gender Region Education Condition Rating
0 1 Female North some Lo D, Lo F 6.0
1 2 Female North advance Lo D, Lo F 10.0
2 1 Female North some Lo D, Hi F 6.0
3 2 Female North advance Lo D, Hi F NaN
4 1 Female North some Hi D, Lo F 9.0
5 2 Female North advance Hi D, Lo F 10.0
6 1 Female North some Hi D, Hi F 10.0
7 2 Female North advance Hi D, Hi F 10.0

The function is smart enough to know that anything not specified in id_vars should be collapsed into the pair of columns specifying the conditions and the values. However, anything not specified in id_vars or value_vars is simply dropped from the result!

# Lazy usages
ex1 = pd.melt(frame=example, id_vars=['Subject', 'Gender', 'Region', 'Education'],
                var_name='Condition', value_name='Rating')

# Remember to specify everything!
ex2 = pd.melt(frame=example, id_vars='Subject', 
              value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F'],
                var_name='Condition', value_name='Rating')

display(ex1)
display(ex2)
Subject Gender Region Education Condition Rating
0 1 Female North some Lo D, Lo F 6.0
1 2 Female North advance Lo D, Lo F 10.0
2 1 Female North some Lo D, Hi F 6.0
3 2 Female North advance Lo D, Hi F NaN
4 1 Female North some Hi D, Lo F 9.0
5 2 Female North advance Hi D, Lo F 10.0
6 1 Female North some Hi D, Hi F 10.0
7 2 Female North advance Hi D, Hi F 10.0
Subject Condition Rating
0 1 Lo D, Lo F 6.0
1 2 Lo D, Lo F 10.0
2 1 Lo D, Hi F 6.0
3 2 Lo D, Hi F NaN
4 1 Hi D, Lo F 9.0
5 2 Hi D, Lo F 10.0

If you don’t specify any arguments, the whole DataFrame will be collapsed into two columns, repeating the column header for every observation in that column:

# Total meltdown
display(pd.melt(example))
variable value
0 Subject 1
1 Subject 2
2 Gender Female
3 Gender Female
4 Region North
5 Region North
6 Education some
7 Education advance
8 Lo D, Lo F 6.0
9 Lo D, Lo F 10.0
10 Lo D, Hi F 6.0
11 Lo D, Hi F NaN
12 Hi D, Lo F 9.0
13 Hi D, Lo F 10.0
14 Hi D, Hi F 10.0
15 Hi D, Hi F 10.0