2. Data surgery - reshaping data with `melt`#

If you’re used to analysing data with SPSS or Excel, you are probably used to wide format data. Each variable has its own column, and each participant has their own row. Consider the example below, taken from the paper by Ryan, Wilde, & Crist (2013), which examines the willingness of participants to kill an insect as a function of its ‘disgustingness’ and ‘frighteningness’. Data was taken from JASP’s datasets, describing an experiment where participants rated willingness to kill an insect across four conditions (e.g., low disgust, high fright). They are stored on the OSF, available to download using pd.read_csv.

# Read and display
rm_data = pd.read_csv('https://osf.io/mrhjn/download')
display(rm_data.head())

	Subject	Gender	Region	Education	Lo D, Lo F	Lo D, Hi F	Hi D, Lo F	Hi D, Hi F
0	1	Female	North	some	6.0	6.0	9.0	10.0
1	2	Female	North	advance	10.0	NaN	10.0	10.0
2	3	Female	Europe	college	5.0	10.0	10.0	10.0
3	4	Female	North	college	6.0	9.0	6.0	9.0
4	5	Female	North	some	3.0	6.5	5.5	8.5

2.1. From wide to long#

For most analysis purposes, this format works well. But as we’ll see, some data operations (especially plotting) require the data in a different format - known as long or sometimes as narrow. In fact, most of the analysis world outside of SPSS require data in this long format. What does a long format version look like?

long = pd.melt(rm_data, id_vars=['Subject', 'Gender', 'Region', 'Education'], 
                 value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
                var_name='Condition', value_name='Rating')

long.sort_values(by='Subject', inplace=True)

# Show some of the long format data
display(long[:8])

	Subject	Gender	Region	Education	Condition	Rating
0	1	Female	North	some	Lo D, Lo F	6.0
186	1	Female	North	some	Hi D, Lo F	9.0
279	1	Female	North	some	Hi D, Hi F	10.0
93	1	Female	North	some	Lo D, Hi F	6.0
94	2	Female	North	advance	Lo D, Hi F	NaN
1	2	Female	North	advance	Lo D, Lo F	10.0
187	2	Female	North	advance	Hi D, Lo F	10.0
280	2	Female	North	advance	Hi D, Hi F	10.0

Notice how the participant data is now repeated per row, and there is a separate row for each ‘Condition’. Rather than 4 columns for the data, there is now one, represented by ‘Rating’. This is long data. Importantly, working with data in this way poses no additional problems for us. Consider computing the means in each condition, across ratings:

# Means for original data
display(rm_data.mean().sort_values())

# Means for long data - simple group-by works
display(long.groupby('Condition').Rating.mean().sort_values())

/var/folders/mw/xt4ddf0j2n3dr4qcr__qhqlr0000gn/T/ipykernel_2980/3531915259.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  display(rm_data.mean().sort_values())

Lo D, Lo F     5.715054
Hi D, Lo F     6.741758
Lo D, Hi F     7.379121
Hi D, Hi F     7.853261
Subject       50.215054
dtype: float64

Condition
Lo D, Lo F    5.715054
Hi D, Lo F    6.741758
Lo D, Hi F    7.379121
Hi D, Hi F    7.853261
Name: Rating, dtype: float64

2.2. Melting data#

So how can we go from a ‘wide’ dataset our SPSS-using colleagues send us, and get it into a long format for other operations? There is a function dedicated solely for that - pd.melt().

This functions allows us to bend a dataset into a format where some columns are kept as ‘identifier’ variables, and all others are collapsed into a pair of columns - one that identifies the variable name, and the other that identifies the numeric value. Example:

# Subset rm_data to two participants; use for melt illustration
example = rm_data.iloc[0:2, :].copy()

# Melt!
melted_df = pd.melt(frame=example, id_vars=['Subject', 'Gender', 'Region', 'Education'],
                   value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
                var_name='Condition', value_name='Rating')

# Or use the identical 'melt' method each DataFrame has - same outcome here as above, use as you wish
example.melt(id_vars=['Subject', 'Gender', 'Region', 'Education'],
                   value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
                var_name='Condition', value_name='Rating')

display(example)
display(melted_df)

	Subject	Gender	Region	Education	Lo D, Lo F	Lo D, Hi F	Hi D, Lo F	Hi D, Hi F
0	1	Female	North	some	6.0	6.0	9.0	10.0
1	2	Female	North	advance	10.0	NaN	10.0	10.0

	Subject	Gender	Region	Education	Condition	Rating
0	1	Female	North	some	Lo D, Lo F	6.0
1	2	Female	North	advance	Lo D, Lo F	10.0
2	1	Female	North	some	Lo D, Hi F	6.0
3	2	Female	North	advance	Lo D, Hi F	NaN
4	1	Female	North	some	Hi D, Lo F	9.0
5	2	Female	North	advance	Hi D, Lo F	10.0
6	1	Female	North	some	Hi D, Hi F	10.0
7	2	Female	North	advance	Hi D, Hi F	10.0

The function is smart enough to know that anything not specified in id_vars should be collapsed into the pair of columns specifying the conditions and the values. However, anything not specified in id_vars or value_vars is simply dropped from the result!

# Lazy usages
ex1 = pd.melt(frame=example, id_vars=['Subject', 'Gender', 'Region', 'Education'],
                var_name='Condition', value_name='Rating')

# Remember to specify everything!
ex2 = pd.melt(frame=example, id_vars='Subject', 
              value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F'],
                var_name='Condition', value_name='Rating')

display(ex1)
display(ex2)

	Subject	Gender	Region	Education	Condition	Rating
0	1	Female	North	some	Lo D, Lo F	6.0
1	2	Female	North	advance	Lo D, Lo F	10.0
2	1	Female	North	some	Lo D, Hi F	6.0
3	2	Female	North	advance	Lo D, Hi F	NaN
4	1	Female	North	some	Hi D, Lo F	9.0
5	2	Female	North	advance	Hi D, Lo F	10.0
6	1	Female	North	some	Hi D, Hi F	10.0
7	2	Female	North	advance	Hi D, Hi F	10.0

	Subject	Condition	Rating
0	1	Lo D, Lo F	6.0
1	2	Lo D, Lo F	10.0
2	1	Lo D, Hi F	6.0
3	2	Lo D, Hi F	NaN
4	1	Hi D, Lo F	9.0
5	2	Hi D, Lo F	10.0

If you don’t specify any arguments, the whole DataFrame will be collapsed into two columns, repeating the column header for every observation in that column:

# Total meltdown
display(pd.melt(example))

	variable	value
0	Subject	1
1	Subject	2
2	Gender	Female
3	Gender	Female
4	Region	North
5	Region	North
6	Education	some
7	Education	advance
8	Lo D, Lo F	6.0
9	Lo D, Lo F	10.0
10	Lo D, Hi F	6.0
11	Lo D, Hi F	NaN
12	Hi D, Lo F	9.0
13	Hi D, Lo F	10.0
14	Hi D, Hi F	10.0
15	Hi D, Hi F	10.0

An introduction to data analysis in Python

Data surgery - reshaping data with melt

Contents

2. Data surgery - reshaping data with `melt`#

2.1. From wide to long#

2.2. Melting data#

An introduction to data analysis in Python

Data surgery - reshaping data with melt

Contents

2. Data surgery - reshaping data with melt#

2.1. From wide to long#

2.2. Melting data#

2. Data surgery - reshaping data with `melt`#