{
"cells": [
{
"cell_type": "markdown",
"id": "fc63a79c-aff9-4cbb-aba8-8723f0973d65",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Data surgery - reshaping data with `melt`"
]
},
{
"cell_type": "markdown",
"id": "beb4de69-d6a0-49fa-a366-7f29988938a9",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"If you're used to analysing data with SPSS or Excel, you are probably used to *wide format* data. Each variable has its own column, and each participant has their own row. Consider the example below, taken from the paper by [Ryan, Wilde, & Crist (2013)](https://www.sciencedirect.com/science/article/pii/S0747563213000277?via%3Dihub), which examines the willingness of participants to kill an insect as a function of its 'disgustingness' and 'frighteningness'. Data was taken from JASP's datasets, describing an experiment where participants rated willingness to kill an insect across four conditions (e.g., low disgust, high fright). They are stored on the OSF, available to download using `pd.read_csv`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9561510-a3ec-457a-b7e0-ae01a5bbcc28",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "41c346b3-1d9b-451a-a59f-771bb090f90c",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" Lo D, Lo F | \n",
" Lo D, Hi F | \n",
" Hi D, Lo F | \n",
" Hi D, Hi F | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 6.0 | \n",
" 6.0 | \n",
" 9.0 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" NaN | \n",
" 10.0 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Female | \n",
" Europe | \n",
" college | \n",
" 5.0 | \n",
" 10.0 | \n",
" 10.0 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Female | \n",
" North | \n",
" college | \n",
" 6.0 | \n",
" 9.0 | \n",
" 6.0 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 3.0 | \n",
" 6.5 | \n",
" 5.5 | \n",
" 8.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Subject Gender Region Education Lo D, Lo F Lo D, Hi F Hi D, Lo F \\\n",
"0 1 Female North some 6.0 6.0 9.0 \n",
"1 2 Female North advance 10.0 NaN 10.0 \n",
"2 3 Female Europe college 5.0 10.0 10.0 \n",
"3 4 Female North college 6.0 9.0 6.0 \n",
"4 5 Female North some 3.0 6.5 5.5 \n",
"\n",
" Hi D, Hi F \n",
"0 10.0 \n",
"1 10.0 \n",
"2 10.0 \n",
"3 9.0 \n",
"4 8.5 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Read and display\n",
"rm_data = pd.read_csv('https://osf.io/mrhjn/download')\n",
"display(rm_data.head())"
]
},
{
"cell_type": "markdown",
"id": "31c29adf-5079-4ade-851b-85acd4c6739b",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### From wide to long\n",
"For most analysis purposes, this format works well. But as we'll see, some data operations (especially plotting) require the data in a different format - known as `long` or sometimes as `narrow`. In fact, most of the analysis world outside of SPSS require data in this `long` format. What does a `long` format version look like?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5ba04bde-4b9f-4a4e-b2da-b8bc7a28004e",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"long = pd.melt(rm_data, id_vars=['Subject', 'Gender', 'Region', 'Education'], \n",
" value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],\n",
" var_name='Condition', value_name='Rating')\n",
"\n",
"long.sort_values(by='Subject', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "92feac9c-edd4-45df-970d-2f0990df2f67",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" Condition | \n",
" Rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Lo D, Lo F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 186 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Hi D, Lo F | \n",
" 9.0 | \n",
"
\n",
" \n",
" 279 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Hi D, Hi F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 93 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Lo D, Hi F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 94 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Lo D, Hi F | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Lo D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 187 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Hi D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 280 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Hi D, Hi F | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Subject Gender Region Education Condition Rating\n",
"0 1 Female North some Lo D, Lo F 6.0\n",
"186 1 Female North some Hi D, Lo F 9.0\n",
"279 1 Female North some Hi D, Hi F 10.0\n",
"93 1 Female North some Lo D, Hi F 6.0\n",
"94 2 Female North advance Lo D, Hi F NaN\n",
"1 2 Female North advance Lo D, Lo F 10.0\n",
"187 2 Female North advance Hi D, Lo F 10.0\n",
"280 2 Female North advance Hi D, Hi F 10.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Show some of the long format data\n",
"display(long[:8])"
]
},
{
"cell_type": "markdown",
"id": "a2cd296c-54e4-409a-9fa0-8e1dfcf39f52",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Notice how the participant data is now repeated per row, and there is a separate row for each 'Condition'. Rather than 4 columns for the data, there is now one, represented by 'Rating'. This is `long` data. Importantly, working with data in this way poses no additional problems for us. Consider computing the means in each condition, across ratings:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "fb2945ad-a2a7-45d9-8b30-a9e3dee0bd4e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Lo D, Lo F 5.715054\n",
"Hi D, Lo F 6.741758\n",
"Lo D, Hi F 7.379121\n",
"Hi D, Hi F 7.853261\n",
"Subject 50.215054\n",
"dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Condition\n",
"Lo D, Lo F 5.715054\n",
"Hi D, Lo F 6.741758\n",
"Lo D, Hi F 7.379121\n",
"Hi D, Hi F 7.853261\n",
"Name: Rating, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Means for original data\n",
"display(rm_data.mean().sort_values())\n",
"\n",
"# Means for long data - simple group-by works\n",
"display(long.groupby('Condition').Rating.mean().sort_values())"
]
},
{
"cell_type": "markdown",
"id": "5b826634-bbbf-44e5-b341-191ccc83dee9",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Melting data\n",
"So how can we go from a 'wide' dataset our SPSS-using colleagues send us, and get it into a long format for other operations? There is a function dedicated solely for that - `pd.melt()`.\n",
"\n",
"This functions allows us to bend a dataset into a format where some columns are kept as 'identifier' variables, and all others are collapsed into a pair of columns - one that identifies the variable name, and the other that identifies the numeric value. Example:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "e2470502-4dbc-400b-b74a-4960ab2f960a",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" Lo D, Lo F | \n",
" Lo D, Hi F | \n",
" Hi D, Lo F | \n",
" Hi D, Hi F | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 6.0 | \n",
" 6.0 | \n",
" 9.0 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" NaN | \n",
" 10.0 | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Subject Gender Region Education Lo D, Lo F Lo D, Hi F Hi D, Lo F \\\n",
"0 1 Female North some 6.0 6.0 9.0 \n",
"1 2 Female North advance 10.0 NaN 10.0 \n",
"\n",
" Hi D, Hi F \n",
"0 10.0 \n",
"1 10.0 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" Condition | \n",
" Rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Lo D, Lo F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Lo D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Lo D, Hi F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Lo D, Hi F | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Hi D, Lo F | \n",
" 9.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Hi D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Hi D, Hi F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Hi D, Hi F | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Subject Gender Region Education Condition Rating\n",
"0 1 Female North some Lo D, Lo F 6.0\n",
"1 2 Female North advance Lo D, Lo F 10.0\n",
"2 1 Female North some Lo D, Hi F 6.0\n",
"3 2 Female North advance Lo D, Hi F NaN\n",
"4 1 Female North some Hi D, Lo F 9.0\n",
"5 2 Female North advance Hi D, Lo F 10.0\n",
"6 1 Female North some Hi D, Hi F 10.0\n",
"7 2 Female North advance Hi D, Hi F 10.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Subset rm_data to two participants; use for melt illustration\n",
"example = rm_data.iloc[0:2, :].copy()\n",
"\n",
"# Melt!\n",
"melted_df = pd.melt(frame=example, id_vars=['Subject', 'Gender', 'Region', 'Education'],\n",
" value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],\n",
" var_name='Condition', value_name='Rating')\n",
"\n",
"# Or use the identical 'melt' method each DataFrame has - same outcome here as above, use as you wish\n",
"example.melt(id_vars=['Subject', 'Gender', 'Region', 'Education'],\n",
" value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],\n",
" var_name='Condition', value_name='Rating')\n",
"\n",
"display(example)\n",
"display(melted_df)"
]
},
{
"cell_type": "markdown",
"id": "84863ee7-4269-4382-8cac-6081542515cd",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The function is smart enough to know that anything not specified in `id_vars` should be collapsed into the pair of columns specifying the conditions and the values. However, anything not specified in `id_vars` or `value_vars` is simply dropped from the result!"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "ecef8caa-1d65-4a12-9e55-aa0844bd0d3b",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" Condition | \n",
" Rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Lo D, Lo F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Lo D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Lo D, Hi F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Lo D, Hi F | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Hi D, Lo F | \n",
" 9.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Hi D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" Hi D, Hi F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" Hi D, Hi F | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Subject Gender Region Education Condition Rating\n",
"0 1 Female North some Lo D, Lo F 6.0\n",
"1 2 Female North advance Lo D, Lo F 10.0\n",
"2 1 Female North some Lo D, Hi F 6.0\n",
"3 2 Female North advance Lo D, Hi F NaN\n",
"4 1 Female North some Hi D, Lo F 9.0\n",
"5 2 Female North advance Hi D, Lo F 10.0\n",
"6 1 Female North some Hi D, Hi F 10.0\n",
"7 2 Female North advance Hi D, Hi F 10.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Subject | \n",
" Condition | \n",
" Rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Lo D, Lo F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Lo D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Lo D, Hi F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" Lo D, Hi F | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Hi D, Lo F | \n",
" 9.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" Hi D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Subject Condition Rating\n",
"0 1 Lo D, Lo F 6.0\n",
"1 2 Lo D, Lo F 10.0\n",
"2 1 Lo D, Hi F 6.0\n",
"3 2 Lo D, Hi F NaN\n",
"4 1 Hi D, Lo F 9.0\n",
"5 2 Hi D, Lo F 10.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Lazy usages\n",
"ex1 = pd.melt(frame=example, id_vars=['Subject', 'Gender', 'Region', 'Education'],\n",
" var_name='Condition', value_name='Rating')\n",
"\n",
"# Remember to specify everything!\n",
"ex2 = pd.melt(frame=example, id_vars='Subject', \n",
" value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F'],\n",
" var_name='Condition', value_name='Rating')\n",
"\n",
"display(ex1)\n",
"display(ex2)"
]
},
{
"cell_type": "markdown",
"id": "73d83a16-0d14-4eb5-a676-637173063b8e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"If you don't specify any arguments, the whole DataFrame will be collapsed into two columns, repeating the column header for every observation in that column:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "25c61179-1aab-4bcc-85e6-dd7af2381546",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" variable | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Subject | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" Subject | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" Gender | \n",
" Female | \n",
"
\n",
" \n",
" 3 | \n",
" Gender | \n",
" Female | \n",
"
\n",
" \n",
" 4 | \n",
" Region | \n",
" North | \n",
"
\n",
" \n",
" 5 | \n",
" Region | \n",
" North | \n",
"
\n",
" \n",
" 6 | \n",
" Education | \n",
" some | \n",
"
\n",
" \n",
" 7 | \n",
" Education | \n",
" advance | \n",
"
\n",
" \n",
" 8 | \n",
" Lo D, Lo F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 9 | \n",
" Lo D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 10 | \n",
" Lo D, Hi F | \n",
" 6.0 | \n",
"
\n",
" \n",
" 11 | \n",
" Lo D, Hi F | \n",
" NaN | \n",
"
\n",
" \n",
" 12 | \n",
" Hi D, Lo F | \n",
" 9.0 | \n",
"
\n",
" \n",
" 13 | \n",
" Hi D, Lo F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 14 | \n",
" Hi D, Hi F | \n",
" 10.0 | \n",
"
\n",
" \n",
" 15 | \n",
" Hi D, Hi F | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" variable value\n",
"0 Subject 1\n",
"1 Subject 2\n",
"2 Gender Female\n",
"3 Gender Female\n",
"4 Region North\n",
"5 Region North\n",
"6 Education some\n",
"7 Education advance\n",
"8 Lo D, Lo F 6.0\n",
"9 Lo D, Lo F 10.0\n",
"10 Lo D, Hi F 6.0\n",
"11 Lo D, Hi F NaN\n",
"12 Hi D, Lo F 9.0\n",
"13 Hi D, Lo F 10.0\n",
"14 Hi D, Hi F 10.0\n",
"15 Hi D, Hi F 10.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Total meltdown\n",
"display(pd.melt(example))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}