{
"cells": [
{
"cell_type": "markdown",
"id": "d88565d7-bad7-4ba2-96ce-8557e557f23f",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## There and back again - reshaping data with `.pivot_table()`.\n",
"Melting DataFrames is essential for working with data under specific contexts. But it also makes other uses hard. For example, how could we compute the difference between two conditions in our melted example of the `Bugs` dataset? \n",
"\n",
"The only solution is subsetting - extract a column where Condition equals the condition you want, then pull out the numeric values. Do the same for the other variable, but make sure the index is the same - if you break pairs the results will mean nothing! It was far easier before...\n",
"\n",
"A simple solution would be to put the data into a `wide` format. This can be done using `pd.pivot_table` or its identical DataFrame method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92ddf0a6-d78d-45c9-a93f-ed95e9681fa1",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"melted_df = (pd\n",
" .read_csv('https://osf.io/mrhjn/download')\n",
" .iloc[0:2, :]\n",
" .melt(id_vars=['Subject', 'Gender', 'Region', 'Education'],\n",
" value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],\n",
" var_name='Condition', \n",
" value_name='Rating')\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "86778ada-e6b4-4da6-8d31-f4ab494e39ee",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" | \n",
" Condition | \n",
" Hi D, Hi F | \n",
" Hi D, Lo F | \n",
" Lo D, Hi F | \n",
" Lo D, Lo F | \n",
"
\n",
" \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 10.0 | \n",
" 9.0 | \n",
" 6.0 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" 10.0 | \n",
" NaN | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Condition Hi D, Hi F Hi D, Lo F Lo D, Hi F \\\n",
"Subject Gender Region Education \n",
"1 Female North some 10.0 9.0 6.0 \n",
"2 Female North advance 10.0 10.0 NaN \n",
"\n",
"Condition Lo D, Lo F \n",
"Subject Gender Region Education \n",
"1 Female North some 6.0 \n",
"2 Female North advance 10.0 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Reshape the melted data back to the original\n",
"wide = pd.pivot_table(melted_df, index=['Subject', 'Gender', 'Region', 'Education'], columns='Condition', values='Rating')\n",
"\n",
"display(wide)"
]
},
{
"cell_type": "markdown",
"id": "af1d4550-39b7-4494-bf8e-cafc86529c9e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"It should be clear this works in more or less the opposite way to `melt`. The arguments `columns` and `values` want the names of the pair of columns that house, respectively, the condition labels, and the numeric values. Easy - but `index` wants a list of the identifier variables that you specify when carrying out a `melt`. The result is that the DataFrame that comes out of a `pivot_table` has a `MultiIndex`. You can fix this with `.reset_index()`."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "1d1ccb24-a1f4-4b66-bb42-0516ecc521f7",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Condition | \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" Hi D, Hi F | \n",
" Hi D, Lo F | \n",
" Lo D, Hi F | \n",
" Lo D, Lo F | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 10.0 | \n",
" 9.0 | \n",
" 6.0 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" 10.0 | \n",
" NaN | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Condition Subject Gender Region Education Hi D, Hi F Hi D, Lo F \\\n",
"0 1 Female North some 10.0 9.0 \n",
"1 2 Female North advance 10.0 10.0 \n",
"\n",
"Condition Lo D, Hi F Lo D, Lo F \n",
"0 6.0 6.0 \n",
"1 NaN 10.0 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Reset index\n",
"display(wide.reset_index())"
]
},
{
"cell_type": "markdown",
"id": "72ad9e3f-679e-4c36-b2bb-2986eabdb8c8",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Advanced uses in data reshaping and cleaning\n",
"When moving from wide to long data, you may encounter situations in which you don't gain complete separation of a variable. When melting `rm_data` down as we did earlier, we obtain a column with four levels. These levels denote *two* variables, each with *two* levels. Pandas is blind to this fact since it is all wrapped together. While no issue for us in interpreting data, this poses problems for various analyses and graphing functions we will cover soon.\n",
"\n",
"The solution is to split a column like into two using some string methods. We already have most of the tools at our disposal to do this."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "a888acbe-47e0-4a88-80f8-8c3d282a1887",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" Rating | \n",
" Disgust | \n",
" Fright | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 6.0 | \n",
" Lo | \n",
" Lo | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" Lo | \n",
" Lo | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 6.0 | \n",
" Lo | \n",
" Hi | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" NaN | \n",
" Lo | \n",
" Hi | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 9.0 | \n",
" Hi | \n",
" Lo | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" Hi | \n",
" Lo | \n",
"
\n",
" \n",
" 6 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 10.0 | \n",
" Hi | \n",
" Hi | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" Hi | \n",
" Hi | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Subject Gender Region Education Rating Disgust Fright\n",
"0 1 Female North some 6.0 Lo Lo\n",
"1 2 Female North advance 10.0 Lo Lo\n",
"2 1 Female North some 6.0 Lo Hi\n",
"3 2 Female North advance NaN Lo Hi\n",
"4 1 Female North some 9.0 Hi Lo\n",
"5 2 Female North advance 10.0 Hi Lo\n",
"6 1 Female North some 10.0 Hi Hi\n",
"7 2 Female North advance 10.0 Hi Hi"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Fix 'Condition' column, and extract into two \n",
"# First define a function that will split the column\n",
"def col_splitter(df_col, split_on=None):\n",
" \n",
" # This splits the column into two, we know will be 'Lo D', 'Hi D' etc\n",
" a, b = df_col.split(split_on)\n",
" \n",
" # Now remove the 'D' and 'F' - they are no longer necessary\n",
" a = a.replace('D', '').strip(' ')\n",
" b = b.replace('F', '').strip(' ')\n",
" \n",
" return pd.Series([a, b])\n",
"\n",
"# Assign the output to a pair of columns in the original DataFrame\n",
"melted_df[['Disgust', 'Fright']] = melted_df['Condition'].apply(col_splitter, split_on=',')\n",
"\n",
"# Finally, drop the original Condition column - 'drop' method is easy to use\n",
"melted_df.drop('Condition', axis=1, inplace=True)\n",
"\n",
"display(melted_df)"
]
},
{
"cell_type": "markdown",
"id": "e79101d3-68e7-481b-ab1b-e7ee573640f6",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"This data is still a little untidy. The labels in `Disgust` and `Fright` are lazy, but can easily be replaced."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "a9e7274c-2392-4ea3-b71e-ee5fbdd565d8",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Subject | \n",
" Gender | \n",
" Region | \n",
" Education | \n",
" Rating | \n",
" Disgust | \n",
" Fright | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 6.0 | \n",
" Low | \n",
" Low | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" Low | \n",
" Low | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 6.0 | \n",
" Low | \n",
" High | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" NaN | \n",
" Low | \n",
" High | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 9.0 | \n",
" High | \n",
" Low | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" High | \n",
" Low | \n",
"
\n",
" \n",
" 6 | \n",
" 1 | \n",
" Female | \n",
" North | \n",
" some | \n",
" 10.0 | \n",
" High | \n",
" High | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" Female | \n",
" North | \n",
" advance | \n",
" 10.0 | \n",
" High | \n",
" High | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Subject Gender Region Education Rating Disgust Fright\n",
"0 1 Female North some 6.0 Low Low\n",
"1 2 Female North advance 10.0 Low Low\n",
"2 1 Female North some 6.0 Low High\n",
"3 2 Female North advance NaN Low High\n",
"4 1 Female North some 9.0 High Low\n",
"5 2 Female North advance 10.0 High Low\n",
"6 1 Female North some 10.0 High High\n",
"7 2 Female North advance 10.0 High High"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Tidy up further\n",
"replacement = {'Lo': 'Low', 'Hi': 'High'}\n",
"melted_df.replace({'Disgust': replacement, 'Fright': replacement}, inplace=True)\n",
"\n",
"display(melted_df)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "858f20d6-c324-42ae-8691-a6258a50b110",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Disgust Fright\n",
"High High 10.0\n",
" Low 9.5\n",
"Low High 6.0\n",
" Low 8.0\n",
"Name: Rating, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Analysis is just as easy - notice how by encapsulating in parentheses gives a nice readable 'chain'\n",
"condition_means = (melted_df\n",
" .groupby(by=['Disgust', 'Fright'])\n",
" .agg({'Rating': 'mean'})\n",
" )\n",
"\n",
"display(condition_means)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}