{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d88565d7-bad7-4ba2-96ce-8557e557f23f",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## There and back again - reshaping data with `.pivot_table()`.\n",
    "Melting DataFrames is essential for working with data under specific contexts. But it also makes other uses hard. For example, how could we compute the difference between two conditions in our melted example of the `Bugs` dataset? \n",
    "\n",
    "The only solution is subsetting - extract a column where Condition equals the condition you want, then pull out the numeric values. Do the same for the other variable, but make sure the index is the same - if you break pairs the results will mean nothing! It was far easier before...\n",
    "\n",
    "A simple solution would be to put the data into a `wide` format. This can be done using `pd.pivot_table` or its identical DataFrame method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92ddf0a6-d78d-45c9-a93f-ed95e9681fa1",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "melted_df = (pd\n",
    "             .read_csv('https://osf.io/mrhjn/download')\n",
    "             .iloc[0:2, :]\n",
    "             .melt(id_vars=['Subject', 'Gender', 'Region', 'Education'],\n",
    "                   value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],\n",
    "                   var_name='Condition', \n",
    "                   value_name='Rating')\n",
    "            )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "86778ada-e6b4-4da6-8d31-f4ab494e39ee",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Condition</th>\n",
       "      <th>Hi D, Hi F</th>\n",
       "      <th>Hi D, Lo F</th>\n",
       "      <th>Lo D, Hi F</th>\n",
       "      <th>Lo D, Lo F</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Subject</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Region</th>\n",
       "      <th>Education</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <th>Female</th>\n",
       "      <th>North</th>\n",
       "      <th>some</th>\n",
       "      <td>10.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <th>Female</th>\n",
       "      <th>North</th>\n",
       "      <th>advance</th>\n",
       "      <td>10.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Condition                        Hi D, Hi F  Hi D, Lo F  Lo D, Hi F  \\\n",
       "Subject Gender Region Education                                       \n",
       "1       Female North  some             10.0         9.0         6.0   \n",
       "2       Female North  advance          10.0        10.0         NaN   \n",
       "\n",
       "Condition                        Lo D, Lo F  \n",
       "Subject Gender Region Education              \n",
       "1       Female North  some              6.0  \n",
       "2       Female North  advance          10.0  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Reshape the melted data back to the original\n",
    "wide = pd.pivot_table(melted_df, index=['Subject', 'Gender', 'Region', 'Education'], columns='Condition', values='Rating')\n",
    "\n",
    "display(wide)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af1d4550-39b7-4494-bf8e-cafc86529c9e",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "It should be clear this works in more or less the opposite way to `melt`. The arguments `columns` and `values` want the names of the pair of columns that house, respectively, the condition labels, and the numeric values. Easy - but `index` wants a list of the identifier variables that you specify when carrying out a `melt`. The result is that the DataFrame that comes out of a `pivot_table` has a `MultiIndex`. You can fix this with `.reset_index()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "1d1ccb24-a1f4-4b66-bb42-0516ecc521f7",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Condition</th>\n",
       "      <th>Subject</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Region</th>\n",
       "      <th>Education</th>\n",
       "      <th>Hi D, Hi F</th>\n",
       "      <th>Hi D, Lo F</th>\n",
       "      <th>Lo D, Hi F</th>\n",
       "      <th>Lo D, Lo F</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>10.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Condition  Subject  Gender Region Education  Hi D, Hi F  Hi D, Lo F  \\\n",
       "0                1  Female  North      some        10.0         9.0   \n",
       "1                2  Female  North   advance        10.0        10.0   \n",
       "\n",
       "Condition  Lo D, Hi F  Lo D, Lo F  \n",
       "0                 6.0         6.0  \n",
       "1                 NaN        10.0  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Reset index\n",
    "display(wide.reset_index())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72ad9e3f-679e-4c36-b2bb-2986eabdb8c8",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Advanced uses in data reshaping and cleaning\n",
    "When moving from wide to long data, you may encounter situations in which you don't gain complete separation of a variable. When melting `rm_data` down as we did earlier, we obtain a column with four levels. These levels denote *two* variables, each with *two* levels. Pandas is blind to this fact since it is all wrapped together. While no issue for us in interpreting data, this poses problems for various analyses and graphing functions we will cover soon.\n",
    "\n",
    "The solution is to split a column like into two using some string methods. We already have most of the tools at our disposal to do this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "a888acbe-47e0-4a88-80f8-8c3d282a1887",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Subject</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Region</th>\n",
       "      <th>Education</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Disgust</th>\n",
       "      <th>Fright</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>6.0</td>\n",
       "      <td>Lo</td>\n",
       "      <td>Lo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Lo</td>\n",
       "      <td>Lo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>6.0</td>\n",
       "      <td>Lo</td>\n",
       "      <td>Hi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Lo</td>\n",
       "      <td>Hi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>9.0</td>\n",
       "      <td>Hi</td>\n",
       "      <td>Lo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Hi</td>\n",
       "      <td>Lo</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Hi</td>\n",
       "      <td>Hi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Hi</td>\n",
       "      <td>Hi</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Subject  Gender Region Education  Rating Disgust Fright\n",
       "0        1  Female  North      some     6.0      Lo     Lo\n",
       "1        2  Female  North   advance    10.0      Lo     Lo\n",
       "2        1  Female  North      some     6.0      Lo     Hi\n",
       "3        2  Female  North   advance     NaN      Lo     Hi\n",
       "4        1  Female  North      some     9.0      Hi     Lo\n",
       "5        2  Female  North   advance    10.0      Hi     Lo\n",
       "6        1  Female  North      some    10.0      Hi     Hi\n",
       "7        2  Female  North   advance    10.0      Hi     Hi"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Fix 'Condition' column, and extract into two \n",
    "# First define a function that will split the column\n",
    "def col_splitter(df_col, split_on=None):\n",
    "    \n",
    "    # This splits the column into two, we know will be 'Lo D', 'Hi D' etc\n",
    "    a, b = df_col.split(split_on)\n",
    "    \n",
    "    # Now remove the 'D' and 'F' - they are no longer necessary\n",
    "    a = a.replace('D', '').strip(' ')\n",
    "    b = b.replace('F', '').strip(' ')\n",
    "    \n",
    "    return pd.Series([a, b])\n",
    "\n",
    "# Assign the output to a pair of columns in the original DataFrame\n",
    "melted_df[['Disgust', 'Fright']] = melted_df['Condition'].apply(col_splitter, split_on=',')\n",
    "\n",
    "# Finally, drop the original Condition column - 'drop' method is easy to use\n",
    "melted_df.drop('Condition', axis=1, inplace=True)\n",
    "\n",
    "display(melted_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e79101d3-68e7-481b-ab1b-e7ee573640f6",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "This data is still a little untidy. The labels in `Disgust` and `Fright` are lazy, but can easily be replaced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "a9e7274c-2392-4ea3-b71e-ee5fbdd565d8",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Subject</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Region</th>\n",
       "      <th>Education</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Disgust</th>\n",
       "      <th>Fright</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>6.0</td>\n",
       "      <td>Low</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Low</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>6.0</td>\n",
       "      <td>Low</td>\n",
       "      <td>High</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Low</td>\n",
       "      <td>High</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>9.0</td>\n",
       "      <td>High</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>10.0</td>\n",
       "      <td>High</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>some</td>\n",
       "      <td>10.0</td>\n",
       "      <td>High</td>\n",
       "      <td>High</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2</td>\n",
       "      <td>Female</td>\n",
       "      <td>North</td>\n",
       "      <td>advance</td>\n",
       "      <td>10.0</td>\n",
       "      <td>High</td>\n",
       "      <td>High</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Subject  Gender Region Education  Rating Disgust Fright\n",
       "0        1  Female  North      some     6.0     Low    Low\n",
       "1        2  Female  North   advance    10.0     Low    Low\n",
       "2        1  Female  North      some     6.0     Low   High\n",
       "3        2  Female  North   advance     NaN     Low   High\n",
       "4        1  Female  North      some     9.0    High    Low\n",
       "5        2  Female  North   advance    10.0    High    Low\n",
       "6        1  Female  North      some    10.0    High   High\n",
       "7        2  Female  North   advance    10.0    High   High"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Tidy up further\n",
    "replacement = {'Lo': 'Low', 'Hi': 'High'}\n",
    "melted_df.replace({'Disgust': replacement, 'Fright': replacement}, inplace=True)\n",
    "\n",
    "display(melted_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "858f20d6-c324-42ae-8691-a6258a50b110",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Disgust  Fright\n",
       "High     High      10.0\n",
       "         Low        9.5\n",
       "Low      High       6.0\n",
       "         Low        8.0\n",
       "Name: Rating, dtype: float64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Analysis is just as easy - notice how by encapsulating in parentheses gives a nice readable 'chain'\n",
    "condition_means = (melted_df\n",
    "                   .groupby(by=['Disgust', 'Fright'])\n",
    "                   .agg({'Rating': 'mean'})\n",
    "                  )\n",
    "\n",
    "display(condition_means)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}