{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Data surgery - joining DataFrames with `pd.concat` and `pd.merge`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "So far, the data has come together neatly in a single DataFrame. This is almost always *not* how it works in the real world - you will probably have data from multiple sources that need to be collated into a single file (e.g., perhaps each participant has data stored in a separate file). \n",
    "\n",
    "You can't work with them separately, so you need to combine them. Pandas has a function, `pd.concat`, that is made to concatenate DataFrames together, and can combine along rows (stack on top) or along columns (stack next to each other)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Scores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>92</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID  Scores\n",
       "0   1      32\n",
       "1   1      89\n",
       "2   1      32\n",
       "3   1      56\n",
       "0   2      11\n",
       "1   2      48\n",
       "2   2      49\n",
       "3   2      92"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Example\n",
    "df1 = pd.DataFrame({'ID': [1]*4, 'Scores': [32, 89, 32, 56]})\n",
    "df2 = pd.DataFrame({'ID': [2]*4, 'Scores': [11, 48, 49, 92]})\n",
    "\n",
    "# Stack on top to build a longer df\n",
    "stack = pd.concat([df1, df2], axis=0)\n",
    "display(stack)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Scores</th>\n",
       "      <th>ID</th>\n",
       "      <th>Scores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "      <td>2</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "      <td>2</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "      <td>2</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>56</td>\n",
       "      <td>2</td>\n",
       "      <td>92</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID  Scores  ID  Scores\n",
       "0   1      32   2      11\n",
       "1   1      89   2      48\n",
       "2   1      32   2      49\n",
       "3   1      56   2      92"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID  ID\n",
       "0   1   2\n",
       "1   1   2\n",
       "2   1   2\n",
       "3   1   2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# The other direction works as well, but column names duplicate\n",
    "joined = pd.concat([df1, df2], axis=1)\n",
    "display(joined)\n",
    "\n",
    "# Confusing!\n",
    "display(joined['ID'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Mismatched DataFrames\n",
    "Behind the scenes, `pd.concat` will do its best to match on the either the columns by matching their names (and repeating the index values), or by matching on the index values and repeating the columns. A mismatch can have unintended results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>PID</th>\n",
       "      <th>Result</th>\n",
       "      <th>Scores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>89.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>49.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>92.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID  PID  Result  Scores\n",
       "0  NaN  1.0    32.0     NaN\n",
       "1  NaN  1.0    89.0     NaN\n",
       "2  NaN  1.0    32.0     NaN\n",
       "3  NaN  1.0    56.0     NaN\n",
       "0  2.0  NaN     NaN    11.0\n",
       "1  2.0  NaN     NaN    48.0\n",
       "2  2.0  NaN     NaN    49.0\n",
       "3  2.0  NaN     NaN    92.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Change df1 columns to something else\n",
    "df1.columns = ['PID', 'Result']\n",
    "\n",
    "odd_concat = pd.concat([df1, df2], axis=0, sort=True)\n",
    "display(odd_concat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Note how the index values are repeated, but NaN's are imputed because the column names don't match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Scores</th>\n",
       "      <th>ID</th>\n",
       "      <th>Scores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>1.0</td>\n",
       "      <td>89.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>1.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>1.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>49.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>92.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID  Scores   ID  Scores\n",
       "a  1.0    32.0  NaN     NaN\n",
       "b  1.0    89.0  NaN     NaN\n",
       "c  1.0    32.0  NaN     NaN\n",
       "d  1.0    56.0  NaN     NaN\n",
       "0  NaN     NaN  2.0    11.0\n",
       "1  NaN     NaN  2.0    48.0\n",
       "2  NaN     NaN  2.0    49.0\n",
       "3  NaN     NaN  2.0    92.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Fix df1 columns back\n",
    "df1.columns = df2.columns\n",
    "\n",
    "# Change df1 index\n",
    "df1.index = list('abcd')\n",
    "\n",
    "# Now concat along columns\n",
    "odd_concat2 = pd.concat([df1, df2], axis=1)\n",
    "display(odd_concat2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The same thing happens. Make sure things match before concatenation!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Finally, when concatenating row-wise, you can pass to `pd.concat` the keyword `ignore_index=True` which will reset the index so it doesn't repeat."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['a', 'b', 'c', 'd'], dtype='object')"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=0, stop=4, step=1)"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Scores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2</td>\n",
       "      <td>92</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID  Scores\n",
       "0   1      32\n",
       "1   1      89\n",
       "2   1      32\n",
       "3   1      56\n",
       "4   2      11\n",
       "5   2      48\n",
       "6   2      49\n",
       "7   2      92"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(df1.index)\n",
    "display(df2.index)\n",
    "\n",
    "fixed = pd.concat([df1, df2], axis=0, ignore_index=True)\n",
    "display(fixed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Advanced joins with `pd.merge`\n",
    "Sometimes, you will end up in situations where you need to combine DataFrames in an intelligent way, rather than simply stacking them across either axis. \n",
    "\n",
    "Consider the following example. You have your participant demographics in a single DataFrame, but their data responses are in another. Worse still, while most participants completed the demographics section, not all of them completed the experiment. Some even completed study, but didn't log their demographics! So the DataFrames are of different sizes. \n",
    "\n",
    "How to place everything together into a single DataFrame in this nightmare scenario? A full understanding of `pd.merge` will solve most of these problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# Make a demographics DataFrame\n",
    "demographics = pd.DataFrame({'ID': ['001', '002', '003', '004', '005'],\n",
    "                            'Age': [21, 25, 18, 19, 20],\n",
    "                            'Sex': ['Female', 'Male', 'Male', 'Female', 'Male']})\n",
    "\n",
    "# Make a dataset\n",
    "data = pd.DataFrame({'ID': ['001', '002', '004', '006'],\n",
    "                    'A_Score': [228, 900, 291, 440],\n",
    "                    'B_Score': [321, 225, 100, 300]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>ID</th>\n",
       "      <th>A_Score</th>\n",
       "      <th>B_Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>21</td>\n",
       "      <td>Female</td>\n",
       "      <td>001</td>\n",
       "      <td>228.0</td>\n",
       "      <td>321.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>25</td>\n",
       "      <td>Male</td>\n",
       "      <td>002</td>\n",
       "      <td>900.0</td>\n",
       "      <td>225.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>18</td>\n",
       "      <td>Male</td>\n",
       "      <td>004</td>\n",
       "      <td>291.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>004</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>006</td>\n",
       "      <td>440.0</td>\n",
       "      <td>300.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>005</td>\n",
       "      <td>20</td>\n",
       "      <td>Male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID  Age     Sex   ID  A_Score  B_Score\n",
       "0  001   21  Female  001    228.0    321.0\n",
       "1  002   25    Male  002    900.0    225.0\n",
       "2  003   18    Male  004    291.0    100.0\n",
       "3  004   19  Female  006    440.0    300.0\n",
       "4  005   20    Male  NaN      NaN      NaN"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# First try pd.concat\n",
    "try1 = pd.concat([demographics, data], axis=1)\n",
    "display(try1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "`pd.concat` gives us a total mess. Let's use `pd.merge` and provide it with three arguments:\n",
    "* `left`: The DataFrame on the left of the join.\n",
    "* `right`: The DataFrame on the right of the join.\n",
    "* `on`: A specific column header shared between both DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>A_Score</th>\n",
       "      <th>B_Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>21</td>\n",
       "      <td>Female</td>\n",
       "      <td>228</td>\n",
       "      <td>321</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>25</td>\n",
       "      <td>Male</td>\n",
       "      <td>900</td>\n",
       "      <td>225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>004</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>291</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID  Age     Sex  A_Score  B_Score\n",
       "0  001   21  Female      228      321\n",
       "1  002   25    Male      900      225\n",
       "2  004   19  Female      291      100"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Try joining on 'ID'\n",
    "initial_merge = pd.merge(left=demographics, right=data, on='ID')\n",
    "display(initial_merge)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "What's happened? `pd.merge` has looked at both DataFrames and seen that IDs `001`, `002`, and `004` are the only values that appear in *both* DataFrames, removed the values from `demographics` that are not in `data`, and joined the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Sometimes the value you might want to join on is in the index of one of the DataFrames. No problem - `pd.merge` has the optional keyword arguments `left_index` and `right_index`, and setting them to `True` will tell Pandas to look at the index of the DataFrame on the left/right. However, because we have told Pandas to look at the index of one of the DataFrames, we need to tell it what column to look at in the *other* - the keyword arguments `left_on` and `right_on` will take care of that. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A_Score</th>\n",
       "      <th>B_Score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>001</th>\n",
       "      <td>228</td>\n",
       "      <td>321</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>002</th>\n",
       "      <td>900</td>\n",
       "      <td>225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>004</th>\n",
       "      <td>291</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>006</th>\n",
       "      <td>440</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A_Score  B_Score\n",
       "ID                   \n",
       "001      228      321\n",
       "002      900      225\n",
       "004      291      100\n",
       "006      440      300"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Set the index of data to be the ID column\n",
    "new_data = data.set_index('ID')\n",
    "display(new_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>A_Score</th>\n",
       "      <th>B_Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>21</td>\n",
       "      <td>Female</td>\n",
       "      <td>228</td>\n",
       "      <td>321</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>25</td>\n",
       "      <td>Male</td>\n",
       "      <td>900</td>\n",
       "      <td>225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>004</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>291</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID  Age     Sex  A_Score  B_Score\n",
       "0  001   21  Female      228      321\n",
       "1  002   25    Male      900      225\n",
       "3  004   19  Female      291      100"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Merge with extra keywords\n",
    "index_merge = pd.merge(left=demographics, right=new_data, left_on='ID', right_index=True)\n",
    "display(index_merge)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### How to join - inner, outer, left, and right\n",
    "The default behaviour of `pd.merge` is to conduct what is known as an *inner join* - find elements of the specified columns that are in **both** DataFrames, discard the rest, and join together. However, sometimes you need other kinds of joins. Pandas provides these other options:\n",
    "* The `outer` join: Takes items that are present in the specified column of **either** DataFrame and cleanly stacks the results together.\n",
    "* The `left` join: Takes items present in the left DataFrame specified column, and will append matches from the right DataFrame or force them.\n",
    "* The `right` join: Takes items present in the right DataFrame specified column, and will  append matches from the left DataFrame or force them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>A_Score</th>\n",
       "      <th>B_Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>21.0</td>\n",
       "      <td>Female</td>\n",
       "      <td>228.0</td>\n",
       "      <td>321.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>25.0</td>\n",
       "      <td>Male</td>\n",
       "      <td>900.0</td>\n",
       "      <td>225.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>18.0</td>\n",
       "      <td>Male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>004</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Female</td>\n",
       "      <td>291.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>005</td>\n",
       "      <td>20.0</td>\n",
       "      <td>Male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>006</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>440.0</td>\n",
       "      <td>300.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID   Age     Sex  A_Score  B_Score\n",
       "0  001  21.0  Female    228.0    321.0\n",
       "1  002  25.0    Male    900.0    225.0\n",
       "2  003  18.0    Male      NaN      NaN\n",
       "3  004  19.0  Female    291.0    100.0\n",
       "4  005  20.0    Male      NaN      NaN\n",
       "5  006   NaN     NaN    440.0    300.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# What happens with outer?\n",
    "outer_join = pd.merge(left=demographics, right=data, on='ID', how='outer')\n",
    "display(outer_join)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "With `outer`, the values that were present in the `ID` column of both DataFrames were put together, and any missing values were \n",
    "replaced with NaN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>A_Score</th>\n",
       "      <th>B_Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>21</td>\n",
       "      <td>Female</td>\n",
       "      <td>228.0</td>\n",
       "      <td>321.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>25</td>\n",
       "      <td>Male</td>\n",
       "      <td>900.0</td>\n",
       "      <td>225.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>18</td>\n",
       "      <td>Male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>004</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>291.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>005</td>\n",
       "      <td>20</td>\n",
       "      <td>Male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID  Age     Sex  A_Score  B_Score\n",
       "0  001   21  Female    228.0    321.0\n",
       "1  002   25    Male    900.0    225.0\n",
       "2  003   18    Male      NaN      NaN\n",
       "3  004   19  Female    291.0    100.0\n",
       "4  005   20    Male      NaN      NaN"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# What happens with left?\n",
    "left_join = pd.merge(left=demographics, right=data, on='ID', how='left')\n",
    "display(left_join)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "With `left`, all the elements in `ID` were retained, and matches from the right DataFrame were pulled in. If there were no matches, they were filled with NaN. Notice how participant `006` is missing here - it was only in the right DataFrame, so wasn't filled. \n",
    "\n",
    "This is a particularly useful approach if you have a set of demographics data and want the data for *all* participants who logged their demographics, even if they didn't complete their data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>A_Score</th>\n",
       "      <th>B_Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>21.0</td>\n",
       "      <td>Female</td>\n",
       "      <td>228</td>\n",
       "      <td>321</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>25.0</td>\n",
       "      <td>Male</td>\n",
       "      <td>900</td>\n",
       "      <td>225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>004</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Female</td>\n",
       "      <td>291</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>006</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>440</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID   Age     Sex  A_Score  B_Score\n",
       "0  001  21.0  Female      228      321\n",
       "1  002  25.0    Male      900      225\n",
       "2  004  19.0  Female      291      100\n",
       "3  006   NaN     NaN      440      300"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# What happens with right? A reverse of left\n",
    "right_join = pd.merge(left=demographics, right=data, on='ID', how='right')\n",
    "display(right_join)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The same result could be achieved by swapping the DataFrames that were specified as left and right, and setting join to `left`. All entries from the right DataFrame are retained, and matches are pulled from the left DataFrame or are forced if they don't exist - note how participant `005` is missing here - they were only present in the left DataFrame, and not the right."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### `pandas` close\n",
    "With the concepts from this chapter and the previous ones, you have all the tools you might need for serious data wrangling. `pandas` is versatile and can handle computation on even very large datasets. At first, it will feel awkward and difficult; but keep practicing. \n",
    "\n",
    "If you want to dive deeper, try [Effective Pandas](https://hairysun.com/announcing-effective-pandas.html) by Matt Harrison. "
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}