{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "19f57f7d-0699-45e1-bf58-491d1870272e",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Advanced data handling with `pandas`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43d9a2c7-1575-47b8-8a71-4709c7fa52df",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### The `.apply()` method\n",
    "So far, `.transform()` and `.agg()` have been powerful methods we can use to pass existing functions, or even custom functions, to our data. There is one more method that allows the passing of functions to a DataFrame in a flexible way - `.apply()`. This method allows you to pass a function to each row or column of a DataFrame, and the result can flexibly be a single number (unlike `.transform()`) or another DataFrame. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "13eaa770-ca0a-4cb2-af6d-2392ef6d04b9",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5386747a-ae45-4155-bdc6-1c4869d72902",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# Demonstrate apply\n",
    "sample_df = pd.DataFrame({'item': ['mirror', 'chair', 'drawers', 'desk'],\n",
    "                          'cost':['£32.1023', '£21.29102022', '£90.21', '£99.1']})\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "483147f1-286d-491a-a849-5d0b4f2deec5",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>item</th>\n",
       "      <th>cost</th>\n",
       "      <th>describe</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>mirror</td>\n",
       "      <td>£32.1023</td>\n",
       "      <td>mirror costs: £32.1023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chair</td>\n",
       "      <td>£21.29102022</td>\n",
       "      <td>chair costs: £21.29102022</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>drawers</td>\n",
       "      <td>£90.21</td>\n",
       "      <td>drawers costs: £90.21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>desk</td>\n",
       "      <td>£99.1</td>\n",
       "      <td>desk costs: £99.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      item          cost                   describe\n",
       "0   mirror      £32.1023     mirror costs: £32.1023\n",
       "1    chair  £21.29102022  chair costs: £21.29102022\n",
       "2  drawers        £90.21      drawers costs: £90.21\n",
       "3     desk         £99.1          desk costs: £99.1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Define a function to create a nice print out of each row\n",
    "def df_printer(df_row):\n",
    "    return df_row['item'] + ' costs: ' + df_row['cost']\n",
    "\n",
    "# Apply along columns\n",
    "sample_df['describe'] =  sample_df.apply(df_printer, axis=1)\n",
    "\n",
    "display(sample_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "69b96437-cee1-4b59-94d2-8bb81fc9df59",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>item</th>\n",
       "      <th>cost</th>\n",
       "      <th>describe</th>\n",
       "      <th>clean_cost</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>mirror</td>\n",
       "      <td>£32.1023</td>\n",
       "      <td>mirror costs: £32.1023</td>\n",
       "      <td>32.102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chair</td>\n",
       "      <td>£21.29102022</td>\n",
       "      <td>chair costs: £21.29102022</td>\n",
       "      <td>21.291</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>drawers</td>\n",
       "      <td>£90.21</td>\n",
       "      <td>drawers costs: £90.21</td>\n",
       "      <td>90.210</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>desk</td>\n",
       "      <td>£99.1</td>\n",
       "      <td>desk costs: £99.1</td>\n",
       "      <td>99.100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      item          cost                   describe  clean_cost\n",
       "0   mirror      £32.1023     mirror costs: £32.1023      32.102\n",
       "1    chair  £21.29102022  chair costs: £21.29102022      21.291\n",
       "2  drawers        £90.21      drawers costs: £90.21      90.210\n",
       "3     desk         £99.1          desk costs: £99.1      99.100"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Clean the cost column up with a lambda\n",
    "sample_df['clean_cost'] = sample_df['cost'].apply(lambda x: round(float(x.replace('£', '')), 3))\n",
    "sample_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58f9f130-ecbf-413c-a5e8-3eac5fc4374e",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "A more practical example might help more. How could we correlate each column of a DataFrame with a specific variable? This is a common task - perhaps we want to check how our predictors are correlated with our outcome measure. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1b19708f-c538-47f8-9631-cee24a27461b",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# For this example, use the 'iris' data set available here: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\n",
    "iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None,\n",
    "                  names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "450664f8-9afd-4f27-bfed-65c47034bae7",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pearsonr</th>\n",
       "      <th>pvalue</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>sepal_length</th>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sepal_width</th>\n",
       "      <td>-0.109</td>\n",
       "      <td>0.183</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>petal_length</th>\n",
       "      <td>0.872</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>petal_width</th>\n",
       "      <td>0.818</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>species</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              pearsonr  pvalue\n",
       "sepal_length     1.000   0.000\n",
       "sepal_width     -0.109   0.183\n",
       "petal_length     0.872   0.000\n",
       "petal_width      0.818   0.000\n",
       "species            NaN     NaN"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# How to correlate, with p-value, each column with sepal_width?\n",
    "# import correlation function - more on this in later chapters\n",
    "from scipy.stats import pearsonr\n",
    "\n",
    "def sepal_correlate(df_col):\n",
    "    \n",
    "    # Check type, and just return a pair of NaN if so. \n",
    "    if df_col.dtype == 'O':\n",
    "        return pd.Series([np.nan, np.nan], index=['pearsonr', 'pvalue'])\n",
    "    \n",
    "    # Correlation function returns two value\n",
    "    r, pval = pearsonr(iris['sepal_length'], df_col)\n",
    "    \n",
    "    # Place these into a 'Series' - a single column from a DataFrame\n",
    "    output = pd.Series([r, pval], index=['pearsonr', 'pvalue'])\n",
    "    \n",
    "    return output\n",
    "\n",
    "# Apply\n",
    "corrs = iris.apply(sepal_correlate, axis=0)\n",
    "\n",
    "# Display and transpose\n",
    "display(corrs.T.round(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbf8ebbf-a62a-4d22-8021-6b9d38ea82d9",
   "metadata": {},
   "source": [
    "There are other ways to achieve this that may be simpler, but the general idea of taking specific columns and applying functions to is an all-round good solution. `pandas` does have a built-in `.corr` method on DataFrames; but if you are interested in *p*-values, it will not return them. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}