{ "cells": [ { "cell_type": "markdown", "id": "19f57f7d-0699-45e1-bf58-491d1870272e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Advanced data handling with `pandas`" ] }, { "cell_type": "markdown", "id": "43d9a2c7-1575-47b8-8a71-4709c7fa52df", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### The `.apply()` method\n", "So far, `.transform()` and `.agg()` have been powerful methods we can use to pass existing functions, or even custom functions, to our data. There is one more method that allows the passing of functions to a DataFrame in a flexible way - `.apply()`. This method allows you to pass a function to each row or column of a DataFrame, and the result can flexibly be a single number (unlike `.transform()`) or another DataFrame. " ] }, { "cell_type": "code", "execution_count": 1, "id": "13eaa770-ca0a-4cb2-af6d-2392ef6d04b9", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "id": "5386747a-ae45-4155-bdc6-1c4869d72902", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Demonstrate apply\n", "sample_df = pd.DataFrame({'item': ['mirror', 'chair', 'drawers', 'desk'],\n", " 'cost':['£32.1023', '£21.29102022', '£90.21', '£99.1']})\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "483147f1-286d-491a-a849-5d0b4f2deec5", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemcostdescribe
0mirror£32.1023mirror costs: £32.1023
1chair£21.29102022chair costs: £21.29102022
2drawers£90.21drawers costs: £90.21
3desk£99.1desk costs: £99.1
\n", "
" ], "text/plain": [ " item cost describe\n", "0 mirror £32.1023 mirror costs: £32.1023\n", "1 chair £21.29102022 chair costs: £21.29102022\n", "2 drawers £90.21 drawers costs: £90.21\n", "3 desk £99.1 desk costs: £99.1" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Define a function to create a nice print out of each row\n", "def df_printer(df_row):\n", " return df_row['item'] + ' costs: ' + df_row['cost']\n", "\n", "# Apply along columns\n", "sample_df['describe'] = sample_df.apply(df_printer, axis=1)\n", "\n", "display(sample_df)" ] }, { "cell_type": "code", "execution_count": 4, "id": "69b96437-cee1-4b59-94d2-8bb81fc9df59", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemcostdescribeclean_cost
0mirror£32.1023mirror costs: £32.102332.102
1chair£21.29102022chair costs: £21.2910202221.291
2drawers£90.21drawers costs: £90.2190.210
3desk£99.1desk costs: £99.199.100
\n", "
" ], "text/plain": [ " item cost describe clean_cost\n", "0 mirror £32.1023 mirror costs: £32.1023 32.102\n", "1 chair £21.29102022 chair costs: £21.29102022 21.291\n", "2 drawers £90.21 drawers costs: £90.21 90.210\n", "3 desk £99.1 desk costs: £99.1 99.100" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Clean the cost column up with a lambda\n", "sample_df['clean_cost'] = sample_df['cost'].apply(lambda x: round(float(x.replace('£', '')), 3))\n", "sample_df" ] }, { "cell_type": "markdown", "id": "58f9f130-ecbf-413c-a5e8-3eac5fc4374e", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A more practical example might help more. How could we correlate each column of a DataFrame with a specific variable? This is a common task - perhaps we want to check how our predictors are correlated with our outcome measure. " ] }, { "cell_type": "code", "execution_count": 5, "id": "1b19708f-c538-47f8-9631-cee24a27461b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# For this example, use the 'iris' data set available here: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\n", "iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None,\n", " names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])" ] }, { "cell_type": "code", "execution_count": 6, "id": "450664f8-9afd-4f27-bfed-65c47034bae7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pearsonrpvalue
sepal_length1.0000.000
sepal_width-0.1090.183
petal_length0.8720.000
petal_width0.8180.000
speciesNaNNaN
\n", "
" ], "text/plain": [ " pearsonr pvalue\n", "sepal_length 1.000 0.000\n", "sepal_width -0.109 0.183\n", "petal_length 0.872 0.000\n", "petal_width 0.818 0.000\n", "species NaN NaN" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# How to correlate, with p-value, each column with sepal_width?\n", "# import correlation function - more on this in later chapters\n", "from scipy.stats import pearsonr\n", "\n", "def sepal_correlate(df_col):\n", " \n", " # Check type, and just return a pair of NaN if so. \n", " if df_col.dtype == 'O':\n", " return pd.Series([np.nan, np.nan], index=['pearsonr', 'pvalue'])\n", " \n", " # Correlation function returns two value\n", " r, pval = pearsonr(iris['sepal_length'], df_col)\n", " \n", " # Place these into a 'Series' - a single column from a DataFrame\n", " output = pd.Series([r, pval], index=['pearsonr', 'pvalue'])\n", " \n", " return output\n", "\n", "# Apply\n", "corrs = iris.apply(sepal_correlate, axis=0)\n", "\n", "# Display and transpose\n", "display(corrs.T.round(3))" ] }, { "cell_type": "markdown", "id": "bbf8ebbf-a62a-4d22-8021-6b9d38ea82d9", "metadata": {}, "source": [ "There are other ways to achieve this that may be simpler, but the general idea of taking specific columns and applying functions to is an all-round good solution. `pandas` does have a built-in `.corr` method on DataFrames; but if you are interested in *p*-values, it will not return them. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" } }, "nbformat": 4, "nbformat_minor": 5 }