{
"cells": [
{
"cell_type": "markdown",
"id": "19f57f7d-0699-45e1-bf58-491d1870272e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Advanced data handling with `pandas`"
]
},
{
"cell_type": "markdown",
"id": "43d9a2c7-1575-47b8-8a71-4709c7fa52df",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### The `.apply()` method\n",
"So far, `.transform()` and `.agg()` have been powerful methods we can use to pass existing functions, or even custom functions, to our data. There is one more method that allows the passing of functions to a DataFrame in a flexible way - `.apply()`. This method allows you to pass a function to each row or column of a DataFrame, and the result can flexibly be a single number (unlike `.transform()`) or another DataFrame. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "13eaa770-ca0a-4cb2-af6d-2392ef6d04b9",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5386747a-ae45-4155-bdc6-1c4869d72902",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Demonstrate apply\n",
"sample_df = pd.DataFrame({'item': ['mirror', 'chair', 'drawers', 'desk'],\n",
" 'cost':['£32.1023', '£21.29102022', '£90.21', '£99.1']})\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "483147f1-286d-491a-a849-5d0b4f2deec5",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item | \n",
" cost | \n",
" describe | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" mirror | \n",
" £32.1023 | \n",
" mirror costs: £32.1023 | \n",
"
\n",
" \n",
" 1 | \n",
" chair | \n",
" £21.29102022 | \n",
" chair costs: £21.29102022 | \n",
"
\n",
" \n",
" 2 | \n",
" drawers | \n",
" £90.21 | \n",
" drawers costs: £90.21 | \n",
"
\n",
" \n",
" 3 | \n",
" desk | \n",
" £99.1 | \n",
" desk costs: £99.1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item cost describe\n",
"0 mirror £32.1023 mirror costs: £32.1023\n",
"1 chair £21.29102022 chair costs: £21.29102022\n",
"2 drawers £90.21 drawers costs: £90.21\n",
"3 desk £99.1 desk costs: £99.1"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Define a function to create a nice print out of each row\n",
"def df_printer(df_row):\n",
" return df_row['item'] + ' costs: ' + df_row['cost']\n",
"\n",
"# Apply along columns\n",
"sample_df['describe'] = sample_df.apply(df_printer, axis=1)\n",
"\n",
"display(sample_df)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "69b96437-cee1-4b59-94d2-8bb81fc9df59",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item | \n",
" cost | \n",
" describe | \n",
" clean_cost | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" mirror | \n",
" £32.1023 | \n",
" mirror costs: £32.1023 | \n",
" 32.102 | \n",
"
\n",
" \n",
" 1 | \n",
" chair | \n",
" £21.29102022 | \n",
" chair costs: £21.29102022 | \n",
" 21.291 | \n",
"
\n",
" \n",
" 2 | \n",
" drawers | \n",
" £90.21 | \n",
" drawers costs: £90.21 | \n",
" 90.210 | \n",
"
\n",
" \n",
" 3 | \n",
" desk | \n",
" £99.1 | \n",
" desk costs: £99.1 | \n",
" 99.100 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item cost describe clean_cost\n",
"0 mirror £32.1023 mirror costs: £32.1023 32.102\n",
"1 chair £21.29102022 chair costs: £21.29102022 21.291\n",
"2 drawers £90.21 drawers costs: £90.21 90.210\n",
"3 desk £99.1 desk costs: £99.1 99.100"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Clean the cost column up with a lambda\n",
"sample_df['clean_cost'] = sample_df['cost'].apply(lambda x: round(float(x.replace('£', '')), 3))\n",
"sample_df"
]
},
{
"cell_type": "markdown",
"id": "58f9f130-ecbf-413c-a5e8-3eac5fc4374e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"A more practical example might help more. How could we correlate each column of a DataFrame with a specific variable? This is a common task - perhaps we want to check how our predictors are correlated with our outcome measure. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "1b19708f-c538-47f8-9631-cee24a27461b",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# For this example, use the 'iris' data set available here: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\n",
"iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None,\n",
" names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "450664f8-9afd-4f27-bfed-65c47034bae7",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pearsonr | \n",
" pvalue | \n",
"
\n",
" \n",
" \n",
" \n",
" sepal_length | \n",
" 1.000 | \n",
" 0.000 | \n",
"
\n",
" \n",
" sepal_width | \n",
" -0.109 | \n",
" 0.183 | \n",
"
\n",
" \n",
" petal_length | \n",
" 0.872 | \n",
" 0.000 | \n",
"
\n",
" \n",
" petal_width | \n",
" 0.818 | \n",
" 0.000 | \n",
"
\n",
" \n",
" species | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pearsonr pvalue\n",
"sepal_length 1.000 0.000\n",
"sepal_width -0.109 0.183\n",
"petal_length 0.872 0.000\n",
"petal_width 0.818 0.000\n",
"species NaN NaN"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# How to correlate, with p-value, each column with sepal_width?\n",
"# import correlation function - more on this in later chapters\n",
"from scipy.stats import pearsonr\n",
"\n",
"def sepal_correlate(df_col):\n",
" \n",
" # Check type, and just return a pair of NaN if so. \n",
" if df_col.dtype == 'O':\n",
" return pd.Series([np.nan, np.nan], index=['pearsonr', 'pvalue'])\n",
" \n",
" # Correlation function returns two value\n",
" r, pval = pearsonr(iris['sepal_length'], df_col)\n",
" \n",
" # Place these into a 'Series' - a single column from a DataFrame\n",
" output = pd.Series([r, pval], index=['pearsonr', 'pvalue'])\n",
" \n",
" return output\n",
"\n",
"# Apply\n",
"corrs = iris.apply(sepal_correlate, axis=0)\n",
"\n",
"# Display and transpose\n",
"display(corrs.T.round(3))"
]
},
{
"cell_type": "markdown",
"id": "bbf8ebbf-a62a-4d22-8021-6b9d38ea82d9",
"metadata": {},
"source": [
"There are other ways to achieve this that may be simpler, but the general idea of taking specific columns and applying functions to is an all-round good solution. `pandas` does have a built-in `.corr` method on DataFrames; but if you are interested in *p*-values, it will not return them. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}