1. Advanced data handling with pandas#

1.1. The .apply() method#

So far, .transform() and .agg() have been powerful methods we can use to pass existing functions, or even custom functions, to our data. There is one more method that allows the passing of functions to a DataFrame in a flexible way - .apply(). This method allows you to pass a function to each row or column of a DataFrame, and the result can flexibly be a single number (unlike .transform()) or another DataFrame.

import pandas as pd
import numpy as np
# Demonstrate apply
sample_df = pd.DataFrame({'item': ['mirror', 'chair', 'drawers', 'desk'],
                          'cost':['£32.1023', '£21.29102022', '£90.21', '£99.1']})
# Define a function to create a nice print out of each row
def df_printer(df_row):
    return df_row['item'] + ' costs: ' + df_row['cost']

# Apply along columns
sample_df['describe'] =  sample_df.apply(df_printer, axis=1)

display(sample_df)
item cost describe
0 mirror £32.1023 mirror costs: £32.1023
1 chair £21.29102022 chair costs: £21.29102022
2 drawers £90.21 drawers costs: £90.21
3 desk £99.1 desk costs: £99.1
# Clean the cost column up with a lambda
sample_df['clean_cost'] = sample_df['cost'].apply(lambda x: round(float(x.replace('£', '')), 3))
sample_df
item cost describe clean_cost
0 mirror £32.1023 mirror costs: £32.1023 32.102
1 chair £21.29102022 chair costs: £21.29102022 21.291
2 drawers £90.21 drawers costs: £90.21 90.210
3 desk £99.1 desk costs: £99.1 99.100

A more practical example might help more. How could we correlate each column of a DataFrame with a specific variable? This is a common task - perhaps we want to check how our predictors are correlated with our outcome measure.

# For this example, use the 'iris' data set available here: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None,
                  names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
# How to correlate, with p-value, each column with sepal_width?
# import correlation function - more on this in later chapters
from scipy.stats import pearsonr

def sepal_correlate(df_col):
    
    # Check type, and just return a pair of NaN if so. 
    if df_col.dtype == 'O':
        return pd.Series([np.nan, np.nan], index=['pearsonr', 'pvalue'])
    
    # Correlation function returns two value
    r, pval = pearsonr(iris['sepal_length'], df_col)
    
    # Place these into a 'Series' - a single column from a DataFrame
    output = pd.Series([r, pval], index=['pearsonr', 'pvalue'])
    
    return output

# Apply
corrs = iris.apply(sepal_correlate, axis=0)

# Display and transpose
display(corrs.T.round(3))
pearsonr pvalue
sepal_length 1.000 0.000
sepal_width -0.109 0.183
petal_length 0.872 0.000
petal_width 0.818 0.000
species NaN NaN

There are other ways to achieve this that may be simpler, but the general idea of taking specific columns and applying functions to is an all-round good solution. pandas does have a built-in .corr method on DataFrames; but if you are interested in p-values, it will not return them.