1. Advanced data handling with `pandas`#

1.1. The `.apply()` method#

So far, .transform() and .agg() have been powerful methods we can use to pass existing functions, or even custom functions, to our data. There is one more method that allows the passing of functions to a DataFrame in a flexible way - .apply(). This method allows you to pass a function to each row or column of a DataFrame, and the result can flexibly be a single number (unlike .transform()) or another DataFrame.

import pandas as pd
import numpy as np

# Demonstrate apply
sample_df = pd.DataFrame({'item': ['mirror', 'chair', 'drawers', 'desk'],
                          'cost':['£32.1023', '£21.29102022', '£90.21', '£99.1']})

# Define a function to create a nice print out of each row
def df_printer(df_row):
    return df_row['item'] + ' costs: ' + df_row['cost']

# Apply along columns
sample_df['describe'] =  sample_df.apply(df_printer, axis=1)

display(sample_df)

	item	cost	describe
0	mirror	£32.1023	mirror costs: £32.1023
1	chair	£21.29102022	chair costs: £21.29102022
2	drawers	£90.21	drawers costs: £90.21
3	desk	£99.1	desk costs: £99.1

# Clean the cost column up with a lambda
sample_df['clean_cost'] = sample_df['cost'].apply(lambda x: round(float(x.replace('£', '')), 3))
sample_df

	item	cost	describe	clean_cost
0	mirror	£32.1023	mirror costs: £32.1023	32.102
1	chair	£21.29102022	chair costs: £21.29102022	21.291
2	drawers	£90.21	drawers costs: £90.21	90.210
3	desk	£99.1	desk costs: £99.1	99.100

A more practical example might help more. How could we correlate each column of a DataFrame with a specific variable? This is a common task - perhaps we want to check how our predictors are correlated with our outcome measure.

# For this example, use the 'iris' data set available here: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None,
                  names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])

# How to correlate, with p-value, each column with sepal_width?
# import correlation function - more on this in later chapters
from scipy.stats import pearsonr

def sepal_correlate(df_col):
    
    # Check type, and just return a pair of NaN if so. 
    if df_col.dtype == 'O':
        return pd.Series([np.nan, np.nan], index=['pearsonr', 'pvalue'])
    
    # Correlation function returns two value
    r, pval = pearsonr(iris['sepal_length'], df_col)
    
    # Place these into a 'Series' - a single column from a DataFrame
    output = pd.Series([r, pval], index=['pearsonr', 'pvalue'])
    
    return output

# Apply
corrs = iris.apply(sepal_correlate, axis=0)

# Display and transpose
display(corrs.T.round(3))

	pearsonr	pvalue
sepal_length	1.000	0.000
sepal_width	-0.109	0.183
petal_length	0.872	0.000
petal_width	0.818	0.000
species	NaN	NaN

There are other ways to achieve this that may be simpler, but the general idea of taking specific columns and applying functions to is an all-round good solution. pandas does have a built-in .corr method on DataFrames; but if you are interested in p-values, it will not return them.

An introduction to data analysis in Python

Advanced data handling with pandas

Contents

1. Advanced data handling with `pandas`#

1.1. The `.apply()` method#

An introduction to data analysis in Python

Advanced data handling with pandas

Contents

1. Advanced data handling with pandas#

1.1. The .apply() method#

1. Advanced data handling with `pandas`#

1.1. The `.apply()` method#