Advanced data handling with pandas
Contents
1. Advanced data handling with pandas
#
1.1. The .apply()
method#
So far, .transform()
and .agg()
have been powerful methods we can use to pass existing functions, or even custom functions, to our data. There is one more method that allows the passing of functions to a DataFrame in a flexible way - .apply()
. This method allows you to pass a function to each row or column of a DataFrame, and the result can flexibly be a single number (unlike .transform()
) or another DataFrame.
import pandas as pd
import numpy as np
# Demonstrate apply
sample_df = pd.DataFrame({'item': ['mirror', 'chair', 'drawers', 'desk'],
'cost':['£32.1023', '£21.29102022', '£90.21', '£99.1']})
# Define a function to create a nice print out of each row
def df_printer(df_row):
return df_row['item'] + ' costs: ' + df_row['cost']
# Apply along columns
sample_df['describe'] = sample_df.apply(df_printer, axis=1)
display(sample_df)
item | cost | describe | |
---|---|---|---|
0 | mirror | £32.1023 | mirror costs: £32.1023 |
1 | chair | £21.29102022 | chair costs: £21.29102022 |
2 | drawers | £90.21 | drawers costs: £90.21 |
3 | desk | £99.1 | desk costs: £99.1 |
# Clean the cost column up with a lambda
sample_df['clean_cost'] = sample_df['cost'].apply(lambda x: round(float(x.replace('£', '')), 3))
sample_df
item | cost | describe | clean_cost | |
---|---|---|---|---|
0 | mirror | £32.1023 | mirror costs: £32.1023 | 32.102 |
1 | chair | £21.29102022 | chair costs: £21.29102022 | 21.291 |
2 | drawers | £90.21 | drawers costs: £90.21 | 90.210 |
3 | desk | £99.1 | desk costs: £99.1 | 99.100 |
A more practical example might help more. How could we correlate each column of a DataFrame with a specific variable? This is a common task - perhaps we want to check how our predictors are correlated with our outcome measure.
# For this example, use the 'iris' data set available here: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None,
names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
# How to correlate, with p-value, each column with sepal_width?
# import correlation function - more on this in later chapters
from scipy.stats import pearsonr
def sepal_correlate(df_col):
# Check type, and just return a pair of NaN if so.
if df_col.dtype == 'O':
return pd.Series([np.nan, np.nan], index=['pearsonr', 'pvalue'])
# Correlation function returns two value
r, pval = pearsonr(iris['sepal_length'], df_col)
# Place these into a 'Series' - a single column from a DataFrame
output = pd.Series([r, pval], index=['pearsonr', 'pvalue'])
return output
# Apply
corrs = iris.apply(sepal_correlate, axis=0)
# Display and transpose
display(corrs.T.round(3))
pearsonr | pvalue | |
---|---|---|
sepal_length | 1.000 | 0.000 |
sepal_width | -0.109 | 0.183 |
petal_length | 0.872 | 0.000 |
petal_width | 0.818 | 0.000 |
species | NaN | NaN |
There are other ways to achieve this that may be simpler, but the general idea of taking specific columns and applying functions to is an all-round good solution. pandas
does have a built-in .corr
method on DataFrames; but if you are interested in p-values, it will not return them.