5. Custom functions#

The last few sessions have seen you use an enormous range of different functions in different modules. These pieces of reusable code achieve a specific outcome - one that is repeated enough times that its worth writing a function for.

Sometimes when analysing data, you will need to write your own functions. This is easily achievable in Python, using the special keyword def. Defining a function is like creating a blueprint - a set of commands that can be executed time and time again. Functions can be as simple or as complex as is needed (there is no limit) and so can be used to solve specific data problems you have.

# A very simple function example that works with strings
def simple(input_string):
    
    new_string = 'HELLO ' + input_string.upper()
    
    return new_string

5.1. Function definition breakdown#

  • The function is defined with the def keyword.

  • The name of the function follows immediately after.

  • The function inputs are specified within parentheses, followed by the colon and indented text for the actual work the function does.

  • The return keyword tells Python that the variable immediately following it will be the output of the function.

# Use case
out = simple('there')
print(out)

# The actual variable 'new_string' does not exist - globals is a special function returning all defined variables in your Python session,
# we are checking if 'new_string' is in there!
'new_string' in globals()
HELLO THERE
False

Functions can take multiple arguments, allowing for more complex operations, and return more than one output.

# Square the first argument, divide the second
def square_divide(a, b):
    
    first = a**2
    second = b/2
    
    return first, second

# 'Unpack' the outputs like this
i, j = square_divide(10, 20)

print(i, j)
100 10.0

The arguments seen so far are known as positional. Arguments take on the values that are inserted in the position they go in!

There are also keyword arguments. You have already seen these in action. These are arguments that can take on a default value, or you can specify a parameter of interest:

# Add a keyword argument to the square_divide function to specify the factor
def square_divide(a, b, factor=2):
    
    first = a**factor
    second = b/factor
    
    return first, second

# Default usage - what happens if you don't unpack
x = square_divide(2, 10)
print(x[0], x[1])

# Run with a new scaling factor, specifying with a keyword argument
x, y = square_divide(2, 10, factor=5)
print(x, y)
4 5.0
32 2.0

5.2. Lambda functions#

Lambda functions are a special kind of function definition used in Python. They are used to define simple functions that you may not need to re-use, but are nonetheless required for specific instances. These are quite abstract tools but are very powerful, especially with DataFrames. Lambda functions can be defined with the lambda keyword, followed by the arguments and actual function work. An example should make things clearer:

# Show use of lambda function - adds 5 to input
lamb = lambda x: x + 5

print(lamb(10))
15

Notice the different convention - everything is done on one line, and it returns only a single output. At this point, it does seem kind of useless…

But consider the example below. The function sorted takes a list, and will sort it from low to high. This behaviour is its default. However, sorted also takes a key argument, that allows for a specific sorting key to be defined. How could we sort a nested list?

# Define messy list - ID's and scores are messed up
data_list = [[4, 81, '003'], [111, 2, '002'], [2, 87, '001']]

# Run sorted with default
default_sort = sorted(data_list)

print(default_sort)
[[2, 87, '001'], [4, 81, '003'], [111, 2, '002']]
# Now use a lambda function to specify a special key and sort by the string!
lambda_sort = sorted(data_list, key=lambda x:x[-1])

print(lambda_sort)
[[2, 87, '001'], [111, 2, '002'], [4, 81, '003']]

It may be easier to understand what’s going on by writing out the full function definition and using that as the sort key.

# Write out the function definition in full - clunky, but hopefully clarifies
def sort_key(element):
    
    return element[-1]

# Apply 
func_sort = sorted(data_list, key=sort_key)

print(func_sort)
[[2, 87, '001'], [111, 2, '002'], [4, 81, '003']]

Each element of the original list is a nested list. The key function is applied to each of these nested lists, and these become x - the final element of x is extracted, and that is what the sorted function then uses to sort!

Lambda and normal functions can also be used with DataFrames for powerful effects. Let’s z-score some data in a DataFrame using some functions.

df = pd.DataFrame({'gender':np.random.choice(['Female', 'Male'], size=15),
                   'age': np.random.normal(30, 10, 15), 
                   'RT': np.random.randint(300, 1500, size=(15,))
                  })
display(df.head())
gender age RT
0 Male 44.776511 1052
1 Female 25.440464 1099
2 Female 30.872675 699
3 Female 21.358618 1157
4 Female 23.693729 1384
# Define a function that will take a column and z-score it - value minus mean, divided by SD
def zscorer(df_column):
    
    # Compute average and SD - because we will be passing a DataFrame column, can use its methods!
    avg = df_column.mean(); sd = df_column.std()
    
    # Subtract mean from each element
    centred = df_column - avg
    
    # Divide each element by sd
    zscored = centred/sd
    
    # Return
    return zscored

Our function can now be applied to the columns using the .transform() method. However, we have to stick to the columns with numeric data, or it will break - ‘male’ and ‘female’ cannot be operated on with the function.

func_zscore_data = df[['age', 'RT']].transform(zscorer)
display(func_zscore_data)
age RT
0 1.372054 0.211038
1 -0.502245 0.374535
2 0.024315 -1.016926
3 -0.897910 0.576297
4 -0.671561 1.365951
5 1.879427 -0.853429
6 -0.460162 1.164189
7 -0.376091 -0.314238
8 -0.074778 -2.217061
9 -0.639529 -0.446427
10 -1.933108 0.600647
11 0.119051 0.346706
12 1.328922 0.315398
13 -0.055604 -1.215209
14 0.887218 1.108530

Alternatively, this could be done much more efficiently using a lambda function.

# Apply the lambda
lamb_zscore_data = df[['age', 'RT']].transform(lambda x: (x - x.mean())/x.std())

# Do you get the same output?
(lamb_zscore_data == func_zscore_data).all().all()
True

5.3. Flow control - the if-else statement#

There is one more important set of keywords in Python you haven’t yet encountered, which is very powerful. The if-else statement allows our code to make decisions and do different actions depending on the nature of certain variables. This statement is very useful in functions, where different outputs can be returned.

# Demonstrate if with a simple greeting - this is how auto-generated emails will work
surname = ' Jones'
sex = 'male'

if sex == 'female':
    prefix = 'Miss'
else:
    prefix = 'Mr'

salutation = 'Dear ' + prefix + surname
print(salutation)
Dear Mr Jones
# Another simple example
val = 10

if val > 10:
    print('Above 10')

# Why no output?

An if statement checks the truth of a conditional statement, and carries out the specified code if the condition is true - ‘else’ it does the code specified in the ‘else’ statement. You don’t have to have an else statement.

if statements can have complexity added with an elif keyword, which means ‘else-if’ - allowing you to check for multiple conditions.

# Demonstrate elif
age = 25

if age >= 30:
    print('In 30s')
    
elif age < 30 and age > 20:
    print('In 20s')
    
elif age < 20:
    print('In teens')
In 20s

Another example with a list:

my_list = ['PID', 23, 921]

# Check if an item is in a list, then do something with it
if 'PID' in my_list:
    my_list.remove('PID')
else:
    print('List is clean')
    
print(my_list)
[23, 921]

These statements are very powerful in functions, where different outputs can be returned depending on the inputs. In an earlier example, we had to subset a DataFrame to z-score the data, because we knew the function we applied with .transform() would break on the column containing strings.

# Edit the function to cope with strings, by checking the dtype of the column!
# Define a function that will take a column and z-score it - value minus mean, divided by SD
def zscorer_new(df_column):
    
    # If its an 'object', meaning has strings
    if df_column.dtype == 'O':
        return df_column # Give it back in place
    
    else:
        # Compute average and SD - because we will be passing a DataFrame column, can use its methods!
        avg = df_column.mean()
        sd = df_column.std()

        # Subtract mean from each element
        centred = df_column - avg

        # Divide each element by sd
        zscored = centred/sd
    
        # Return
        return zscored
# Now transform the DataFrame; no need to subset it
full_zscore = df.transform(zscorer_new)
display(full_zscore.head())
gender age RT
0 Male 1.372054 0.211038
1 Female -0.502245 0.374535
2 Female 0.024315 -1.016926
3 Female -0.897910 0.576297
4 Female -0.671561 1.365951

5.3.1. Close#

There’s a lot to take in there once more, but hopefully you can now see the benefits of working with NumPy and Pandas to get the most of your data.

Try the exercises, and don’t feel you have to know all of this in one go. Mastering the basics of arrays and the general split-apply-combine approach counts for a large proportion of data analysis.