2. The numpy module#

While dictionaries and lists will get you places, they are not built for serious data manipulation and number crunching. NumPy’s main feature is a special data type, called the array. This is a data collection built for holding data of a single data type (e.g. integers, floats) in multiple dimensions, and is capable of performing mathematical operations on them quickly and efficiently.

2.1. np.array#

To make a NumPy array, we need to pass a list of numbers to the np.array function - that is, the array function accessed in the NumPy module.

# Import
import numpy as np

# Make a small array
arr = np.array([3, 4, 8, 9, 10])

print(arr)
print(type(arr))
[ 3  4  8  9 10]
<class 'numpy.ndarray'>
np.random.seed(42)

2.2. Mathematical operations with NumPy#

With lists, if we wanted to multiply each element, we’d need a for loop, and a way to make sure the data was actually a number of some kind. Carrying out operations on a NumPy array is very simple - each element will be changed simply by carrying out some kind of operation. If you have two arrays with the same dimensions, each element is operated on by the corresponding position in the other array.

# Maths with NumPy #
# Add 5 to each element
arr_add5 = 5 + arr

# Multiply by 3
arr_times3 = 3 * arr

print(arr, arr_add5, arr_times3)
[ 3  4  8  9 10] [ 8  9 13 14 15] [ 9 12 24 27 30]
# It's even possible to do operations with two arrays of the same size - more on this later
arr1 = np.array([1, 5, 10])
arr2 = np.array([1, 1, 1])

added = arr1 + arr2
print(added)
[ 2  6 11]

2.3. Array dimensions - 1D#

In NumPy, each array has an associated ‘shape’, which indicates its dimensions - that is, does the array live in 1, 2, or 3 dimensions? NumPy can store data in multidimensional arrays beyond what we can imagine. The power of all data handling comes from understanding these multidimensional arrays. You can find the shape of an array with the .shape attribute (like a method, but not a function - just returns some useful information).

# Arrays in 1 dimension
arr_1d = np.array([1, 2, 3, 4, 5])

# Print shape
print(arr_1d.shape)
(5,)

So the above array has a single number in its shape - and that’s how you know it’s only 1-dimensional. It has 5 elements in dimension zero. In this case, it looks a lot like a list.

2.4. Array dimensions - 2D#

Think back to data you have analysed in something like SPSS and Excel. This data had a two dimensional structure - not a single row or column of data, but a set of them. Most data in psychology is stored as 2D arrays in some way, so understanding the 2D array is the building block of most data operations.

2D arrays are essentially a series of 1D arrays stacked on top of each other - each ‘row’ must contain the same number of elements. So:

# Define a 2D array like this
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

print(arr_2d)
print(arr_2d.shape)
[[1 2 3]
 [4 5 6]]
(2, 3)

Now, .shape has two numbers - two dimensions. Dimension zero has two elements - two rows, and dimension one has three elements - three columns. This row, column format is the key to understanding these data structures.

We’ll be sticking mostly to 2D arrays for the remainder of the course, but explore 3D and above arrays using NumPy. For example, photographs are seen by Python has a 3D array - the height is the number of rows, the width is the number of columns, and the ‘depth’ represents the colour channels.

2.5. Slicing and dicing NumPy arrays#

When you have data in an array, you will want to access it. You can access data in an array using a similar approach to lists, with a separate entry for each dimension.

# 1D array slicing example is easy
arr_1d = np.array([1, 2, 3, 4, 5, 6])

print(arr_1d[1:5:2])
[2 4]
# 2D slicing examples
arr_2d = np.array([[10, 11, 12, 13], [14, 15, 16, 17], [18, 19, 20, 21]])

# Get the first element of the rows, and the second element of the columns - by passing those indices
print(arr_2d)
print('\n', arr_2d[0, 1])
[[10 11 12 13]
 [14 15 16 17]
 [18 19 20 21]]

 11
# 2D slicing examples
# How to remove a whole row? Or a whole column?
one_row = arr_2d[-1, :]
one_col = arr_2d[:, 2]

print(one_row, one_col)
print(one_row.shape, one_col.shape)
[18 19 20 21] [12 16 20]
(4,) (3,)
print(arr_2d)
[[10 11 12 13]
 [14 15 16 17]
 [18 19 20 21]]
# 2D slicing examples
# Select specific sets of numbers from different rows and columns
subset = arr_2d[[0, 1], [1, 0]]
print(subset)
[11 14]
# 2D slicing examples
# Strides are allowed
strided = arr_2d[:, :3:2]
print(strided)
[[10 12]
 [14 16]
 [18 20]]

2.6. Merging and shaping data arrays#

Sometimes, your data doesn’t come in a neat package, and you need to stitch things together - perhaps an extra participant’s data was collected later, or a new column needs to be added to each data file. Arrays are not fixed, just like lists, they are mutable data types. There are a set of NumPy functions that allow you to merge arrays together, as long as they are the same size on the dimension you want to join. Remember, arrays must be square/rectangular!

Other times, you will need to alter the structure or shape of your data to suit some kind of purpose. That’s handled efficiently by NumPy too.

# Make some 1D arrays
arr_one = np.array(range(0,10))
arr_two = np.array(range(11,21))

print(arr_one.shape, arr_two.shape)
(10,) (10,)
# Put these next to each other as-columns using 'concatenate'!
stacked = np.column_stack((arr_one, arr_two))
print(stacked, stacked.shape)
[[ 0 11]
 [ 1 12]
 [ 2 13]
 [ 3 14]
 [ 4 15]
 [ 5 16]
 [ 6 17]
 [ 7 18]
 [ 8 19]
 [ 9 20]] (10, 2)
# Once they are joined by column_stack, we can put arr_one on *top* of arr_two with a transpose, using the .T method!
transposed = stacked.T
print(transposed, transposed.shape)
[[ 0  1  2  3  4  5  6  7  8  9]
 [11 12 13 14 15 16 17 18 19 20]] (2, 10)
# Stacking 2D arrays is easy as long as dimensions match!
doubled_up = np.column_stack((stacked, stacked[::-1, ::-1]))
print(doubled_up, doubled_up.shape)
[[ 0 11 20  9]
 [ 1 12 19  8]
 [ 2 13 18  7]
 [ 3 14 17  6]
 [ 4 15 16  5]
 [ 5 16 15  4]
 [ 6 17 14  3]
 [ 7 18 13  2]
 [ 8 19 12  1]
 [ 9 20 11  0]] (10, 4)

2.7. np.column_stack#

np.column_stack is a powerful function for joining arrays. But there is one thing you should know about how it works behind the scenes that can really trip you up. There is a general purpose function for joining arrays called np.concatenate that lets you join arrays together by across any axis you desire. But, there’s one caveat.

arr_one and arr_two are 1-D arrays. Trying to concatenate them like with column_stack ends badly:

# Try to concat - axis argument is '1' to join on columns - this will break!
concat_stack = np.concatenate((arr_one, arr_two), axis=1)
---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
Input In [17], in <cell line: 2>()
      1 # Try to concat - axis argument is '1' to join on columns - this will break!
----> 2 concat_stack = np.concatenate((arr_one, arr_two), axis=1)

File <__array_function__ internals>:5, in concatenate(*args, **kwargs)

AxisError: axis 1 is out of bounds for array of dimension 1

An error - axis 1 does not exist for a one dimensional array! np.column_stack converted the inputs silently into two dimensional arrays, even if they only had a single element in the second dimension. How can we coerce these input arrays? Using the array method .reshape, which modifies the shape of an array with the arguments you give it.

### Demonstrate reshape
print(arr_one, arr_one.shape)

shaped = arr_one.reshape(1, 10)
print(shaped, shaped.shape)
[0 1 2 3 4 5 6 7 8 9] (10,)
[[0 1 2 3 4 5 6 7 8 9]] (1, 10)

So to coerce the more general purpose np.concatenate to work, we need to reshape our arrays. You can get far with np.column_stack and transposing, but if you’re not aware of this caveat, you can land up very confused.

# Replicate column_stack functionality
concat_stack = np.concatenate((arr_one.reshape(10,1), arr_two.reshape(10,1)), axis=1)
print(concat_stack, concat_stack.shape)
[[ 0 11]
 [ 1 12]
 [ 2 13]
 [ 3 14]
 [ 4 15]
 [ 5 16]
 [ 6 17]
 [ 7 18]
 [ 8 19]
 [ 9 20]] (10, 2)