3. Boolean operations with Pandas#

Like NumPy, you can subset and select values from a DataFrame using Boolean values - they work just the same as in NumPy by comparing values in your DataFrame to your chosen quantity.

# Demonstrate Boolean subsets. Split the data into two - 'am' specifies whether a car is automatic or manual
auto = mtcars['am'] == 1

display(mtcars[auto].head())
display(mtcars[~auto].head())
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
model mpg cyl disp hp drat wt qsec vs am gear carb
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Or MPG between certain values, only for cars with 5 gears!
miles_per = ((mtcars['mpg'] < 17.0) | (mtcars['mpg'] > 32.0)) & (mtcars['gear'] == 5)

display(mtcars[miles_per])
model mpg cyl disp hp drat wt qsec vs am gear carb
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14.5 0 1 5 4
30 Maserati Bora 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8

3.1. Cleaning data in Pandas#

As with NumPy arrays, there are special ways to remove data with NaN’s. Pandas has its own .isnull() method which can be chained with other methods to help clean up a dataset.

# First add NaN to rows
inds = np.random.choice([True, False], p=[.1, .9], size=mtcars.shape)

# Copy our original array so we don't overwrite it!
nan_cars = mtcars.copy()

# Add NaN
nan_cars[inds] = np.NaN

display(nan_cars.head())
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6.0 160.0 110.0 3.90 2.620 16.46 0.0 1.0 4.0 NaN
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.90 2.875 17.02 0.0 1.0 4.0 4.0
2 Datsun 710 22.8 4.0 108.0 93.0 3.85 2.320 18.61 NaN 1.0 4.0 1.0
3 Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 3.215 19.44 1.0 0.0 3.0 1.0
4 Hornet Sportabout 18.7 8.0 360.0 175.0 3.15 NaN NaN 0.0 0.0 3.0 2.0
# Use .isnull() to find problem participants with .any()
problems = nan_cars.isnull().any(axis=1)

# Subset!
display(nan_cars[~problems])
model mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.90 2.875 17.02 0.0 1.0 4.0 4.0
3 Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 3.215 19.44 1.0 0.0 3.0 1.0
6 Duster 360 14.3 8.0 360.0 245.0 3.21 3.570 15.84 0.0 0.0 3.0 4.0
8 Merc 230 22.8 4.0 140.8 95.0 3.92 3.150 22.90 1.0 0.0 4.0 2.0
16 Chrysler Imperial 14.7 8.0 440.0 230.0 3.23 5.345 17.42 0.0 0.0 3.0 4.0
20 Toyota Corona 21.5 4.0 120.1 97.0 3.70 2.465 20.01 1.0 0.0 3.0 1.0
24 Pontiac Firebird 19.2 8.0 400.0 175.0 3.08 3.845 17.05 0.0 0.0 3.0 2.0
25 Fiat X1-9 27.3 4.0 79.0 66.0 4.08 1.935 18.90 1.0 1.0 4.0 1.0
26 Porsche 914-2 26.0 4.0 120.3 91.0 4.43 2.140 16.70 0.0 1.0 5.0 2.0

3.2. Imputing values and summary statistics#

Dropping data is great, but most of the time you don’t want to do that. Psychological data can be hard to come by and you want to make the most of it. There is a solution - replacing missing values with some educated guesses. An example of this might be the mean or median.

In NumPy, computing summary statistics across vectors with missing data (np.NaN) will always result in nan - the missing value will cause problems. But in Pandas, calling .mean(), .median(), or .std() will ignore missing values and compute the statistic on the valid data.

And once you have those, you can replace the missing values with them. This is made very simple with Pandas, because of the .fillna() method. This method accepts either a constant value, or a vector of scores that match the axis of interest (e.g., same number of rows or columns) to fill the gaps with.

# Demonstrate fillna with a constant
display(nan_cars.fillna(9999).tail())
model mpg cyl disp hp drat wt qsec vs am gear carb
27 Lotus Europa 30.4 4.0 9999.0 113.0 3.77 1.513 16.9 1.0 9999.0 5.0 2.0
28 9999 15.8 8.0 351.0 264.0 4.22 3.170 14.5 0.0 1.0 5.0 4.0
29 Ferrari Dino 19.7 6.0 9999.0 175.0 3.62 2.770 15.5 0.0 1.0 5.0 6.0
30 Maserati Bora 15.0 8.0 301.0 335.0 3.54 3.570 9999.0 0.0 1.0 5.0 8.0
31 Volvo 142E 21.4 4.0 121.0 9999.0 4.11 2.780 18.6 1.0 1.0 4.0 2.0
# Compute means PER VARIABLE (across axis 0)
mean_scores = nan_cars.mean(axis=0) # altrenatively use `axis='rows'`

# Now fill
fixed = nan_cars.fillna(mean_scores)

display(fixed.tail())
display(mean_scores.T)
/var/folders/mw/xt4ddf0j2n3dr4qcr__qhqlr0000gn/T/ipykernel_2940/463963669.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  mean_scores = nan_cars.mean(axis=0) # altrenatively use `axis='rows'`
model mpg cyl disp hp drat wt qsec vs am gear carb
27 Lotus Europa 30.4 4.0 239.271429 113.000000 3.77 1.513 16.900000 1.0 0.4 5.0 2.0
28 NaN 15.8 8.0 351.000000 264.000000 4.22 3.170 14.500000 0.0 1.0 5.0 4.0
29 Ferrari Dino 19.7 6.0 239.271429 175.000000 3.62 2.770 15.500000 0.0 1.0 5.0 6.0
30 Maserati Bora 15.0 8.0 301.000000 335.000000 3.54 3.570 18.004286 0.0 1.0 5.0 8.0
31 Volvo 142E 21.4 4.0 121.000000 149.642857 4.11 2.780 18.600000 1.0 1.0 4.0 2.0
mpg      20.213333
cyl       6.068966
disp    239.271429
hp      149.642857
drat      3.633667
wt        3.235759
qsec     18.004286
vs        0.407407
am        0.400000
gear      3.689655
carb      2.733333
dtype: float64