Contents

Boolean operations with Pandas

Contents

3. Boolean operations with Pandas#

Like NumPy, you can subset and select values from a DataFrame using Boolean values - they work just the same as in NumPy by comparing values in your DataFrame to your chosen quantity.

# Demonstrate Boolean subsets. Split the data into two - 'am' specifies whether a car is automatic or manual
auto = mtcars['am'] == 1

display(mtcars[auto].head())
display(mtcars[~auto].head())

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
17	Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
18	Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
5	Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
6	Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
7	Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2

# Or MPG between certain values, only for cars with 5 gears!
miles_per = ((mtcars['mpg'] < 17.0) | (mtcars['mpg'] > 32.0)) & (mtcars['gear'] == 5)

display(mtcars[miles_per])

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
28	Ford Pantera L	15.8	8	351.0	264	4.22	3.17	14.5	0	1	5	4
30	Maserati Bora	15.0	8	301.0	335	3.54	3.57	14.6	0	1	5	8

3.1. Cleaning data in Pandas#

As with NumPy arrays, there are special ways to remove data with NaN’s. Pandas has its own .isnull() method which can be chained with other methods to help clean up a dataset.

# First add NaN to rows
inds = np.random.choice([True, False], p=[.1, .9], size=mtcars.shape)

# Copy our original array so we don't overwrite it!
nan_cars = mtcars.copy()

# Add NaN
nan_cars[inds] = np.NaN

display(nan_cars.head())

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6.0	160.0	110.0	3.90	2.620	16.46	0.0	1.0	4.0	NaN
1	Mazda RX4 Wag	21.0	6.0	160.0	110.0	3.90	2.875	17.02	0.0	1.0	4.0	4.0
2	Datsun 710	22.8	4.0	108.0	93.0	3.85	2.320	18.61	NaN	1.0	4.0	1.0
3	Hornet 4 Drive	21.4	6.0	258.0	110.0	3.08	3.215	19.44	1.0	0.0	3.0	1.0
4	Hornet Sportabout	18.7	8.0	360.0	175.0	3.15	NaN	NaN	0.0	0.0	3.0	2.0

# Use .isnull() to find problem participants with .any()
problems = nan_cars.isnull().any(axis=1)

# Subset!
display(nan_cars[~problems])

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
1	Mazda RX4 Wag	21.0	6.0	160.0	110.0	3.90	2.875	17.02	0.0	1.0	4.0	4.0
3	Hornet 4 Drive	21.4	6.0	258.0	110.0	3.08	3.215	19.44	1.0	0.0	3.0	1.0
6	Duster 360	14.3	8.0	360.0	245.0	3.21	3.570	15.84	0.0	0.0	3.0	4.0
8	Merc 230	22.8	4.0	140.8	95.0	3.92	3.150	22.90	1.0	0.0	4.0	2.0
16	Chrysler Imperial	14.7	8.0	440.0	230.0	3.23	5.345	17.42	0.0	0.0	3.0	4.0
20	Toyota Corona	21.5	4.0	120.1	97.0	3.70	2.465	20.01	1.0	0.0	3.0	1.0
24	Pontiac Firebird	19.2	8.0	400.0	175.0	3.08	3.845	17.05	0.0	0.0	3.0	2.0
25	Fiat X1-9	27.3	4.0	79.0	66.0	4.08	1.935	18.90	1.0	1.0	4.0	1.0
26	Porsche 914-2	26.0	4.0	120.3	91.0	4.43	2.140	16.70	0.0	1.0	5.0	2.0

3.2. Imputing values and summary statistics#

Dropping data is great, but most of the time you don’t want to do that. Psychological data can be hard to come by and you want to make the most of it. There is a solution - replacing missing values with some educated guesses. An example of this might be the mean or median.

In NumPy, computing summary statistics across vectors with missing data (np.NaN) will always result in nan - the missing value will cause problems. But in Pandas, calling .mean(), .median(), or .std() will ignore missing values and compute the statistic on the valid data.

And once you have those, you can replace the missing values with them. This is made very simple with Pandas, because of the .fillna() method. This method accepts either a constant value, or a vector of scores that match the axis of interest (e.g., same number of rows or columns) to fill the gaps with.

# Demonstrate fillna with a constant
display(nan_cars.fillna(9999).tail())

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
27	Lotus Europa	30.4	4.0	9999.0	113.0	3.77	1.513	16.9	1.0	9999.0	5.0	2.0
28	9999	15.8	8.0	351.0	264.0	4.22	3.170	14.5	0.0	1.0	5.0	4.0
29	Ferrari Dino	19.7	6.0	9999.0	175.0	3.62	2.770	15.5	0.0	1.0	5.0	6.0
30	Maserati Bora	15.0	8.0	301.0	335.0	3.54	3.570	9999.0	0.0	1.0	5.0	8.0
31	Volvo 142E	21.4	4.0	121.0	9999.0	4.11	2.780	18.6	1.0	1.0	4.0	2.0

# Compute means PER VARIABLE (across axis 0)
mean_scores = nan_cars.mean(axis=0) # altrenatively use `axis='rows'`

# Now fill
fixed = nan_cars.fillna(mean_scores)

display(fixed.tail())
display(mean_scores.T)

/var/folders/mw/xt4ddf0j2n3dr4qcr__qhqlr0000gn/T/ipykernel_2940/463963669.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  mean_scores = nan_cars.mean(axis=0) # altrenatively use `axis='rows'`

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
27	Lotus Europa	30.4	4.0	239.271429	113.000000	3.77	1.513	16.900000	1.0	0.4	5.0	2.0
28	NaN	15.8	8.0	351.000000	264.000000	4.22	3.170	14.500000	0.0	1.0	5.0	4.0
29	Ferrari Dino	19.7	6.0	239.271429	175.000000	3.62	2.770	15.500000	0.0	1.0	5.0	6.0
30	Maserati Bora	15.0	8.0	301.000000	335.000000	3.54	3.570	18.004286	0.0	1.0	5.0	8.0
31	Volvo 142E	21.4	4.0	121.000000	149.642857	4.11	2.780	18.600000	1.0	1.0	4.0	2.0

mpg      20.213333
cyl       6.068966
disp    239.271429
hp      149.642857
drat      3.633667
wt        3.235759
qsec     18.004286
vs        0.407407
am        0.400000
gear      3.689655
carb      2.733333
dtype: float64