1. (Frequentist) Statistics in Python#

So far, we’ve dealt with data processing, handling, and visualisation. These steps are sometimes done ‘behind the scenes’ for early career researchers, who may be presented with a final dataset and asked to run statistical tests. But this hides a wealth of decision making and understanding of data from you,. There is no substitute for working with data yourself - and once you’re done with that, its time to actually do some statistical inference.

Python is fully capable of running a range of statistical analyses on data. It is missing perhaps some of the nuanced functionality in statistics that R has; but this is the price you pay for using the second-best language for everything.

There are two ways of approaching statistics in Python, and the approach will dictated the kinds of packages that are used. But much like our experience of plotting, these two approaches can directly work with one another.

The first approach (and the most low-level) is the use of the scipy and statsmodels packages, which provide a dizzying array of functionality for working statistically - they are very general, suited to statistics across many domains, so may be confusing for psychologists.

We will start with another package that aims to make traditional frequentist statistical analysis straightforward, pingouin.

pingouin does not come with Anaconda, so you will need to install it. If you are working in a Jupyter notebook, in a fresh cell, you can use a magic command to pip install it, like so:

%pip install pingouin

Check out the documentation of the pingouin package.

# Import pingouin, pandas, and seaborn
import numpy as np
import pandas as pd
import pingouin as pg
import matplotlib.pyplot as plt
import seaborn as sns

1.1. A t-test in Python#

The most basic kind of analysis we can run is a T-test comparing two groups of participants on a single dependent variable. We have used the tips dataset before for data visualisation, and observed some differences - but we have not tested whether those differences are statistically significant. pingouin contains functions that allow us to answer this question - pg.ttest().

# First load tips
tips = sns.load_dataset('tips')
display(tips.head())

/Users/alexjones/opt/anaconda3/envs/py10/lib/python3.10/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.5.1, the latest is 0.5.2.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  return warn(

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Do male and female customers differ in the amount of tips they give? Conducting a t-test on this is straightforward. We just need to pass the scores of the first group as the first argument to the function, and the scores of the second group as the second.

If we look at tips we can see our data requires subsetting for that to happen, as it is in long format.

# Subset the data and store before passing, using loc
females = tips.loc[tips['sex'] == 'Female', 'tip']
males = tips.loc[tips['sex'] == 'Male', 'tip']

# Conduct t test
pg.ttest(females, males)

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	-1.489536	215.707021	two-sided	0.137807	[-0.6, 0.08]	0.185494	0.414	0.282179

pingouin returns output in a DataFrame, including the T value, the p-value, and degrees of freedom. It also reports an effect size with Cohen’s D, achieved power, and also a Bayes Factor assessing the strength of evidence for the alternative hypothesis - more than enough information about your statistical test for reporting! Here we see a non-significant effect, with lower power.

1.2. A repeated-measures t-test in Python#

By default, pg.ttest() assumes an independent samples test. A repeated measures comparison is achieved by setting the keyword argument paired to True. As an example, we could ask whether for each meal, the cost of the meal was higher than the tip - you would hope that this is the case!

results = pg.ttest(tips['total_bill'], tips['tip'], paired=True)
display(results)

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	32.646505	243	two-sided	8.020019e-91	[15.77, 17.8]	2.635205	1.222e+87	1.0

A very large difference, as you would expect.

1.3. Correlations with Python#

Computing a correlation between a pair of variable is very straightforward. Simply pass the two variables to the pg.corr() function. For example, is there a correlation between the total bill and the tip?

corr_results = pg.corr(tips['total_bill'], tips['tip'])
display(corr_results)

	n	r	CI95%	p-val	BF10	power
pearson	244	0.675734	[0.6, 0.74]	6.692471e-34	4.952e+30	1.0

The output provides a range of information again - the n (of which your degrees of freedom is equal to n - 2), the r value itself, a 95% confidence interval around it, the variance explained in r2 (achieved via squaring r), an adjusted r2 (adjusting for the numbers of predictors in the model), the p value, power, and again a Bayes Factor.

Its also worth noting that pingouin supports partial correlations, repeated measures correlations, and has other convenience functions that make testing correlational hypotheses straightforward.

1.4. One way ANOVA#

Things get more complicated once we move past simple tests and correlations, but thankfully pingouin makes things simple. Remember that a one way ANOVA examines differences between 3 or more groups on a single dependent variable. This is implemented in the pg.anova function.

Using the tips dataset, we can examine whether there are significant differences between in the amount of tips received across the four different days the study was conducted (Thursday - Sunday).

# Conduct a one way ANOVA
one_way = pg.anova(data=tips, dv='tip', between='day', detailed=True)
display(one_way)

	Source	SS	DF	MS	F	p-unc	np2
0	day	9.525873	3	3.175291	1.672355	0.173589	0.020476
1	Within	455.686604	240	1.898694	NaN	NaN	NaN

You specify the dataset, the dependent variable, and the between-measures (or grouping factor). The detailed keyword gives more information back. The output should look familiar from software like SPSS, including df, ms, F, p value, and an effect size of partial eta squared.

1.5. Two way ANOVA#

If we want to add an additional factor, that is straightforward - pass a list of columns to the between keyword.

We can examine the amount of tips given as a consequence of the day, and the sex of the customer.

# Conduct the two way ANOVA
two_way = pg.anova(data=tips, dv='tip', between=['day', 'sex'])
display(two_way)

	Source	SS	DF	MS	F	p-unc	np2
0	day	7.446900	3.0	2.482300	1.298061	0.275785	0.016233
1	sex	1.594561	1.0	1.594561	0.833839	0.362097	0.003521
2	day * sex	2.785891	3.0	0.928630	0.485606	0.692600	0.006135
3	Residual	451.306151	236.0	1.912314	NaN	NaN	NaN

1.6. Repeated measures designs - one and two way ANOVAs#

Handling repeated measures in an ANOVA context is dealt with by the, pg.rm_anova function.

The bugs dataset, where participants rate how much they want to kill an insect based on how frightening and disgusting they perceive it to be, is a good example of fully repeated measures data:

# Read in bugs from the OSF
bugs = pd.read_csv('https://osf.io/mrhjn/download')
display(bugs.head())

	Subject	Gender	Region	Education	Lo D, Lo F	Lo D, Hi F	Hi D, Lo F	Hi D, Hi F
0	1	Female	North	some	6.0	6.0	9.0	10.0
1	2	Female	North	advance	10.0	NaN	10.0	10.0
2	3	Female	Europe	college	5.0	10.0	10.0	10.0
3	4	Female	North	college	6.0	9.0	6.0	9.0
4	5	Female	North	some	3.0	6.5	5.5	8.5

Unfortunately, the data is in the wrong format for analysis - remember that repeated measures data often needs to be put into the ‘long’ format for analysis. The first step is to represent the data correctly.

# Melt the data
bugs_long = bugs.melt(id_vars=['Subject', 'Gender', 'Region', 'Education'],
                      value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
                      var_name='Condition', value_name='Rating')

# Split the condition column into two, so as to have two variables
bugs_long[['Disgust_Level', 'Fright_Level']] = bugs_long['Condition'].apply(lambda x: pd.Series(x.split(', ')))      

# Finally replace the text values
bugs_long.replace({'Disgust_Level': {'Lo D': 'Low', 'Hi D': 'High'}, 'Fright_Level': {'Lo F': 'Low', 'Hi F': 'High'}}, inplace=True)

display(bugs_long.head(), bugs_long.tail())

	Subject	Gender	Region	Education	Condition	Rating	Disgust_Level	Fright_Level
0	1	Female	North	some	Lo D, Lo F	6.0	Low	Low
1	2	Female	North	advance	Lo D, Lo F	10.0	Low	Low
2	3	Female	Europe	college	Lo D, Lo F	5.0	Low	Low
3	4	Female	North	college	Lo D, Lo F	6.0	Low	Low
4	5	Female	North	some	Lo D, Lo F	3.0	Low	Low

	Subject	Gender	Region	Education	Condition	Rating	Disgust_Level	Fright_Level
367	96	Male	North	high	Hi D, Hi F	10.0	High	High
368	97	Female	North	NaN	Hi D, Hi F	10.0	High	High
369	98	Female	North	some	Hi D, Hi F	10.0	High	High
370	99	Female	North	some	Hi D, Hi F	10.0	High	High
371	100	Female	Europe	some	Hi D, Hi F	3.0	High	High

The first thing is to apply a simple one way repeated measures ANOVA examining differences between disgust level. Note that for a repeated measures ANOVA, a column denoting the participant number has to be specified.

rma = pg.rm_anova(data=bugs_long, dv='Rating', within='Disgust_Level', subject='Subject', detailed=True)
display(rma)

	Source	SS	DF	MS	F	p-unc	np2	eps
0	Disgust_Level	27.485215	1	27.485215	12.043878	0.000793	0.115758	1.0
1	Error	209.952285	92	2.282090	NaN	NaN	NaN	NaN

This produces some familiar output - numerator and denominator degrees of freedom, mean square, F, p values, effect sizes, and epsilon, and index of sphericity.

Extending this to the two way case is straightforward - add in more levels to the within keyword.

rma2 = pg.rm_anova(data=bugs_long, dv='Rating', within=['Disgust_Level', 'Fright_Level'], subject='Subject')
display(rma2)

	Source	SS	ddof1	ddof2	MS	F	p-unc	p-GG-corr	np2	eps
0	Disgust_Level	48.752841	1	87	48.752841	12.175190	7.623808e-04	7.623808e-04	0.122764	1.0
1	Fright_Level	177.556818	1	87	177.556818	41.629663	6.011447e-09	6.011447e-09	0.323640	1.0
2	Disgust_Level * Fright_Level	6.545455	1	87	6.545455	2.152300	1.459622e-01	1.459622e-01	0.024142	1.0

This function is capable of removing data with missing values automatically, and provides a similar set of outcomes, including a p value corrected for violations of sphericity.

1.7. Mixed ANOVA#

Often, researchers will blend both repeated measures and between-groups variables. For example, how do depressions scores change between time one and time two (a repeated measure) and for people who have received a new drug compared to those who have received placebo (between measures). The pg.mixed_anova() function is capable of handling cases where a dataset contains a single repeated measures factor and a single between groups factor.

The exercise dataset used to illustrate seaborn plots in the last chapter has this property - individuals are on a low fat or no fat diet, and their pulse is measured after 5, 10, and 15 minutes of exericse. Are there significant differences or interactions in these variables on heart rate?

# Load dataset straight from the internet!
exercise = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/exercise.csv')
display(exercise.head())

	Unnamed: 0	id	diet	pulse	time	kind
0	0	1	low fat	85	1 min	rest
1	1	1	low fat	85	15 min	rest
2	2	1	low fat	88	30 min	rest
3	3	2	low fat	90	1 min	rest
4	4	2	low fat	92	15 min	rest

# Submit to mixed anova - remove_na keyword set to False here
mix = pg.mixed_anova(data=exercise, dv='pulse', within='time', between='diet', subject='id')
display(mix)

	Source	SS	DF1	DF2	MS	F	p-unc	p-GG-corr	np2	eps	sphericity	W-spher	p-spher
0	diet	1261.877778	1	28	1261.877778	3.147101	0.086939	NaN	0.101040	NaN	NaN	NaN	NaN
1	time	2066.600000	2	56	1033.300000	11.807751	0.000053	0.000312	0.296619	0.746425	False	0.660281	0.002994
2	Interaction	192.822222	2	56	96.411111	1.101711	0.339395	NaN	0.037857	NaN	NaN	NaN	NaN

The results are a merger of the standard ANOVA and the repeated measures ANOVA, providing sphericity measures for the repeated measures factor.

1.8. ANCOVA#

The final ANOVA design worth discussing is the ANCOVA. The ANCOVA allows for the addition of a covariate, a continuous variable that the researcher wasnt to remove the variance of from the model before examining differences between the groups. This is handled by the pg.ancova function.

Currently, the function supports only between group designs, and not repeated measures.

As a simple example, we can compare whether males and females offer different tip amounts after controlling for the total cost of the bill.

# ANCOVA example
pg.ancova(data=tips, dv='tip', between='sex', covar='total_bill')

	Source	SS	DF	F	p-unc	np2
0	sex	0.038803	1	0.036999	8.476290e-01	0.000153
1	total_bill	208.789002	1	199.082735	2.332173e-33	0.452376
2	Residual	252.749941	241	NaN	NaN	NaN

Additional covariates can be specified as a list.

1.9. Post hoc multiple comparisons#

Now you know how to do any kind of research design under the sun with ANOVA in Python, the next step is to explore the follow up tests. After all, while ANOVA tells you there are significant differences, it offers no information on what those are driven by.

pingouin offers a range of follow up tests to explore your data, but the focus here is on pg.pairwise_ttests, which computes t-tests between all pairwise differences of each variable level in your data.

Let’s take a look at a dataset with an interaction. In the exercise dataset, there is an interaction between diet and the kind of exercise:

# Examine interaction in exercise
interaction = pg.anova(data=exercise, dv='pulse', between=['kind', 'diet'], detailed=True)
display(interaction)

	Source	SS	DF	MS	F	p-unc	np2
0	kind	8326.066667	2	4163.033333	37.824471	1.935335e-12	0.473846
1	diet	1261.877778	1	1261.877778	11.465164	1.080516e-03	0.120098
2	kind * diet	815.755556	2	407.877778	3.705894	2.868384e-02	0.081081
3	Residual	9245.200000	84	110.061905	NaN	NaN	NaN

We can see the interaction is significant. How to follow this up with pingouin and see what is causing it? We use pg.pairwise_ttests like so:

# Follow up interaction using comparisons at all levels
comparisons = pg.pairwise_ttests(data=exercise, dv='pulse', between=['kind', 'diet'])
display(comparisons)

	Contrast	kind	A	B	Paired	Parametric	T	dof	alternative	p-unc	BF10	hedges
0	kind	-	rest	running	False	True	-6.561162	58.0	two-sided	1.593561e-08	5.95e+05	-1.672084
1	kind	-	rest	walking	False	True	-2.674622	58.0	two-sided	9.704606e-03	4.824	-0.681616
2	kind	-	running	walking	False	True	5.183378	58.0	two-sided	2.881236e-06	5136.555	1.320961
3	diet	-	low fat	no fat	False	True	-2.457504	88.0	two-sided	1.595161e-02	3.0	-0.513659
4	kind * diet	rest	low fat	no fat	False	True	-1.434339	28.0	two-sided	1.625505e-01	0.742	-0.509591
5	kind * diet	running	low fat	no fat	False	True	-2.754828	28.0	two-sided	1.020405e-02	5.01	-0.978734
6	kind * diet	walking	low fat	no fat	False	True	-1.425097	28.0	two-sided	1.651816e-01	0.735	-0.506308

We can change the keyword arguments of effsize and padjust to control the returned effect sizes and type of adjustment done, respectively - sometimes you need to adjust your alpha level for lots of tests:

# More flexible use of function
pairs_bonf_cohen = pg.pairwise_ttests(data=exercise, dv='pulse', between=['kind', 'diet'], 
                                      padjust='bonf', effsize='cohen')

display(pairs_bonf_cohen)

	Contrast	kind	A	B	Paired	Parametric	T	dof	alternative	p-unc	p-corr	p-adjust	BF10	cohen
0	kind	-	rest	running	False	True	-6.561162	58.0	two-sided	1.593561e-08	4.780682e-08	bonf	5.95e+05	-1.694085
1	kind	-	rest	walking	False	True	-2.674622	58.0	two-sided	9.704606e-03	2.911382e-02	bonf	4.824	-0.690584
2	kind	-	running	walking	False	True	5.183378	58.0	two-sided	2.881236e-06	8.643708e-06	bonf	5136.555	1.338343
3	diet	-	low fat	no fat	False	True	-2.457504	88.0	two-sided	1.595161e-02	NaN	NaN	3.0	-0.518087
4	kind * diet	rest	low fat	no fat	False	True	-1.434339	28.0	two-sided	1.625505e-01	4.876516e-01	bonf	0.742	-0.523747
5	kind * diet	running	low fat	no fat	False	True	-2.754828	28.0	two-sided	1.020405e-02	3.061214e-02	bonf	5.01	-1.005921
6	kind * diet	walking	low fat	no fat	False	True	-1.425097	28.0	two-sided	1.651816e-01	4.955448e-01	bonf	0.735	-0.520372

There are lots of additional flexible ways the function works - it can handle the case of one way and two ANOVAs and mixed ANOVAs, just by specifying what variables should be between and within. Remember to choose the appropriate level of adjustment and effect size that suits your needs - but with this function, you can explore interactions in sufficient detail.

1.10. Linear regression#

The final statistical test that is common in psychology is linear regression, where a continuous outcome variable (such as age, or height) is predicted by other continuous variables (e.g. weight). Regression is among the most flexible of all analysis types - in fact, ANOVA is just a special case of regression.

Linear regression is handled by the pg.linear_regression() function. It returns somewhat different output than the ANOVA function, containing information regarding the beta values of the regression (how much the dependent variable changes if the associated predictor increases by one unit), their p-values, and some information about the overall model fit.

In the following example, we attempt to build a linear model to predict the amount of loss suffered by an insurance company in a given car crash, based on the insurance premium of the driver, their alcohol consumption, and their speed at the time of the crash. This data is stored in the car_crashes dataset of seaborn.

# Load dataset
crash = sns.load_dataset('car_crashes')
display(crash.head())

	total	speeding	alcohol	not_distracted	no_previous	ins_premium	ins_losses	abbrev
0	18.8	7.332	5.640	18.048	15.040	784.55	145.08	AL
1	18.1	7.421	4.525	16.290	17.014	1053.48	133.93	AK
2	18.6	6.510	5.208	15.624	17.856	899.47	110.35	AZ
3	22.4	4.032	5.824	21.056	21.280	827.34	142.39	AR
4	12.0	4.200	3.360	10.920	10.680	878.41	165.63	CA

# Perform linear regression
lin_reg = pg.linear_regression(crash[['speeding', 'alcohol', 'ins_premium']], crash['ins_losses'])
display(lin_reg)

	names	coef	se	T	pval	r2	adj_r2	CI[2.5%]	CI[97.5%]
0	Intercept	58.321239	17.891849	3.259654	0.002078	0.388636	0.349613	22.327480	94.314997
1	speeding	-0.297846	1.892755	-0.157361	0.875634	0.388636	0.349613	-4.105578	3.509885
2	alcohol	0.142754	2.234772	0.063878	0.949338	0.388636	0.349613	-4.353028	4.638535
3	ins_premium	0.086772	0.016143	5.375042	0.000002	0.388636	0.349613	0.054295	0.119248

Unlike SPSS, there is no overall model fit returned by pingouin. However, each predictor is returned with an associated coefficient and p value, and t value. The overall \(R^2\) of the model is also returned. By default the function will add an intercept to the model via the add_intercept keyword - if you centre your variables by subtracting the mean from each (easily achieved in Pandas!) then you should set that to False.

1.11. Power Analysis#

Power - the probability that a statistical test tell you there is a result when there is indeed a result, can be calculated in a closed-form way (i.e. without the use of simulations) for some basic tests. pingouin has some great functionality for this. For example, calculating the sample size needed for a correlation of .10, an alpha of .025, and power of .90, is easy:

# Correlation power
pg.power_corr(r=.10, alpha=.025, power=.90)

1235.0573893376597

Ouch.

There are a number of other power functions, as well as effect size estimator functions that can convert existing effect sizes (e.g. Cohen’s d) into others, or compute effect sizes from summary statistics in published papers (such as t or F values).

1.12. Reliability and beyond#

There is also the capability to estimate statistics of measurement reliability, such as the intra-class correlation or Cronbach’s alpha. Intraclass correlation is shown below, using the wine dataset from pingouin:

# Read wine from pingouin
wine = pg.read_dataset('icc')
display(wine.head())

# Compute intra-class correlation within judges on their wine scores
pg.intraclass_corr(data=wine, targets='Wine', 
                   raters='Judge',
                   ratings='Scores')

	Wine	Judge	Scores
0	1	A	1
1	2	A	1
2	3	A	3
3	4	A	6
4	5	A	6

	Type	Description	ICC	F	df1	df2	pval	CI95%
0	ICC1	Single raters absolute	0.727521	11.680026	7	24	0.000002	[0.43, 0.93]
1	ICC2	Single random raters	0.727689	11.786693	7	21	0.000005	[0.43, 0.93]
2	ICC3	Single fixed raters	0.729487	11.786693	7	21	0.000005	[0.43, 0.93]
3	ICC1k	Average raters absolute	0.914384	11.680026	7	24	0.000002	[0.75, 0.98]
4	ICC2k	Average random raters	0.914450	11.786693	7	21	0.000005	[0.75, 0.98]
5	ICC3k	Average fixed raters	0.915159	11.786693	7	21	0.000005	[0.75, 0.98]

In most cases, classical statistical techniques are handled exceptionally well in pingouin, and there is a huge range of functionality of more advanced tasks (power, repeated measures correlations, etc), and even statistical plotting functions. However, when things are limited, there are other options…

An introduction to data analysis in Python

(Frequentist) Statistics in Python

Contents