1. (Frequentist) Statistics in Python#

So far, we’ve dealt with data processing, handling, and visualisation. These steps are sometimes done ‘behind the scenes’ for early career researchers, who may be presented with a final dataset and asked to run statistical tests. But this hides a wealth of decision making and understanding of data from you,. There is no substitute for working with data yourself - and once you’re done with that, its time to actually do some statistical inference.

Python is fully capable of running a range of statistical analyses on data. It is missing perhaps some of the nuanced functionality in statistics that R has; but this is the price you pay for using the second-best language for everything.

There are two ways of approaching statistics in Python, and the approach will dictated the kinds of packages that are used. But much like our experience of plotting, these two approaches can directly work with one another.

The first approach (and the most low-level) is the use of the scipy and statsmodels packages, which provide a dizzying array of functionality for working statistically - they are very general, suited to statistics across many domains, so may be confusing for psychologists.

We will start with another package that aims to make traditional frequentist statistical analysis straightforward, pingouin.

  • pingouin does not come with Anaconda, so you will need to install it. If you are working in a Jupyter notebook, in a fresh cell, you can use a magic command to pip install it, like so:

%pip install pingouin

Check out the documentation of the pingouin package.

# Import pingouin, pandas, and seaborn
import numpy as np
import pandas as pd
import pingouin as pg
import matplotlib.pyplot as plt
import seaborn as sns

1.1. A t-test in Python#

The most basic kind of analysis we can run is a T-test comparing two groups of participants on a single dependent variable. We have used the tips dataset before for data visualisation, and observed some differences - but we have not tested whether those differences are statistically significant. pingouin contains functions that allow us to answer this question - pg.ttest().

# First load tips
tips = sns.load_dataset('tips')
display(tips.head())
/Users/alexjones/opt/anaconda3/envs/py10/lib/python3.10/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.5.1, the latest is 0.5.2.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  return warn(
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Do male and female customers differ in the amount of tips they give? Conducting a t-test on this is straightforward. We just need to pass the scores of the first group as the first argument to the function, and the scores of the second group as the second.

If we look at tips we can see our data requires subsetting for that to happen, as it is in long format.

# Subset the data and store before passing, using loc
females = tips.loc[tips['sex'] == 'Female', 'tip']
males = tips.loc[tips['sex'] == 'Male', 'tip']

# Conduct t test
pg.ttest(females, males)
T dof alternative p-val CI95% cohen-d BF10 power
T-test -1.489536 215.707021 two-sided 0.137807 [-0.6, 0.08] 0.185494 0.414 0.282179

pingouin returns output in a DataFrame, including the T value, the p-value, and degrees of freedom. It also reports an effect size with Cohen’s D, achieved power, and also a Bayes Factor assessing the strength of evidence for the alternative hypothesis - more than enough information about your statistical test for reporting! Here we see a non-significant effect, with lower power.

1.2. A repeated-measures t-test in Python#

By default, pg.ttest() assumes an independent samples test. A repeated measures comparison is achieved by setting the keyword argument paired to True. As an example, we could ask whether for each meal, the cost of the meal was higher than the tip - you would hope that this is the case!

results = pg.ttest(tips['total_bill'], tips['tip'], paired=True)
display(results)
T dof alternative p-val CI95% cohen-d BF10 power
T-test 32.646505 243 two-sided 8.020019e-91 [15.77, 17.8] 2.635205 1.222e+87 1.0

A very large difference, as you would expect.

1.3. Correlations with Python#

Computing a correlation between a pair of variable is very straightforward. Simply pass the two variables to the pg.corr() function. For example, is there a correlation between the total bill and the tip?

corr_results = pg.corr(tips['total_bill'], tips['tip'])
display(corr_results)
n r CI95% p-val BF10 power
pearson 244 0.675734 [0.6, 0.74] 6.692471e-34 4.952e+30 1.0

The output provides a range of information again - the n (of which your degrees of freedom is equal to n - 2), the r value itself, a 95% confidence interval around it, the variance explained in r2 (achieved via squaring r), an adjusted r2 (adjusting for the numbers of predictors in the model), the p value, power, and again a Bayes Factor.

Its also worth noting that pingouin supports partial correlations, repeated measures correlations, and has other convenience functions that make testing correlational hypotheses straightforward.

1.4. One way ANOVA#

Things get more complicated once we move past simple tests and correlations, but thankfully pingouin makes things simple. Remember that a one way ANOVA examines differences between 3 or more groups on a single dependent variable. This is implemented in the pg.anova function.

Using the tips dataset, we can examine whether there are significant differences between in the amount of tips received across the four different days the study was conducted (Thursday - Sunday).

# Conduct a one way ANOVA
one_way = pg.anova(data=tips, dv='tip', between='day', detailed=True)
display(one_way)
Source SS DF MS F p-unc np2
0 day 9.525873 3 3.175291 1.672355 0.173589 0.020476
1 Within 455.686604 240 1.898694 NaN NaN NaN

You specify the dataset, the dependent variable, and the between-measures (or grouping factor). The detailed keyword gives more information back. The output should look familiar from software like SPSS, including df, ms, F, p value, and an effect size of partial eta squared.

1.5. Two way ANOVA#

If we want to add an additional factor, that is straightforward - pass a list of columns to the between keyword.

We can examine the amount of tips given as a consequence of the day, and the sex of the customer.

# Conduct the two way ANOVA
two_way = pg.anova(data=tips, dv='tip', between=['day', 'sex'])
display(two_way)
Source SS DF MS F p-unc np2
0 day 7.446900 3.0 2.482300 1.298061 0.275785 0.016233
1 sex 1.594561 1.0 1.594561 0.833839 0.362097 0.003521
2 day * sex 2.785891 3.0 0.928630 0.485606 0.692600 0.006135
3 Residual 451.306151 236.0 1.912314 NaN NaN NaN

1.6. Repeated measures designs - one and two way ANOVAs#

Handling repeated measures in an ANOVA context is dealt with by the, pg.rm_anova function.

The bugs dataset, where participants rate how much they want to kill an insect based on how frightening and disgusting they perceive it to be, is a good example of fully repeated measures data:

# Read in bugs from the OSF
bugs = pd.read_csv('https://osf.io/mrhjn/download')
display(bugs.head())
Subject Gender Region Education Lo D, Lo F Lo D, Hi F Hi D, Lo F Hi D, Hi F
0 1 Female North some 6.0 6.0 9.0 10.0
1 2 Female North advance 10.0 NaN 10.0 10.0
2 3 Female Europe college 5.0 10.0 10.0 10.0
3 4 Female North college 6.0 9.0 6.0 9.0
4 5 Female North some 3.0 6.5 5.5 8.5

Unfortunately, the data is in the wrong format for analysis - remember that repeated measures data often needs to be put into the ‘long’ format for analysis. The first step is to represent the data correctly.

# Melt the data
bugs_long = bugs.melt(id_vars=['Subject', 'Gender', 'Region', 'Education'],
                      value_vars=['Lo D, Lo F', 'Lo D, Hi F', 'Hi D, Lo F', 'Hi D, Hi F'],
                      var_name='Condition', value_name='Rating')

# Split the condition column into two, so as to have two variables
bugs_long[['Disgust_Level', 'Fright_Level']] = bugs_long['Condition'].apply(lambda x: pd.Series(x.split(', ')))      

# Finally replace the text values
bugs_long.replace({'Disgust_Level': {'Lo D': 'Low', 'Hi D': 'High'}, 'Fright_Level': {'Lo F': 'Low', 'Hi F': 'High'}}, inplace=True)

display(bugs_long.head(), bugs_long.tail())
Subject Gender Region Education Condition Rating Disgust_Level Fright_Level
0 1 Female North some Lo D, Lo F 6.0 Low Low
1 2 Female North advance Lo D, Lo F 10.0 Low Low
2 3 Female Europe college Lo D, Lo F 5.0 Low Low
3 4 Female North college Lo D, Lo F 6.0 Low Low
4 5 Female North some Lo D, Lo F 3.0 Low Low
Subject Gender Region Education Condition Rating Disgust_Level Fright_Level
367 96 Male North high Hi D, Hi F 10.0 High High
368 97 Female North NaN Hi D, Hi F 10.0 High High
369 98 Female North some Hi D, Hi F 10.0 High High
370 99 Female North some Hi D, Hi F 10.0 High High
371 100 Female Europe some Hi D, Hi F 3.0 High High

The first thing is to apply a simple one way repeated measures ANOVA examining differences between disgust level. Note that for a repeated measures ANOVA, a column denoting the participant number has to be specified.

rma = pg.rm_anova(data=bugs_long, dv='Rating', within='Disgust_Level', subject='Subject', detailed=True)
display(rma)
Source SS DF MS F p-unc np2 eps
0 Disgust_Level 27.485215 1 27.485215 12.043878 0.000793 0.115758 1.0
1 Error 209.952285 92 2.282090 NaN NaN NaN NaN

This produces some familiar output - numerator and denominator degrees of freedom, mean square, F, p values, effect sizes, and epsilon, and index of sphericity.

Extending this to the two way case is straightforward - add in more levels to the within keyword.

rma2 = pg.rm_anova(data=bugs_long, dv='Rating', within=['Disgust_Level', 'Fright_Level'], subject='Subject')
display(rma2)
Source SS ddof1 ddof2 MS F p-unc p-GG-corr np2 eps
0 Disgust_Level 48.752841 1 87 48.752841 12.175190 7.623808e-04 7.623808e-04 0.122764 1.0
1 Fright_Level 177.556818 1 87 177.556818 41.629663 6.011447e-09 6.011447e-09 0.323640 1.0
2 Disgust_Level * Fright_Level 6.545455 1 87 6.545455 2.152300 1.459622e-01 1.459622e-01 0.024142 1.0

This function is capable of removing data with missing values automatically, and provides a similar set of outcomes, including a p value corrected for violations of sphericity.

1.7. Mixed ANOVA#

Often, researchers will blend both repeated measures and between-groups variables. For example, how do depressions scores change between time one and time two (a repeated measure) and for people who have received a new drug compared to those who have received placebo (between measures). The pg.mixed_anova() function is capable of handling cases where a dataset contains a single repeated measures factor and a single between groups factor.

The exercise dataset used to illustrate seaborn plots in the last chapter has this property - individuals are on a low fat or no fat diet, and their pulse is measured after 5, 10, and 15 minutes of exericse. Are there significant differences or interactions in these variables on heart rate?

# Load dataset straight from the internet!
exercise = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/exercise.csv')
display(exercise.head())
Unnamed: 0 id diet pulse time kind
0 0 1 low fat 85 1 min rest
1 1 1 low fat 85 15 min rest
2 2 1 low fat 88 30 min rest
3 3 2 low fat 90 1 min rest
4 4 2 low fat 92 15 min rest
# Submit to mixed anova - remove_na keyword set to False here
mix = pg.mixed_anova(data=exercise, dv='pulse', within='time', between='diet', subject='id')
display(mix)
Source SS DF1 DF2 MS F p-unc p-GG-corr np2 eps sphericity W-spher p-spher
0 diet 1261.877778 1 28 1261.877778 3.147101 0.086939 NaN 0.101040 NaN NaN NaN NaN
1 time 2066.600000 2 56 1033.300000 11.807751 0.000053 0.000312 0.296619 0.746425 False 0.660281 0.002994
2 Interaction 192.822222 2 56 96.411111 1.101711 0.339395 NaN 0.037857 NaN NaN NaN NaN

The results are a merger of the standard ANOVA and the repeated measures ANOVA, providing sphericity measures for the repeated measures factor.

1.8. ANCOVA#

The final ANOVA design worth discussing is the ANCOVA. The ANCOVA allows for the addition of a covariate, a continuous variable that the researcher wasnt to remove the variance of from the model before examining differences between the groups. This is handled by the pg.ancova function.

Currently, the function supports only between group designs, and not repeated measures.

As a simple example, we can compare whether males and females offer different tip amounts after controlling for the total cost of the bill.

# ANCOVA example
pg.ancova(data=tips, dv='tip', between='sex', covar='total_bill')
Source SS DF F p-unc np2
0 sex 0.038803 1 0.036999 8.476290e-01 0.000153
1 total_bill 208.789002 1 199.082735 2.332173e-33 0.452376
2 Residual 252.749941 241 NaN NaN NaN

Additional covariates can be specified as a list.

1.9. Post hoc multiple comparisons#

Now you know how to do any kind of research design under the sun with ANOVA in Python, the next step is to explore the follow up tests. After all, while ANOVA tells you there are significant differences, it offers no information on what those are driven by.

pingouin offers a range of follow up tests to explore your data, but the focus here is on pg.pairwise_ttests, which computes t-tests between all pairwise differences of each variable level in your data.

Let’s take a look at a dataset with an interaction. In the exercise dataset, there is an interaction between diet and the kind of exercise:

# Examine interaction in exercise
interaction = pg.anova(data=exercise, dv='pulse', between=['kind', 'diet'], detailed=True)
display(interaction)
Source SS DF MS F p-unc np2
0 kind 8326.066667 2 4163.033333 37.824471 1.935335e-12 0.473846
1 diet 1261.877778 1 1261.877778 11.465164 1.080516e-03 0.120098
2 kind * diet 815.755556 2 407.877778 3.705894 2.868384e-02 0.081081
3 Residual 9245.200000 84 110.061905 NaN NaN NaN

We can see the interaction is significant. How to follow this up with pingouin and see what is causing it? We use pg.pairwise_ttests like so:

# Follow up interaction using comparisons at all levels
comparisons = pg.pairwise_ttests(data=exercise, dv='pulse', between=['kind', 'diet'])
display(comparisons)
Contrast kind A B Paired Parametric T dof alternative p-unc BF10 hedges
0 kind - rest running False True -6.561162 58.0 two-sided 1.593561e-08 5.95e+05 -1.672084
1 kind - rest walking False True -2.674622 58.0 two-sided 9.704606e-03 4.824 -0.681616
2 kind - running walking False True 5.183378 58.0 two-sided 2.881236e-06 5136.555 1.320961
3 diet - low fat no fat False True -2.457504 88.0 two-sided 1.595161e-02 3.0 -0.513659
4 kind * diet rest low fat no fat False True -1.434339 28.0 two-sided 1.625505e-01 0.742 -0.509591
5 kind * diet running low fat no fat False True -2.754828 28.0 two-sided 1.020405e-02 5.01 -0.978734
6 kind * diet walking low fat no fat False True -1.425097 28.0 two-sided 1.651816e-01 0.735 -0.506308

We can change the keyword arguments of effsize and padjust to control the returned effect sizes and type of adjustment done, respectively - sometimes you need to adjust your alpha level for lots of tests:

# More flexible use of function
pairs_bonf_cohen = pg.pairwise_ttests(data=exercise, dv='pulse', between=['kind', 'diet'], 
                                      padjust='bonf', effsize='cohen')

display(pairs_bonf_cohen)
Contrast kind A B Paired Parametric T dof alternative p-unc p-corr p-adjust BF10 cohen
0 kind - rest running False True -6.561162 58.0 two-sided 1.593561e-08 4.780682e-08 bonf 5.95e+05 -1.694085
1 kind - rest walking False True -2.674622 58.0 two-sided 9.704606e-03 2.911382e-02 bonf 4.824 -0.690584
2 kind - running walking False True 5.183378 58.0 two-sided 2.881236e-06 8.643708e-06 bonf 5136.555 1.338343
3 diet - low fat no fat False True -2.457504 88.0 two-sided 1.595161e-02 NaN NaN 3.0 -0.518087
4 kind * diet rest low fat no fat False True -1.434339 28.0 two-sided 1.625505e-01 4.876516e-01 bonf 0.742 -0.523747
5 kind * diet running low fat no fat False True -2.754828 28.0 two-sided 1.020405e-02 3.061214e-02 bonf 5.01 -1.005921
6 kind * diet walking low fat no fat False True -1.425097 28.0 two-sided 1.651816e-01 4.955448e-01 bonf 0.735 -0.520372

There are lots of additional flexible ways the function works - it can handle the case of one way and two ANOVAs and mixed ANOVAs, just by specifying what variables should be between and within. Remember to choose the appropriate level of adjustment and effect size that suits your needs - but with this function, you can explore interactions in sufficient detail.

1.10. Linear regression#

The final statistical test that is common in psychology is linear regression, where a continuous outcome variable (such as age, or height) is predicted by other continuous variables (e.g. weight). Regression is among the most flexible of all analysis types - in fact, ANOVA is just a special case of regression.

Linear regression is handled by the pg.linear_regression() function. It returns somewhat different output than the ANOVA function, containing information regarding the beta values of the regression (how much the dependent variable changes if the associated predictor increases by one unit), their p-values, and some information about the overall model fit.

In the following example, we attempt to build a linear model to predict the amount of loss suffered by an insurance company in a given car crash, based on the insurance premium of the driver, their alcohol consumption, and their speed at the time of the crash. This data is stored in the car_crashes dataset of seaborn.

# Load dataset
crash = sns.load_dataset('car_crashes')
display(crash.head())
total speeding alcohol not_distracted no_previous ins_premium ins_losses abbrev
0 18.8 7.332 5.640 18.048 15.040 784.55 145.08 AL
1 18.1 7.421 4.525 16.290 17.014 1053.48 133.93 AK
2 18.6 6.510 5.208 15.624 17.856 899.47 110.35 AZ
3 22.4 4.032 5.824 21.056 21.280 827.34 142.39 AR
4 12.0 4.200 3.360 10.920 10.680 878.41 165.63 CA
# Perform linear regression
lin_reg = pg.linear_regression(crash[['speeding', 'alcohol', 'ins_premium']], crash['ins_losses'])
display(lin_reg)
names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%]
0 Intercept 58.321239 17.891849 3.259654 0.002078 0.388636 0.349613 22.327480 94.314997
1 speeding -0.297846 1.892755 -0.157361 0.875634 0.388636 0.349613 -4.105578 3.509885
2 alcohol 0.142754 2.234772 0.063878 0.949338 0.388636 0.349613 -4.353028 4.638535
3 ins_premium 0.086772 0.016143 5.375042 0.000002 0.388636 0.349613 0.054295 0.119248

Unlike SPSS, there is no overall model fit returned by pingouin. However, each predictor is returned with an associated coefficient and p value, and t value. The overall \(R^2\) of the model is also returned. By default the function will add an intercept to the model via the add_intercept keyword - if you centre your variables by subtracting the mean from each (easily achieved in Pandas!) then you should set that to False.

1.11. Power Analysis#

Power - the probability that a statistical test tell you there is a result when there is indeed a result, can be calculated in a closed-form way (i.e. without the use of simulations) for some basic tests. pingouin has some great functionality for this. For example, calculating the sample size needed for a correlation of .10, an alpha of .025, and power of .90, is easy:

# Correlation power
pg.power_corr(r=.10, alpha=.025, power=.90)
1235.0573893376597

Ouch.

There are a number of other power functions, as well as effect size estimator functions that can convert existing effect sizes (e.g. Cohen’s d) into others, or compute effect sizes from summary statistics in published papers (such as t or F values).

1.12. Reliability and beyond#

There is also the capability to estimate statistics of measurement reliability, such as the intra-class correlation or Cronbach’s alpha. Intraclass correlation is shown below, using the wine dataset from pingouin:

# Read wine from pingouin
wine = pg.read_dataset('icc')
display(wine.head())

# Compute intra-class correlation within judges on their wine scores
pg.intraclass_corr(data=wine, targets='Wine', 
                   raters='Judge',
                   ratings='Scores')
Wine Judge Scores
0 1 A 1
1 2 A 1
2 3 A 3
3 4 A 6
4 5 A 6
Type Description ICC F df1 df2 pval CI95%
0 ICC1 Single raters absolute 0.727521 11.680026 7 24 0.000002 [0.43, 0.93]
1 ICC2 Single random raters 0.727689 11.786693 7 21 0.000005 [0.43, 0.93]
2 ICC3 Single fixed raters 0.729487 11.786693 7 21 0.000005 [0.43, 0.93]
3 ICC1k Average raters absolute 0.914384 11.680026 7 24 0.000002 [0.75, 0.98]
4 ICC2k Average random raters 0.914450 11.786693 7 21 0.000005 [0.75, 0.98]
5 ICC3k Average fixed raters 0.915159 11.786693 7 21 0.000005 [0.75, 0.98]

In most cases, classical statistical techniques are handled exceptionally well in pingouin, and there is a huge range of functionality of more advanced tasks (power, repeated measures correlations, etc), and even statistical plotting functions. However, when things are limited, there are other options…