Hierarchical or mixed models
Contents
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
4. Hierarchical or mixed models#
Hierload_datamodels - sometimes known as mixed models, multilevel models, or random effects models, are experiencing massive adoption in psychology. Hierarchical models are suitable for datasets with, well, a hierarchy. The classic example is predicting exam scores for individual school students in different classrooms, in different schools, in different counties, or for modelling repeated-measures data, such as many participants giving many responses to different stimuli.
At their core, hierarchical models take care of the key independent-and-identically-distributed - i.i.d that has been implicitly present in every analysis or regression model covered so far. Simply, this assumption states that each observed datapoint should be independent from the others, and distributed in the same way (i.e., perhaps coming from a normal distribution, a Poisson, etc). In repeated measures data, this is obviously violated - some responses come from one participant, some from another, and so on. In the school example, students in a classroom are probably more similar than those in another, and perhaps schools in some counties are more similar overall than others. Indeed, the longer you think on this, the more likely it is you will start seeing hierarchies everywhere. The fundamental i.i.d asusmption is pervasive, and not often met. Psychologists have typically dodged this bullet by averaging data across stimuli or participants, and analysing the subsequent average scores. But these data do not represent the actual data collected or the data-generating process!
Hierarchical models incorporate specific effects, sometimes called random effects, that are unique to each “group” in the data. For example, in a repeated measures study, a single participant would have their own unique effect ‘tagged’ onto the predictions of their score, to account for the fact those scores come from the same person.
These ‘random effects’ take basically two forms. The most common is known as a random intercept. In this type of model, each grouping variable is given its own intercept value in regression (most typically an ‘offset’ from a global intercept), that denotes how the predictions of the data points from that group should be shifted up or down.
The second form is that of a random slope. This works in exactly the same way, but for a coefficient/predictor variable - imagine that school students are given a teaching intervention that involves some extra one-to-one tuition, measured in minutes. Students receive varying amounts of tuition minutes depending on their availability - some get more, some less. A random slope would mean that students within different classrooms could have a specific coefficient value that deviates from the global effect of tuition minutes.
In practice, both random slopes and random intercepts are combined in models.
4.1. Fitting hierarchical models with statsmodels
- random intercepts and slopes#
The mixedlm
function will estimate a hiearchical model for us. Note that this function is back in the realm of normal, linear regression - no more generalized linear madness now! To highlight the flexibility of these models, we will use the sleepstudy
dataset, which is ideally suited to highlight random effects. The data contains reaction times on a series of tasks while they are sleep deprived (three hours per night) over the course of 10 days. The first two days are baseline and adaptation.
First, we read it in:
# Read sleep data
sleep = pd.read_csv('https://figshare.com/ndownloader/files/31181002')
display(sleep)
Reaction | Days | Subject | |
---|---|---|---|
0 | 249.5600 | 0 | 308 |
1 | 258.7047 | 1 | 308 |
2 | 250.8006 | 2 | 308 |
3 | 321.4398 | 3 | 308 |
4 | 356.8519 | 4 | 308 |
... | ... | ... | ... |
175 | 329.6076 | 5 | 372 |
176 | 334.4818 | 6 | 372 |
177 | 343.2199 | 7 | 372 |
178 | 369.1417 | 8 | 372 |
179 | 364.1236 | 9 | 372 |
180 rows × 3 columns
We can plot the reactions time over each day, for each of the participants:
# Plot
with sns.plotting_context('talk'):
sns.catplot(data=sleep, x='Days', y='Reaction', col='Subject',
col_wrap=6, kind='point', join=False, color='black', alpha=.3)
There is a lot of variation here. Each subject has a different pattern! For example subject 309 is really consistent but subject 337 trends upwards (slowing their reaction time) across the study. With this dataset, we can show off the mixed-model capabilities of statsmodels
.
We want to:
Always predict the reaction time from days of sleep deprivation
Allow the model to vary the intercept (start-point) for each subject
Allow the model to vary the coefficient of day for each subject (effect of days of deprivation)
Allow the model to do both
4.1.1. Random intercepts only#
Lets get started. Below we show how part 1. The mixedlm
function accepts a formula as usual, but has an additional argument, groups
, that allows us to specify the grouping data.
# First use of mixedlm, a random-intercepts only model
random_intercepts = smf.mixedlm('Reaction ~ Days', groups='Subject', data=sleep).fit()
# Summary
random_intercepts.summary()
Model: | MixedLM | Dependent Variable: | Reaction |
No. Observations: | 180 | Method: | REML |
No. Groups: | 18 | Scale: | 960.4568 |
Min. group size: | 10 | Log-Likelihood: | -893.2325 |
Max. group size: | 10 | Converged: | Yes |
Mean group size: | 10.0 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 251.405 | 9.747 | 25.794 | 0.000 | 232.302 | 270.508 |
Days | 10.467 | 0.804 | 13.015 | 0.000 | 8.891 | 12.044 |
Subject Var | 1378.176 | 17.156 |
Interesting! The model shows us the number of groups (18), which is equal to the number of participants. It shows the group size is 10 - the number of observations provided by each participant. The Intercept
and Days
data have their usual interpretation - the intercept is the average reaction time when day is zero (so the start point) and Days is the increase in reaction time as each day goes by, so an additional 10.467 ms are added per day.
The Subject Var
coefficient represents the variability in reaction time due to the different subjects. We can divide that by the total variability if we wish to get a kind of ‘variance explained’:
# Compute variance explained
random_intercepts.cov_re / sleep['Reaction'].var()
Subject | |
---|---|
Subject | 0.434354 |
We’ve allowed each participant here their own intercept. Where are they? We can access that using the .random_effects
attribute. As can be seen, this returns a dictionary with keys as the individual participant identifiers, and a value which represents the random effect - actually each participants offset from the global intercept. We’d like this as a DataFrame, so first we extract it:
# Extract offsets to df using the .from_dict method in the constructor
intercepts = pd.DataFrame.from_dict(random_intercepts.random_effects, orient='index')
To see each participants individual intercept, we simply add the global intercept of the model! We can grab this from the .fe_params
attribute of the fitted model (fe
being ‘fixed effect’, i.e., not random)
# Add the global intercept
intercepts = intercepts + random_intercepts.fe_params['Intercept']
intercepts
Subject | |
---|---|
308 | 292.188809 |
309 | 173.555562 |
310 | 188.296546 |
330 | 255.811546 |
331 | 261.621293 |
332 | 259.626342 |
333 | 267.905597 |
334 | 248.408124 |
335 | 206.122984 |
337 | 323.587781 |
349 | 230.208859 |
350 | 265.516466 |
351 | 243.542885 |
352 | 287.783525 |
369 | 258.441485 |
370 | 245.042403 |
371 | 248.110832 |
372 | 269.520849 |
Each participants baseline reaction time is estimated. Let’s next examine the actual predictions of this model, which allows us to account for each participants baseline. First, we insert the .fittedvalues
into the sleep data.
Second, we have to do some slightly more complex graph action to show these results. We utilise the seaborn
class FacetGrid
, which allows us to split a dataset up by a grouping factor and map a specfic plotting function to each subgroup. First we plot the raw data, and then the estimated regression line.
# Insert
sleep['predicted_random_intercept'] = random_intercepts.fittedvalues
# Now plot using FacetGrid
with sns.plotting_context('poster', font_scale=0.5):
(
sns.FacetGrid(data=sleep, col='Subject', col_wrap=6)
.map_dataframe(sns.scatterplot, x='Days', y='Reaction', color='black', alpha=.3)
.map_dataframe(sns.lineplot, x='Days', y='predicted_random_intercept', color='black', linewidth=3)
)
Kind of amazing. It is evident that the random intercepts are true to their name - each participants estimated regression line starts at a different point, but the slope is absolutely identical across the participants, and sometimes with disastrous predictions (eg subject 335).
4.1.2. Random slopes only#
How about if we allow the effect of day to vary using a random slope? We can now demonstrate how to add this in (and keep the intercepts totally fixed!). We need to add an additional dictionary to the model specification under the keyword vc
, which stands for variance components. The '0 + Days'
description indicates to the model to remove the participant-specific intercept, and estimate only a random effect for the Days coefficient.
We estimate the model, store the predictions, and plot the data. Its also entirely possible to extract the random slopes themselves, and add the fixed effect of day to them, if required. The process is the same.
# Estimate a random slope only regression
random_slopes = smf.mixedlm('Reaction ~ Days', groups='Subject',
vc_formula={'Days': '0 + Days'},
data=sleep).fit()
random_slopes.summary()
Model: | MixedLM | Dependent Variable: | Reaction |
No. Observations: | 180 | Method: | REML |
No. Groups: | 18 | Scale: | 842.0310 |
Min. group size: | 10 | Log-Likelihood: | -883.2625 |
Max. group size: | 10 | Converged: | Yes |
Mean group size: | 10.0 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 251.405 | 4.020 | 62.539 | 0.000 | 243.526 | 259.284 |
Days | 10.467 | 1.870 | 5.599 | 0.000 | 6.803 | 14.132 |
Days Var | 52.707 | 0.692 |
# Insert
sleep['predicted_random_slope'] = random_slopes.fittedvalues
# Plot once more with facet grid
with sns.plotting_context('poster', font_scale=0.5):
(
sns.FacetGrid(data=sleep, col='Subject', col_wrap=6)
.map_dataframe(sns.scatterplot, x='Days', y='Reaction', color='black', alpha=.3)
.map_dataframe(sns.lineplot, x='Days', y='predicted_random_slope', color='black', linewidth=3)
)
We are now in eessentially the opposite position. Its clear that each slope varies, but notice that it starts in the same position on the y-axis, because the intercept is fixed across all participants! For some participants, the fit is pretty good already - e.g. Subject 372; but for others it is poor, e.g 309.
4.2. Putting it all together#
We are now in a position where we can essentially combine the two approaches. We can fit a model that has random intercepts and slopes, allowing the starting position and the slope of the line to be unique for each participant. Putting this together in statsmodels
is very easy, and relies on the use of the re_formula
keyword.
# Estimate a random intercepts and slopes regression
random_all = smf.mixedlm('Reaction ~ Days', groups='Subject',
re_formula='Days',
data=sleep).fit()
random_all.summary()
Model: | MixedLM | Dependent Variable: | Reaction |
No. Observations: | 180 | Method: | REML |
No. Groups: | 18 | Scale: | 654.9405 |
Min. group size: | 10 | Log-Likelihood: | -871.8141 |
Max. group size: | 10 | Converged: | Yes |
Mean group size: | 10.0 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 251.405 | 6.825 | 36.838 | 0.000 | 238.029 | 264.781 |
Days | 10.467 | 1.546 | 6.771 | 0.000 | 7.438 | 13.497 |
Subject Var | 612.096 | 11.881 | ||||
Subject x Days Cov | 9.605 | 1.821 | ||||
Days Var | 35.072 | 0.610 |
The output now includes a variance estimate for each of the intercepts, slopes, and their interaction (Subject, Days, and Subject x Days respectively). Accordingly, the .random_effects
dictionary will now contain two values per participant - the slope and intercept estimate.
Finally, lets see the predictions for this, using the same logic as before.
# Insert
sleep['predicted_random_all'] = random_all.fittedvalues
# Plot once more with facet grid
with sns.plotting_context('poster', font_scale=0.5):
(
sns.FacetGrid(data=sleep, col='Subject', col_wrap=6)
.map_dataframe(sns.scatterplot, x='Days', y='Reaction', color='black', alpha=.3)
.map_dataframe(sns.lineplot, x='Days', y='predicted_random_all', color='black', linewidth=3)
)
Perfect! Those random effects worked really well, allowing for the slope and intercept to vary for each participant.
But how should one choose whether to use random intercepts, slopes, or both? There are many schools of thought on this with no dominant view. In psychology, a popular approach is the ‘keep it maximal’ one of Barr et al, 2013, which suggests that you should include all the random intercepts and slopes that your data allows for. Others claim this makes the models overly saturated, unidentifiable, and ultimately uninterpretable Bates et al. Ultimately, the price we pay for more complex models like the hierarchical ones seen here is that we must make more decisions, and that we must be able to defend these decisions.
4.3. Further mixed models and notes#
The model shown above is a hierarchical cousin of the standard OLS model. Its entirely possible to fit generalized hierarchical models for binary or count data, and statsmodels
can do that, but computation of these complex models becomes tricky. Bayesian models can help here, but that is another story!
Finally, there are often times with experimental data when it would be good to include multiple random intercepts and slopes - for example, participants rating stimuli for attractiveness, where both participants and stimuli have different baselines and intercepts. These are less straightforward to implement and rely on a good working knowledge of the variance components side of mixedlm
. This is an unfortunate aspect of mixedlm
in contrast to other languages and packages (e.g. the lmer
library in \(R\)), but it does at least build significant knowledge when trying to work with complex data.