Linear mixed models for psychological data

or: not letting the statistical tail wag the theoretical dog

Alex Jones & Jeremy Tree

The data

This data inspired by papers of mine (Mileva et al., 2016; Childs & Jones, 2022)
Broad questions:
- How does intrasexual competitiveness and cosmetics affect beauty perception?
45 participants, 70 faces, without and with makeup
But participants view only a subset of faces
Outcome: attractiveness rating (1–7)
Participants have a score on ICS, a questionnaire measuring competitiveness

Hypotheses

We have five hypotheses to test in this study.

H1. Do cosmetics increase attractiveness?
H2. Is the effect of makeup on attractiveness the same for all observers?
H3. Is the effect of makeup on attractiveness the same for all faces?
H4. For observers, does the ‘makeup effect’ depend on intrasexual competitiveness?
H5. Do faces differ in how much they are impacted by intrasexual competitiveness of observers?

We will (attempt) to test these using traditional statistics, and then again with an LMM

H1. Do cosmetics increase attractiveness?

Analyst degrees of freedom — how do we test this?

Traditionally, we have to average across one level — face or rater — to even start.

We go from this…

pid	faceid	ICS	attr	cosmetics_f
ppt_1	face_4	0.979	3	With
ppt_1	face_66	0.979	4	Without
ppt_1	face_32	0.979	1	With
ppt_1	face_21	0.979	3	With
ppt_1	face_67	0.979	5	Without
ppt_1	face_28	0.979	4	Without
ppt_1	face_33	0.979	3	With
ppt_1	face_52	0.979	4	With

The full dataset, 3,669 rows, to…

H1. Do cosmetics increase attractiveness?

pid	ics	With	Without
ppt_1	0.979	3.50	4.00
ppt_10	1.396	3.41	2.94
ppt_11	-1.055	4.87	3.56
ppt_12	0.150	1.94	3.34
ppt_13	-0.488	2.33	3.59
ppt_14	-0.257	6.06	3.89
ppt_15	0.744	4.30	3.77
ppt_16	-0.211	4.85	3.46

By-participant averages, N = 45

faceid	With	Without
face_1	3.12	3.32
face_10	2.48	3.09
face_11	5.65	4.57
face_12	5.50	3.96
face_13	5.12	3.59
face_14	5.32	4.67
face_15	3.17	3.33
face_16	4.37	3.37

By-face averages, N = 70

H1. Do cosmetics increase attractiveness?

Which to choose? By face has more N, so more power? By ppt also allows us to test ICS, which we lose with faces! Either way, a paired-samples t-test should help!

By participant, then by face

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.431	-2.71	0.01	44	-0.752	-0.11	Paired t-test	two.sided

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.386	-4.08	0	69	-0.575	-0.197	Paired t-test	two.sided

H1. Do cosmetics increase attractiveness?

A mixed model can do either or both of these tests in a single fit. - Recall a regression with a binary predictor is akin to a t-test! - A mixed model uses the full data, no aggregation

“By face” = attr ~ cosmetics + (1|faceid)
“By participant” = attr ~ cosmetics + (1|pid)
All at once = attr ~ cosmetics + (1|pid) + (1|faceid)

H1. Do cosmetics increase attractiveness?

While the mean difference is broadly similar, the uncertainty around it is very different. A mixed model loses no information to aggregation, so is more certain about the effect.

In return you get individual-difference measures of each face and participant in the No Cosmetics condition (the intercept!):

H1. Do cosmetics increase attractiveness?

The variances of those distributions are shared with us by the model, and we see how much variance can be partitioned by the different “sources”.

Group	Variance	SD	Proportion
faceid	0.551	0.742	0.217
pid	0.371	0.609	0.146
Residual	1.616	1.271	0.637

H2. Does the ‘makeup effect’ on attractiveness vary across observers?

Are all observers ratings equally affected by cosmetics?

Messy and indirect traditionally. Carry out separate regressions for each person (attr ~ cosmetics) and collect their slopes; then test against zero. Loses all face information and bakes in measurement error - we know all participants are not equal!

attr ~ cosmetics + (1+cosmetics|pid) + (1|faceid) allows the effect of cosmetics to differ for each participant, directly answering the question.

LMM estimate is more conservative for participants with smaller N

H3. Does the ‘makeup effect’ on attractiveness vary across faces?

More simply, is the effect of cosmetics the same for all faces, or does it vary?

Harder problem than for observers! A separate regression is not possible because each face has two ‘versions’. Best we could manage is a correlation between the average score for each face, which removes individual variation. No easy way to test this traditionally, but solved simply by mixed models

attr ~ cosmetics + (1+cosmetics|pid) + (1+cosmetics|faceid)

allows the effect of cosmetics to differ for each face (and observers too).

A free gift

Our model now includes intercepts and slopes for both observers and faces:

Intercept — higher baseline rating
Slope — change in ratings with cosmetics

The model also estimates the correlation between these effects - Harsher raters are unaffected by cosmetics - More attractive faces get more attractive with cosmetics

Comparing how the fixed effect differs across models

The random effects structure - intercepts and slopes - affects how ‘significant’ the fixed effect is, and its magnitude:

Model	Estimate	SE	df	t	p
Intercepts only: (1 \| pid) + (1 \| faceid)	0.375	0.042	3561.3	8.87	0.000
Slope for participants: (1 + cosmetics \| pid) + (1 \| faceid)	0.404	0.158	44.7	2.56	0.014
Slope for faces: (1 \| pid) + (1 + cosmetics \| faceid)	0.377	0.095	69.3	3.99	0.000
Slopes for both: (1 + cosmetics \| pid) + (1 + cosmetics \| faceid)	0.408	0.181	70.2	2.26	0.027

H4. Does an observer ‘makeup effect’ depend on intrasexual competitiveness?

An interaction. How to test that traditionally?

ANCOVA only controls for ICS, it doesn’t interact with it
Solution — chop ICS into subgroups, force into ANOVA?
Requires averaging over faces again + an arbitrary decision on how to chop

Median split (2 groups)

Effect	DFn	DFd	F	p
ics_group	1	43	2.71	0.107
cosmetics	1	43	8.04	0.007
ics_group:cosmetics	1	43	7.47	0.009

Tertile split (3 groups)

Effect	DFn	DFd	F	p
ics_group	2	42	1.45	0.245
cosmetics	1	42	8.11	0.007
ics_group:cosmetics	2	42	3.40	0.043

Interaction is significant under one chop and borderline under the other

The answer depends on how you chop

Forcing a continuous moderator into categories is not good!

H4. Does an observer ‘makeup effect’ depend on intrasexual competitiveness?

Trivial to let a continuous, observer-level variable interact with a face-level variable, with no chopping required - variables are ‘natural’

attr ~ cosmetics + ICS + cosmetics:ICS + (1+cosmetics|pid) + (1+cosmetics|faceid)

The p-value is between the two other approaches!

Unpacking with simple effects/marginal means

While we have clear evidence of the interaction, we can use our mixed model to probe it further through EMM, or simple effects. A typical approach is to ‘pick a point’ on one variable and take the difference between the predictions from the points of the other variable.

Concretely, we could pin ICS at low, medium, high (-1, 0, 1 Z-score) and take the difference between the without and with cosmetics conditions to see where the difference holds:

ICS	term	contrast	estimate	std.error	statistic	p.value
-0.776	cosmetics	mean(1) - mean(0)	0.751	0.230	3.271	0.001
0.132	cosmetics	mean(1) - mean(0)	0.396	0.174	2.277	0.023
1.039	cosmetics	mean(1) - mean(0)	0.040	0.237	0.171	0.865

Alternatively, we can compute a simple slope which is the slope of the relationship between our DV and a predictor when we fix another predictor to specific levels.

Here we can estimate the slopes of ICS with attractiveness when we pin cosmetics to without (0) and with (1):

term	cosmetics	estimate	std.error	statistic	p.value
ICS	0	0.112	0.056	1.98	0.048
ICS	1	-0.280	0.186	-1.50	0.132

Notice only one is borderline significant. But their difference is!

term	estimate	std.error	statistic	p.value
b2=b1	-0.392	0.171	-2.28	0.022

H5. Do faces differ in how much they are impacted by intrasexual competitiveness of observers?

Take a moment to consider the ‘fixed effects’
attr ~ cosmetics + ICS + cosmetics:ICS
and random
(1 + cosmetics|faceid) + (1 + cosmetics|pid)
The simple rule of thumb is that a fixed effect without random effects is the same across units (here ICS).
Relaxing that assumption allows substantial flexibility in analysis.
ICS is an observer level score; and so each face could be affected differently by ICS, rather than the effect being constant.

We can let the effect of ICS vary for faces - some faces are more impacted by observer ICS than others.
attr ~ cosmetics + ICS + cosmetics:ICS + (1+ICS+cosmetics|faceid) + (1+cosmetics|pid)
Genuinely no clear way to test this outside of LMMs

SD of face-level ICS slope: 0.507

Correlation	estimate
Baseline ↔︎ Cosmetics effect	0.652
Baseline ↔︎ ICS effect	-0.219
Cosmetics effect ↔︎ ICS effect	-0.482

A higher baseline attractiveness is correlated with lower slopes, that is, as ICS goes up, ratings go down
A higher cosmetics slope (boost from makeup) is correlated with lower slopes, that is, as ICS goes up, ratings go down
LMM’s allow us to test complex questions and allow more realistic flexibility

Random effects change conclusions - things change with and without them

Without face-level ICS slope

Term	Estimate	SE	p
(Intercept)	3.556	0.074	0.000
cosmetics	0.447	0.174	0.013
ICS	0.112	0.056	0.055
cosmetics:ICS	-0.392	0.171	0.027

With face-level ICS slope

Term	Estimate	SE	p
(Intercept)	3.555	0.075	0.000
cosmetics	0.459	0.176	0.011
ICS	0.111	0.082	0.179
cosmetics:ICS	-0.416	0.172	0.020

Ignoring that the ICS effect varies across faces makes its uncertainty look smaller than it is.

Enough talk - lets try another dataset