or: not letting the statistical tail wag the theoretical dog
We have five hypotheses to test in this study.
We will (attempt) to test these using traditional statistics, and then again with an LMM
Traditionally, we have to average across one level — face or rater — to even start.
We go from this…
| pid | faceid | ICS | attr | cosmetics_f |
|---|---|---|---|---|
| ppt_1 | face_4 | 0.979 | 3 | Cosmetics |
| ppt_1 | face_66 | 0.979 | 4 | No cosmetics |
| ppt_1 | face_32 | 0.979 | 1 | Cosmetics |
| ppt_1 | face_21 | 0.979 | 3 | Cosmetics |
| ppt_1 | face_67 | 0.979 | 5 | No cosmetics |
| ppt_1 | face_28 | 0.979 | 4 | No cosmetics |
| ppt_1 | face_33 | 0.979 | 3 | Cosmetics |
| ppt_1 | face_52 | 0.979 | 4 | Cosmetics |
The full dataset, 26,364 rows, to…
| pid | ics | No cosmetics | Cosmetics |
|---|---|---|---|
| ppt_1 | 0.979 | 4.00 | 3.50 |
| ppt_10 | 1.396 | 2.94 | 3.41 |
| ppt_11 | -1.055 | 3.56 | 4.87 |
| ppt_12 | 0.150 | 3.34 | 1.94 |
| ppt_13 | -0.488 | 3.59 | 2.33 |
| ppt_14 | -0.257 | 3.89 | 6.06 |
| ppt_15 | 0.744 | 3.77 | 4.30 |
| ppt_16 | -0.211 | 3.46 | 4.85 |
By-participant averages, N = 45
| faceid | No cosmetics | Cosmetics |
|---|---|---|
| face_1 | 3.32 | 3.12 |
| face_10 | 3.09 | 2.48 |
| face_11 | 4.57 | 5.65 |
| face_12 | 3.96 | 5.50 |
| face_13 | 3.59 | 5.12 |
| face_14 | 4.67 | 5.32 |
| face_15 | 3.33 | 3.17 |
| face_16 | 3.37 | 4.37 |
By-face averages, N = 70
Which to choose? By face has more N, so more power? By ppt also allows us to test ICS, which we lose with faces! Either way, a paired-samples t-test should help!
By participant, then by face
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| -0.431 | -2.71 | 0.01 | 44 | -0.752 | -0.11 | Paired t-test | two.sided |
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| -0.386 | -4.08 | 0 | 69 | -0.575 | -0.197 | Paired t-test | two.sided |
A mixed model can do either or both of these tests in a single fit. - Recall a regression with a binary predictor is akin to a t-test! - A mixed model uses the full data, no aggregation
attr ~ cosmetics + (1|faceid)attr ~ cosmetics + (1|pid)attr ~ cosmetics + (1|pid) + (1|faceid)While the mean difference is broadly similar, the uncertainty around it is very different. A mixed model loses no information to aggregation, so is more certain about the effect.
In return you get individual-difference measures of each face and participant in the No Cosmetics condition (the intercept!):
The variances of those distributions are shared with us by the model, and we see how much variance can be partitioned by the different “sources”.
| Group | Variance | SD | Proportion |
|---|---|---|---|
| faceid | 0.551 | 0.742 | 0.217 |
| pid | 0.371 | 0.609 | 0.146 |
| Residual | 1.616 | 1.271 | 0.637 |
Are all observers ratings equally affected by cosmetics?
Messy and indirect traditionally. Carry out separate regressions for each person (attr ~ cosmetics) and collect their slopes; then test against zero. Loses all face information and bakes in measurement error - we know all participants are not equal!
attr ~ cosmetics + (1+cosmetics|pid) + (1|faceid) allows the effect of cosmetics to differ for each participant, directly answering the question.
LMM estimate is more conservative for participants with smaller N
More simply, is the effect of cosmetics the same for all faces, or does it vary?
Harder problem than for observers! A separate regression is not possible because each face has two ‘versions’. Best we could manage is a correlation between the average score for each face, which removes individual variation. No easy way to test this traditionally, but solved simply by mixed models
attr ~ cosmetics + (1+cosmetics|pid) + (1+cosmetics|faceid)
allows the effect of cosmetics to differ for each face (and observers too).
Our model now includes intercepts and slopes for both observers and faces:
The model also estimates the correlation between these effects - Harsher raters are unaffected by cosmetics - More attractive faces get more attractive with cosmetics
An interaction. How to test that traditionally?
Median split (2 groups)
| Effect | DFn | DFd | F | p |
|---|---|---|---|---|
| ics_group | 1 | 43 | 2.71 | 0.107 |
| cosmetics | 1 | 43 | 8.04 | 0.007 |
| ics_group:cosmetics | 1 | 43 | 7.47 | 0.009 |
Tertile split (3 groups)
| Effect | DFn | DFd | F | p |
|---|---|---|---|---|
| ics_group | 2 | 42 | 1.45 | 0.245 |
| cosmetics | 1 | 42 | 8.11 | 0.007 |
| ics_group:cosmetics | 2 | 42 | 3.40 | 0.043 |
Interaction is significant under one chop and borderline under the other
Forcing a continuous moderator into categories is not good!
Trivial to let a continuous, observer-level variable interact with a face-level variable, with no chopping required - variables are ‘natural’
attr ~ cosmetics + ICS + cosmetics:ICS + (1+cosmetics|pid) + (1+cosmetics|faceid)
The p-value is between the two other approaches!
Unpacking with EMM
While we have clear evidence of the interaction, we can use our mixed model to probe it further through EMM, or ‘simple slopes’ A typical approach is to ‘pick a point’ on one variable and take the difference between the predictions from the points of the other variable.
Concretely, we could pin ICS at low, medium, high (-1, 0, 1 Z-score) and take the difference between the without and with cosmetics conditions to see where the difference holds:
| ICS | term | contrast | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|---|
| -1 | cosmetics | mean(1) - mean(0) | 1.159 | 0.256 | 4.52 | 0.000 |
| 0 | cosmetics | mean(1) - mean(0) | 0.767 | 0.174 | 4.40 | 0.000 |
| 1 | cosmetics | mean(1) - mean(0) | 0.376 | 0.232 | 1.62 | 0.106 |
Alternatively, we can compute a simple slope which is the slope of the relationship between our DV and a predictor when we fix another predictor to specific levels.
Here we can estimate the slopes of ICS with attractiveness when we pin cosmetics to without (0) and with (1):
| term | cosmetics | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|
| ICS | 0 | 0.112 | 0.056 | 1.98 | 0.048 |
| ICS | 1 | -0.280 | 0.186 | -1.50 | 0.132 |
Notice only one is borderline significant. But their difference is!
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| b2=b1 | -0.392 | 0.171 | -2.28 | 0.022 |
Take a moment to consider the ‘fixed effects’
attr ~ cosmetics + ICS + cosmetics:ICS
and random
(1 + cosmetics|faceid) + (1 + cosmetics|pid)
The simple rule of thumb is that a fixed effect without random effects is the same across units (here ICS).
Relaxing that assumption allows substantial flexibility in analysis.
ICS is an observer level score; and so each face could be affected differently by ICS, rather than the effect being constant.
attr ~ cosmetics + ICS + cosmetics:ICS + (1+ICS+cosmetics|faceid) + (1+cosmetics|pid)SD of face-level ICS slope: 0.507
| Correlation | estimate |
|---|---|
| Baseline ↔︎ Cosmetics effect | 0.652 |
| Baseline ↔︎ ICS effect | -0.220 |
| Cosmetics effect ↔︎ ICS effect | -0.482 |
| Term | Estimate | SE | p |
|---|---|---|---|
| (Intercept) | 3.556 | 0.074 | 0.000 |
| cosmetics | 0.447 | 0.174 | 0.013 |
| ICS | 0.112 | 0.056 | 0.055 |
| cosmetics:ICS | -0.392 | 0.171 | 0.027 |
| Term | Estimate | SE | p |
|---|---|---|---|
| (Intercept) | 3.555 | 0.075 | 0.000 |
| cosmetics | 0.459 | 0.176 | 0.011 |
| ICS | 0.111 | 0.082 | 0.179 |
| cosmetics:ICS | -0.416 | 0.172 | 0.020 |
Ignoring that the ICS effect varies across faces makes its uncertainty look smaller than it is.
Enough talk - lets try another dataset