4 - The art of prediction#

Asking scientific questions of models#

This week we have a very important topic to cover - using a model to generate predictions, and for us to test our scientific theories on those model-based predictions.

This is likely one of the most important sessions we will cover, as it provides the basis for how to interact with models and have them answer our questons. We will cover:

  • Generating conditional predictions from a model

  • Generating marginal predictions and testing comparisons and slopes of those values

  • Translating our natural-language queries into statistical terms

The marginaleffects package#

Much of the heavy lifting of our foray into predictions will be handled by a new package, called marginaleffects. This is a new Python package and can take a little getting used to, but it is very powerful. We’ll explore how to use it and which of its functions we use to turn our questions into answers.

Lets import everything we need, including marginaleffects, which we will call me.

# Import most of what we need
import pandas as pd # dataframes
import seaborn as sns # plots
import statsmodels.formula.api as smf # Models
import marginaleffects as me # marginal effects

# Set the style for plots
sns.set_style('whitegrid') # a different theme!
sns.set_context('talk')

Fitting a model#

We will work through some examples with the affairs dataset, fitting models of varying complexity and using conditional and marginal predictions to understand the implications and answer our questions.

First, let’s load in the dataset, and use statsmodels to fit a simple model, akin to a t-test, which looks at whether those with and without children have higher marital satisfaction:

# First read in the data
affairs = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv')

# Fit the simple model
simple_model = smf.ols('rating ~ children', data=affairs).fit()

# Simple summary
display(simple_model.summary(slim=True))
OLS Regression Results
Dep. Variable: rating R-squared: 0.039
Model: OLS Adj. R-squared: 0.037
No. Observations: 601 F-statistic: 24.00
Covariance Type: nonrobust Prob (F-statistic): 1.24e-06
coef std err t P>|t| [0.025 0.975]
Intercept 4.2749 0.083 51.635 0.000 4.112 4.437
children[T.yes] -0.4795 0.098 -4.899 0.000 -0.672 -0.287


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Using marginaleffects to get predictions#

With a fitted model, lets see some of the ways we can use marginaleffects to get some results.

The first function is me.predictions, which is very general and returns a prediction for each datapoint that went into the data! Let’s see how this works. marginaleffects returns a different type of dataframe (not a pandas one) so we just use the to_pandas() methods.

# Demonstrate prediction
me.predictions(simple_model).to_pandas().head(3) # Predict for the fitted model
rowid estimate std_error statistic p_value s_value conf_low conf_high rownames affairs gender age yearsmarried children religiousness education occupation rating
0 0 4.274854 0.082790 51.634708 0.0 inf 4.112588 4.437120 4 0 male 37.0 10.0 no 3 18 7 4
1 1 4.274854 0.082790 51.634708 0.0 inf 4.112588 4.437120 5 0 female 27.0 4.0 no 4 14 6 4
2 2 3.795349 0.052209 72.695638 0.0 inf 3.693022 3.897676 11 0 female 32.0 15.0 yes 1 12 1 4

Notice here we obtain, for each row/observation that went into the model:

  • an estimate (the prediction)

  • the std error

  • the t-statistic

  • the p-value (is it different from zero?)

  • s-value (we can ignore)

  • conf_low, conf_high - 95% confidence interval bounds

… and the rest of the dataset so we can see what went in!

This is great and can be very useful, but we want to use the model to predict an outcome for the values of its predictors. That is, we want to know, given someone has children, what is their satisfaction? What about if they do not have children? These are conditional predictions!

To do this, we have to ask the model to give us its predictions, given its inputs. We need to generate our inputs, which we can do using the very powerful me.datagrid function. Lets explore this.

Generating inputs to get the right outputs#

me.datagrid allows us to specify the values of the predictors we want the models outputs for, in a simple way. We use the me.datagrid function, and the first argument is the name of the model, and the remaining arguments are the names of the predictors and the values we’d like them to take, specified in a list (and the values must be the same as what went into the model).

So to generate a ‘datagrid’ (think - a set of questions we want to ask) for someone with and without children, we can do this:

# Demonstrate datagrid
inputs = me.datagrid(simple_model,
                     children=['yes', 'no'])

display(inputs.to_pandas())
children rownames affairs gender age yearsmarried religiousness education occupation rating
0 yes 1480 0 female 32.487521 8.177696 4 14 5 5
1 no 1480 0 female 32.487521 8.177696 4 14 5 5

This basically generates data, based on the original data the model saw, that varies only in the predictor we are interested in - see the children column has ‘yes’, and ‘no’, rows, but is the same on everything else. We can then pass this onto the me.predictions function as part of the newdata argument, and obtain our conditional predictions!

# Obtain conditional predictions
cond_pred = me.predictions(simple_model, newdata=inputs)
display(cond_pred.to_pandas())
children rowid estimate std_error statistic p_value s_value conf_low conf_high rownames affairs gender age yearsmarried religiousness education occupation rating
0 yes 0 3.795349 0.052209 72.695638 0.0 inf 3.693022 3.897676 1480 0 female 32.487521 8.177696 4 14 5 5
1 no 1 4.274854 0.082790 51.634708 0.0 inf 4.112588 4.437120 1480 0 female 32.487521 8.177696 4 14 5 5

The estimate column reveals those with children have lower happiness than those without.

Comparing predictions#

We now see the mean satisfaction for marriage with children is 3.79, and without, it is 4.27. Now we may wish to know whether that difference is interesting. We can compare those values with a hypothesis test to work out the difference between them, and whether it is significant or not. This is achieved using the hypothesis argument, which takes varied inputs. We will see some complex ones later, but if we set this to the string 'pairwise', we will obtain a test of the difference between those predictions, as it will compare each row to every other row in the output!

# Difference between our predictions
me.predictions(simple_model, 
               newdata=inputs, 
               hypothesis='pairwise')
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"Row 1 - Row 2"-0.4795050.097877-4.8990359.6308e-719.985837-0.671341-0.287669

Amazingly and beautifully, this estimate is equal to the coefficient in the model! This is our test of interest, and we derived it entirely from a model - are those with children more satisfied in their marrige? No.

More complex models, the same approach#

While traditional approaches would require us to use our flow-chart to figure out which test to use, model-based inference makes tackling more complex problems easy.

Lets extend the model to include the yearsmarried predictor. We will test whether the difference between those with and without children is present even when we adjust for the length of time they are married.

# Build a new model
model2 = smf.ols('rating ~ children + scale(yearsmarried)', data=affairs).fit() # note 'scale'
model2.summary(slim=True)
OLS Regression Results
Dep. Variable: rating R-squared: 0.064
Model: OLS Adj. R-squared: 0.061
No. Observations: 601 F-statistic: 20.43
Covariance Type: nonrobust Prob (F-statistic): 2.63e-09
coef std err t P>|t| [0.025 0.975]
Intercept 4.0801 0.095 42.960 0.000 3.894 4.267
children[T.yes] -0.2073 0.118 -1.758 0.079 -0.439 0.024
scale(yearsmarried) -0.2144 0.053 -4.030 0.000 -0.319 -0.110


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

If we want to make marginal predictions, that is a prediction while we hold other variables constant, we simply leave out the variables we want to hold constant. We repeat the same procedure as before:

# Make a datagrid of what we want to know
new_datagrid = me.datagrid(model2, children=['yes', 'no'], yearsmarried=[10]) # Notice NO yearsmarried variable
display('Predict this', new_datagrid)

# Model makes predictions on this
new_predictions = me.predictions(model2, newdata=new_datagrid)
display('Predictions', new_predictions)

# The contrast is done as before
contrast = me.predictions(model2, newdata=new_datagrid, hypothesis='pairwise')
display('The difference:', contrast)
'Predict this'
shape: (2, 10)
childrenyearsmarriedrownamesaffairsgenderagereligiousnesseducationoccupationrating
stri64i64i64strf64i64i64i64i64
"yes"1010300"female"32.48752141455
"no"1010300"female"32.48752141455
'Predictions'
shape: (2, 18)
childrenyearsmarriedrowidestimatestd_errorstatisticp_values_valueconf_lowconf_highrownamesaffairsgenderagereligiousnesseducationoccupationrating
stri64i32f64f64f64f64f64f64f64i64i64strf64i64i64i64i64
"yes"1003.8026150.05158973.7104270.0inf3.7015043.90372710300"female"32.48752141455
"no"1014.0098980.10491538.2204160.0inf3.8042694.21552810300"female"32.48752141455
'The difference:'
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"Row 1 - Row 2"-0.2072830.117922-1.7577930.0787833.665976-0.4384070.02384

Exploring interactions with greater complexity#

A huge bonus of this approach is that it makes answering complex questions seamless. Lets build a model that includes an interaction between children and gender, and includes the yearsmarried covariate. That is - whats the difference in marital satisfaction between men and women who do and dont have children, controlling for how long they have been married?

This will introduce us to some subtle ways of working with the me.predictions and particularly the hypothesis approach.

# Specify the model
model3 = smf.ols('rating ~ gender * children + scale(yearsmarried)', data=affairs).fit()
display(model3.summary(slim=True))
OLS Regression Results
Dep. Variable: rating R-squared: 0.070
Model: OLS Adj. R-squared: 0.063
No. Observations: 601 F-statistic: 11.16
Covariance Type: nonrobust Prob (F-statistic): 9.72e-09
coef std err t P>|t| [0.025 0.975]
Intercept 4.1985 0.120 35.013 0.000 3.963 4.434
gender[T.male] -0.2607 0.166 -1.572 0.116 -0.586 0.065
children[T.yes] -0.3860 0.150 -2.571 0.010 -0.681 -0.091
gender[T.male]:children[T.yes] 0.3751 0.196 1.917 0.056 -0.009 0.759
scale(yearsmarried) -0.2049 0.053 -3.840 0.000 -0.310 -0.100


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Build a datagrid to answer our question
datagrid3 = me.datagrid(model3, 
                        children=['yes', 'no'], 
                        gender=['male', 'female'])
display(datagrid3)
shape: (4, 10)
childrengenderrownamesaffairsageyearsmarriedreligiousnesseducationoccupationrating
strstri64i64f64f64i64i64i64i64
"yes""male"1131032.4875218.17769641455
"yes""female"1131032.4875218.17769641455
"no""male"1131032.4875218.17769641455
"no""female"1131032.4875218.17769641455
# Obtain predictions
predictions3 = me.predictions(model3, newdata=datagrid3)
display(predictions3)
shape: (4, 18)
childrengenderrowidestimatestd_errorstatisticp_values_valueconf_lowconf_highrownamesaffairsageyearsmarriedreligiousnesseducationoccupationrating
strstri32f64f64f64f64f64f64f64i64i64f64f64i64i64i64i64
"yes""male"03.9268210.07476452.5229260.0inf3.7802874.0733561131032.4875218.17769641455
"yes""female"13.8124410.07598550.1738710.0inf3.6635143.9613681131032.4875218.17769641455
"no""male"23.9378130.13249129.7212720.0inf3.6781344.1974911131032.4875218.17769641455
"no""female"34.198490.11991235.0131040.0inf3.9634674.4335131131032.4875218.17769641455

We can make a plot of these if we like with seaborn.

# visualise
sns.lineplot(data=predictions3,
             x='children', y='estimate',
             hue='gender')
<Axes: xlabel='children', ylabel='estimate'>
../_images/8730a1bc6713b4101dd41d050507d4d0713969ac3f527452abc9af32d274e983.png

The output of our predictions is essentially four rows, one for each combination of gender and child status (e.g. female with children, male without children, and so on).

We can ask for specific differences here by indicating the row labels to compare and ask whether they are equal. The row labels are prefaced with the letter ‘b’. This is a little confusing, so lets see how this works. Here are the predictions again:

# Predictions
me.predictions(model3, 
               newdata=datagrid3)
shape: (4, 18)
childrengenderrowidestimatestd_errorstatisticp_values_valueconf_lowconf_highrownamesaffairsageyearsmarriedreligiousnesseducationoccupationrating
strstri32f64f64f64f64f64f64f64i64i64f64f64i64i64i64i64
"yes""male"03.9268210.07476452.5229260.0inf3.7802874.0733561131032.4875218.17769641455
"yes""female"13.8124410.07598550.1738710.0inf3.6635143.9613681131032.4875218.17769641455
"no""male"23.9378130.13249129.7212720.0inf3.6781344.1974911131032.4875218.17769641455
"no""female"34.198490.11991235.0131040.0inf3.9634674.4335131131032.4875218.17769641455

If I want to compare males with children (ROW 1) to males without children (ROW 3), then this is the way:

# A specific difference
me.predictions(model3, 
               newdata=datagrid3,
               hypothesis='b1 = b3')
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"b1=b3"-0.0109910.156498-0.0702330.9440080.083129-0.3177220.295739

Are there differences between females with and without children? That’s row 2 and row 4, so:

# Another
me.predictions(model3, 
               newdata=datagrid3,
               hypothesis='b2 = b4')
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"b2=b4"-0.3860490.150131-2.5714190.0101286.625468-0.6803-0.091798

You might also be interested in the difference between men and women, averaged over whether they have children or not. You can ask marginaleffects to do this for you by including the variable you wish to target in the ‘by’ keyword. For example, if I want to see the predictions for men and women averaged across their child status, this is the way:

# Averaging over something
avg_over = me.predictions(model3,
                          newdata=datagrid3,
                          by='gender')
display(avg_over)
shape: (2, 8)
genderestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"female"4.0054650.06664460.1021610.0inf3.8748454.136086
"male"3.9323170.07381753.2714530.0inf3.7876394.076995

Testing differences between those predictions proceeds as normal:

# Differences
me.predictions(model3,
               newdata=datagrid3,
               by='gender',
               hypothesis='b1=b2') # 'pairwise' also works here
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"b1=b2"0.0731480.0974440.7506670.4528531.142885-0.1178390.264136

Exploring more interactions - categorical and continuous#

What if we have categorical and continuous predictors interacting, and want to use our marginal effects to understand what is going on?

When we are working with continuous variables, we have two options - the comparison approach as we did above, or the simple slopes approach. Lets first explore the comparison approach.

Exploring more interactions - categorical and continuous - comparisons#

When working with continuous variables and predictions, we need to provide some candidate values across the range of the predictor for the model to work with. Lets see how this works by fitting a model that predicts marital satisfaction from the interaction of gender and years-married. That is, how does satisfaction change for females and males over the course of a marriage?

First we fit the model:

# New model
model4 = smf.ols('rating ~ gender * yearsmarried', data=affairs).fit()
model4.summary(slim=True)
OLS Regression Results
Dep. Variable: rating R-squared: 0.066
Model: OLS Adj. R-squared: 0.061
No. Observations: 601 F-statistic: 14.02
Covariance Type: nonrobust Prob (F-statistic): 7.68e-09
coef std err t P>|t| [0.025 0.975]
Intercept 4.4470 0.105 42.374 0.000 4.241 4.653
gender[T.male] -0.2668 0.156 -1.715 0.087 -0.572 0.039
yearsmarried -0.0633 0.011 -5.902 0.000 -0.084 -0.042
gender[T.male]:yearsmarried 0.0325 0.016 2.069 0.039 0.002 0.063


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We need to provide some values for yearsmarried for our model to predict for. What should we choose? There’s no right or wrong answer - we might have some specific ideas in mind (e.g., 6 months, 2 years, 10 years, and 20 years), or we could provide something like the interquartile range for the yearsmarried variable and use those. Pandas can calculate that for us easily. Here we collect the 25th, 50th, and 75th percentiles of years married:

# Use pandas to get the quartiles
quartiles = affairs['yearsmarried'].quantile([.25, .5, .75])
display(quartiles)
0.25     4.0
0.50     7.0
0.75    15.0
Name: yearsmarried, dtype: float64

And now we pass those back into the datagrid approach, also asking for predictions on males and females:

# New datagrid
datagrid4 = me.datagrid(model4, 
                        gender=['male', 'female'],
                        yearsmarried=quartiles) # can pass right in

display(datagrid4)
shape: (6, 10)
genderyearsmarriedrownamesaffairsagechildrenreligiousnesseducationoccupationrating
strf64i64i64f64stri64i64i64i64
"male"4.01482032.487521"yes"41455
"male"7.01482032.487521"yes"41455
"male"15.01482032.487521"yes"41455
"female"4.01482032.487521"yes"41455
"female"7.01482032.487521"yes"41455
"female"15.01482032.487521"yes"41455

And now we predict:

# Predict
predictions4 = me.predictions(model4, newdata=datagrid4)
display(predictions4)
shape: (6, 18)
genderyearsmarriedrowidestimatestd_errorstatisticp_values_valueconf_lowconf_highrownamesaffairsagechildrenreligiousnesseducationoccupationrating
strf64i32f64f64f64f64f64f64f64i64i64f64stri64i64i64i64
"male"4.004.0570670.08059950.3364190.0inf3.8990964.2150381482032.487521"yes"41455
"male"7.013.9647580.06509460.908150.0inf3.8371764.092341482032.487521"yes"41455
"male"15.023.71860.09909437.5259230.0inf3.5243793.9128211482032.487521"yes"41455
"female"4.034.1938570.07403956.6436730.0inf4.0487424.3389711482032.487521"yes"41455
"female"7.044.0040360.06120765.4180890.0inf3.8840734.1239991482032.487521"yes"41455
"female"15.053.4978480.09607836.4064170.0inf3.3095393.6861571482032.487521"yes"41455

We may choose to plot this for interpretability:

# Plot
sns.lineplot(data=predictions4,
             x='yearsmarried', y='estimate', 
             hue='gender')
<Axes: xlabel='yearsmarried', ylabel='estimate'>
../_images/d1c1da0bde6351fcea734bf340d4b5c3df61d2a96977d2346c84e082d0757126.png

Interpreting the comparisons now proceeds as before. Depending on our questions, we might want to know the differences between men and women at each time point. We do this simply by specifying our hypotheses. Once again here are the predictions:

# Predictions
display(predictions4)
shape: (6, 18)
genderyearsmarriedrowidestimatestd_errorstatisticp_values_valueconf_lowconf_highrownamesaffairsagechildrenreligiousnesseducationoccupationrating
strf64i32f64f64f64f64f64f64f64i64i64f64stri64i64i64i64
"male"4.004.0570670.08059950.3364190.0inf3.8990964.2150381482032.487521"yes"41455
"male"7.013.9647580.06509460.908150.0inf3.8371764.092341482032.487521"yes"41455
"male"15.023.71860.09909437.5259230.0inf3.5243793.9128211482032.487521"yes"41455
"female"4.034.1938570.07403956.6436730.0inf4.0487424.3389711482032.487521"yes"41455
"female"7.044.0040360.06120765.4180890.0inf3.8840734.1239991482032.487521"yes"41455
"female"15.053.4978480.09607836.4064170.0inf3.3095393.6861571482032.487521"yes"41455

Getting all the comparisons done in one go is a little clunky. We can do it one at a time, e.g:

me.predictions(model4, newdata=datagrid4, hypothesis='b1=b4')

or we could recall which rows we wanted to find the differences of (e.g. row 1 - row 4 compares men and women at 4 years of marriage, and ask for all the pairwise comparisons:

# Clunky comparisons
me.predictions(model4, 
               newdata=datagrid4,
               hypothesis='pairwise').to_pandas()
term estimate std_error statistic p_value s_value conf_low conf_high
0 Row 1 - Row 2 0.092309 0.034453 2.679272 7.378243e-03 7.082507 0.024782 0.159836
1 Row 1 - Row 3 0.338467 0.126328 2.679272 7.378245e-03 7.082507 0.090869 0.586065
2 Row 1 - Row 4 -0.136790 0.109444 -1.249861 2.113505e-01 2.242291 -0.351296 0.077717
3 Row 1 - Row 5 0.053031 0.101205 0.523992 6.002839e-01 0.736283 -0.145328 0.251389
4 Row 1 - Row 6 0.559219 0.125408 4.459201 8.226583e-06 16.891275 0.313424 0.805014
5 Row 2 - Row 3 0.246158 0.091875 2.679272 7.378245e-03 7.082507 0.066086 0.426229
6 Row 2 - Row 4 -0.229099 0.098585 -2.323867 2.013263e-02 5.634321 -0.422323 -0.035875
7 Row 2 - Row 5 -0.039278 0.089351 -0.439598 6.602280e-01 0.598964 -0.214402 0.135845
8 Row 2 - Row 6 0.466910 0.116052 4.023267 5.739648e-05 14.088678 0.239451 0.694369
9 Row 3 - Row 4 -0.475257 0.123699 -3.842037 1.220176e-04 13.000623 -0.717702 -0.232811
10 Row 3 - Row 5 -0.285436 0.116473 -2.450664 1.425932e-02 6.131952 -0.513719 -0.057153
11 Row 3 - Row 6 0.220752 0.138024 1.599379 1.097365e-01 3.187885 -0.049769 0.491274
12 Row 4 - Row 5 0.189821 0.032160 5.902403 3.582456e-09 28.056404 0.126788 0.252853
13 Row 4 - Row 6 0.696009 0.117920 5.902403 3.582456e-09 28.056404 0.464891 0.927127
14 Row 5 - Row 6 0.506188 0.085760 5.902403 3.582456e-09 28.056404 0.338102 0.674274

This sort of approach is sometimes useful, but a cleaner technique is not to dwell on the individual predictions, but rather to look at the simple slopes!

Exploring more interactions - categorical and continuous - slopes#

When dealing with situations like this, its almost always better to focus instead on the simple slopes. These represent, quite literally, the slope the continuous variable has at the level of another variable, which might be continuous or categorical itself. Its akin to asking this question:

  • How much does Y change with a 1-unit increase in X, when Z is fixed to a specific category? Or, here:

  • How much does marital satisfaction change when years married increases by 1 year, when gender is fixed to male or female?

This is a much cleaner approach as it removes the need to make lots of comparisons and instead directly checks the associations of a predictor and the dependent variable, fixing one variable at specific value. Lets see how this works, using the me.slopes function. First, examining the slopes visually makes it clear something is going on:

# Plot
sns.lineplot(data=predictions4,
             x='yearsmarried', y='estimate', 
             hue='gender')
<Axes: xlabel='yearsmarried', ylabel='estimate'>
../_images/d1c1da0bde6351fcea734bf340d4b5c3df61d2a96977d2346c84e082d0757126.png

Lets use me.slopes to work those out. Notice here we don’t actually specify a new data grid, but simply say which variable to figure out the slopes for (yearsmarried in this case), and which variable to do this over (gender).

# slopes
slopes = me.slopes(model4, 
                   variables='yearsmarried', 
                   by='gender')
display(slopes)
shape: (2, 10)
gendertermcontrastestimatestd_errorstatisticp_values_valueconf_lowconf_high
strstrstrf64f64f64f64f64f64f64
"female""yearsmarried""mean(dY/dX)"-0.0632740.010686-5.9211643.1967e-928.220764-0.084218-0.042329
"male""yearsmarried""mean(dY/dX)"-0.030770.011479-2.6805260.0073517.087913-0.053268-0.008271

Aligned with the figure, females show a steeper decline than males (a more negative slope). We can ask whether there is a difference between those slopes in the usual way, e.g.

# Test for differences therein
me.slopes(model4, 
          variables='yearsmarried', 
          by='gender',
          hypothesis='pairwise')
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"Row 1 - Row 2"-0.0325040.015699-2.0704940.0384064.702519-0.063273-0.001735

Exploring more interactions - continuous and continuous!#

Let us now consider a model that has an interaction between two continuous predictors, and how marginal predictions can help us interpret them. Here, we build a model that predicts marital satisfaction from number of years married and education. We are interested in whether those with more education and longer marriages are happier than those with less education.

First we fit the model:

# Model 5
model5 = smf.ols('rating ~ yearsmarried * education', data=affairs).fit()
model5.summary(slim=True)
OLS Regression Results
Dep. Variable: rating R-squared: 0.097
Model: OLS Adj. R-squared: 0.092
No. Observations: 601 F-statistic: 21.28
Covariance Type: nonrobust Prob (F-statistic): 4.17e-13
coef std err t P>|t| [0.025 0.975]
Intercept 5.4446 0.589 9.246 0.000 4.288 6.601
yearsmarried -0.2574 0.054 -4.798 0.000 -0.363 -0.152
education -0.0700 0.036 -1.919 0.056 -0.142 0.002
yearsmarried:education 0.0130 0.003 3.924 0.000 0.006 0.019


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.26e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

How should we proceed in terms of inference, given we can see the interaction effect is significant?

Much like before, we can select some points on which to compare our variables, and try to interpret the differences between them.

Here, I’ve chosen to predict for marriages of lengths 1, 5, 10, and 20 years, and educational attainment of 10, 12, 18, and 20.

# Create a datagrid
datagrid5 = me.datagrid(model5,
                        yearsmarried=[1, 5, 10, 20],
                        education=[10, 12, 18, 20])

# Generate predictions
predictions5 = me.predictions(model5, newdata=datagrid5)

Plotting these predictions will help visualise what’s going on.

# Plot
sns.lineplot(data=predictions5, 
             x='yearsmarried', y='estimate',
             hue='education')
<Axes: xlabel='yearsmarried', ylabel='estimate'>
../_images/698edaab82338c7facda7ada2a6d25d01ccffdb3b1add390912bd160ba6ed202.png

It is difficult to know where to begin! It is often simpler in these instances to pick one or two points and test differences.

Lets say that we want to know whether the difference between educational attainment values 12 and 18 are significantly different at 10 and 20 years of marriage. That is achievable:

# Sensible predictions
display(me.predictions(model5, 
                       newdata=me.datagrid(model5, yearsmarried=[10, 20], education=[12, 18])
                      )
       )
shape: (4, 18)
yearsmarriededucationrowidestimatestd_errorstatisticp_values_valueconf_lowconf_highrownamesaffairsgenderagechildrenreligiousnessoccupationrating
i64i64i32f64f64f64f64f64f64f64i64i64strf64stri64i64i64
101203.5887390.08780140.8734590.0inf3.4166523.76082711380"female"32.487521"yes"455
101813.9479710.05545571.1923190.0inf3.8392814.05666111380"female"32.487521"yes"455
201222.5723240.18963413.5646470.0inf2.2006482.94400111380"female"32.487521"yes"455
201833.710520.12379629.9727650.0inf3.4678843.95315611380"female"32.487521"yes"455

Repeat and actually test difference between rows 1 and 2 (married = 10, education varies):

# Testing one specific hypothesis
me.predictions(model5, 
               newdata=me.datagrid(model5, yearsmarried=[10], education=[12, 18]),
               hypothesis='b2=b1')
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"b2=b1"0.3592310.1075443.3403320.00083710.2228620.148450.570013
# And repeats for another
me.predictions(model5, 
               newdata=me.datagrid(model5, yearsmarried=[20], education=[12, 18]),
               hypothesis='b2=b1')
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"b2=b1"1.1381960.2325814.8937619.8927e-719.9471320.6823451.594046

Clearly a bigger difference at 20 years between the low and high education groups…

But what if you want to compare whether the difference at 20 years between the education values is significantly bigger than the difference at 10 years? Thats a neccessary and complex question, but you can ask it by actually performing the calculation with the hypothesis argument on the full set of comparisons. Lets see all the necessary predictions again first:

# These are the predictions
me.predictions(model5, 
               newdata=me.datagrid(model5,
                                   yearsmarried=[10, 20],
                                   education=[12, 18])
              )
shape: (4, 18)
yearsmarriededucationrowidestimatestd_errorstatisticp_values_valueconf_lowconf_highrownamesaffairsgenderagechildrenreligiousnessoccupationrating
i64i64i32f64f64f64f64f64f64f64i64i64strf64stri64i64i64
101203.5887390.08780140.8734590.0inf3.4166523.7608274900"female"32.487521"yes"455
101813.9479710.05545571.1923190.0inf3.8392814.0566614900"female"32.487521"yes"455
201222.5723240.18963413.5646470.0inf2.2006482.9440014900"female"32.487521"yes"455
201833.710520.12379629.9727650.0inf3.4678843.9531564900"female"32.487521"yes"455

We specifically want to see whether the difference between rows 3 and 4 (years married = 20, education 12 and 18) is bigger than rows 1 and 2 (years married = 10, education 12 and 18). We can do that literally translating that query into a hypothesis.

# Answer the question!
me.predictions(model5, 
               newdata=me.datagrid(model5,
                                   yearsmarried=[10, 20],
                                   education=[12, 18]),
               hypothesis='b4-b3 = b2-b1'
              )
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"b4-b3=b2-b1"0.7789640.1985293.9236890.00008713.4852590.3898551.168073

This is indeed significant and reveals a closer insight into what the model understands.

Of course, another approach is to compare slopes. We saw how to do this earlier in a continuous by categorical interaction situation, but here, everything is continuous.

The solution is essentially a combination of using a datagrid and the slopes command. We pick some candidate values for our predictors, generate a datagrid, and then compute the slopes for one over the other variable.

To be explicit, first we will plot the slopes. I want the slopes for three levels of education - 9, 14, and 20. Call these low, medium, and high. I want these slopes for as many years married as we care to consider as we will be averaging over it soon enough. This looks like the below:

# Get our datagrid
slopes_datagrid = me.datagrid(model5, 
                              yearsmarried=[0.5, 1, 2, 3, 5, 10, 20],
                              education=[9, 14, 20])

# Predict them
p = me.predictions(model5, newdata=slopes_datagrid)

# Plot them
sns.lineplot(data=p,
             x='yearsmarried', y='estimate',
             hue='education')
<Axes: xlabel='yearsmarried', ylabel='estimate'>
../_images/cc8def7b80eabd28ea8e7cf61edc19b5b3d61fe31b67abaa33ec3e7cf7b5e0ab.png

What do we expect? The slope for an educational attainment of 20 is basically zero - a flat line! For 14, it is negative but not too steep, but for 9, it is very steep and negative. So lets obtain those three values and see how they look, which we do with a combination of our existing data grid and the variables/by arguments you’ve already seen:

# Slopes for ultimate insight
final_slopes = me.slopes(model5,
                         newdata=slopes_datagrid,
                         variables='yearsmarried', # average over this
                         by='education') # for each of these
final_slopes
shape: (3, 10)
educationtermcontrastestimatestd_errorstatisticp_values_valueconf_lowconf_high
i64strstrf64f64f64f64f64f64f64
9"yearsmarried""mean(dY/dX)"-0.140590.024533-5.7306991.0002e-826.57517-0.188673-0.092506
14"yearsmarried""mean(dY/dX)"-0.0756760.010247-7.3853231.5210e-1342.58004-0.095759-0.055593
20"yearsmarried""mean(dY/dX)"0.002220.0151430.1466240.8834290.178814-0.027460.031901

The slope for an educational attainment of 20 is not significantly different from zero. As we predicted, for 14 its negative, and for 9, its much more steep. We can infact compare those if we want to using hypothesis - e.g - does the jump from 9 to 14 really slow your decline in marital satisfaction?

# Yes!
me.slopes(model5,
          newdata=slopes_datagrid,
          variables='yearsmarried', # average over this
          by='education', # for each of these
          hypothesis='b2 = b1')
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"b2=b1"0.0649140.0165383.9250540.00008713.4934390.0324990.097328

Answering bespoke questions#

You have seen so far how the model-based approach can answer some more complex questions than are usually posed and answered by psychology. We can now turn to an advanced use case, which is to use models to answer questions about individuals.

Imagine a female friend who has been married to her partner for 10 years, has a high level of educational attainment, and is aged 32. Her partner and her are considering having a child. She wants to know how this will change her marital happiness. How can we help? Traditional statistics in psychology cannot, despite this being a very psychologically based question. Our model can answer this for us. All we need do is build a model that considers the interactions of these variables, and have it predict two things.

The first should be our friends characteristics exactly as they are now. The second should be our friends altered characteristics, e.g., she then has a child.

Then we can work out the difference and provide an answer. Lets do this, using the exact same logic we have already seen.

First, the model.

# Fit the model with all variables in it
individual_mod = smf.ols('rating ~ gender * scale(yearsmarried) * scale(age) * scale(education) * children', data=affairs).fit()

# Create a datagrid that fits the profile
profile = me.datagrid(individual_mod,
                      gender='female',
                      yearsmarried=10,
                      age=32,
                      education=20,
                      children=['no', 'yes']
                     )

# Show the profile
display(profile)
shape: (2, 10)
genderyearsmarriedageeducationchildrenrownamesaffairsreligiousnessoccupationrating
stri64i64i64stri64i64i64i64i64
"female"103220"no"17430455
"female"103220"yes"17430455

All we now need do is recover these predictions and examine the difference.

# Make the predictions
me.predictions(individual_mod, newdata=profile)
shape: (2, 18)
genderyearsmarriedageeducationchildrenrowidestimatestd_errorstatisticp_values_valueconf_lowconf_highrownamesaffairsreligiousnessoccupationrating
stri64i64i64stri32f64f64f64f64f64f64f64i64i64i64i64i64
"female"103220"no"03.2334790.7266114.4500810.00000916.8299551.8093474.6576117430455
"female"103220"yes"14.4147540.28476115.5033910.0inf3.8566344.97287517430455

And the contrast:

me.predictions(individual_mod, newdata=profile, hypothesis='b2=b1')
shape: (1, 8)
termestimatestd_errorstatisticp_values_valueconf_lowconf_high
strf64f64f64f64f64f64f64
"b2=b1"1.1812750.7804181.5136440.1301162.942129-0.3483162.710867

According to the model, while our friend would see a rather large jump in their satisfaction of 1.18 units, this is a very uncertain outcome, and is not statistically significant.