Fitting latent variables onto data - Exercises & Answers

Fitting latent variables onto data - Exercises & Answers#

1. Grit revisited - are we sure its there?#

Armed with new knowledge of CFA, we can now examine directly whether the two-factor structure claimed in the research literature exists in the data we have to hand. While our use of EFA indirectly explored the latent variables in that dataset, with CFA we explicitly test the presence of two factors and set certain questions to load onto these variables.

First, lets re-obtain the data that has the grit questionnaire in it.

Download the data from this link: https://openpsychometrics.org/_rawdata/duckworth-grit-scale-data.zip

You will need to unzip it and grab the data.csv file.

Import everything we need first, adding semopy to the list of packages we will use.

Important#

After importing everything, run this command just underneath your imports. It’ll ensure the results will match when running this code elsehwere. np.random.seed(36)

Read in the data into a dataframe called grit, specifying the separator as ‘\t’ (sep=’\t’), as before.

	country	surveyelapse	GS1	GS2	GS3	GS4	GS5	GS6	GS7	GS8	...	O7	O8	O9	O10	operatingsystem	browser	screenw	screenh	introelapse	testelapse
0	RO	174	1	1	3	3	3	2	3	1	...	5	4	5	4	Windows	Chrome	1366	768	69590	307
1	US	120	2	2	3	3	2	1	3	3	...	4	3	4	5	Macintosh	Chrome	1280	800	33657	134
2	US	99	3	3	3	3	4	3	4	4	...	5	5	4	4	Windows	Firefox	1920	1080	95550	138
3	KE	5098	1	3	4	2	4	1	5	4	...	4	2	5	4	Windows	Chrome	1600	900	4	4440
4	JP	340	1	2	3	3	2	2	2	4	...	4	1	3	2	Windows	Firefox	1920	1080	3	337
5	AU	515	1	2	5	1	3	1	4	5	...	5	2	5	5	Windows	Chrome	1920	1080	2090	554
6	US	126	2	1	3	4	1	1	1	1	...	5	5	5	5	Windows	Chrome	1366	768	36	212
7	RO	208	3	1	1	4	1	3	4	4	...	5	3	4	3	Windows	Chrome	1366	768	6	207
8	EU	130	1	3	3	1	4	1	5	4	...	5	1	4	5	Windows	Microsoft Internet Explorer	1600	1000	14	183
9	NZ	129	2	3	2	2	4	2	4	4	...	4	3	4	4	Macintosh	Chrome	1440	900	68	143

10 rows × 98 columns

Like before, get the grit-related columns by keeping only the columns with ‘GS’ in them. Store in a dataframe called grit2.

	GS1	GS2	GS3	GS4	GS5	GS6	GS7	GS8	GS9	GS10	GS11	GS12
0	1	1	3	3	3	2	3	1	3	2	3	3
1	2	2	3	3	2	1	3	3	2	1	3	2
2	3	3	3	3	4	3	4	4	3	3	3	3
3	1	3	4	2	4	1	5	4	1	1	3	1
4	1	2	3	3	2	2	2	4	3	3	4	4

Building a latent factor model of grit#

We take the model directly stated from the original Duckworth et al (2007) paper that described the grit scale. The questions loading onto the two latent grit factors are as follows:

Consistency in interest
- GS2
- GS3
- GS5
- GS7
- GS8
- GS11
Perseverance in effort
- GS1
- GS4
- GS6
- GS9
- GS10
- GS12

With that in mind, create a CFA model that tests whether the two latent variables (consistency and perseverance) are measured by the above stated questionnaire variables. Create it and fit it in semopy.

SolverResult(fun=0.3782276478265812, success=True, n_it=24, x=array([ 1.122537  ,  1.36719906,  1.39375519,  1.44706539,  0.80596871,
        0.85883226,  1.32252661,  1.43824065,  1.57281968,  1.28650847,
        0.68358768,  1.09217894,  0.98222584,  0.57969277,  0.8487556 ,
        0.96281907,  1.19213773,  0.82104992,  0.57220757,  0.61039529,
        0.78652695,  0.66075116,  0.45345085, -0.2184887 ,  0.29510074]), message='Optimization terminated successfully', name_method='SLSQP', name_obj='MLW')

If you have fitted the model successfully, inspect the standardised loadings. Do they appear significant and sensible?

	lval	op	rval	Estimate	Est. Std	Std. Err	z-value	p-value
0	GS2	~	consistency	1.000000	0.590099	-	-	-
1	GS3	~	consistency	1.122537	0.610272	0.035904	31.264817	0.0
2	GS5	~	consistency	1.367199	0.712711	0.039375	34.72214	0.0
3	GS7	~	consistency	1.393755	0.768558	0.038401	36.295186	0.0
4	GS8	~	consistency	1.447065	0.739557	0.040748	35.512611	0.0
5	GS11	~	consistency	0.805969	0.480314	0.030976	26.019401	0.0
6	GS1	~	perseverance	1.000000	0.549115	-	-	-
7	GS4	~	perseverance	0.858832	0.392929	0.041181	20.85489	0.0
8	GS6	~	perseverance	1.322527	0.688657	0.042987	30.765874	0.0
9	GS9	~	perseverance	1.438241	0.692969	0.046594	30.867441	0.0
10	GS10	~	perseverance	1.572820	0.632947	0.053624	29.330766	0.0
11	GS12	~	perseverance	1.286508	0.676221	0.042229	30.464843	0.0
12	consistency	~~	consistency	0.453451	1.000000	0.023116	19.616305	0.0
13	consistency	~~	perseverance	-0.218489	-0.597281	0.010538	-20.732445	0.0
14	perseverance	~~	perseverance	0.295101	1.000000	0.016762	17.605834	0.0
15	GS1	~~	GS1	0.683588	0.698473	0.016434	41.597162	0.0
16	GS10	~~	GS10	1.092179	0.599378	0.027985	39.027627	0.0
17	GS11	~~	GS11	0.982226	0.769299	0.02249	43.672973	0.0
18	GS12	~~	GS12	0.579693	0.542725	0.015614	37.126657	0.0
19	GS2	~~	GS2	0.848756	0.651783	0.02037	41.667457	0.0
20	GS3	~~	GS3	0.962819	0.627568	0.023394	41.157212	0.0
21	GS4	~~	GS4	1.192138	0.845607	0.026929	44.269472	0.0
22	GS5	~~	GS5	0.821050	0.492043	0.021991	37.336067	0.0
23	GS6	~~	GS6	0.572208	0.525751	0.015686	36.478381	0.0
24	GS7	~~	GS7	0.610395	0.409318	0.018072	33.77512	0.0
25	GS8	~~	GS8	0.786527	0.453055	0.021966	35.806484	0.0
26	GS9	~~	GS9	0.660751	0.519794	0.018232	36.241337	0.0

Finally, check the fit statistics. Is this model any good? Does it describe the data well?

	DoF	DoF Baseline	chi2	chi2 p-value	chi2 Baseline	CFI	GFI	AGFI	NFI	TLI	RMSEA	AIC	BIC	LogLik
Value	53	66	1615.032056	0.0	15958.962524	0.901715	0.898801	0.873979	0.898801	0.877608	0.083089	49.243545	208.227772	0.378228

What do the guidelines suggest we should do with this model? Do the statistics suggest it fits the model well?

2. Grit and conscientiousness#

Let us now expand the use of CFA and dip into the idea of SEM a little. Here we’ll expand our previous exercise where we conducted an EFA on scores from the Big 5 trait Conscientiousness at the same time as the grit scale. In the following we’ll see what happens if we specify our grit model like we just did, but also include the data for Conscientiousness and its associated questions. As such, we’ll find three latent variables (the two grit related ones from the grit questionnaire, and one for Conscientiousness).

First, get the right questions out of the main grit dataframe and store it in a dataframe called grit_consc. It should include all questions with GS in them and C.

	GS1	GS2	GS3	GS4	GS5	GS6	GS7	GS8	GS9	GS10	...	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10
0	1	1	3	3	3	2	3	1	3	2	...	2	4	4	3	2	4	3	2	2	4
1	2	2	3	3	2	1	3	3	2	1	...	4	3	4	3	1	3	5	2	5	3
2	3	3	3	3	4	3	4	4	3	3	...	2	2	4	2	3	4	5	3	3	4
3	1	3	4	2	4	1	5	4	1	1	...	4	1	5	1	4	1	4	1	4	3
4	1	2	3	3	2	2	2	4	3	3	...	3	1	3	1	4	2	3	2	3	4

5 rows × 22 columns

Set up a CFA model in which the two grit latent variables are captured exactly like in the last exercise, and the conscientiousness variable is captured by its own questions.

SolverResult(fun=0.9710712086824149, success=True, n_it=48, x=array([ 1.09283328,  1.33418234,  1.36291197,  1.42914314,  0.78710305,
        0.85078183,  1.39157688,  1.49479537,  1.55776988,  1.34527766,
       -0.94991487,  0.48338527, -1.02911509,  1.07204046, -1.08695932,
        0.76819775, -0.8109484 ,  1.0327781 ,  0.65677164,  0.75019843,
        0.79143721,  1.38136428,  0.83958205,  1.02988488,  0.96542705,
        1.30328443,  0.96080557,  0.90506302,  0.95856894,  0.70078748,
        1.14495165,  0.98523927,  0.56304364,  0.83198401,  0.97233703,
        1.20771199,  0.83141956,  0.54803132,  0.61782435,  0.77576666,
        0.64744105,  0.56479771,  0.32254792, -0.29715799,  0.47034206,
       -0.21648202,  0.27902947]), message='Optimization terminated successfully', name_method='SLSQP', name_obj='MLW')

Inspect the standardised loadings once the model has been estimated.

	lval	op	rval	Estimate	Est. Std	Std. Err	z-value	p-value
0	GS2	~	consistency	1.000000	0.600962	-	-	-
1	GS3	~	consistency	1.092833	0.605118	0.034543	31.6372	0.0
2	GS5	~	consistency	1.334182	0.708337	0.037735	35.35657	0.0
3	GS7	~	consistency	1.362912	0.765356	0.03673	37.106282	0.0
4	GS8	~	consistency	1.429143	0.743799	0.039179	36.476844	0.0
5	GS11	~	consistency	0.787103	0.477756	0.029984	26.250568	0.0
6	GS1	~	perseverance	1.000000	0.533645	-	-	-
7	GS4	~	perseverance	0.850782	0.378515	0.041879	20.315416	0.0
8	GS6	~	perseverance	1.391577	0.704603	0.044876	31.009702	0.0
9	GS9	~	perseverance	1.494795	0.700406	0.048353	30.914318	0.0
10	GS10	~	perseverance	1.557770	0.609603	0.054507	28.57938	0.0
11	GS12	~	perseverance	1.345278	0.687617	0.04394	30.616507	0.0
12	C1	~	conscientious	1.000000	0.655366	-	-	-
13	C2	~	conscientious	-0.949915	-0.519141	0.032092	-29.599427	0.0
14	C3	~	conscientious	0.483385	0.368559	0.022368	21.61034	0.0
15	C4	~	conscientious	-1.029115	-0.606145	0.030385	-33.868641	0.0
16	C5	~	conscientious	1.072040	0.634066	0.030477	35.175309	0.0
17	C6	~	conscientious	-1.086959	-0.581919	0.033231	-32.708767	0.0
18	C7	~	conscientious	0.768198	0.507498	0.026483	29.007307	0.0
19	C8	~	conscientious	-0.810948	-0.539424	0.026485	-30.619698	0.0
20	C9	~	conscientious	1.032778	0.621230	0.029867	34.578726	0.0
21	C10	~	conscientious	0.656772	0.485152	0.023576	27.858042	0.0
22	conscientious	~~	conscientious	0.564798	1.000000	0.025115	22.488144	0.0
23	conscientious	~~	consistency	0.322548	0.625808	0.013958	23.108344	0.0
24	conscientious	~~	perseverance	-0.297158	-0.748542	0.012763	-23.282097	0.0
25	consistency	~~	consistency	0.470342	1.000000	0.023376	20.121145	0.0
26	consistency	~~	perseverance	-0.216482	-0.597572	0.010434	-20.746923	0.0
27	perseverance	~~	perseverance	0.279029	1.000000	0.016157	17.270103	0.0
28	C1	~~	C1	0.750198	0.570495	0.018656	40.211662	0.0
29	C10	~~	C10	0.791437	0.764628	0.018083	43.767014	0.0
30	C2	~~	C2	1.381364	0.730493	0.031916	43.280672	0.0
31	C3	~~	C3	0.839582	0.864165	0.018673	44.962816	0.0
32	C4	~~	C4	1.029885	0.632588	0.024763	41.589541	0.0
33	C5	~~	C5	0.965427	0.597961	0.02363	40.856799	0.0
34	C6	~~	C6	1.303284	0.661370	0.030928	42.139464	0.0
35	C7	~~	C7	0.960806	0.742446	0.02211	43.456143	0.0
36	C8	~~	C8	0.905063	0.709022	0.021072	42.950379	0.0
37	C9	~~	C9	0.958569	0.614074	0.023262	41.208169	0.0
38	GS1	~~	GS1	0.700787	0.715223	0.016475	42.536573	0.0
39	GS10	~~	GS10	1.144952	0.628384	0.02811	40.73177	0.0
40	GS11	~~	GS11	0.985239	0.771750	0.022495	43.798237	0.0
41	GS12	~~	GS12	0.563044	0.527183	0.014875	37.852765	0.0
42	GS2	~~	GS2	0.831984	0.638845	0.020011	41.576408	0.0
43	GS3	~~	GS3	0.972337	0.633833	0.023445	41.47376	0.0
44	GS4	~~	GS4	1.207712	0.856727	0.027035	44.672902	0.0
45	GS5	~~	GS5	0.831420	0.498259	0.021946	37.885028	0.0
46	GS6	~~	GS6	0.548031	0.503535	0.014807	37.011208	0.0
47	GS7	~~	GS7	0.617824	0.414231	0.017916	34.484275	0.0
48	GS8	~~	GS8	0.775767	0.446763	0.021583	35.942844	0.0
49	GS9	~~	GS9	0.647441	0.509432	0.017391	37.228265	0.0

Take a look at the model fit statistics now. What does this suggest to us?

	DoF	DoF Baseline	chi2	chi2 p-value	chi2 Baseline	CFI	GFI	AGFI	NFI	TLI	RMSEA	AIC	BIC	LogLik
Value	206	231	4146.474061	0.0	29702.88149	0.866297	0.860402	0.84346	0.860402	0.850071	0.066939	92.057858	390.948206	0.971071

What should we conclude? Do these three separate latent factors represent the data well?

3. Latent regressions#

For the final trick, lets expand the last model we made but include an additional part - this time, we’ll use the two grit latent variables to predict the latent conscientiousness score. We’d like to see how well these latent variables can predict conscientiousness. If they are measuring the ‘same’ kind of trait, then we’d expect to see high coefficients for these.

Rebuild the model from question 2 but include an additional part where the two grit latent variables predict laten conscientiousness.

SolverResult(fun=0.9710712122615321, success=True, n_it=44, x=array([ 1.09372032,  1.33481513,  1.36326389,  1.42906892,  0.78786201,
        0.8498987 ,  1.3911391 ,  1.4946746 ,  1.55663121,  1.34467594,
       -0.94951483,  0.48329874, -1.02884432,  1.07183657, -1.0865845 ,
        0.768163  , -0.81085453,  1.03271978,  0.65668753,  0.30459112,
       -0.82905879,  0.75009116,  0.79133413,  1.38197185,  0.83954118,
        1.03002355,  0.96546841,  1.30376258,  0.96068456,  0.9050144 ,
        0.95839995,  0.70040618,  1.14535444,  0.98499342,  0.56307038,
        0.83191806,  0.9719378 ,  1.20764937,  0.83116974,  0.54796473,
        0.61777322,  0.7761055 ,  0.64744345,  0.22025482,  0.46986339,
       -0.21635048,  0.27907538]), message='Optimization terminated successfully', name_method='SLSQP', name_obj='MLW')

Once estimated, inspect the coefficients of that latent regression.

	lval	op	rval	Estimate	Est. Std	Std. Err	z-value	p-value
0	conscientious	~	consistency	0.304591	0.277782	0.022774	13.37429	0.0
1	conscientious	~	perseverance	-0.829059	-0.582703	0.038286	-21.654358	0.0
2	GS2	~	consistency	1.000000	0.600782	-	-	-
3	GS3	~	consistency	1.093720	0.605312	0.034572	31.636412	0.0
4	GS5	~	consistency	1.334815	0.708377	0.037764	35.346361	0.0
5	GS7	~	consistency	1.363264	0.765289	0.036755	37.09102	0.0
6	GS8	~	consistency	1.429069	0.743540	0.039199	36.456336	0.0
7	GS11	~	consistency	0.787862	0.477969	0.030008	26.255331	0.0
8	GS1	~	perseverance	1.000000	0.533780	-	-	-
9	GS4	~	perseverance	0.849899	0.378213	0.041859	20.303666	0.0
10	GS6	~	perseverance	1.391139	0.704542	0.044855	31.014121	0.0
11	GS9	~	perseverance	1.494675	0.700406	0.04834	30.920059	0.0
12	GS10	~	perseverance	1.556631	0.609287	0.054476	28.574866	0.0
13	GS12	~	perseverance	1.344676	0.687476	0.043917	30.618746	0.0
14	C1	~	conscientious	1.000000	0.655438	-	-	-
15	C2	~	conscientious	-0.949515	-0.518943	0.032088	-29.591337	0.0
16	C3	~	conscientious	0.483299	0.368548	0.022364	21.610512	0.0
17	C4	~	conscientious	-1.028844	-0.606064	0.030378	-33.867662	0.0
18	C5	~	conscientious	1.071837	0.634031	0.03047	35.176885	0.0
19	C6	~	conscientious	-1.086584	-0.581762	0.033225	-32.703737	0.0
20	C7	~	conscientious	0.768163	0.507550	0.026478	29.01177	0.0
21	C8	~	conscientious	-0.810855	-0.539436	0.026479	-30.622414	0.0
22	C9	~	conscientious	1.032720	0.621287	0.029861	34.584457	0.0
23	C10	~	conscientious	0.656688	0.485173	0.02357	27.860751	0.0
24	conscientious	~~	conscientious	0.220255	0.389878	0.011722	18.790372	0.0
25	consistency	~~	consistency	0.469863	1.000000	0.023361	20.113187	0.0
26	consistency	~~	perseverance	-0.216350	-0.597463	0.010429	-20.744672	0.0
27	perseverance	~~	perseverance	0.279075	1.000000	0.016155	17.274642	0.0
28	C1	~~	C1	0.750091	0.570401	0.018655	40.209597	0.0
29	C10	~~	C10	0.791334	0.764607	0.018081	43.766842	0.0
30	C2	~~	C2	1.381972	0.730698	0.031928	43.283861	0.0
31	C3	~~	C3	0.839541	0.864173	0.018672	44.962958	0.0
32	C4	~~	C4	1.030024	0.632686	0.024765	41.591692	0.0
33	C5	~~	C5	0.965468	0.598005	0.02363	40.858009	0.0
34	C6	~~	C6	1.303763	0.661553	0.030937	42.142977	0.0
35	C7	~~	C7	0.960685	0.742393	0.022107	43.455503	0.0
36	C8	~~	C8	0.905014	0.709009	0.021071	42.950317	0.0
37	C9	~~	C9	0.958400	0.614002	0.023258	41.206853	0.0
38	GS1	~~	GS1	0.700406	0.715078	0.016468	42.532551	0.0
39	GS10	~~	GS10	1.145354	0.628769	0.028115	40.738819	0.0
40	GS11	~~	GS11	0.984993	0.771545	0.022491	43.79462	0.0
41	GS12	~~	GS12	0.563070	0.527377	0.014874	37.856134	0.0
42	GS2	~~	GS2	0.831918	0.639061	0.020008	41.579265	0.0
43	GS3	~~	GS3	0.971938	0.633597	0.023439	41.467314	0.0
44	GS4	~~	GS4	1.207649	0.856955	0.027032	44.675186	0.0
45	GS5	~~	GS5	0.831170	0.498202	0.021942	37.880345	0.0
46	GS6	~~	GS6	0.547965	0.503621	0.014806	37.010857	0.0
47	GS7	~~	GS7	0.617773	0.414332	0.017914	34.485515	0.0
48	GS8	~~	GS8	0.776105	0.447148	0.021585	35.955693	0.0
49	GS9	~~	GS9	0.647443	0.509432	0.017393	37.224808	0.0

By how much does an increase in latent grit alter conscientiousness?

4. Testing the presence of the Big Five - in Big Data#

We’ll now test the presence of the Big Five in a massive dataset of over 1 million respondents! You can find the dataset here: https://openpsychometrics.org/_rawdata/IPIP-FFM-data-8Nov2018.zip

Download it, unzip it, and read in the data-final.csv file, which is an enormous dataset containing responses to a Big 5 Questionnaire (the IPIP). These are the questions, the short-hand prefixes showing what trait the question is measuring:

EXT1 - I am the life of the party.
EXT2 - I don’t talk a lot.
EXT3 - I feel comfortable around people.
EXT4 - I keep in the background.
EXT5 - I start conversations.
EXT6 - I have little to say.
EXT7 - I talk to a lot of different people at parties.
EXT8 - I don’t like to draw attention to myself.
EXT9 - I don’t mind being the center of attention.
EXT10 - I am quiet around strangers.
EST1 - I get stressed out easily.
EST2 - I am relaxed most of the time.
EST3 - I worry about things.
EST4 - I seldom feel blue.
EST5 - I am easily disturbed.
EST6 - I get upset easily.
EST7 - I change my mood a lot.
EST8 - I have frequent mood swings.
EST9 - I get irritated easily.
EST10 - I often feel blue.
AGR1 - I feel little concern for others.
AGR2 - I am interested in people.
AGR3 - I insult people.
AGR4 - I sympathize with others’ feelings.
AGR5 - I am not interested in other people’s problems.
AGR6 - I have a soft heart.
AGR7 - I am not really interested in others.
AGR8 - I take time out for others.
AGR9 - I feel others’ emotions.
AGR10 - I make people feel at ease.
CSN1 - I am always prepared.
CSN2 - I leave my belongings around.
CSN3 - I pay attention to details.
CSN4 - I make a mess of things.
CSN5 - I get chores done right away.
CSN6 - I often forget to put things back in their proper place.
CSN7 - I like order.
CSN8 - I shirk my duties.
CSN9 - I follow a schedule.
CSN10 - I am exacting in my work.
OPN1 - I have a rich vocabulary.
OPN2 - I have difficulty understanding abstract ideas.
OPN3 - I have a vivid imagination.
OPN4 - I am not interested in abstract ideas.
OPN5 - I have excellent ideas.
OPN6 - I do not have a good imagination.
OPN7 - I am quick to understand things.
OPN8 - I use difficult words.
OPN9 - I spend time reflecting on things.
OPN10 - I am full of ideas.

	EXT1	EXT2	EXT3	EXT4	EXT5	EXT6	EXT7	EXT8	EXT9	EXT10	...	dateload	screenw	screenh	introelapse	testelapse	endelapse	IPC	country	lat_appx_lots_of_err	long_appx_lots_of_err
0	4.0	1.0	5.0	2.0	5.0	1.0	5.0	2.0	4.0	1.0	...	2016-03-03 02:01:01	768.0	1024.0	9.0	234.0	6	1	GB	51.5448	0.1991
1	3.0	5.0	3.0	4.0	3.0	3.0	2.0	5.0	1.0	5.0	...	2016-03-03 02:01:20	1360.0	768.0	12.0	179.0	11	1	MY	3.1698	101.706
2	2.0	3.0	4.0	4.0	3.0	2.0	1.0	3.0	2.0	5.0	...	2016-03-03 02:01:56	1366.0	768.0	3.0	186.0	7	1	GB	54.9119	-1.3833
3	2.0	2.0	2.0	3.0	4.0	2.0	2.0	4.0	1.0	4.0	...	2016-03-03 02:02:02	1920.0	1200.0	186.0	219.0	7	1	GB	51.75	-1.25
4	3.0	3.0	3.0	3.0	5.0	3.0	3.0	5.0	3.0	4.0	...	2016-03-03 02:02:57	1366.0	768.0	8.0	315.0	17	2	KE	1.0	38.0

5 rows × 110 columns

There are some other columns which can be gotten rid of by running the following code:

	EXT1	EXT2	EXT3	EXT4	EXT5	EXT6	EXT7	EXT8	EXT9	EXT10	...	OPN1	OPN2	OPN3	OPN4	OPN5	OPN6	OPN7	OPN8	OPN9	OPN10
0	4.0	1.0	5.0	2.0	5.0	1.0	5.0	2.0	4.0	1.0	...	5.0	1.0	4.0	1.0	4.0	1.0	5.0	3.0	4.0	5.0
1	3.0	5.0	3.0	4.0	3.0	3.0	2.0	5.0	1.0	5.0	...	1.0	2.0	4.0	2.0	3.0	1.0	4.0	2.0	5.0	3.0
2	2.0	3.0	4.0	4.0	3.0	2.0	1.0	3.0	2.0	5.0	...	5.0	1.0	2.0	1.0	4.0	2.0	5.0	3.0	4.0	4.0
3	2.0	2.0	2.0	3.0	4.0	2.0	2.0	4.0	1.0	4.0	...	4.0	2.0	5.0	2.0	3.0	1.0	4.0	4.0	3.0	3.0
4	3.0	3.0	3.0	3.0	5.0	3.0	3.0	5.0	3.0	4.0	...	5.0	1.0	5.0	1.0	5.0	1.0	5.0	3.0	5.0	5.0

5 rows × 50 columns

With the dataset ready, prepare a CFA model that tests whether each of the ten questions load onto their respective latent factor (e.g. EXT1, EXT2 on Extraversion, OPN1, OPN10 onto Openness, and so on).

SolverResult(fun=5.36296039998346, success=True, n_it=57, x=array([-1.06200145,  0.95417843, -1.030203  ,  1.08329996, -0.81910411,
        1.18386593, -0.80722522,  0.94436833, -1.02606913, -0.86188908,
        0.96142689, -0.74789625,  1.01326436, -0.82631873,  0.78080518,
        1.01344284,  0.62361959,  1.18156628, -1.09149167,  0.5441523 ,
       -1.07460931,  1.11229093, -1.2430879 ,  0.77959885, -0.80312943,
        1.03367876,  0.60953939, -1.12362769,  0.68604513, -1.43895163,
        1.20265913, -1.16395247,  1.14358636, -1.07432794, -1.40957848,
       -0.85553933, -0.65361763,  0.74696009, -0.480897  ,  0.72871028,
        1.06146634,  1.0741923 ,  1.14944716,  0.97562448,  0.94016725,
        1.43674748,  0.90548163,  0.84803146,  1.4625594 ,  0.52875896,
        0.85281192,  1.00831611,  0.78594071,  0.78610704,  0.64851546,
        0.87507333,  0.91986045,  1.32519691,  0.94056273,  0.9785403 ,
        0.99411219,  1.21806196,  0.96280453,  0.97306684,  1.08424766,
        0.9874805 ,  1.02212687,  1.1566158 ,  0.89358005,  1.37933939,
        1.18518101,  0.83121266,  0.72093211,  0.71466859,  0.89804327,
        0.86073212,  0.92737012,  0.92128132,  0.8048323 ,  0.74866263,
        0.76710141,  1.00770347,  0.92762256,  1.13732951,  1.1543183 ,
        0.93733664,  0.50929538,  0.93945862,  0.8324595 ,  0.96393237,
        0.57244287,  0.94478514,  0.77220248,  1.16343713,  0.92790156,
        0.35866303,  0.50955039, -0.06273015,  0.73729111, -0.17700163,
        0.05657929, -0.1776681 ,  0.82428062, -0.01288336, -0.18473819,
        0.40169197, -0.06727421,  0.04077642,  0.11323818, -0.03575431]), message='Optimization terminated successfully', name_method='SLSQP', name_obj='MLW')

Check the model fit statistics:

	DoF	DoF Baseline	chi2	chi2 p-value	chi2 Baseline	CFI	GFI	AGFI	NFI	TLI	RMSEA	AIC	BIC	LogLik
Value	1165	1225	5.445234e+06	0.0	1.902557e+07	0.713837	0.713794	0.699054	0.713794	0.699099	0.067841	209.274079	1510.654937	5.36296

Based on this truly massive dataset, what do we make of the Big 5 model?

5. Exploring and confirming#

For the final exercise, we’ll see how EFA and CFA work together.

The last dataset we’ll see contains a series of scales that researchers thought might measure the ‘DISC’ personality model (see here), which has four central traits:

Dominance: active use of force to overcome resistance in the environment
Inducement: use of charm in order to deal with obstacles
Submission: warm and voluntary acceptance of the need to fulfill a request
Compliance: fearful adjustment to a superior force.

This model of personality is used in business, but has no real empirical underpinning. To try to address this, researchers took four sub-scale measures from the IPIP, which measure similar-sorts of things - namely, assertiveness, social confidence, adventurousness, and dominance.

Our goal now will be to explore a set of latent factors in half of this data, and then confirm it in the other half. Thus, we will use both EFA and CFA.

First, lets read in the data, which is accessible from here: http://openpsychometrics.org/_rawdata/AS+SC+AD+DO.zip

Extract the data.csv file and read it in, and show the head. I’ve renamed it data-disc.csv my end and read it in with that title.

	AS1	AS2	AS3	AS4	AS5	AS6	AS7	AS8	AS9	AS10	...	DO3	DO4	DO5	DO6	DO7	DO8	DO9	DO10	age	gender
0	4	4	3	3	5	4	1	3	1	1	...	3	1	3	2	5	4	2	1	29	2
1	4	3	4	4	3	2	3	3	4	3	...	3	2	3	2	3	3	2	2	49	2
2	5	4	4	5	3	3	2	2	1	1	...	3	3	3	4	4	5	2	3	52	1
3	4	3	3	2	3	3	4	3	4	1	...	3	3	4	4	4	5	3	1	34	2
4	4	4	4	4	4	3	2	1	2	0	...	4	3	4	3	5	5	4	4	52	2

5 rows × 42 columns

The column names here represent the four sub-scales, like so:

Assertiveness
- AS1 Express myself easily.
- AS2 Try to lead others.
- AS3 Automatically take charge.
- AS4 Know how to convince others.
- AS5 Am the first to act.
- AS6 Take control of things.
- AS7 Wait for others to lead the way.
- AS8 Let others make the decisions.
- AS9 Am not highly motivated to succeed.
- AS10 Can’t come up with new ideas.
Social Confidence
- SC1 Feel comfortable around people.
- SC2 Don’t mind being the center of attention.
- SC3 Am good at making impromptu speeches.
- SC4 Express myself easily.
- SC5 Have a natural talent for influencing people.
- SC6 Hate being the center of attention.
- SC7 Lack the talent for influencing people.
- SC8 Often feel uncomfortable around others.
- SC9 Don’t like to draw attention to myself.
- SC10 Have little to say.
Adventurousness
- AD1 Prefer variety to routine.
- AD2 Like to visit new places.
- AD3 Interested in many things.
- AD4 Like to begin new things.
- AD5 Prefer to stick with things that I know.
- AD6 Dislike changes.
- AD7 Don’t like the idea of change.
- AD8 Am a creature of habit.
- AD9 Dislike new foods.
- AD10 Am attached to conventional ways.
Dominance
- DO1 Try to surpass others’ accomplishments.
- DO2 Try to outdo others.
- DO3 Am quick to correct others.
- DO4 Impose my will on others.
- DO5 Demand explanations from others.
- DO6 Want to control the conversation.
- DO7 Am not afraid of providing criticism.
- DO8 Challenge others’ points of view.
- DO9 Lay down the law to others.
- DO10 Put people under pressure.

The first task is to prepare the data. Drop the age and gender columns, and the use the .sample dataframe method to extract 50% of the data for exploration and the other half for confirmation. This can be a tricky step, so take care.

Fit an EFA to the first half of the data. Lets start with four factors, which would represent the four subscales. What does that solution look like? Fit it and examine a plot of the loadings.

If you want the figure to be larger, you can run the following lines of code before you make a heatmap: import matplotlib.pyplot as plt plt.figure(figsize=(10, 10))

<Axes: >

../_images/0e587843bf8d1480e105a3efa1fc47bb1d07b78dcca6732c27a315b226433073.png

We can see immediately that aspects of the Assertiveness and Social Confidence scales align with the first factor, dominance with the second, adventurousness with the third, while some questions in adventurousness (e.g. 1-4) seem to align with the fourth factor. Examine the variance explained next.

(array([6.4104039 , 4.89054046, 3.33641926, 1.72314879]),
 array([0.1602601 , 0.12226351, 0.08341048, 0.04307872]),
 array([0.1602601 , 0.28252361, 0.36593409, 0.40901281]))

This explains around 40%, with the most common from the first two factors. Perhaps we could do away with a 4 factor solution and retain a 3 factor one? Fit that below.

<Axes: >

../_images/42c2b8db531dbc099acddd565ccf5e5cf01939acd4cc41c6b0dc8d41753d6b90.png

Examine the variance of this fit.

(array([6.54110144, 5.0344487 , 3.69496587]),
 array([0.16352754, 0.12586122, 0.09237415]),
 array([0.16352754, 0.28938875, 0.3817629 ]))

This loses only around 2% variance and has some more informative factors. We’ll then say there are three factors underpinning this four-questionnaire test. The first captures Assertiveness and Social Confidence, the second dominance, and the third adventurousness.

Now, how might we translate this setup into a CFA model that we can test on the other half of the data?

One straightforward way would be to use a coarse measure as just described above - that is, all questions for assertiveness and social confidence go on one factor, all dominance questions on another, and all adventurousness questions on the final one.

Build that below, but using the other slice of the data to confirm whether this structure ‘holds’. Check the fit statistics.

	DoF	DoF Baseline	chi2	chi2 p-value	chi2 Baseline	CFI	GFI	AGFI	NFI	TLI	RMSEA	AIC	BIC	LogLik
Value	737	780	4041.373173	0.0	9776.734335	0.632714	0.586634	0.562516	0.586634	0.611285	0.094506	149.930922	500.239906	8.034539

That does not look great! Inspect whether the individual coefficients are associated in the way we’d expect:

	lval	op	rval	Estimate	Est. Std	Std. Err	z-value	p-value
0	AS1	~	latent1	1.000000	0.704852	-	-	-
1	AS2	~	latent1	0.818380	0.609185	0.063324	12.923657	0.0
2	AS3	~	latent1	0.826251	0.596697	0.065244	12.664005	0.0
3	AS4	~	latent1	0.803585	0.645872	0.058724	13.68413	0.0
4	AS5	~	latent1	0.668898	0.496057	0.063347	10.559247	0.0
5	AS6	~	latent1	0.621050	0.471490	0.061842	10.042551	0.0
6	AS7	~	latent1	-0.679325	-0.518800	0.061552	-11.036618	0.0
7	AS8	~	latent1	-0.452216	-0.344830	0.061406	-7.364376	0.0
8	AS9	~	latent1	-0.400243	-0.269313	0.069506	-5.758391	0.0
9	AS10	~	latent1	-0.311342	-0.229752	0.063346	-4.914959	0.000001
10	SC1	~	latent1	0.894873	0.589880	0.071463	12.522108	0.0
11	SC2	~	latent1	1.086737	0.657465	0.07805	13.923668	0.0
12	SC3	~	latent1	1.029209	0.600678	0.080742	12.746809	0.0
13	SC4	~	latent1	0.957403	0.651346	0.069391	13.797292	0.0
14	SC5	~	latent1	0.890399	0.631887	0.066474	13.394672	0.0
15	SC6	~	latent1	-0.957331	-0.614929	0.073398	-13.042958	0.0
16	SC7	~	latent1	-0.773334	-0.599738	0.060762	-12.727266	0.0
17	SC8	~	latent1	-0.874095	-0.579628	0.071015	-12.30852	0.0
18	SC9	~	latent1	-0.910533	-0.601252	0.071365	-12.758753	0.0
19	SC10	~	latent1	-0.769383	-0.533247	0.067851	-11.33938	0.0
20	DO1	~	latent2	1.000000	0.538283	-	-	-
21	DO2	~	latent2	1.142825	0.592763	0.113915	10.032241	0.0
22	DO3	~	latent2	1.144616	0.666074	0.106152	10.782812	0.0
23	DO4	~	latent2	1.285773	0.744064	0.112106	11.46927	0.0
24	DO5	~	latent2	1.282494	0.688627	0.116664	10.99304	0.0
25	DO6	~	latent2	1.035638	0.610491	0.101301	10.223355	0.0
26	DO7	~	latent2	0.996247	0.571367	0.101729	9.793166	0.0
27	DO8	~	latent2	0.957089	0.598186	0.094842	10.09137	0.0
28	DO9	~	latent2	1.040693	0.604591	0.102426	10.160443	0.0
29	DO10	~	latent2	1.177656	0.670289	0.108812	10.822828	0.0
30	AD1	~	latent3	1.000000	0.478837	-	-	-
31	AD2	~	latent3	0.596162	0.342152	0.094078	6.336898	0.0
32	AD3	~	latent3	0.437053	0.268654	0.084017	5.201967	0.0
33	AD4	~	latent3	0.769467	0.427006	0.103145	7.460015	0.0
34	AD5	~	latent3	-1.349486	-0.698710	0.13627	-9.902998	0.0
35	AD6	~	latent3	-1.606016	-0.819433	0.152228	-10.550087	0.0
36	AD7	~	latent3	-1.390536	-0.778319	0.134248	-10.357974	0.0
37	AD8	~	latent3	-1.274979	-0.643348	0.133857	-9.524967	0.0
38	AD9	~	latent3	-0.792595	-0.407009	0.109885	-7.212967	0.0
39	AD10	~	latent3	-1.226555	-0.634494	0.129663	-9.459547	0.0

These seem to match the pattern seen in the loadings, broadly, but our model fit suggests this is not good. As a final push, we could consider trying to reflect the loadings we saw from EFA more closely. From the loadings matrix we can actually discern which factor each question had the highest affinity with by using the .idxmax(axis='columns') command, like so:

loadings.idxmax(axis='columns')

Run that below and examine the output.

AS1     0
AS2     0
AS3     1
AS4     0
AS5     1
AS6     1
AS7     2
AS8     2
AS9     2
AS10    2
SC1     0
SC2     0
SC3     0
SC4     0
SC5     0
SC6     1
SC7     2
SC8     1
SC9     1
SC10    2
AD1     1
AD2     0
AD3     1
AD4     1
AD5     2
AD6     2
AD7     2
AD8     2
AD9     2
AD10    2
DO1     1
DO2     1
DO3     1
DO4     1
DO5     1
DO6     1
DO7     1
DO8     1
DO9     1
DO10    1
dtype: int64

What is interesting here is that reveals a rather disparate pattern for the assertiveness scale. Some questions loading on factor 0 (the first) while others are more associated with the second factor, which was related to dominance. Consider AS3 - “automatically take charge”. Should we be surprised this is more closely associated with a different factor? We can try to recreate this by tying each question as shown above to a specific factor. This is a tricky process. See if you can build this model below and examnine its fit statistic. It will more closely resemble the EFA, and may emerge as a better model.

	DoF	DoF Baseline	chi2	chi2 p-value	chi2 Baseline	CFI	GFI	AGFI	NFI	TLI	RMSEA	AIC	BIC	LogLik
Value	737	780	4906.10706	0.0	9776.734335	0.536598	0.498185	0.468907	0.498185	0.509561	0.106154	146.492616	496.8016	9.753692

Amusingly, this is even worse. Despite our best efforts, we’re unable to create a solid, stable set of latent variables that underpin this collection of data. This is a common experience, and the field of psychology has numerous measurement issues to which poor latent factor models contribute. Nonetheless, EFA and CFA are incredibly powerful approaches but they need to be treated with a lot of circumspection.

	GS1	GS2	GS3	GS4	GS5	GS6	GS7	GS8	GS9	GS10	GS11	GS12
0	1	1	3	3	3	2	3	1	3	2	3	3
1	2	2	3	3	2	1	3	3	2	1	3	2
2	3	3	3	3	4	3	4	4	3	3	3	3
3	1	3	4	2	4	1	5	4	1	1	3	1
4	1	2	3	3	2	2	2	4	3	3	4	4

	GS1	GS2	GS3	GS4	GS5	GS6	GS7	GS8	GS9	GS10	...	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10
0	1	1	3	3	3	2	3	1	3	2	...	2	4	4	3	2	4	3	2	2	4
1	2	2	3	3	2	1	3	3	2	1	...	4	3	4	3	1	3	5	2	5	3
2	3	3	3	3	4	3	4	4	3	3	...	2	2	4	2	3	4	5	3	3	4
3	1	3	4	2	4	1	5	4	1	1	...	4	1	5	1	4	1	4	1	4	3
4	1	2	3	3	2	2	2	4	3	3	...	3	1	3	1	4	2	3	2	3	4

	EXT1	EXT2	EXT3	EXT4	EXT5	EXT6	EXT7	EXT8	EXT9	EXT10	...	OPN1	OPN2	OPN3	OPN4	OPN5	OPN6	OPN7	OPN8	OPN9	OPN10
0	4.0	1.0	5.0	2.0	5.0	1.0	5.0	2.0	4.0	1.0	...	5.0	1.0	4.0	1.0	4.0	1.0	5.0	3.0	4.0	5.0
1	3.0	5.0	3.0	4.0	3.0	3.0	2.0	5.0	1.0	5.0	...	1.0	2.0	4.0	2.0	3.0	1.0	4.0	2.0	5.0	3.0
2	2.0	3.0	4.0	4.0	3.0	2.0	1.0	3.0	2.0	5.0	...	5.0	1.0	2.0	1.0	4.0	2.0	5.0	3.0	4.0	4.0
3	2.0	2.0	2.0	3.0	4.0	2.0	2.0	4.0	1.0	4.0	...	4.0	2.0	5.0	2.0	3.0	1.0	4.0	4.0	3.0	3.0
4	3.0	3.0	3.0	3.0	5.0	3.0	3.0	5.0	3.0	4.0	...	5.0	1.0	5.0	1.0	5.0	1.0	5.0	3.0	5.0	5.0

	AS1	AS2	AS3	AS4	AS5	AS6	AS7	AS8	AS9	AS10	...	DO3	DO4	DO5	DO6	DO7	DO8	DO9	DO10	age	gender
0	4	4	3	3	5	4	1	3	1	1	...	3	1	3	2	5	4	2	1	29	2
1	4	3	4	4	3	2	3	3	4	3	...	3	2	3	2	3	3	2	2	49	2
2	5	4	4	5	3	3	2	2	1	1	...	3	3	3	4	4	5	2	3	52	1
3	4	3	3	2	3	3	4	3	4	1	...	3	3	4	4	4	5	3	1	34	2
4	4	4	4	4	4	3	2	1	2	0	...	4	3	4	3	5	5	4	4	52	2

	GS1	GS2	GS3	GS4	GS5	GS6	GS7	GS8	GS9	GS10	GS11	GS12
0	1	1	3	3	3	2	3	1	3	2	3	3
1	2	2	3	3	2	1	3	3	2	1	3	2
2	3	3	3	3	4	3	4	4	3	3	3	3
3	1	3	4	2	4	1	5	4	1	1	3	1
4	1	2	3	3	2	2	2	4	3	3	4	4

	GS1	GS2	GS3	GS4	GS5	GS6	GS7	GS8	GS9	GS10	...	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10
0	1	1	3	3	3	2	3	1	3	2	...	2	4	4	3	2	4	3	2	2	4
1	2	2	3	3	2	1	3	3	2	1	...	4	3	4	3	1	3	5	2	5	3
2	3	3	3	3	4	3	4	4	3	3	...	2	2	4	2	3	4	5	3	3	4
3	1	3	4	2	4	1	5	4	1	1	...	4	1	5	1	4	1	4	1	4	3
4	1	2	3	3	2	2	2	4	3	3	...	3	1	3	1	4	2	3	2	3	4

	EXT1	EXT2	EXT3	EXT4	EXT5	EXT6	EXT7	EXT8	EXT9	EXT10	...	OPN1	OPN2	OPN3	OPN4	OPN5	OPN6	OPN7	OPN8	OPN9	OPN10
0	4.0	1.0	5.0	2.0	5.0	1.0	5.0	2.0	4.0	1.0	...	5.0	1.0	4.0	1.0	4.0	1.0	5.0	3.0	4.0	5.0
1	3.0	5.0	3.0	4.0	3.0	3.0	2.0	5.0	1.0	5.0	...	1.0	2.0	4.0	2.0	3.0	1.0	4.0	2.0	5.0	3.0
2	2.0	3.0	4.0	4.0	3.0	2.0	1.0	3.0	2.0	5.0	...	5.0	1.0	2.0	1.0	4.0	2.0	5.0	3.0	4.0	4.0
3	2.0	2.0	2.0	3.0	4.0	2.0	2.0	4.0	1.0	4.0	...	4.0	2.0	5.0	2.0	3.0	1.0	4.0	4.0	3.0	3.0
4	3.0	3.0	3.0	3.0	5.0	3.0	3.0	5.0	3.0	4.0	...	5.0	1.0	5.0	1.0	5.0	1.0	5.0	3.0	5.0	5.0

	AS1	AS2	AS3	AS4	AS5	AS6	AS7	AS8	AS9	AS10	...	DO3	DO4	DO5	DO6	DO7	DO8	DO9	DO10	age	gender
0	4	4	3	3	5	4	1	3	1	1	...	3	1	3	2	5	4	2	1	29	2
1	4	3	4	4	3	2	3	3	4	3	...	3	2	3	2	3	3	2	2	49	2
2	5	4	4	5	3	3	2	2	1	1	...	3	3	3	4	4	5	2	3	52	1
3	4	3	3	2	3	3	4	3	4	1	...	3	3	4	4	4	5	3	1	34	2
4	4	4	4	4	4	3	2	1	2	0	...	4	3	4	3	5	5	4	4	52	2