Bayesian Ordinal Regression, in PyMC
Ordinal data is everywhere in social and behavioural science. People indicate their responses to questionnaire items by selecting a response out of a set of available options, such as disagree, neutral, agree, or rate their preferences on a numeric scale with a few discrete options (e.g. 1, 2, 3, 4). The longstanding approach to analysing this data is to simply treat it as though it were truly continuous, applying the usual techniques that assume Gaussian distributions, like OLS and ANOVA. This ignores the very real possibility that the distances between the categories may not be fixed and equal. Its easy to imagine situations where the gap between a 3 and a 4 on a scale is small, but a big jump exists between selecting 4 over 5.
Accordingly, there’s a lot of work over the last few years that convincingly demonstrates that treating ordinal data as continuous is not a good idea, and can often lead to bias in the conclusions or even getting the sign of relationships incorrect. Ordinal data is modelled much more accurately using an appropriate likelihood function that can express the ordinal nature of the observed variable. Unfortunately, fitting and interpreting these models can be difficult to do, and is something I’ve struggled with for quite a while.
Using Bayesian inference, its straightforward to express a model that expresses ordindal data, and then extend that to including predictors, which is almost always the needed use case in applied statistics. We’ll work through some use cases with PyMC, laying out the bare bones of the model, different approaches, and then move onto full regression modelling with an ordinal outcome through a few different datasets.
Are the latent variables in the room with us?
First though, some theory. Understanding this part can be tricky, but is key to grasping the way ordinal models work. Ordinal models assume a latent, continuous variable that is unobserved. This is the variable of interest - a persons tendency to endorse the statement, find a face attractive, their level of agreement to the statement, and so on. We make assumptions about the kind of distribution this variable can take - typically its the normal or logistic distribution. But to realise the discrete and ordered nature of the responses, we must impose a set of thresholds or cutpoints onto this distribution to slice it into categories. We can then work out the probability density within a category, and use those probabilities to assess the frequency with which a response will appear. If this is confusing, don’t worry. Writing down the steps and plotting some data will hopefully clear things up.
Simulating ordinal data with PyMC from scratch
First, lets import some packages, and then make some assumptions about what we want to see. Imagine we are trying to simulate a dataset where 1000 participants indicate their response to the question “I like chocolate”. We will imagine there is a latent normal distribution that governs this response, and it has a mean of zero, and a standard deviation of 1 (i.e., the standard normal). People are however forced to respond to this statement on a 1-5 scale, indicating their agreement (1 = strong dislike, 5 = strong like).
What might a distribution of ordinal responses to this question look like? For me, I’d imagine a lot of people might respond with a four or five - i.e., a four and five have a high probability of occuring. A smaller minority might select three, and not a lot of people might select one, two, or three. Lets see if we can have the normal distribution help us realise that, using SciPy.
PyMC is an awesome probabilistic programming language that makes it easy to build Bayesian statistical models, and I’ve used it exclusively for several years for data analysis problems. Despite its flexibility, one of the few sticking points I’ve encountered is trying to generate out-of-sample predictions. In the age of machine learning and predictive modelling, simply fitting a model to data and assessing the quality of predictions is no longer the gold standard - rather, our models should be able to predict data that they have never seen before. Python libraries like scikit-learn
place the predictive utility of models front and centre, but it is not so straightforward to do this in PyMC. This is a shame, because Bayesian models come with free uncertainty around predictions - an incredibly powerful feature when uncertainty is so important to quantify, and yet point estimates or confusing frequentist confidence intervals are the main currency of prediction.
This problem is especially compounded with hierarchial models, because they several kinds of choices when doing out-of-sample predictions! For example:
- A hierarchical model could predict new, unobserved data for an “average” person, omitting the group-specific effects entirely - this is common in many packages as the default, such as in
lme4
in R.
- A hierarchical model could predict on new, unobserved data for a known group-specific (or random effect) value. An example would be using the model to predict new scores on an unobserved quantity for an individual already in the dataset.
- A hierarchical model might predict unobserved data for an unknown group-specific effect. An example of this would be predicting scores for unseen, new groups.
Doing this in PyMC is possible, but not immediately obvious. Here, I’ll try some examples of out of sample predictions using PyMC. Much of this was inspired by the excellent webinar on out of sample predictions by Ricardo Vieira and Alex Andorra, on the Learning Bayesian Statistics podcast. Its also worth noting that the excellent bambi
package abstracts much of this away and makes prediction very easy, but there are still many occasions when I need to write a model myself that bambi
doesn’t yet support or I need more specific control and detail.
Let’s go!
To get a break from academic work over the summer, I took part in another data science competition. This one was difficult, involving finding clusters of different customers in a multi-table database, and invovled some tricky feature engineering and data imputation. It also struck me how effective the general approach of dimension-reduction and clustering is; I’ve used it in some papers but the logic here was the same. One thing I picked up here was the utility of non-linear dimension reduction approaches like spectral embedding, which work by maintaing distances between points as the the higher dimensions are projected onto manifolds. I also had some great Altair plots in this, but on submission they all got stripped. So I had to remake some static plots last minute. The final version is here, but I much preferred this one. Anyway, second place isn’t so bad for a stab at a very applied problem.
Earlier this year I responded to invitation to take part in the MultiAnalysis Collaboration Project. I think the project is being run by the Centre for Open Science, and its aiming to obtain at least 5 different analyses for 100 different papers published, with open data, in pscychology. I opted for the below paper, published in 2016 - Is Eco-Friendly Unmanly? The Green-Feminine Stereotype and Its Effect on Sustainable Consumption? The organisers have selected a single claim from each paper that appears to be the ‘central’ conclusion, and participants in the project have to test it - thats more or less it! The claim in this paper is
…consumers who engaged in green behavior were perceived …. as more feminine than consumers who engaged in nongreen behavior. (p. 571.)
In April 2021 I got my first COVID-19 vaccination. The side effects were pretty hard going - for about 24 hours I was laid out, and woke up in the middle of the night to my smart watch giving me an “extreme heart rate warning”, which is just the kind of thing you want to see after getting injected with a vaccine for a virus sweeping the globe!
Once I felt more alive, I had the idea of extracting the heart rate data from my watch. As it turns out, there is an excellent Python package, fitparse
, that allows the parsing of the .FIT files from Fitbit watches. I won’t post the raw data here and the associated code, but it was pretty simple to get a few days’ worth of data from the watch and plot it. Easy to see how bad things were; which made me extremely grateful for getting vaccinated and not the real thing!