Alex Jones Ph.D

Customer segmentation with Bayes, Spectral Embedding, and a second-place win

To get a break from academic work over the summer, I took part in another data science competition. This one was difficult, involving finding clusters of different customers in a multi-table database, and invovled some tricky feature engineering and data imputation. It also struck me how effective the general approach of dimension-reduction and clustering is; I’ve used it in some papers but the logic here was the same. One thing I picked up here was the utility of non-linear dimension reduction approaches like spectral embedding, which work by maintaing distances between points as the the higher dimensions are projected onto manifolds. I also had some great Altair plots in this, but on submission they all got stripped. So I had to remake some static plots last minute. The final version is here, but I much preferred this one. Anyway, second place isn’t so bad for a stab at a very applied problem.

# Imports
import arviz as az
import altair as alt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns

import matplotlib.pyplot as plt
import scipy.stats as st

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.manifold import SpectralEmbedding
from sklearn.impute import KNNImputer
from statsmodels.distributions.empirical_distribution import ECDF

Data-driven segmentation reveals three doctor profiles

Executive Summary

This report highlights a new, data-driven perspective on the customer base of the company. The historical approach of segmenting doctors by their region has been shown to be a poor predictor of their support or purchasing behaviour. The analysis describes here reveals three new segments of customers with distinct profiles that will inform product and marketing development alongside customer service needs.

The report focuses on three key questions:

  1. What is the composition of each region in terms of number of doctors and their average number of purchases?
  2. Is there a relationship between purchases and number of complaints?
  3. Can new segments be found, and what variables best describe these new segments if they exist?

Key Insights

New doctor segments:

  1. Segment One - Doctors here have low experience with the company, lower rates of purchases, R and incidence rates, and give fewer complaints. They also treat fewer unique conditions, and have only average reported satisfaction. There are more general practitioner in this segment.
  2. Segment Two - Doctors here have high R rates, give more complaints, and have a high level of experience with the company. They make relatively fewer purchases and seem to treat fewer unique conditions. This segment, the largest in number, is diffuse in terms of its members and their rank and category.
  3. Segment Three. Doctors here make a lot of purchases and have high incidence rates, are relatively satisfied, and treat the most number of unique conditions. They make fewer complaints but have less experience with the company. They are comprised of a lot of Ambassadors.

Key recommendations

Doctor-by-region analysis; interactively

The main data sources here are split across four tables, which briefly comprise the following information:

There is missing data in the doctors dataset about satisfaction, which will be imputed after highlighting the doctor-by-region analysis.

# Read in data files
doctors = pd.read_csv('data/doctors.csv')            # doctor details
orders = pd.read_csv('data/orders.csv')              # order details
complaints = pd.read_csv('data/complaints.csv')      # complaint details
instructions = pd.read_csv('data/instructions.csv')  # instruction details (y/n)

# Aggregate the doctors dataset
query1 = (doctors
          .groupby(by='Region', as_index=False)
          .agg(number_of_doctors=('DoctorID', pd.Series.nunique),
               average_purchase_number=('Purchases', 'mean'))
          .round(3))

## Display interactively with Altair ##
input_dropdown = alt.binding_select(options=query1['Region'].unique(), name='Select Region! ')
selection = alt.selection_single(fields=['Region'], bind=input_dropdown)

query1_chart = (alt.Chart(query1, width=900, height=400)
                .mark_point(filled=True, size=300, opacity=.75)
                .encode(x=alt.X('number_of_doctors', title='Number of Doctors in Region'),
                        y=alt.Y('average_purchase_number', title='Average Number of Purchases in Region', scale=alt.Scale(domain=(0, 140))),
                        color=alt.condition(selection, alt.Color('Region:N', legend=None), alt.value('grey')),
                        tooltip=[alt.Tooltip('Region'),
                                 alt.Tooltip('number_of_doctors', title='Number of Doctors'),
                                 alt.Tooltip('average_purchase_number', title='Average Purchase #')])
                .configure_axis(labelFontSize=16,
                                titleFontSize=16)
                .configure_legend(labelFontSize=16,
                                  titleFontSize=16)   
               ).add_selection(selection)
# Render
query1_chart

Hover over a point to view detailed information, or use the dropdown to isolate a specific region. A brief glance at this chart highlights why region is a poor predictor of the number of purchases that can be made. Overall, there is no relationship between number of doctors and their average number of purchases across regions, aside from a large outlier.

Assessing the purchase ~ complaint association probabilistically

While there may be a simple association between purchases and complaints, the relationship may be driven by other variables (e.g. satisfaction). Using a combination of careful feature engineering and a Bayesian linear model, the influence of other variables in the dataset (see Technical Annexe for details) can be controlled, and the relationship between complaint number as a predictor of purchase number examined. Before modelling, 55 missing datapoints in the satisfaction variable were imputed, and a final sample of 74 doctors were analysed.

## Feature engineering
# Doctors cleaning
doctors_processed = (doctors
                     .assign(**pd.get_dummies(doctors[['Category', 'Rank']], drop_first=True))     # One hot encodes category and rank
                     .assign(Satisfaction=pd.to_numeric(doctors['Satisfaction'], errors='coerce')) # Fixes the string column of Satisfaction
                     .set_index('DoctorID')
                     .filter(regex='rate|Satisfaction|Experience|Purchases|Rank_|Category_')       # Select relevant variables
                    )

# Orders cleaning - calculate unique-condition-count - number of unique treatments
orders_processed = (orders
                    .set_index('DoctorID')               # Push doctor ID into index
                    .filter(regex='Condition [A-I]')     # Keep conditions A-I; J seems related to when devices are administered
                    .groupby('DoctorID')                 # Groupby Doctor as some doctors appear more than once
                    .apply(lambda x: x.sum(axis='rows')) # Take the sum across rows - each doctor then has a count of how many times they treat a condition
                    .gt(0)                               # where that is above zero, indicate true
                    .sum(axis='columns')                 # The sum of those booleans indicates a condition has been treated by this doctor
                    .to_frame('Unique Condition Index')
                   )

# Complaints - count the total number of complaints a given doctor gives, across complaint types
complaints_processed = (complaints
                        .groupby('DoctorID')
                        .agg(Complaints=('Qty', 'sum'))
                       )

# Merge the data and scale appropriately
z_score_these = ['Incidence rate', 'R rate', 'Satisfaction', 
                 'Experience', 'Complaints', 'Unique Condition Index']

# Merge the data together
df = (complaints_processed
      .join(doctors_processed, how='inner')
      .join(orders_processed, how='left')
     )

# Impute the missing data in Satisfaction
imputed_data = KNNImputer(n_neighbors=4).fit_transform(df)

# Recreate a dataframe, scale necessary variables
df = pd.DataFrame(imputed_data, index=df.index, columns=df.columns).apply(lambda z: st.zscore(z) if z.name in z_score_these else z)
# Test association using a PyMC negative binomial model
X = df.drop(columns='Purchases')
y = df['Purchases'].values

with pm.Model(coords={'coefs': X.columns}) as bayesian_negbinom_regression:
    
    # Define priors over intercept and all coefficients
    β0 = pm.Normal('β0', mu=0, sigma=1)
    β = pm.Normal('β', mu=0, sigma=1, dims='coefs')
    
    # Compute linear predictor
    λ = β0 + pm.math.dot(X.values, β)
    
    # Take the exponent
    λ_exp = pm.Deterministic('λ_exp', pm.math.exp(λ))
    
    # Send to a Poisson likelihood which models the count of a variable
    α = pm.Exponential('nb', 0.01)
    pm.NegativeBinomial('likelihood', mu=λ_exp, alpha=α, observed=y)
    
    # Sample
    trace = pm.sample(2000, tune=2000, target_accept=.90, random_seed=42)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [β0, β, nb]
100.00% [16000/16000 00:19<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 43 seconds.
# Extract the posterior distribution of the "complaints" predictor
coef_trace = np.exp(trace['posterior'].sel(coefs='Complaints')['β'].values.flatten())

# Take its empirical cumulative distribution
ecdf = ECDF(coef_trace)([coef_trace]).ravel()

# and its survival function
esf = 1 - ecdf

# Build a kernel density estimate and its probability density function of the coefficient
kde = st.gaussian_kde(coef_trace).pdf(coef_trace)

# Store as a df
traceplot = pd.DataFrame({'trace': coef_trace, 'density': kde, 'ecdf': ecdf, 'esf': esf}).round(3)

# The kdeplot
alt.data_transformers.enable('default', max_rows=8_000)
kdeplot = (alt.Chart(traceplot, width=500, height=500)
           .mark_area(opacity=.5, color='black', width=10, point=alt.OverlayMarkDef(color='black'))
           .encode(x=alt.X('trace:Q', title='Change in rate of purchasing with increasing complaints'),
                   y=alt.Y('density:Q', title='', axis=alt.Axis(labels=False)),
                   tooltip=[alt.Tooltip('trace:Q', title='Plausible purchase change rate'),
                            alt.Tooltip('ecdf:O', title='Probability effect is less than X'),
                            alt.Tooltip('esf:O', title='Probability effect is bigger than X')])
          )

# Text labels
text_μ = (alt.Chart({'values': [{'x': 1.40, 'y': 1.84}]})
          .mark_text(text=f'Average effect = {coef_trace.mean():.2f}', size=16)
          .encode(x='x:Q', y='y:Q'))
text_θ = (alt.Chart({'values': [{'x': 1.40, 'y': 1.60}]})
          .mark_text(text='95% interval [{0:.2f}, {1:.2f}]'.format(*az.hdi(coef_trace, .95)), size=16)
          .encode(x='x:Q', y='y:Q'))

# Add together in a layer, adjust font size
(kdeplot + text_μ + text_θ).configure_axis(labelFontSize=16, titleFontSize=16)

Hover over the above plot to examine plausible rates of change in purchases with increasing numbers of complaints, and an associated probability. For example, by hovering over one, which indicates no change in purchases as a result of increasing complaints, we see that the probability the effect is bigger than this - that is, increasing complaints means increasing purchases - is 75%. Similarly, the probability is negative is 25%. The plot highlights clearly that the most likely rate of increase is 1.06, or 6%, but the rate is plausibly within the region of 0.89 to 1.25. From this we are able to infer the effect of increased complaints on purchases across a variety of contexts.

Customer segmentation - revealing three profiles

With the same dataset used above, machine learning techniques (see Technical Annexe for details) can uncover the latent structure in the data that can then be grouped together, revealing an optimal grouping of doctors. The below chart, also interactive, allows for the exploration of each segment and their distinct customer profile. Doctors were segmented using a two-feature representation of a wider feature space, specifically:

# First scale the Qty measure
df = df.assign(Purchases=st.zscore(df['Purchases']))

# Use Spectral embedding to preserve local distances in a 2D representation
spectra = SpectralEmbedding(n_components=2, random_state=42).fit_transform(df)

# Use silhouette scoring to select K with K-means
sc = {}
for k in range(2, 10):
    
    # Fit KMeans
    km = KMeans(n_clusters=k, random_state=42).fit(spectra)
    
    # Extra and score silhouette
    sc[k] = silhouette_score(spectra, km.labels_)

# Refit KMeans with the max silhouette score
best_K, best_sc = max(sc.items(), key=lambda y: y[-1])
final_k = KMeans(n_clusters=best_K, random_state=42).fit(spectra)

# Reveal dimension reduction and clusters in an interactive plot, with additional data on mouseover!
spectra_df = (pd.DataFrame(spectra, columns=['spec_1', 'spec_2'], index=df.index)
              .assign(Segment=final_k.labels_+1)
              .join(doctors[['DoctorID', 'Category', 'Rank', 'Region']].set_index('DoctorID'))
              .join(df)
              .reset_index()
             )

# Per group, take mean of continuous features, count of Category/Rank
average = ['Complaints', 'Incidence rate', 'R rate', 'Satisfaction', 'Experience', 'Purchases', 'Unique Condition Index']

# Groupby object
base = spectra_df.groupby('Segment')
chart = (base.agg({x: 'mean' for x in average})
         .join(base['Category'].value_counts().unstack('Category'))
         .join(base['Rank'].value_counts().unstack('Rank'))
         .fillna(0)
         .reset_index()
         .melt(id_vars='Segment', var_name='Attribute', value_name='Aggregate')
         .assign(prep=lambda x: np.where(x['Attribute'].isin(average), 'con', 'cat'))
        )

# Build continuous/cat plots
left = (alt.Chart(chart.query('prep == "con"'), width=450, height=400)
        .mark_bar()
        .encode(x='Aggregate', 
                y=alt.Y('Attribute', title='Metric variable'),
                color='Segment:N')
       )

right = (alt.Chart(chart.query('prep == "cat"'), width=450, height=400)
         .mark_bar()
         .encode(x=alt.X('Aggregate', title='Number of doctors'),
                 y=alt.Y('Attribute', title='Rank and Category'),
                 color='Segment:N')
       )

# Dropdown selectors 
input_dropdown = alt.binding_select(options=list(chart['Segment'].unique()), name='SELECT SEGMENT:   ')
selection = alt.selection_single(fields=['Segment'], bind=input_dropdown)

# Add to plots
left = left.add_selection(selection).transform_filter(selection)
right = right.add_selection(selection).transform_filter(selection)

# Stack, and configure appearance
(left | right).configure_axis(labelFontSize=16, titleFontSize=16).configure_legend(labelFontSize=16, titleFontSize=16)   
# Segment profile
(spectra_df['Segment']
 .value_counts(normalize=True)
 .sort_index()
 .round(2)
 .to_frame('Proportion of sample')
 .rename_axis(index='Segment Number'))
Proportion of sample
Segment Number
1 0.20
2 0.53
3 0.27

Using the interactive dropdown above, it is easy to explore the new segments and their important features. These are described simply below:

A simple, memorable mapping could be:

Technical Annexe

Significantly greater technical detail on the above sections are provided below, for interested readers.

Feature engineering

The features used across the analysis were generated from the doctors, complaints, and orders datasets. Despite a large number of observations in the doctors dataset (437), relatively few of these were in the complaints dataset (just 16%). Given that complaints and purchases were a primary query, the smaller dataset was analysed. Features were engineered from across the datasets before being merged into one:

The final dataset contained 74 unique doctors, and 16 features. After merging, there were 55 missing data points for the satisfaction feature. This was imputed using K-nearest-neighbours imputation.

Assessing purchase ~ complaint association probabilistically

To assess the relationship between complaints and purchases, PyMC was used to fit a Bayesian negative-binomial regression, complaints alongside the 14 other features described above (purchases being the target variable). Negative-binomial regressions allow for the prediction of count data precisely, especially in situations where the variance of the count data is greater than the mean (here, the average purchase count is 10, while the variance is 130). Priors were set to be restrictive to help with the small dataset, avoiding overfitting, following a $\mathcal{N}(0, 1)$ distribution.

The central feature is of course the complaint predictor, but the full model summary is printed below, highlighting those predictors which have a high (or indeed low) probability of being associated with increasing purchases, which may be valuable information. For example, the unique condition index is much more likely to have a negative association with purhcases; doctors who treat more conditions make fewer purchases. Those with higher incidence rates make more purchases. There is uncertainty in many of these estimates but they are provided for further insight.

# Tabulate results
(az
 .summary(trace, var_names='β', 
          filter_vars='like', 
          kind='stats', 
          hdi_prob=.95,
          stat_funcs={'Purchase Rate Change': lambda x: np.mean(np.exp(x)),
                      '95% lower': lambda x: az.hdi(np.exp(x), .95)[0],
                      '95% upper': lambda x: az.hdi(np.exp(x), .95)[1],
                      'Probability Positive': lambda x: np.mean(np.exp(x) > 1)})
 .drop(index=['β0'])
 .assign(Feature=lambda x: x.index.str.replace('β|\\[|\\]', '', regex=True).str.replace('_', ': '))
 .set_index('Feature')
 .filter(items=['Purchase Rate Change', '95% lower', '95% upper', 'Probability Positive'])
 .style
 .applymap(lambda x: 'color: blue' if x < .20 or x > .80 else None, subset=['Probability Positive'])
 .format('{:.2f}')
)
  Purchase Rate Change 95% lower 95% upper Probability Positive
Feature        
Complaints 1.06 0.90 1.24 0.76
Incidence rate 1.09 0.92 1.26 0.84
R rate 1.03 0.83 1.25 0.61
Satisfaction 1.08 0.89 1.27 0.79
Experience 1.15 0.94 1.36 0.92
Category: Specialist 2.26 0.97 3.85 0.99
Rank: Gold 0.38 0.08 0.80 0.02
Rank: Gold Plus 0.21 0.05 0.42 0.00
Rank: Platinum 0.29 0.15 0.46 0.00
Rank: Platinum Plus 0.32 0.16 0.51 0.00
Rank: Silver 1.61 0.03 4.89 0.50
Rank: Silver Plus 0.38 0.16 0.68 0.00
Rank: Titanium 0.49 0.30 0.71 0.00
Rank: Titanium Plus 1.63 0.03 5.03 0.50
Unique Condition Index 0.93 0.79 1.08 0.17

Customer segmentation - revealing three profiles

The full 16-feature dataset was reduced to two dimensions using spectral embedding. This was chosen as, unlike other feature reduction methods, it focuses on maintaining local distances between observations in the reduced space. Following the embedding, K-Means was used to determine clusters. K was chosen by iterating over candidate cluster numbers and assessing the silhouette coefficient, which assesses the average distance between a point and its within-cluster neighbours compared to its between-cluster neighbours. Three clusters emerged as having the highest coefficient, 0.47.

An interactive scatter plot of the resulting two-dimensional embedding and clusters is shown below, highlighting how the embedding retained distances that K-Means identified easily. Hover over each datapoint to visualise its underying full-dimensional data.

# Show scatter plot
query2_chart = (alt.Chart(spectra_df, width=600, height=400)
                .mark_point(filled=True, size=200, opacity=.50)
                .encode(x=alt.X('spec_1', title='Embedding 1'),
                        y=alt.Y('spec_2', title='Embedding 2'),
                        color=alt.Color('Segment:N'),
                        tooltip=[alt.Tooltip('DoctorID'),
                                 alt.Tooltip('Purchases'),
                                 alt.Tooltip('Complaints', title='Complaint Count'),
                                 alt.Tooltip('Category'),
                                 alt.Tooltip('Rank'),
                                 alt.Tooltip('Region'),
                                 alt.Tooltip('Incidence rate'),
                                 alt.Tooltip('R rate'),
                                 alt.Tooltip('Satisfaction'),
                                 alt.Tooltip('Experience')])
                .configure_axis(labelFontSize=16,
                                titleFontSize=16)
                .configure_legend(labelFontSize=16,
                                  titleFontSize=16)
               )

# Render
query2_chart
%load_ext watermark
%watermark --iversions
pymc      : 4.0.0
arviz     : 0.12.1
seaborn   : 0.12.0rc0
altair    : 4.2.0
numpy     : 1.21.6
matplotlib: 3.5.2
scipy     : 1.8.1
pandas    : 1.4.2