Clustering with mix of binary and continuious variables, the bools seem to 'dominate' everything

I have a data set that contains 2 binary variables and 7 continuous variables. I would like to cluster this data. After scaling my variables, I Initially I tried with kmeans but when looking at the results I noticed that the binary variables caused perfect separation among cluster groups which was unexpected.

After some research I read this post on SO.

I then gave PAM clustering a shot with a distance matrix. However, I get a very similar result as to when I initially tried with kmeans, the 2 binary variables seem to determine the clusters and everything else is just a side show.

Here's my data, a csv with 3,200 rows and disguised field names and scaled data.

Here's the steps to reproduce:

pacman::p_load(tidyverse, fpc, cluster)
cluster_data <- readr::read_csv('cluster_data.csv')

bool_cols <- cdata_selected |> select_at(vars(matches('Bool'))) |> ncol()
g.dist = daisy(cluster_data, metric = "gower", type = list(symm = 1:bool_cols)) # I think/hope that I'm telling daisy to treat the first twop columns as bools here
pc = pamk(g.dist, krange=2:10, criterion = "asw")

# get 4 clusters
pc$nc # 4

# summary of clusters:
cluster_data$cluster <- pc$pamobject$clustering

cluster_summary <- cluster_data |> 
  group_by(cluster) |> 
  summarise(
    N = n(),
    across(matches('Bool|Count'), mean, .names = 'Avg_{.col}')) |>
  mutate_at(vars(-cluster), ~ round(., 4)) 

cluster_summary |> View()

# binary variables cause perfect seperation, expected some more mixing
cluster_data |> 
  group_by(cluster) |> 
  summarise(N = n(),
            Bool1Total = sum(Bool1),
            Bool2Total = sum(Bool2)
            )

The data frame 'cluster summary' shows averages of each field by cluster. Then the last block just illustrates the separation of clusters along the binary variables:

cluster_data |> 
  group_by(cluster) |> 
  summarise(N = n(),
            Bool1Total = sum(Bool1),
            Bool2Total = sum(Bool2)
            )
# A tibble: 4 × 4
  cluster     N Bool1Total Bool2Total
    <int> <int>      <dbl>      <dbl>
1       1  1944          0       1944
2       2   518        518        518
3       3   663          0          0
4       4   149        149          0

I might expect some clusters to have a higher or lower proportion of the bool variables, but not 100% separation as I have just now.

While reading the SO post I linked to, it sounded like PAM clustering with a gowen matrix was the way to get around not being able to use binary variables in kmeans. But nonetheless, it seems that when I use PAM, I still get perfect separation along these variables.

Is this expected? Is my approach wrong? How can I best cluster this data set?

Could you fix the data storage? The screenshot is the whole page; doesn't seem to be an option to download.

Hi sorry about that. On mobile right now but I tried in the Nextcloud app. Does this new link work?

1 Like

Bingo! This one has a download button in the upper right part of the blue bar header. Will dig tonight (PST)

1 Like

I was playing around and decided to simplify the problem by looking at the data excluding the booleans and found only two clusters among the non-binary variables. Wouldn't that make them perfectly separated as well?

Thanks for taking a look.

Wouldn't that make them perfectly separated as well?

Are they separated along one particular count variable? Maybe there's just not detectable clusters with the data set I have.
All the variables except bool1 relate to in product engagement data, with bool2 signifying if the user completed an optional getting started flow. The first bool just indicates if the engagement with the service was done on a desktop device or not.

I would like to include the bools if they have such an impact on user behavior (regular, business intelligence style analysis looking at just ratios suggest they do)

Something I tried yesterday was adding a weights arg to my call to pamk() and after giving the bools just 0.1 weighting, I removed the perfect separation. I'm unsure if this is a sound practice or not. The results made more sense when I read them, but I cannot tell if that's just wishful thinking on my part.

When I did PCA and plotted the data points on a 2D x,y plane (leaving out the bools), I didn't see much in the way of distinguishable groups. I might need more data. Or maybe my approach is fine but there's just not detectable clusters with the data set as is :confused:

Do you think my approach is sound? PAM with weighting?

I'm gonna call BS on myself. I feel like I can usually suss stuff like this that I haven't done before, but I don't feel I'm getting any traction on trying to cluster with this.

I did run across the notion that the bools dominate. I speculate that this is a consequence of their short Euclidian distance compared to the continuous vars. But \dots? So, underweighting them would mitigate that.

As far as my run at with omitting the bools, it was just dropping those two at the start. I ended up with everything falling into cluster 1 or cluster 2.

I found getting a handle on count data squirrely. Found that doing visualization with {vcd}
helpful.

Interesting that you got two clusters... would you mind sharing the code that arrived at that? Was that with kmeans or something else?

By the way, here's the PCA and resulting visualization I did, not sure if this is helpful or not. I was looking to eyeball real distinct groups but it all just seems a bit messy and all over the place, which makes me doubt these data are 'clusterable'.

Given the above maybe 2 clusters does make more sense than the 4 I got.

library(factoextra)

cdata_princomp <- cluster_data |> princomp()
fviz_pca_ind(cdata_princomp,
             palette = "jco",
             addEllipses = F,
             label = "none",
             col.var = "black",
             repel = T,
             legend.title = "Converted")

Just substituted

cluster_no_bool <- cluster_data[3:9]

But I'm getting a different PCA result, with 4 distinct groupings

library(factoextra)
#> Loading required package: ggplot2
#> Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

cluster_data <- readr::read_csv('/home/roc/projects/demo/cluster_data.csv')
#> Rows: 3274 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (9): Bool1, Bool2, A_Count, B_Count, C_Count, D_Count, E_Count, F_Count,...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cdata_princomp <- cluster_data |> princomp()
fviz_pca_ind(cdata_princomp,
             palette = "jco",
             addEllipses = F,
             label = "none",
             col.var = "black",
             repel = T,
             legend.title = "Converted")

Created on 2022-11-13 by the reprex package (v2.0.1)

Ah. Yeah I see. I shared an image using an earlier data set not the one I shared here. Please ignore the PCA image I shared above it's a distraction.

Here's the same PCA but using cluster_no_bool instead of cluster_data

Seems like no definable clusters :cry:

I think I need to look at my source data and see what else I can pull.

1 Like

The difference between the last two plots pretty well shows how dominant the booleans are in the clustering algorithm.

I think given that you have continuos and dummy variables you should use the Gower distance since it can be applied to a dataset with variables of different nature, instead of the Euclidean distance. Here a completed post about this.

1 Like

Hi @Adan1 right, yes that's what we tried, see this line of code from the main post:

g.dist = daisy(cluster_data, metric = "gower", type = list(symm = 1:bool_cols)) # I think/hope that I'm telling daisy to treat the first twop columns as bools here

Still, with this data struggled to pull meaningful clusters.

Do you have similar data with no clustering? Maybe this set just lacks any.

Surely, the presence of binary variables has a strong effect on clustering (and on PCA too). If you think carefully, two boolean variables may identify up to (0-0, 1-0, 0-1 and 1-1) four natural groups. Methodologically, computing a Euclidean distance between such variables is also wrong, because you will observe only three values (dist(0-0,0-0)=dist(1-1,1-1)=dist(1-0,1-0)=dist(0-1,0-1)=0 , dist(1-0,0-1)=dist(0-1,1-0)=dist(0-1,1-1)=dist(1-0,1-1)=d1 and dist(1-0,0-1)=dist(1-1,0-0)=d2, like in a square where you compute all the possible distances between pairs of vertices) in a hypothetical distance matrix constructed only on boolean variables. You have several strategies

  1. (quantify boolean variables) you quantify the boolean variables using the scores of correspondence analysis and then you use the scores together with the other continuous variables for the clustering step, but it is better to have more than two boolean variables;
  2. (Categorize continuous variables) you can transform the continuous variables into categorical ones (binning them) and then use a correspondence analysis and then a clustering technique. The advantage of categorizing is to reveal also non-linear patterns among data features, the problem is the choice of binning parameters;
    2.1) after categorization (or vector quantization) you can apply clustering on categorical data.
  3. gower can be a solution with several cautions!
1 Like