Tidymodels KKNN -- how do I return which categories belong to which clusters?

I've been exploring here:

The video shows mostly continuous predictors. However, I would like to:

  1. Use categorical predictors
  2. Return which categories (or columns, see explanation below) exhibits high-level 'belonging' to which clusters

Side note: I think UMAP would also work (though in Tidymodels, it is only a step_umap() at the moment)

So to clarify, let's take the following scenario:

  1. One hot encode bunch of categorical variables
  2. Run KNN (or UMAP)
  3. See if above X% (e.g. 90% -- tune?) of the values from one dummy/category belong to one cluster
  4. Do 3) for rest of the clusters as a multinomial outcome (bonus: tune optimal number of clusters)

Essentially, I would like for KKNN (Euclidean distance can be problematic for high dimensions, so I'm also thinking Pearson's or Chi Square, but unsure on how to implement decision-making process for boundaries) to return which columns/categories belong to which clusters. Would that be possible?

It seems like you are asking to calculate correlation of a categoric variable with a cluster id , that would seem to me to be relatively straightforward

1 Like

Ah yes, that's one way to look at it. How can I extract correlation of each of the categorical variables to each of the clusters? Would this also be doable with step_knn() from tidymodels?

the only ingredients you would need is your original data, and the cluster assigment of the knn on that data.
Then I think cor() function using spearman method

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.