Count the number of clusters on a ggplot scatterplot (without providing labels)?

I have a dataframe with columns 'x' and 'y' corresponding to x/y coordinates of a scatterplot I've made using ggplot2. I'm looking for some way to ask, "how many clusters exist here?". I understand that maybe some user input may be required here for what you want to call a 'cluster'.

I have found some success using Seurat, because it contains a function to label clusters. However, it's more like finding the clusters that correspond to a vector of labels provided by the user (ex: I proivde 5 unique labels so just go find 5 clusters).

Min Reprex:

Seurat's LabelClusters function is very useful for labeling clusters starting solely from X/Y coordinates:

library(umap)
library(Seurat)
my_umap <- umap(iris[,c(1:4)])
my_umap <- as.data.frame(my_umap$layout)
my_umap$id <- iris$Species
colnames(my_umap) <- c("x","y", "id")

p <- ggplot(my_umap, aes(x=x,y=y,color=id)) + geom_point()

LabelClusters(plot=p, id='id', color="black")

However, I have a need to detect total # of clusters from these data (without providing labels). By this I mean first detecting how many clusters exist . Maybe here we would see 5 clusters instead of 3 :

Can this be achieved in some way?

I'm not aware of a way to automatically choose the number of clusters. kmeans requires you specify the number of clusters. You can try a range of number of clusters and look at the within and between cluster sum of squares to choose the number of clusters that strikes the desired balance. Or do a hierarchical clustering and choose from those results.

library(tidyverse)

# pca and k-means clusters
pc <- iris %>% select_if(is.numeric) %>% princomp()
km <- iris %>% select_if(is.numeric) %>% list() %>% map2(1:10, kmeans) %>% set_names(1:length(.))

# sum of squares analysis  
clus <- km %>% map_dfr(~.["tot.withinss"], .id = "num_clusters")
print(clus)
#> # A tibble: 10 x 2
#>    num_clusters tot.withinss
#>    <chr>               <dbl>
#>  1 1                   681. 
#>  2 2                   152. 
#>  3 3                    78.9
#>  4 4                    71.4
#>  5 5                    49.8
#>  6 6                    39.4
#>  7 7                    36.9
#>  8 8                    33.1
#>  9 9                    32.3
#> 10 10                   27.0

# add principal components and cluster ids to the data frame
df <- iris %>% 
  as_tibble() %>%
  bind_cols(pc$scores, cluster = km[[4]]$cluster) %>%
  mutate(cluster = as.factor(cluster))

#plot
df %>%
  ggplot() + 
  aes(Comp.1, Comp.2, color = cluster) + 
  geom_point()

Created on 2022-01-31 by the reprex package (v2.0.1)

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.