Find the ideal cluster

So, I and some other colleagues developed a hierarchical clustering algorithm to basically find the main clusters involving agricultural industries according to a particular city (e.g. London city).. We structured this algorithm in R. It is working perfectly. So, according to our filters that we inserted in the algorithm, we were able to generate 6 clustering scenarios to London city. For example, the first scenario generated 2 clusters, the second scenario 5 clusters, and so on. I would therefore like some help on how I can choose the most appropriate one. I saw that there are some packages that help in this process, like pvclust, but I couldn't use it for my case. I am inserting a brief executable code below to show the essence of what I want.

Any help is welcome! If you know how to use using another package, feel free to describe.

Best Regards.

library(rdist)
library(geosphere)
library(fpc)
 
 
df<-structure(list(Industries = c(1,2,3,4,5,6), 
+                    Latitude = c(-23.8, -23.8, -23.9, -23.7, -23.7,-23.7), 
+                    Longitude = c(-49.5, -49.6, -49.7, -49.8, -49.6,-49.9), 
+                    Waste = c(526, 350, 526, 469, 534, 346)), class = "data.frame", row.names = c(NA, -6L))
 
df1<-df
 
#clusters
coordinates<-df[c("Latitude","Longitude")]
d<-as.dist(distm(coordinates[,2:1]))
fit.average<-hclust(d,method="average") 
 
clusters<-cutree(fit.average, k=2) 
df$cluster <- clusters 
> df
  Industries Latitude Longitude Waste cluster
1          1    -23.8     -49.5   526       1
2          2    -23.8     -49.6   350       1
3          3    -23.9     -49.7   526       1
4          4    -23.7     -49.8   469       2
5          5    -23.7     -49.6   534       1
6          6    -23.7     -49.9   346       2
> 
clusters1<-cutree(fit.average, k=5) 
df1$cluster <- clusters1
> df1
  Industries Latitude Longitude Waste cluster
1          1    -23.8     -49.5   526       1
2          2    -23.8     -49.6   350       1
3          3    -23.9     -49.7   526       2
4          4    -23.7     -49.8   469       3
5          5    -23.7     -49.6   534       4
6          6    -23.7     -49.9   346       5
>

Deciding the "correct" number of clusters is always difficult, since there is no fundamental reason why a number is better than another (e.g. it is always correct to say that there is 1 cluster, it is always correct to say that each point is a cluster).

I've had some success in the past with ConsensusClusterPlus. The principle of consensus clustering is to rerun the clustering algorithm many time with small changes in the starting data, and see how many times each pair of data points end up in the same clusters. The package provides tools to look at the stability of the clusters obtained, so you can try to run it with different numbers of clusters and see when results become unstable.

Thank you so much @AlexisW

I will check.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.