Find the ideal cluster

JojoSouza · December 12, 2020, 1:39pm

So, I and some other colleagues developed a hierarchical clustering algorithm to basically find the main clusters involving agricultural industries according to a particular city (e.g. London city).. We structured this algorithm in R. It is working perfectly. So, according to our filters that we inserted in the algorithm, we were able to generate 6 clustering scenarios to London city. For example, the first scenario generated 2 clusters, the second scenario 5 clusters, and so on. I would therefore like some help on how I can choose the most appropriate one. I saw that there are some packages that help in this process, like pvclust, but I couldn't use it for my case. I am inserting a brief executable code below to show the essence of what I want.

Any help is welcome! If you know how to use using another package, feel free to describe.

Best Regards.

library(rdist)
library(geosphere)
library(fpc)
 
 
df<-structure(list(Industries = c(1,2,3,4,5,6), 
+                    Latitude = c(-23.8, -23.8, -23.9, -23.7, -23.7,-23.7), 
+                    Longitude = c(-49.5, -49.6, -49.7, -49.8, -49.6,-49.9), 
+                    Waste = c(526, 350, 526, 469, 534, 346)), class = "data.frame", row.names = c(NA, -6L))
 
df1<-df
 
#clusters
coordinates<-df[c("Latitude","Longitude")]
d<-as.dist(distm(coordinates[,2:1]))
fit.average<-hclust(d,method="average") 
 
clusters<-cutree(fit.average, k=2) 
df$cluster <- clusters 
> df
  Industries Latitude Longitude Waste cluster
1          1    -23.8     -49.5   526       1
2          2    -23.8     -49.6   350       1
3          3    -23.9     -49.7   526       1
4          4    -23.7     -49.8   469       2
5          5    -23.7     -49.6   534       1
6          6    -23.7     -49.9   346       2
> 
clusters1<-cutree(fit.average, k=5) 
df1$cluster <- clusters1
> df1
  Industries Latitude Longitude Waste cluster
1          1    -23.8     -49.5   526       1
2          2    -23.8     -49.6   350       1
3          3    -23.9     -49.7   526       2
4          4    -23.7     -49.8   469       3
5          5    -23.7     -49.6   534       4
6          6    -23.7     -49.9   346       5
>

AlexisW · December 14, 2020, 5:09am

Deciding the "correct" number of clusters is always difficult, since there is no fundamental reason why a number is better than another (e.g. it is always correct to say that there is 1 cluster, it is always correct to say that each point is a cluster).

I've had some success in the past with ConsensusClusterPlus. The principle of consensus clustering is to rerun the clustering algorithm many time with small changes in the starting data, and see how many times each pair of data points end up in the same clusters. The package provides tools to look at the stability of the clusters obtained, so you can try to run it with different numbers of clusters and see when results become unstable.

JojoSouza · December 14, 2020, 3:31pm

Thank you so much @AlexisW

I will check.

system · January 4, 2021, 3:31pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.