Problem generating graphics using hclust and pvclust in R

I would like help with the following question:
I am using hierarchical clustering for my data. I'm doing an example using both the hclust function and using pvclust (includes the hclust function).

I did two tests: the first for a base with 8 properties and the other with 19 properties. The first base worked, presented the graphics correctly, they are the same, since both use hclustfunction. But when I made it with 19 properties, the graph was different, could it help me understand and solve this problem?

Thank you!

library(rdist)
library(pvclust)
library(geosphere)

#USING HCLUST
 df <- structure(
   list(Propertie = c(1,2,3,4,5,6,7,8), Latitude = c(-24.779225, -24.789635, -24.763461, -24.794394, -24.747102,-24.781307,-24.761081,-24.761084),
        Longitude = c(-49.934816, -49.922324, -49.911616, -49.906262, -49.890796,-49.8875254,-49.8875254,-49.922244), 
        Waste = c(526, 350, 526, 469, 285, 433, 456,825)),class = "data.frame", row.names = c(NA, -8L))

coordinates<-subset(df,select=c("Latitude","Longitude")) 
d<- dist(distm(coordinates[,2:1]), method="euclidean")
fit.average<-hclust(d,method="average") 
plot(fit.average,hang=-1,cex=.8,main="Average Linkage Clustering")

enter image description here

### USING PVCLUST
coordinates<-subset(df,select=c("Latitude","Longitude")) 
d<-dist(distm(coordinates[,2:1]))
mat <- as.matrix(d)
mat <- t(mat)
fit <- pvclust(mat, method.hclust="average", method.dist="euclidean", 
               nboot=10)
plot(fit,hang=-1,cex=.8, cex.pv=.5, print.num=FALSE, print.pv=FALSE, 
      main="Average Linkage Clustering") 

enter image description here

### FOR DATABASE DF WITH 19 PROPERTIES

     df<-structure(list(Propertie = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), Latitude = c(-23.8, -23.9, -23.5, -23.4, -23.6,-23.9, -23.2, -23.5, -23.8, -23.7, -23.8, -23.9, -23.4, -23.9,-23.9, -23.2, -23.3, -23.7, -23.8),
Longitude = c(-49.1, -49.3,-49.4, -49.7, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7,-49.2, -49.5, -49.8, -49.5, -49.3, -49.3, -49.2, -49.5), 
Waste = c(526,350, 526, 469, 285, 175, 175, 350, 350, 175, 350, 175, 175, 364,175, 175, 350, 45.5, 54.6)), class = "data.frame", row.names = c(NA, -19L))

Figure for hclust

enter image description here

Figure for pvclust

enter image description here

By different: height, clustering or both?

library(rdist)
library(pvclust)
library(geosphere)

fit_avg <- function(x) {
  coordinates <- subset(x, select = c("Latitude", "Longitude"))
  d <- dist(distm(coordinates[, 2:1]), method = "euclidean")
  fit.average <- hclust(d, method = "average")
  plot(fit.average, hang = -1, cex = .8, main = "Average Linkage Clustering")
}

fit_mat <- function(x, coordinates) {
  coordinates <- subset(x, select = c("Latitude", "Longitude"))
  d <- dist(distm(coordinates[, 2:1]))
  mat <- as.matrix(d)
  mat <- t(mat)
  fit <- pvclust(mat,
                 method.hclust = "average", method.dist = "euclidean",
                 nboot = 1000
  )
  plot(fit,
       hang = -1, cex = .8, cex.pv = .5, print.num = FALSE, print.pv = FALSE,
       main = "Average Linkage Clustering"
  )
}

DF8 <- structure(
  list(
    Propertie = c(1, 2, 3, 4, 5, 6, 7, 8), Latitude = c(-24.779225, -24.789635, -24.763461, -24.794394, -24.747102, -24.781307, -24.761081, -24.761084),
    Longitude = c(-49.934816, -49.922324, -49.911616, -49.906262, -49.890796, -49.8875254, -49.8875254, -49.922244),
    Waste = c(526, 350, 526, 469, 285, 433, 456, 825)
  ),
  class = "data.frame", row.names = c(NA, -8L)
)

DF19 <- structure(list(
  Propertie = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), Latitude = c(-23.8, -23.9, -23.5, -23.4, -23.6, -23.9, -23.2, -23.5, -23.8, -23.7, -23.8, -23.9, -23.4, -23.9, -23.9, -23.2, -23.3, -23.7, -23.8),
  Longitude = c(-49.1, -49.3, -49.4, -49.7, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7, -49.2, -49.5, -49.8, -49.5, -49.3, -49.3, -49.2, -49.5),
  Waste = c(526, 350, 526, 469, 285, 175, 175, 350, 350, 175, 350, 175, 175, 364, 175, 175, 350, 45.5, 54.6)
), class = "data.frame", row.names = c(NA, -19L))


fit_avg(DF8)

fit_mat(DF8)
#> Bootstrap (r = 0.5)... Done.
#> Bootstrap (r = 0.62)... Done.
#> Bootstrap (r = 0.75)... Done.
#> Bootstrap (r = 0.88)... Done.
#> Bootstrap (r = 1.0)... Done.
#> Bootstrap (r = 1.12)... Done.
#> Bootstrap (r = 1.25)... Done.
#> Bootstrap (r = 1.38)... Done.

fit_avg(DF19)

fit_mat(DF19)
#> Bootstrap (r = 0.47)... Done.
#> Bootstrap (r = 0.58)... Done.
#> Bootstrap (r = 0.68)... Done.
#> Bootstrap (r = 0.79)... Done.
#> Bootstrap (r = 0.89)... Done.
#> Bootstrap (r = 1.0)... Done.
#> Bootstrap (r = 1.05)... Done.
#> Bootstrap (r = 1.16)... Done.
#> Bootstrap (r = 1.26)... Done.
#> Bootstrap (r = 1.37)... Done.

Hi,

As I start, I've pinpointed where the difference arises.

a1$order
#> [1] 6 5 7 3 8 4 1 2
a2$hclust$order
#> [1] 6 5 7 3 8 4 1 2
b1$order
#> [1] 4 8 3 13 7 16 17 2 12 1 18 14 5 10 11 6 15 9 19
b2$hclust$order
#> [1] 17 7 16 4 13 3 8 18 2 1 12 5 10 14 6 15 19 9 11


Objects were returned by my functions after modifying them to return the fit objects. To solve: why is `order` different with one case?

I'm sorry I don't understand your question.. You can see that b1 and b2 are not the same. But wouldn't it have to be the same?

Sorry to be obscure. I meant that I've pinpointed the object that contains the differences, and now I'm back to trying to figure out how that object gets output. More to come.

Great @technocrat ! Thank you so much for this!

Best regards.

1 Like

I've been unable to untangle this. The problem lies either with the logic (but no similar issue found) or the shape of the matrix (19x2 vs 8x2). I suggest asking the author at the address at the foot of the github repo

1 Like

No problems @technocrat! Thank you so much for trying. I will send a message, as you guided me. Thank you so!

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Thanks for the reply friend @technocrat . I was referring that clustering is different. For me, the two figures corresponding to the database with 19 properties should be the same, but they are different. Note that for the database with 8 properties are the same. I would like to understand why that is.