Random forest for clustering: step by step example

I'm trying to follow this 3 steps for clustering using random forest:

  1. The unsupervised Random Forest algorithm was used to generate a proximity matrix using all listed clinical variables.

  2. PAM clustering of this first proximity matrix generated the initial classes

  3. A supervised Random Forest analysis of the initial classes a) indicated out of bag error ratesof about 25–30%%. b) had variable importance plots demonstrating that FEV1%, FVC%,Age, Height, Weight, BMI, Age, FEV1% product and Brasfield scores were the most impor-tant variables in forming the classes. (Fig. 2A). This pattern was very similar with the k = 3,k = 4, and k = 6 classifications.

It is from page 5 of this paper
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0122705&type=printable

This is my data
https://www.mediafire.com/file/b6b8a1rim2dr8cp/dat.xlsx/file

This is my code:

library(readxl)
    library(randomForest)
    library(dplyr)
    dat <- read_xlsx("C:/Desktop/dat.xlsx")
    dat$studies <- as.factor(dat$studies)
    dat$age <- as.factor(dat$age)
    dat$gender <- as.factor(dat$gender)
    dat$VAR1 <- as.factor(dat$VAR1)
    dat$VAR2 <- as.factor(dat$VAR2)
    dat$VAR3 <- as.factor(dat$VAR3)
    dat$VAR4 <- as.factor(dat$VAR4)
    dat$VAR5 <- as.factor(dat$VAR5)
    dat$VAR6 <- as.factor(dat$VAR6)
    dat$VAR7 <- as.factor(dat$VAR7)
    dat$VAR8 <- as.factor(dat$VAR8)
    dat$VAR9 <- as.factor(dat$VAR9)
    dat$VAR10 <- as.factor(dat$VAR10)

#Step 1:

set.seed(23)
n <- nrow(dat)
datBS <- mutate_all(dat,funs(sample(.,replace=TRUE)))
y <- factor(c(rep(1, n), rep(2, n)))
rf<-randomForest(x=rbind(dat,datBS), y=y, proximity=TRUE)


#Step 2

library(cluster)
gower_dist <- daisy((1-rf$proximity), metric = "gower")
pam_fit <- pam(gower_dist, diss = TRUE,k = 2)
sil_width <- pam_fit$silinfo$avg.width
sil_width
pam_fit$clustering

I got Step 1 from here

and seems to be ok

In Step 2 I run PAM with 1-proximity I think that it is similar to what they did.

In Step 3 I need to run a supervised random forest with my original data and the classes obtained with PAM if I understand it right.. but as remarked here

I have twice the number of rows, where are the clases of the original data?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.