Random forest for clustering: step by step example

anacho · August 16, 2019, 10:17am

I'm trying to follow this 3 steps for clustering using random forest:

The unsupervised Random Forest algorithm was used to generate a proximity matrix using all listed clinical variables.
PAM clustering of this first proximity matrix generated the initial classes
A supervised Random Forest analysis of the initial classes a) indicated out of bag error ratesof about 25–30%%. b) had variable importance plots demonstrating that FEV1%, FVC%,Age, Height, Weight, BMI, Age, FEV1% product and Brasfield scores were the most impor-tant variables in forming the classes. (Fig. 2A). This pattern was very similar with the k = 3,k = 4, and k = 6 classifications.

It is from page 5 of this paper
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0122705&type=printable

This is my data
https://www.mediafire.com/file/b6b8a1rim2dr8cp/dat.xlsx/file

This is my code:

library(readxl)
    library(randomForest)
    library(dplyr)
    dat <- read_xlsx("C:/Desktop/dat.xlsx")
    dat$studies <- as.factor(dat$studies)
    dat$age <- as.factor(dat$age)
    dat$gender <- as.factor(dat$gender)
    dat$VAR1 <- as.factor(dat$VAR1)
    dat$VAR2 <- as.factor(dat$VAR2)
    dat$VAR3 <- as.factor(dat$VAR3)
    dat$VAR4 <- as.factor(dat$VAR4)
    dat$VAR5 <- as.factor(dat$VAR5)
    dat$VAR6 <- as.factor(dat$VAR6)
    dat$VAR7 <- as.factor(dat$VAR7)
    dat$VAR8 <- as.factor(dat$VAR8)
    dat$VAR9 <- as.factor(dat$VAR9)
    dat$VAR10 <- as.factor(dat$VAR10)

#Step 1:

set.seed(23)
n <- nrow(dat)
datBS <- mutate_all(dat,funs(sample(.,replace=TRUE)))
y <- factor(c(rep(1, n), rep(2, n)))
rf<-randomForest(x=rbind(dat,datBS), y=y, proximity=TRUE)


#Step 2

library(cluster)
gower_dist <- daisy((1-rf$proximity), metric = "gower")
pam_fit <- pam(gower_dist, diss = TRUE,k = 2)
sil_width <- pam_fit$silinfo$avg.width
sil_width
pam_fit$clustering

I got Step 1 from here

and seems to be ok

In Step 2 I run PAM with 1-proximity I think that it is similar to what they did.

In Step 3 I need to run a supervised random forest with my original data and the classes obtained with PAM if I understand it right.. but as remarked here

I have twice the number of rows, where are the clases of the original data?

system · September 6, 2019, 10:17am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.