I'm trying to follow this 3 steps for clustering using random forest:
-
The unsupervised Random Forest algorithm was used to generate a proximity matrix using all listed clinical variables.
-
PAM clustering of this first proximity matrix generated the initial classes
-
A supervised Random Forest analysis of the initial classes a) indicated out of bag error ratesof about 25–30%%. b) had variable importance plots demonstrating that FEV1%, FVC%,Age, Height, Weight, BMI, Age, FEV1% product and Brasfield scores were the most impor-tant variables in forming the classes. (Fig. 2A). This pattern was very similar with the k = 3,k = 4, and k = 6 classifications.
It is from page 5 of this paper
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0122705&type=printable
This is my data
https://www.mediafire.com/file/b6b8a1rim2dr8cp/dat.xlsx/file
This is my code:
library(readxl)
library(randomForest)
library(dplyr)
dat <- read_xlsx("C:/Desktop/dat.xlsx")
dat$studies <- as.factor(dat$studies)
dat$age <- as.factor(dat$age)
dat$gender <- as.factor(dat$gender)
dat$VAR1 <- as.factor(dat$VAR1)
dat$VAR2 <- as.factor(dat$VAR2)
dat$VAR3 <- as.factor(dat$VAR3)
dat$VAR4 <- as.factor(dat$VAR4)
dat$VAR5 <- as.factor(dat$VAR5)
dat$VAR6 <- as.factor(dat$VAR6)
dat$VAR7 <- as.factor(dat$VAR7)
dat$VAR8 <- as.factor(dat$VAR8)
dat$VAR9 <- as.factor(dat$VAR9)
dat$VAR10 <- as.factor(dat$VAR10)
#Step 1:
set.seed(23)
n <- nrow(dat)
datBS <- mutate_all(dat,funs(sample(.,replace=TRUE)))
y <- factor(c(rep(1, n), rep(2, n)))
rf<-randomForest(x=rbind(dat,datBS), y=y, proximity=TRUE)
#Step 2
library(cluster)
gower_dist <- daisy((1-rf$proximity), metric = "gower")
pam_fit <- pam(gower_dist, diss = TRUE,k = 2)
sil_width <- pam_fit$silinfo$avg.width
sil_width
pam_fit$clustering
I got Step 1 from here
and seems to be ok
In Step 2 I run PAM with 1-proximity I think that it is similar to what they did.
In Step 3 I need to run a supervised random forest with my original data and the classes obtained with PAM if I understand it right.. but as remarked here
I have twice the number of rows, where are the clases of the original data?