I'm trying to follow this 3 steps for clustering using random forest:
The unsupervised Random Forest algorithm was used to generate a proximity matrix using all listed clinical variables.
PAM clustering of this first proximity matrix generated the initial classes
A supervised Random Forest analysis of the initial classes a) indicated out of bag error ratesof about 25–30%%. b) had variable importance plots demonstrating that FEV1%, FVC%,Age, Height, Weight, BMI, Age, FEV1% product and Brasfield scores were the most impor-tant variables in forming the classes. (Fig. 2A). This pattern was very similar with the k = 3,k = 4, and k = 6 classifications.
It is from page 5 of this paper
This is my data
This is my code:
library(readxl) library(randomForest) library(dplyr) dat <- read_xlsx("C:/Desktop/dat.xlsx") dat$studies <- as.factor(dat$studies) dat$age <- as.factor(dat$age) dat$gender <- as.factor(dat$gender) dat$VAR1 <- as.factor(dat$VAR1) dat$VAR2 <- as.factor(dat$VAR2) dat$VAR3 <- as.factor(dat$VAR3) dat$VAR4 <- as.factor(dat$VAR4) dat$VAR5 <- as.factor(dat$VAR5) dat$VAR6 <- as.factor(dat$VAR6) dat$VAR7 <- as.factor(dat$VAR7) dat$VAR8 <- as.factor(dat$VAR8) dat$VAR9 <- as.factor(dat$VAR9) dat$VAR10 <- as.factor(dat$VAR10) #Step 1: set.seed(23) n <- nrow(dat) datBS <- mutate_all(dat,funs(sample(.,replace=TRUE))) y <- factor(c(rep(1, n), rep(2, n))) rf<-randomForest(x=rbind(dat,datBS), y=y, proximity=TRUE) #Step 2 library(cluster) gower_dist <- daisy((1-rf$proximity), metric = "gower") pam_fit <- pam(gower_dist, diss = TRUE,k = 2) sil_width <- pam_fit$silinfo$avg.width sil_width pam_fit$clustering
I got Step 1 from here
and seems to be ok
In Step 2 I run PAM with 1-proximity I think that it is similar to what they did.
In Step 3 I need to run a supervised random forest with my original data and the classes obtained with PAM if I understand it right.. but as remarked here
I have twice the number of rows, where are the clases of the original data?