Storing variable importance from randomForest using loop

Nile · March 13, 2023, 8:30pm

Hi,
I am trying to run a series of rf model by looping over regions using the following code, but keep getting the error:

Error in `$<-.data.frame`(`*tmp*`, "region", value = 10) : 
  replacement has 1 row, data has 0

Can you please help me see what am I missing in this method?

Here is a glimpse of the data:

Rows: 6,978
Columns: 5
$ age     <dbl> 15, 18, 18, 18, 15, 16, 16, 18, 15, 15, 17, 17, 18, 18, 19, 17, 15, 16, 16, 15, 16, …
$ fs      <dbl> 12, 12, 12, 12, 10, 10, 10, 12, 10, 10, 10, 12, 12, 11, 12, 12, 11, 10, 11, 12, 12, …
$ sex     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ marital <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ region  <dbl> 10, 12, 10, 9, 9, 7, 9, 9, 9, 8, 1, 4, 9, 4, 12, 10, 4, 9, 9, 4, 4, 4, 10, 4, 6, 10,…

And here is the code:

 result <- data.frame()

 for (i in unique(bb$region)) {
    sub_train <- subset(bb, region == i)
    rf <- randomForest(fs ~ age  + sex + marital , data = sub_train, ntree = 5000, mtry = 3)
    imp <- importance(rf, type = 1)
    if (nrow(imp) > 0) {
        imp_table <- as.data.frame(t(imp))
        imp_table$region <- i
        colnames(imp_table) <- c("MeanDecreaseAccuracy", "MeanDecreaseGini", "region")
                if (nrow(result) == 0) {
            result <- imp_table
        } else {
            result <- rbind(result, imp_table)
        }
    }
}

result

technocrat · March 14, 2023, 7:51am

ruins everything

library(randomForest)
#> randomForest 4.7-1.1
#> Type rfNews() to see new features/changes/bug fixes.
N <- 7000
bb <- data.frame(age = sample(15:25,N,replace = TRUE),
                 fs = sample(10:18,N,replace = TRUE),
                 sex = sample(0:1,N,replace = TRUE),
                 marital = sample(0:1,N,replace = TRUE),
                 region = sample(1:19,N,replace = TRUE))

# receiver object
result <- bb[0,]

# check logic by hand
train <- bb[which(bb$region == 1),]
rf = randomForest(fs ~ age  + sex + marital , data = train, ntree = 5000, mtry = 3)
(imp <- importance(rf, type = 1))
#>        
#> age    
#> sex    
#> marital
(if(nrow(imp) > 0) imp_table = as.data.frame(t(imp)))
#> [1] age     sex     marital
#> <0 rows> (or 0-length row.names)

for (i in unique(bb$region)) {
  sub_train = subset(bb, region == i)
  rf = randomForest(fs ~ age  + sex + marital , data = sub_train, ntree = 5000, mtry = 3)
  imp = importance(rf, type = 1)
  if (nrow(imp) > 0) {
    imp_table = as.data.frame(t(imp))
    imp_table$region = i
    colnames(imp_table) = c("MeanDecreaseAccuracy", "MeanDecreaseGini", "region")
    if (nrow(result) == 0) {
      result = imp_table
    } else {
      result = rbind(result, imp_table)
    }
  }
}
#> Error in `$<-.data.frame`(`*tmp*`, "region", value = 3L): replacement has 1 row, data has 0

result
#> [1] age     fs      sex     marital region 
#> <0 rows> (or 0-length row.names)

^{Created on 2023-03-14 with reprex v2.0.2}

nirgrahamuk · March 14, 2023, 9:35am

indeed; randomForest takes an importance param

importance
Should importance of predictors be assessed?

would need to be

rf = randomForest(fs ~ age  + sex + marital ,
                  data = train,
                  ntree = 5000, 
                  mtry = 3,
                  importance=TRUE)
(imp <- importance(rf, type = 1))

system · April 4, 2023, 9:36am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.