tunning the parameters of a random forest model by h2o grid

shamimashrafiyan · December 14, 2022, 9:25am

Hello,
I have written code to tune the parameters of a regression random forest model. I defined a search grid by h2o grid. It finds the best model but when I run the best model to predict the target for the new dataset(test data) all values of the target are the same so when I calculate the correlation between the actual target and the predicted target, the result is NAN.
I need to mention, when this function finds all models I sort them from best to worse based on error and then I get the first one as the best model then it gives the NAN result. but I tried some of the other models from the middle of the sorted models and they gave so better results. here is my code:
Can you help me with what the problem is?

library(h2o)
h2o.init()
h2o.clusterInfo()
library(tidyverse)

df = original_dataset # a dataset which has 173(rows) samples and 1850 features(col)
normalize <- function(x) {
  if(max(x) == min(x)){
    return(0)
  }
  return ((x - min(x)) / (max(x) - min(x)))
}
df = df[,2: ncol(df)]
maxmindf <- as.data.frame(lapply(df, normalize))
attach(maxmindf)
df_norm<-as.matrix(maxmindf)

h_df <- as.h2o(df_norm)

#split the data to train and test
df.split <- h2o.splitFrame(data = h_df, ratios = 0.8, seed = 200)
h_train <- df.split[[1]]
h_test <- df.split[[2]]

target <- "Expression"
features <- setdiff(colnames(df), target)

# different values for the mtries
a1 = floor((ncol(h_train)/3))
a2 = floor(sqrt(ncol(h_train)))

#search grid
hyper_grid.h2o <- list(ntrees = seq(501, 801, by = 100),
                       mtries = c(a1,a2)
                       )

hyper_grid.h2o

#number of model
sapply(hyper_grid.h2o, length) %>% prod()

#finding the best model
system.time(grid_cartesian <- h2o.grid(algorithm = "randomForest",
                                        grid_id = "rf_grid1",
                                       x = features,
                                       y = target,
                                       seed = 200,
                                       # nfolds = 5,
                                       training_frame = h_train,
                                       hyper_params = hyper_grid.h2o,
                                       search_criteria = list(strategy = "Cartesian"),
                                       parallelism = 64 
                                       )
            )

grid_cartesian


grid_perf <- h2o.getGrid(grid_id = "rf_grid1",
                             sort_by = "residual_deviance",
                             decreasing = FALSE)
grid_perf@summary_table

best_model1 <- h2o.getModel(grid_perf@model_ids[[1]]) #select the best model
best_model1

#predict the test data
pred <- h2o.predict (object = best_model1, newdata = h_test)
sqrt(mean((as.vector(h_test$Expression) - as.vector(pred)) ^2))
Corelation1 = cor(h_test$Expression , pred) # this one returns NAN
 print(Corelation1)

nirgrahamuk · December 14, 2022, 9:53am

This is non reproducible code as it depends on private data 'original_dataset ' so likey support will be limited.
Does your issue recur on other data ? I.e. datasets that come bundled with R ? (iris/mtcars etc. etc.)

sidenote:
attach(maxmindf)
This is immediately scary as it indicates you may be using a lack of discipline, or regard to whether objects are in a data.frame or without.

shamimashrafiyan · December 14, 2022, 10:03am

Thanks for the answer, I tries 200 datasets, and for 98% of them, I got the NAN.
the problem is when I check these 200 datasets with ordinary Random forests (without tunning) I get good results.
The default values (like nrees = 501) also is in the search grid but why the function doesn't select that one? why it selects a model which doesn't work well on test data?
How can I send the data set to you? it is a text file.

nirgrahamuk · December 14, 2022, 10:21am

If one of the 200 is an r bundled dataset then simply state its name

shamimashrafiyan · December 14, 2022, 10:30am

No these are private datasets, all of them have 173 rows or samples and different numbers of columns or features. the target column is "Expression" which is common in all of them (last column)

nirgrahamuk · December 14, 2022, 10:37am

My suggestion is to try on inbuilt datasets, because if it reveals the same problem, its easy way to make your issue reproducible.

shamimashrafiyan · December 19, 2022, 11:08am

Thanks for the answer, I tried some of the inbuilt datasets and got NAN again but I found the problem. The problem is the search grid has an "ID", and this function cash it from the previous run, and based on that one can't predict a new dataset and then shows NAN. So for each new data set, I have to assign a new ID to the search grid. As I have 16000 datasets I used a loop and for each new dataset, I assigned a new ID so it works well.

I mean this line of code and grid_id

grid_perf <- h2o.getGrid(grid_id = "rf_grid1",

system · December 26, 2022, 11:08am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.