Using a FOR-loop to calculate AUC of multiple dataframes

Konar · August 26, 2021, 10:09am

Hello,

For a certain problem I want to perform a RandomForest classifier over multiple datasets and compare the AUC's of said datasets. I want to use a 'lazy' approach, so instead of doing the classification n-times over multiple datasets, I wanted to use a for-loop to do this for me instead.

So, a for-loop that loops over multiple datasets, performs randomforest classification, calculate the AUC en store this AUC in a empty matrix/dataframe. The result should be a table/matrix which shows me a column for each dataset and a row showing the AUC of each dataset.

I prepped some code using the Iris dataset to get started, but don't have any experience with using for-loops on this kind of problem. Hopefully somebody can help me out or even to get me thinking in the right direction!?

Example:

require(pROC)
require(randomForest)

#use the Iris dataset as example
data(iris)

#make a simple 2-class outcome over the Iris dataset
iris <- iris[-which(iris$Species=="setosa"),]
iris$Species<-as.factor(as.character(iris$Species))

#create list of dataframes we want to use
df1 <- iris
df2 <- iris
df_list <- list(df1, df2)

#create empty matrix to store results in
results_matrix <- matrix(ncol=2, nrow=1)

#create a for loop to calculate and store AUC of each dataframe 
for(df in df_list){
  rf_model <- randomForest::randomForest(Species ~., data = df)
  rf_model_roc <- roc(iris$Species,rf_model$votes[,2])
  df_auc <- auc(rf_model_roc)
  
  #store df_auc of each df in results_matrix
    }

pieterjanvc · August 26, 2021, 9:05pm

Hi

Thanks for creating a nice reprex! It's always great if you can just start working with the code immediately without having to figure out how to recreate the issue

Here is one way to solve your problem:

require(pROC)
require(randomForest)
require(purrr)

#use the Iris dataset as example
data(iris)

#make a simple 2-class outcome over the Iris dataset
iris <- iris[-which(iris$Species=="setosa"),]
iris$Species<-as.factor(as.character(iris$Species))

#create list of dataframes we want to use
df1 <- iris
df2 <- iris
df_list <- list(df1, df2)

#Calculate and store AUC of each dataframe 
results = map_df(1:length(df_list), function(i){
  
  rf_model <- randomForest::randomForest(Species ~., data = df_list[[i]])
  rf_model_roc <- roc(iris$Species,rf_model$votes[,2])
  df_auc <- auc(rf_model_roc)
  
  data.frame(
    dataset = paste0("dataset", i),
    auc = as.numeric(df_auc)
  )
  
})

results

> results
   dataset    auc
1 dataset1 0.9844
2 dataset2 0.9828

I use the map_df function from the purrr package (part of Tidyverse) and 'loop' over each dataset by index i. The last code in each map function is what is returned, in this case a df with the solution for each model, which will be pasted together by map_df in the end.

You can make this df more complex if you like to add any other stats too of course.

Hope this helps,
PJ

system · September 16, 2021, 7:18am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.