Hi,
As part of a larger study I'm doing a simulation of decisions in regards to subsetting of data. Currently I'm using nested for loops as shown in the example below.
However, my full code has over 1 million iterations and I am therefore trying to optimize it as much as possible to reduce execution time.
I have tried to optimize the code to the best of my knowledge and I changed to data.table and saw a small speed increase.
Some of the iterations will inevitably result in empty dataframes. I have tried to use if/else/next to stop the current iteration if the dataframe has nrow == 0 but it resulted in a marked increase in running time.
Is there any way I can optimize my code to decrease the running time?
Does it make any sense to parallelize it using foreach when the task for each iteration is so small?
library(data.table)
library(tidyverse)
my_df <- data.table(id = c("id1", "id1", "id1", "id2", "id2"),
bin_year = c(1,1,1,2,2),
outcome = c("outcome1", "outcome1", "outcome2", "outcome2", "outcome3"),
bin_interv = c(1, 2, 3, 1, 2)
)
unq_outcome <- unique(my_df$outcome)
loop_output <- list()
for (l in 1:max(my_df$bin_year)) {
for (o in 1:((max(my_df$bin_interv)) + 3)) {
for (p in 1:((n_distinct(unq_outcome)) + 1)) {
# iterations
iteration <- str_c(l,o,p)
# selectors
select_year <- 1:l
select_interv <- if (o <= max(my_df$bin_interv)) {o} else
if (o == max(my_df$bin_interv) + 1 ) {c(2,4)} else
if (o == max(my_df$bin_interv) + 2 ) {c(1,5)} else {1:max(my_df$bin_interv)}
select_outcome <- if (p <= n_distinct(unq_outcome)) {unq_outcome[p]} else {unq_outcome}
# subset data
loop_output[[iteration]] <- my_df[bin_year %in% select_year &
bin_interv %in% select_interv &
outcome %in% select_outcome]
}}}