how to speed up the for loop over data frame in R?

Hi to all
I have two dataframes (p.D, p.G and test) with 900K rows,


now i want to fill columns from C1 to C5 in p.D by following certain conditions. i followed code to do so

 a = Sys.time()
  for( i in c(1:nrow(test))){ # For each Combination in test
    mm = c()
    for( k in c(2:ncol(p.G))){ # For each marker in p.G
      mm[k-1] = ifelse( is.na(p.G[ which( p.G$GEN == test$P2[i]), k]) & 
                          is.na(p.G[ which( p.G$GEN == test$P1[i]), k]) , 3,   # Both missing
                        
                        ifelse( is.na(p.G[ which( p.G$GEN == test$P1[i]), k]) & 
                                  (p.G[ which( p.G$GEN == test$P2[i]), k] == 0 | 
                                     p.G[ which( p.G$GEN == test$P2[i]), k] == 2) , 3.5, # One is missing one is Hom
                                ifelse( is.na(p.G[ which( p.G$GEN == test$P1[i]), k]) & 
                                          p.G[ which( p.G$GEN == test$P2[i]), k] == 1 , 1.5,  # One is missing one is Het
                                        
                                        ifelse( is.na(p.G[ which( p.G$GEN == test$P2[i]), k]) & 
                                                  (p.G[ which( p.G$GEN == test$P1[i]), k] == 0 | 
                                                     p.G[ which( p.G$GEN == test$P1[i]), k] == 2) , 3.5,  # One is missing one is Hom
                                                ifelse( is.na(p.G[ which( p.G$GEN == test$P2[i]), k]) & 
                                                          p.G[ which( p.G$GEN == test$P1[i]), k] == 1 , 1.5,  # One is missing one is Het
                                                        
                                                        ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 0 & 
                                                                 p.G[ which( p.G$GEN == test$P2[i]), k] == 2, 0,          # AA + BB 
                                                               ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 2 & 
                                                                        p.G[ which( p.G$GEN == test$P2[i]), k] == 0, 0,          # BB + AA 
                                                                      
                                                                      ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 0 & 
                                                                               p.G[ which( p.G$GEN == test$P2[i]), k] == 0, 10,    # BB + BB
                                                                             ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 2 & 
                                                                                      p.G[ which( p.G$GEN == test$P2[i]), k] == 2, 10,   # AA + AA
                                                                                    
                                                                                    ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 1 & 
                                                                                             p.G[ which( p.G$GEN == test$P2[i]), k] == 2, 1,    # Het + Hom A
                                                                                           ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 1 & 
                                                                                                    p.G[ which( p.G$GEN == test$P2[i]), k] == 0, 1,   # Het + Hom B
                                                                                                  
                                                                                                  ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 0 & 
                                                                                                           p.G[ which( p.G$GEN == test$P2[i]), k] == 1, 1,  # Het + Hom B
                                                                                                         ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 2 & 
                                                                                                                  p.G[ which( p.G$GEN == test$P2[i]), k] == 1, 1,  # Het + Hom A
                                                                                                                
                                                                                                                ifelse(p.G[ which( p.G$GEN == test$P1[i]), k] == 1 & 
                                                                                                                         p.G[ which( p.G$GEN == test$P2[i]), k] == 1, 2,    # Het + Het
                                                                                                                       NA))))))))))))))
    }
    
    test$C1[i] = mean(mm)
    test$C2[i] = length(which(mm == 10))
    test$C3[i] = length(which(mm == 0))
    test$C4[i] = length(which(mm == 3 | mm == 3.5 | mm == 1.5))
    test$C5[i] = p.D[test$row[i], test$col[i]]
    
    if(i %% 1000 == 0){
      print(paste0(round((i / nrow(test)) * 100, 3), 
                   "% (i = ", i,") completed in ", 
                   round(Sys.time() - a, 2), " mins"))
    }
  }
  return(test)
}

here numbers 0, 1,2 means
0=missing
1=het (both letters are not same i.e. A/T not same letters like A/A
2=homo 9same letters i.e. A/A or T/T
Here i have issue only with speed not with any part of the code. I tried run this code its running more than 12 hours and not finished yet. are there any ways to increase the speed of this above code? any help in this regard is highly appreciated
Thanks in advance

This is lots of nested ifelse statements - nothing wrong with that.

The inner for loop however IMHO can easily be vectorized which will lead to significant speedup.

Below I have tried to create a simple yet similar example. I am using the microbenchmark package to measure execution time.

PS: If you try to rerun the code, please be aware of the time units - you will see microseconds and milliseconds as units (1 millisecond = 1000 microseconds)

library(microbenchmark)

# x: 200 random uniform variables between 0 and 1

x<-runif(200)

# simple ifelse - for loop approach 
microbenchmark(
  for (i in 1:length(x)) {
    ifelse(x[i]>0.5,1,0)
  }
)

# simple ifelse - vectorize it (200x improvement)
microbenchmark(
  ifelse(x>0.5,1,0)
)

# nested ifelse - for loop approach 
microbenchmark(
  for (i in 1:length(x)) {
    ifelse(x[i]>0.5,1,ifelse(x[i]>0.1,0,ifelse(x[i]**2-0.5>0,1,0)))
  }
)

# nested ifelse - vectorize it (150x improvement)
microbenchmark(
  ifelse(x>0.5,1,ifelse(x>0.1,0,ifelse(x**2-0.5>0,1,0)))
)

Happy to discuss further as needed.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.