anonymous functions vs compiled named functions which one should be faster???

Anantadinath · June 3, 2019, 2:00pm

I was wondering if calling a function from lapply would be faster than creating a function inside lapply.

I tried to create a small reprex but feel free to add your input.

library(data.table)
library(compiler)
library(microbenchmark)

f <- cmpfun(
    function(a,b){
    if(a < 0 || b < 0 ){
      "useless"
    } else if(a>b){
      "greater"
    }else if( a == b){
      "lesser"
    } else{
      "equal"
    }
})

dt <- data.table(a = rnorm(20000),
           b = rnorm(20000))

This is a compiled functions it should be faster on the run. So if I call this from inside an lapply function it should be faster than the normal function. So i tried it

microbenchmark(
  dt[,lapply(.SD$a, function(a,b){
    if(a < 0 || b < 0 ){
      "useless"
    } else if(a>b){
      "greater"
    }else if( a == b){
      "lesser"
    } else{
      "equal"
    }
  }, .SD$b)]
,times = 5L)

This gave me the results

min       lq     mean   median       uq      max neval
 6.323885 6.375043 6.446601 6.401394 6.565713 6.566967     5

While calling the compiled function like this

microbenchmark(
  dt[,lapply(.SD$a,f, .SD$b)]  
  ,times = 5L)

gave me exactly the same output

Unit: seconds
                          expr      min       lq     mean   median       uq      max neval
 dt[, lapply(.SD$a, f, .SD$b)] 6.221426 6.230694 6.263112 6.237644 6.269459 6.356337     5

Does anybody has any idea what is happening here????

any input is appreciated. It is just for my own understanding.

Anantadinath · June 4, 2019, 6:00am

I have come to know that calling a function has an overhead of managing environment and callstacks while creating an anonymous function has overhead of defining it. And both of these overheads are negligible. so these both technique should run equally fast.

If I want to improve results. This function should be vectorized for more speed.

library(data.table)
library(compiler)
library(microbenchmark)
library(reprex)

f <- cmpfun(
  function(a,b){
    ifelse(a < 0 | b < 0 ,
      "useless"
    ,ifelse(a>b ,
      "greater"
    ,ifelse( a == b,
      "lesser"
    , "equal")
        )
      )
    })

dt <- data.table(a = rnorm(20000),
           b = rnorm(20000))



microbenchmark(
  dt[,f(a,b)]  
  ,times = 5L)
#> Unit: milliseconds
#>           expr     min       lq     mean   median      uq      max neval
#>  dt[, f(a, b)] 28.8243 29.02629 31.01628 29.29363 29.3849 38.55229     5

nutterb · June 4, 2019, 11:07am

And if you want to go all out on the optimization, use indexing instead of ifelse

f <- cmpfun(
  function(a,b){
    out <- character(max(length(a), length(b)))
    g <- a > b
    l <- a < b
    e <- a == b
    u <- a < 0 | b < 0
    
    out[g] <- "greater"
    out[l] <- "equal" # Deliberately incorrect to match @Anantadinath output
    out[e] <- "lesser" # Deliberately incorrect to match @Anantadinath output
    out[u] <- "useless"
    
    out
  })

microbenchmark(
  dt[,f(a,b)]
  ,times = 5L)

# Unit: milliseconds
#           expr      min       lq     mean   median       uq      max neval
#  dt[, f(a, b)] 1.052468 1.057161 1.108538 1.063905 1.068597 1.300557     5

Anantadinath · June 4, 2019, 12:14pm

Interesting
Really very interesting

Thanks for replying on the thread and taking time in answering it.

I have never come across such a solution where can I read more about this type of optimization. I would really like to explore more. And why is it faster???

nutterb · June 4, 2019, 12:38pm

I'm not educated well enough to give a really good description of why indexed replacement is faster than ifelse. I think the general gist is

ifelse includes a bunch of error proofing that my indexing solution skipped. There's some gain there
ifelse builds result sets for every set of conditions and then tries to figure out how to merge them into the final solution. With ifelse, the final type of the object is unknown. In the indexed solution, we work with a character vector the entire way.

I'm sure there are other contributions. It's a very small gain in efficiency, and one I often ignore for a simple ifelse, but I will often avoid nested ifelse statements if I believe it is a function that might get called repetitively.

Cursory web search results:

Anantadinath · June 4, 2019, 12:41pm

Thanks a ton. It tells me that R code can be optimized to more than a 1000 times. So how you write code actually matters in R.

nwerth · June 4, 2019, 1:23pm

I wish I could give this 100 likes.

Here's the code from inside ifelse:

# print(ifelse)
function (test, yes, no) 
{
    if (is.atomic(test)) {
        if (typeof(test) != "logical") 
            storage.mode(test) <- "logical"
        if (length(test) == 1 && is.null(attributes(test))) {
            if (is.na(test)) 
                return(NA)
            else if (test) {
                if (length(yes) == 1) {
                  yat <- attributes(yes)
                  if (is.null(yat) || (is.function(yes) && identical(names(yat), 
                    "srcref"))) 
                    return(yes)
                }
            }
            else if (length(no) == 1) {
                nat <- attributes(no)
                if (is.null(nat) || (is.function(no) && identical(names(nat), 
                  "srcref"))) 
                  return(no)
            }
        }
    }
    else test <- if (isS4(test)) 
        methods::as(test, "logical")
    else as.logical(test)
    ans <- test
    len <- length(ans)
    ypos <- which(test)
    npos <- which(!test)
    if (length(ypos) > 0L) 
        ans[ypos] <- rep(yes, length.out = len)[ypos]
    if (length(npos) > 0L) 
        ans[npos] <- rep(no, length.out = len)[npos]
    ans
}

Nutterb is right. ifelse does some sanity and safety checks in the beginning, but most of those shouldn't take long. What takes the most time are these lines:

    if (length(ypos) > 0L) 
        ans[ypos] <- rep(yes, length.out = len)[ypos]
    if (length(npos) > 0L) 
        ans[npos] <- rep(no, length.out = len)[npos]

ifelse can't assume that the values it's given for test, yes, and no are the same length. So it extends them. If you know the lengths of inputs match up, then doing it yourself lets you skip some steps. And, because you're using this with the columns of a data.table, you're safe to assume they're equal lengths.

I've looped back around to appreciating index replacement. It's simple to read, very fast, and a descriptive name for the test vector makes the code self-documenting.

Anantadinath · June 4, 2019, 2:02pm

Glad you enjoyed the thread. I was just trying to learn if anonymous functions are slower than named one. But instead I found entirely different optimization.

I think it's time to read advance R and efficient R

Thanks for replying.

nutterb · June 4, 2019, 9:05pm

I would just add a caveat that indexing isn't the "right" way to go about these kinds of operations. I still use ifelse a lot. Usually when I'm doing things interactively or when I'm preparing a data set for analysis. The kinds of things that get run maybe a handful of times. It really is a useful function.

You'll notice that in my code, I had to assign the "useless" components last. If I put them first, they go overwritten. Your nested ifelse avoided that problem. So when I'm trying to write code quickly, I tend to use ifelse because it can avoid those kinds of traps.

But when I'm writing functions/packages, I lean more toward indexing.

Don't become overly reliant on one tool. happy indexing!

Anantadinath · June 5, 2019, 1:48am

I saw it while reading the code and I will surely use it only when optimization is needed.

Thanks a ton for replying.

system · June 12, 2019, 1:48am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.