Efficiency of anonymous functions within pipes

IanW · April 18, 2023, 2:24pm

I'm unclear how R works "under the hood" and that makes me wonder if I really should write some of the code that I can write.

For example. the following. The intention is to use a dataframe as the source of the data for a processing chain, if it exists, otherwise to load the data from a database. For this, I've written:

MessagesSummaryData =
  (function() {
    if (exists("pcapData2")) { print("Using dataframe")
      
                               pcapData2 %>%  
                               select( LocalCt.Bin, ToTpf, RemAdr, serverPort ) %>%
                               filter( !(serverPort %in% c(20, 21, 23, 25, 26, 53, 69)) )  
                               } 
    else { print("Using database")
      
           dbcon <- dbConnect( RSQLite::SQLite(), dbName )
 
           tbl(dbcon, "pcapData2") %>%
           select( LocalCt.Bin, ToTpf, RemAdr, serverPort ) %>%  
           filter( !(serverPort %in% c(20, 21, 23, 25, 26, 53, 69)) | is.na(serverPort) )  %T>% explain() %>%
           collect() %>% 
           #
           # convert/restore variables mangled by db
           #
           mutate( ToTpf = as.logical(ToTpf), ) 
           }
    })() %>%
  mutate( ... ) %>% 
  ...

But does this build the processing chain efficiently or am I introducing inefficiences doing this?

Also, there seems no way of closing the database connection like this, that gives me some pause, though it doesn't seem to have caused any issues (yet).

AlexisW · April 18, 2023, 4:48pm

Why can't you close it after the mutate( ToTpf = as.logical(ToTpf), ) ?

I would go with naming the function, because it is big enough that it's not easy to read. So an explicit name makes reading the whole code a lot easier.

In terms of performance, I don't see how this function would be less performant than other approaches. But note that execution speed is always hard to predict just by reading the code, the best approach is to first try the easiest way, then see if you have performance issues: if it's fast enough, no need to do more work, if it isn't, use profiling to find out which part is too slow.

Finally, I don't think you actually need a function, you could just put the if/else directly in the pipe. For example this code works as expected:

library(dplyr)

choice <- TRUE

{
  if(choice){
    data.frame(x = 5)
  } else{
    data.frame(x = 6)
  }
} %>%
  mutate(x2 = x + 1)

then the question is really what is more readable. In your case, I could go with a function with an explicit name:

get_from_df_or_db <- function(dataset_name) {
  if (exists(dataset_name)) { print("Using dataframe")
    
    pcapData2 %>%  
     ...
  } 
  else { print("Using database")
    
    dbcon <- dbConnect( RSQLite::SQLite(), dbName )
    
    tbl(dbcon, dataset_name) %>%
      ...
  }
}

MessagesSummaryData  <- get_from_df_or_db("pcapData2") %>%
    mutate( ... ) %>% 
    ...

That last one is especially powerful if you need to recover different datasets, change the db name etc, as you can add all that as parameters. But even if you stay in a simple case, just reading the code you know if a second what that line is for, rather than reading the whole if/else block.

system · May 30, 2023, 4:49pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.