Is it bad to write "large" functions (> 0.5 mb) in packages?

eoppe1022 · December 2, 2018, 1:33am

This is more of a general question, but if we assume that -- for the most part -- a function only holds the "essential" components it needs to do what it needs to do, is it okay for a function to be "large" in size?

Like, should I try and separate the components more to make a few more, smaller functions?

Here's an example of a "large" function (1.2 mb) I have that kinda-sorta needs to be the size that it is.

library(tidyverse)
library(RSelenium)
library(rvest)
library(progress)

get_box_score <- function(..., progress = TRUE) {
  
  if (progress) {
    
    pb <- progress::progress_bar$new(format = "get_box_score() [:bar] :percent eta: :eta", clear = FALSE, total = nrow(...), show_after = 0) 
    
    pb$tick(0)}
  
  driver <- rsDriver(verbose = FALSE)
  
  on.exit(driver$client$close())
  on.exit(driver$server$stop())
  
  .get_box_score <- function(url, league, season, ...) {
    
    seq(2, 5, by = 0.001) %>%
      sample(1) %>%
      Sys.sleep()
    
    driver$client$navigate(url)
    
    Sys.sleep(3)
    
    page <- driver$client$getPageSource() %>%
      purrr::pluck(1) %>%
      read_html()
    
    league_alternative_name <- case_when(league == "OHL" ~ "ohl", 
                                         league == "WHL" ~ "whl", 
                                         league == "QMJHL" ~ "lhjmq")
    
    teams <- page %>% 
      html_nodes(".gamecentre-playbyplay-event--goal") %>%
      {tibble(team = as(., "character"))} %>%
      mutate(team = str_split(team, str_c('<div class="gamecentre-playbyplay-event team-border--', league_alternative_name, '-', sep = ""), simplify = TRUE, n = 2)[,2]) %>%
      mutate(team = str_split(team, 'gamecentre-playbyplay-event--goal', simplify = TRUE, n = 2)[,1]) %>%
      mutate(team = toupper(team))
    
    goal_info <- page %>%
      html_nodes(".gamecentre-playbyplay-event--goal") %>%
      html_text() %>%
      as_tibble() %>%
      set_names("messy_data") %>%
      mutate(period = str_split(messy_data, " ", simplify = TRUE, n = 2)[,1]) %>%
      mutate(period = str_split(period, "Goal", simplify = TRUE, n = 2)[,2]) %>%
      mutate(period = str_replace_all(period, c("ST" = "", "ND" = "", "RD" = ""))) %>%
      mutate(time = str_split(messy_data, " ", simplify = TRUE, n = 2)[,2]) %>%
      mutate(time = str_split(time, "\\#", simplify = TRUE, n = 2)[,1]) %>%
      mutate(goal = str_split(messy_data, " ", simplify = TRUE, n = 3)[,3]) %>%
      mutate(goal = str_split(goal, "\\(", simplify = TRUE, n = 2)[,1]) %>%
      mutate(assists = str_split(messy_data, "Assists\\:", simplify = TRUE, n = 2)[,2]) %>%
      mutate(assists = str_split(assists, "\\+/-", simplify = TRUE, n = 2)[,1]) %>%
      mutate(game_strength = case_when(str_detect(messy_data, "Short Handed") & str_detect(messy_data, "Empty Net") ~ "SH EN",
                                       str_detect(messy_data, "Power Play") & str_detect(messy_data, "Empty Net") ~ "PP EN",
                                       str_detect(messy_data, "Short Handed") & str_detect(messy_data, "Penalty Shot") ~ "SH PS",
                                       str_detect(messy_data, "Power Play") & str_detect(messy_data, "Penalty Shot") ~ "PP PS",
                                       str_detect(messy_data, "Empty Net") ~ "EN",
                                       str_detect(messy_data, "Short Handed") ~ "SH",
                                       str_detect(messy_data, "Power Play") ~ "PP",
                                       str_detect(messy_data, "Penalty Shot") ~ "PS",
                                       TRUE ~ "EV")) %>%
      mutate(assists = str_replace_all(assists, c("Power Play" = "", 
                                                  "Short Handed" = "", 
                                                  "Empty Net" = "", 
                                                  "Penalty Shot" = "",
                                                  "Game Winning" = "", 
                                                  "Game Tying" = "",
                                                  "Insurance Goal" = ""))) %>%
      mutate(primary_assist = str_split(assists, ",", simplify = TRUE, n = 2)[,1]) %>%
      mutate(primary_assist = str_replace_all(primary_assist, "\\#[0-9]{1,2}", "")) %>%
      mutate(secondary_assist = str_split(assists, ",", simplify = TRUE, n = 2)[,2]) %>%
      mutate(secondary_assist = str_replace_all(secondary_assist, "\\#[0-9]{1,2}", ""))
      
    
    box_score_data <- teams %>%
      bind_cols(goal_info) %>%
      mutate(season = season) %>%
      mutate(league = league) %>%
      mutate(game_url = url) %>%
      select(time, period, game_strength, team, goal, primary_assist, secondary_assist, season, league, game_url) %>%
      mutate_all(str_squish) %>%
      mutate_all(~na_if(., ""))
    
    if (progress) {pb$tick()}
    
    return(box_score_data)
    
  }
  
  persistently_get_box_score <- elite::persistently(.get_box_score, max_attempts = 10, wait_seconds = 0.0001)
  
  try_get_box_score <- function(url, league, season, ...) {
    
    tryCatch(persistently_get_box_score(url, league, season, ...), 
             
             error = function(e) {
               print(e) 
               print(url)
               data_frame()},
             
             warning = function(w) {
               print(w) 
               print(url)
               data_frame()})
  }
  
  
  all_box_score_data <- pmap_dfr(..., try_get_box_score)
  
  return(all_box_score_data)
  
}

rensa · December 2, 2018, 1:59am

Hey @eoppe1022! I'm not sure what you mean by a large function. In the example above, are you suggesting that the code of the function takes up 1.2 mb, or do you mean that it returns an object that 1.2 mb in size?

eoppe1022 · December 2, 2018, 2:09am

Oh I mean the code -- or at least what this shows.

rensa · December 2, 2018, 3:38am

Hmmm. I wonder if RStudio is reporting the function's environment as well? Might need someone with more RStudio experience to chip in

technocrat · December 2, 2018, 4:55am

@hadley anticipated your question at http://adv-r.had.co.nz/memory.html

There's a package to measure object size. Your function is nowhere so large as reported.

For completeness:

> library(pryr)
> library(tidyverse)
── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.2.1     ✔ forcats 0.3.0
── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ purrr::partial() masks pryr::partial()
> library(RSelenium)
> library(rvest)
Loading required package: xml2

Attaching package: ‘rvest’

The following object is masked from ‘package:purrr’:

    pluck

The following object is masked from ‘package:readr’:

    guess_encoding

> library(progress)
> 
> get_box_score <- function(..., progress = TRUE) {
+   
+   if (progress) {
+     
+     pb <- progress::progress_bar$new(format = "get_box_score() [:bar] :percent eta: :eta", clear = FALSE, total = nrow(...), show_after = 0) 
+     
+     pb$tick(0)}
+   
+   driver <- rsDriver(verbose = FALSE)
+   
+   on.exit(driver$client$close())
+   on.exit(driver$server$stop())
+   
+   .get_box_score <- function(url, league, season, ...) {
+     
+     seq(2, 5, by = 0.001) %>%
+       sample(1) %>%
+       Sys.sleep()
+     
+     driver$client$navigate(url)
+     
+     Sys.sleep(3)
+     
+     page <- driver$client$getPageSource() %>%
+       purrr::pluck(1) %>%
+       read_html()
+     
+     league_alternative_name <- case_when(league == "OHL" ~ "ohl", 
+                                          league == "WHL" ~ "whl", 
+                                          league == "QMJHL" ~ "lhjmq")
+     
+     teams <- page %>% 
+       html_nodes(".gamecentre-playbyplay-event--goal") %>%
+       {tibble(team = as(., "character"))} %>%
+       mutate(team = str_split(team, str_c('<div class="gamecentre-playbyplay-event team-border--', league_alternative_name, '-', sep = ""), simplify = TRUE, n = 2)[,2]) %>%
+       mutate(team = str_split(team, 'gamecentre-playbyplay-event--goal', simplify = TRUE, n = 2)[,1]) %>%
+       mutate(team = toupper(team))
+     
+     goal_info <- page %>%
+       html_nodes(".gamecentre-playbyplay-event--goal") %>%
+       html_text() %>%
+       as_tibble() %>%
+       set_names("messy_data") %>%
+       mutate(period = str_split(messy_data, " ", simplify = TRUE, n = 2)[,1]) %>%
+       mutate(period = str_split(period, "Goal", simplify = TRUE, n = 2)[,2]) %>%
+       mutate(period = str_replace_all(period, c("ST" = "", "ND" = "", "RD" = ""))) %>%
+       mutate(time = str_split(messy_data, " ", simplify = TRUE, n = 2)[,2]) %>%
+       mutate(time = str_split(time, "\\#", simplify = TRUE, n = 2)[,1]) %>%
+       mutate(goal = str_split(messy_data, " ", simplify = TRUE, n = 3)[,3]) %>%
+       mutate(goal = str_split(goal, "\\(", simplify = TRUE, n = 2)[,1]) %>%
+       mutate(assists = str_split(messy_data, "Assists\\:", simplify = TRUE, n = 2)[,2]) %>%
+       mutate(assists = str_split(assists, "\\+/-", simplify = TRUE, n = 2)[,1]) %>%
+       mutate(game_strength = case_when(str_detect(messy_data, "Short Handed") & str_detect(messy_data, "Empty Net") ~ "SH EN",
+                                        str_detect(messy_data, "Power Play") & str_detect(messy_data, "Empty Net") ~ "PP EN",
+                                        str_detect(messy_data, "Short Handed") & str_detect(messy_data, "Penalty Shot") ~ "SH PS",
+                                        str_detect(messy_data, "Power Play") & str_detect(messy_data, "Penalty Shot") ~ "PP PS",
+                                        str_detect(messy_data, "Empty Net") ~ "EN",
+                                        str_detect(messy_data, "Short Handed") ~ "SH",
+                                        str_detect(messy_data, "Power Play") ~ "PP",
+                                        str_detect(messy_data, "Penalty Shot") ~ "PS",
+                                        TRUE ~ "EV")) %>%
+       mutate(assists = str_replace_all(assists, c("Power Play" = "", 
+                                                   "Short Handed" = "", 
+                                                   "Empty Net" = "", 
+                                                   "Penalty Shot" = "",
+                                                   "Game Winning" = "", 
+                                                   "Game Tying" = "",
+                                                   "Insurance Goal" = ""))) %>%
+       mutate(primary_assist = str_split(assists, ",", simplify = TRUE, n = 2)[,1]) %>%
+       mutate(primary_assist = str_replace_all(primary_assist, "\\#[0-9]{1,2}", "")) %>%
+       mutate(secondary_assist = str_split(assists, ",", simplify = TRUE, n = 2)[,2]) %>%
+       mutate(secondary_assist = str_replace_all(secondary_assist, "\\#[0-9]{1,2}", ""))
+       
+     
+     box_score_data <- teams %>%
+       bind_cols(goal_info) %>%
+       mutate(season = season) %>%
+       mutate(league = league) %>%
+       mutate(game_url = url) %>%
+       select(time, period, game_strength, team, goal, primary_assist, secondary_assist, season, league, game_url) %>%
+       mutate_all(str_squish) %>%
+       mutate_all(~na_if(., ""))
+     
+     if (progress) {pb$tick()}
+     
+     return(box_score_data)
+     
+   }
+   
+   persistently_get_box_score <- elite::persistently(.get_box_score, max_attempts = 10, wait_seconds = 0.0001)
+   
+   try_get_box_score <- function(url, league, season, ...) {
+     
+     tryCatch(persistently_get_box_score(url, league, season, ...), 
+              
+              error = function(e) {
+                print(e) 
+                print(url)
+                data_frame()},
+              
+              warning = function(w) {
+                print(w) 
+                print(url)
+                data_frame()})
+   }
+   
+   
+   all_box_score_data <- pmap_dfr(..., try_get_box_score)
+   
+   return(all_box_score_data)
+   
+ }
> object_size(get_box_score)
 71160 
>

On the other hand, there's much more loaded with all the libraries:

> mem_used()
 68660560

hughparsonage · December 2, 2018, 7:56am

To answer the title, yes it's bad as it affects the install time and can play havoc with R's byte compilation.

However, your function is nowhere near the size that would cause problems. The only instances where I have found the size of a function's source code to be problematic have been when a very large object is defined explicitly in the function body, rather than saved in the data/ or R/sysdata.dta parts of the package.

For example, if I wanted a function identical2mtcars(x) that returned TRUE if x is the mtcars dataset. The right way would be to create mtcars outside the package, then save it to R/sysdata.dta. Then have:

identical2mtcars <- function(x) identical(x, mtcars)

The wrong way would be to write:

identical2mtcars <- function(x) identical(x,
structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 
24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 
30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 
19.7, 15, 21.4), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 
8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4), 
    disp = c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8, 
    167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7, 
    71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145, 
    301, 121), hp = c(110, 110, 93, 110, 175, 105, 245, 62, 95, 
    123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150, 
    150, 245, 175, 66, 91, 113, 264, 175, 335, 109), drat = c(3.9, 
    3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 
    3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76, 
    3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11
    ), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 
    3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 
    1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14, 
    1.513, 3.17, 2.77, 3.57, 2.78), qsec = c(16.46, 17.02, 18.61, 
    19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6, 
    18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87, 
    17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6
    ), vs = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 
    0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1), am = c(1, 
    1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
    0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1), gear = c(4, 4, 4, 3, 
    3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 
    3, 3, 4, 5, 5, 5, 5, 5, 4), carb = c(4, 4, 1, 1, 2, 1, 4, 
    2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 
    2, 2, 4, 6, 8, 2)), row.names = c("Mazda RX4", "Mazda RX4 Wag", 
"Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant", 
"Duster 360", "Merc 240D", "Merc 230", "Merc 280", "Merc 280C", 
"Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", 
"Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", 
"Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", 
"Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2", 
"Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", 
"Volvo 142E"), class = "data.frame"))

(Obviously the problems arise when the object so created is enormous.)

eoppe1022 · December 2, 2018, 4:54pm

Is there a semi-objective point at which you think a function is too large? Like 10 mb?

Or is it more contextual?

alistaire · December 2, 2018, 6:09pm

If a package is going to CRAN, data is supposed to be smaller than a megabyte according to R Packages. Writing R Extensions is fuzzier. I swear somewhere I saw that packages were supposed to be smaller than 5 Mb altogether (or get an exemption), but I can't find a source for that, and there are lots of exceptions.

All that size tends to come from data, though. To make a function that big, you'd have to wrap thousands of lines of code all into one function. If the functions here are as big as they say, I suspect RSelenium is somehow storing a bunch of data in the closure, but I'm not really sure how.

eoppe1022 · December 3, 2018, 1:12am

Thanks for the response. Yeah, it might have to do with RSelenium or having a purrr::pmap in there or something else, but who knows.

So in summary, it's not a problem in and of itself to have a large function unless something's in there that probably shouldn't be in there, right?

alistaire · December 3, 2018, 1:33am

Theoretically it's not a problem; practically it may be because lots of R code probably assumes functions are small, and therefore doesn't optimize for ones that aren't.

Really, though, it's just weird, and merits more investigation. Code doesn't run when you assign a function:

f <- function(x) not_ok()

only when you call it:

f()
#> Error in not_ok(): could not find function "not_ok"

so it's not clear where the size is coming from if it's not in the code itself. Profiling your code may be helpful to see what's happening.

rensa · December 4, 2018, 11:16am

For what it's worth, someone on RStudio Support also asked this question a few years ago:

https://support.rstudio.com/hc/en-us/community/posts/202192398-What-is-a-Large-function-

It doesn't look like they ever got to the bottom of it, but they did establish that sourcing the function with keep.source = FALSE lowered the size of the produced object, suggesting that the captured source was causing the blow-out. I wonder if there was some sort of weirdness going on with the encoding of the source somewhere between it being lifted from disk and being stored by R that caused it to inflate in size?

eoppe1022 · December 4, 2018, 7:00pm

Interesting. I'll look into that. Thanks!

hadley · December 4, 2018, 7:48pm

@eoppe1022 — your function is not large, and there is nothing you need to do about it.

Yes, it is possible to create large function by copying and pasting thousands of lines of data into a function, but you can not do this by accident, so it is not a failure mode that you need to worry about.

R CMD check will automatically report if your package is too large; only then do you need to take action.

system · December 25, 2018, 7:57pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.