Quickly recoding large(ish) vectors in a tibble

I think you'll get a 2x speed up or so from using factors. At least in the case of using a recode, it should only have to change the 3 labels on the levels of the factors, rather than altering every element of the vector.

suppressWarnings(library(tidyverse))  
suppressWarnings(library(microbenchmark))

# object to decode and the key

tbl <- tibble(CODE = sample(letters[1:3], 1e+06, replace = TRUE))

key <- tribble(~CODE,   ~FRUIT, 
               "a",  "apple", 
               "b", "banana", 
               "c",  "cherry"
)

tbl_with_factor <- mutate(tbl, CODE = as.factor(CODE))
key_with_factor <- mutate(key, CODE = as.factor(CODE))

# Speed tests

microbenchmark::microbenchmark(
  # Fastest method from previous discussion
  left_join(tbl, key, by = "CODE"),
  
  # Try a recode with factors
  mutate(tbl_with_factor, FRUIT = recode(CODE, a = "apple", b = "banana", c = "cherry")),
  
  # Try the same thing but with recode from forcats
  mutate(tbl_with_factor, FRUIT = forcats::fct_recode(CODE, apple = "a", banana = "b", cherry = "c")),
  
  # Use base R and change the levels
  mutate(tbl_with_factor, FRUIT = `levels<-`(CODE, list("apple" = "a", "banana" = "b", "cherry" = "c"))),
  
  # Left join when everything is a factor
  left_join(tbl_with_factor, key_with_factor, by = "CODE")
)

This gives some nice results

Unit: milliseconds
                                                                                                     expr      min       lq
                                                                         left_join(tbl, key, by = "CODE") 50.89433 55.58770
              mutate(tbl_with_factor, FRUIT = recode(CODE, a = "apple", b = "banana",      c = "cherry")) 22.98268 29.49390
 mutate(tbl_with_factor, FRUIT = forcats::fct_recode(CODE, apple = "a",      banana = "b", cherry = "c")) 24.25023 28.93651
    mutate(tbl_with_factor, FRUIT = `levels<-`(CODE, list(apple = "a",      banana = "b", cherry = "c"))) 22.41556 26.74911
                                                 left_join(tbl_with_factor, key_with_factor, by = "CODE") 36.87323 41.05750
     mean   median       uq      max neval
 64.41534 58.53341 61.98173 177.3578   100
 38.77766 31.69801 35.41894 154.4150   100
 36.24761 31.47930 34.62983 155.7492   100
 35.36276 29.22803 32.72580 156.9444   100
 46.23981 43.02139 47.31327 162.5166   100

Looking at median values, the mutate + recode() with factor runs all seem to score the best, at about 2x what you were doing before.

3 Likes