It is puzzling to me that recoding the factors could take so long. I though only the levels are stored and the character representation of the factors are not repeated. Is there a faster way to achieve below?
library(bench) library(tidyverse) df <- data.frame("y" = rnorm(3E7), "Grp" = rep(c("A_something", "B_something", "C_something"), each = 1E7)) bench::mark( mutate(df, grp = str_replace(Grp, "_something", "")) ) #> Warning: Some expressions had a GC in every iteration; so filtering is #> disabled. #> # A tibble: 1 x 10 #> expression min mean median max `itr/sec` mem_alloc n_gc n_itr #> <chr> <bch> <bch> <bch:> <bch> <dbl> <bch:byt> <dbl> <int> #> 1 "mutate(d~ 22.4s 22.4s 22.4s 22.4s 0.0446 573MB 1 1 #> # ... with 1 more variable: total_time <bch:tm>
Created on 2018-09-25 by the reprex package (v0.2.0).