Is it normal to take long time to rename factors


#1

It is puzzling to me that recoding the factors could take so long. I though only the levels are stored and the character representation of the factors are not repeated. Is there a faster way to achieve below?

library(bench)
library(tidyverse)

df <- data.frame("y" = rnorm(3E7), "Grp" = rep(c("A_something", "B_something", "C_something"), each = 1E7))

bench::mark(
  mutate(df, grp = str_replace(Grp, "_something", ""))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 "mutate(d~ 22.4s 22.4s  22.4s 22.4s    0.0446     573MB     1     1
#> # ... with 1 more variable: total_time <bch:tm>

Created on 2018-09-25 by the reprex package (v0.2.0).


#2

Hey @Dong! I think part of the problem here is that stringr::str_replace() takes 'Either a character vector, or something coercible to one.'

A factor is essentially a numeric vector where the possible labels are stored once, separately. By using str_replace(), you're converting your factor to character (essentially causing the entire vector to be re-written), searching and replacing every value, and then converting the whole thing back. The same is happening with the creation: you create a character column and then data.frame converts it to a factor automatically.

I think both your factor creation and releveling would go a lot faster this way, using the forcats package to change the levels without touching the values:

library(forcats)
df <- data.frame(
  "y" = rnorm(3E7),
  "Grp" = factor(rep(1:3, 1E7), levels = c("1" = "A_something", "2" = "B_something", "3" = "C_something")))

df$grp = df$Grp %>% fct_relabel(str_replace, "_something", "")

The original releveling took about a minute on my fairly new laptop; using fact_relabel took a fraction of a second :slight_smile: Creating the original data frame column directly as a factor also helps a bit; it took 2–3 seconds versus about 10 using a character vector!


#3

One thing I forgot to mention explicitly is that forcats::fct_relabel() causes str_replace to operate on the set of factor labels (length 3), not on the vector values (length 3E7)!


#4

Thanks @rensa for the clear explanation. I was trying to use str_replace to do the work of fct_relabel and got exactly what I deserved :frowning:

Again, thanks for introducing this forcats function to me.


#5

That's okay! As a long-time user of factors, I'm ashamed to say that I've only just started using forcats myself :sweat_smile:


#6

By the way, I noticed that @rensa 's method also works on data.table, but at 10x slower than for data.frame. I wonder if some conversion is going on.

I have been using data.table for performance/memory reason. If the readers have a solution to relabel the factors in data.table, please share as well.


#7

Getting the column as factor with help you relabel it. You can do it with base function, and it applies to data.frame so on data.table and tibble to.
levels will get you a character vector of the level value, a character vector that you can deal with to replace the value of levels. There is much less value than in you Grp character column.

library(data.table)

df <- data.table("y" = rnorm(3E7), "Grp" = rep(c("A_something", "B_something", "C_something"), each = 1E7))
# transform into factor
df[, Grp := as.factor(Grp)]

levels(df$Grp) <- gsub("_something", "", levels(df$Grp))
df
#>                     y Grp
#>        1: -1.61195065   A
#>        2:  0.98342872   A
#>        3: -1.55122757   A
#>        4:  1.17911409   A
#>        5: -2.24083948   A
#>       ---                
#> 29999996:  0.89209690   C
#> 29999997: -0.14506757   C
#> 29999998:  0.57133525   C
#> 29999999: -0.01521659   C
#> 30000000:  0.17231753   C

Created on 2018-09-26 by the reprex package (v0.2.1)

I let you bench::mark() what you want.


#8

Thanks for teaching me the use of levels. The time I got from tictoc are now roughly comparable.

  1. df$Grp = df$Grp %>% fct_relabel(str_replace, "_something", "") 0.59 sec
  2. dt$Grp = dt$Grp %>% fct_relabel(str_replace, "_something", "") 0.72 sec
  3. levels(dt$Grp) <- gsub("_something", "", levels(dt$Grp)) 0.97 sec

So my previous "10x" observation is not true. Sorry for my confusions.


#9

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it: