The reason the manual future()
call is so slow is because you are calling it 1000 times. You are giving each element of mytib$text
its own session to run in, but your computer probably only has ~8 cores to run on. So it sends out 8 requests, then has to wait until one finishes before sending out the next one....1000 times.
future_map()
is much smarter. It chunks your input into 8 groups of roughly equal size, and sends out just those 8 chunks.
plan(multicore)
on my mac is fairly fast. It uses shared memory so things don't have to be copied between multiple R sessions, but this is unsafe in RStudio. See https://github.com/HenrikBengtsson/future/blob/63688db8fa9f609ae0800058eada348696fa371a/NEWS#L183-L192
plan(multisession)
is safer, and is the default in {future} when you use plan(multiprocess)
on a Mac when using RStudio. But it is slower to start up because it has to copy resources over to the different R sessions.
library(purrr)
library(stringr)
library(tictoc)
library(tibble)
library(furrr)
#> Loading required package: future
mytib <- tibble(text = rep('hello this is very interesting', times = 1000))
#sequential
tic()
res_sequential <- map(mytib$text, ~str_detect(.x, regex("very")))
toc()
#> 0.09 sec elapsed
plan(multicore)
tic()
res_furrr_multicore <- future_map(mytib$text, ~str_detect(.x, regex("very")))
toc()
#> 0.066 sec elapsed
plan(multisession)
tic()
res_furrr_multisession <- future_map(mytib$text, ~str_detect(.x, regex("very")))
toc()
#> 0.451 sec elapsed
That said, always look to vectorization over parallelization where possible. Your original example can be vectorized because str_detect()
is vectorized over the input. This is with the full 1000000
times (not the smaller example above with 1000
times):
library(stringr)
library(tibble)
library(tictoc)
mytib <- tibble(text = rep('hello this is very interesting', times = 1000000))
tic()
xx <- str_detect(mytib$text, regex("very"))
toc()
#> 0.36 sec elapsed