multiprocessing in linux with purrr (without using furrr)?

Hello there!

I was curious how can I use multiprocessing with purrr without using furrr.

Indeed, since the last updates it seems the furrr package does not work as good as in the past (in some case, worse). I think this has nothing to do with the package creator (which is an amazing coder, god-level like :slight_smile: ) but more with the limitations of multiprocessing on some platforms.

Can I use future and purrr in a manual way? Am I completely mistaken here?

Thanks!

Sure, you can use them separately. My go-to for that is this blog-post by future author - https://www.jottr.org/2017/06/05/many-faced-future/.

1 Like

Actually, do you know what would be the correct syntax to use future in the following context? I want to parallelize the compute part.

Thanks!

mytib <- tibble(text = rep('hello this is very interesting', times = 1000000))

#sequential
tic()
mytib %>% mutate(compute = map_dbl(text, ~str_detect(., regex('very'))))
toc()

Actually, no matter what I try parallel part is much slower. I was trying to simplify it as much as possible, but it should give an idea how future can be used:

library(dplyr) 
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(future)
library(purrr)
library(stringr)
library(tictoc)
plan(multicore)

mytib <- tibble(text = rep('hello this is very interesting', times = 1000))

#sequential 
tic()
res_sequential <- map(mytib$text, ~str_detect(.x, regex("very")))
toc()
#> 0.03 sec elapsed

#multiprocess
tic()
res_parallel <- map(mytib$text, ~future(str_detect(.x, regex('very')))) %>% values()
toc()
#> 4.512 sec elapsed

Created on 2020-04-15 by the reprex package (v0.3.0)

As I've said, I'm still unsure what makes it so much slower and how to change that.

1 Like

Perhaps can we summon the master? @davis :slight_smile:

The reason the manual future() call is so slow is because you are calling it 1000 times. You are giving each element of mytib$text its own session to run in, but your computer probably only has ~8 cores to run on. So it sends out 8 requests, then has to wait until one finishes before sending out the next one....1000 times.

future_map() is much smarter. It chunks your input into 8 groups of roughly equal size, and sends out just those 8 chunks.

plan(multicore) on my mac is fairly fast. It uses shared memory so things don't have to be copied between multiple R sessions, but this is unsafe in RStudio. See https://github.com/HenrikBengtsson/future/blob/63688db8fa9f609ae0800058eada348696fa371a/NEWS#L183-L192

plan(multisession) is safer, and is the default in {future} when you use plan(multiprocess) on a Mac when using RStudio. But it is slower to start up because it has to copy resources over to the different R sessions.

library(purrr)
library(stringr)
library(tictoc)
library(tibble)
library(furrr)
#> Loading required package: future

mytib <- tibble(text = rep('hello this is very interesting', times = 1000))

#sequential 
tic()
res_sequential <- map(mytib$text, ~str_detect(.x, regex("very")))
toc()
#> 0.09 sec elapsed

plan(multicore)

tic()
res_furrr_multicore <- future_map(mytib$text, ~str_detect(.x, regex("very")))
toc()
#> 0.066 sec elapsed

plan(multisession)

tic()
res_furrr_multisession <- future_map(mytib$text, ~str_detect(.x, regex("very")))
toc()
#> 0.451 sec elapsed

That said, always look to vectorization over parallelization where possible. Your original example can be vectorized because str_detect() is vectorized over the input. This is with the full 1000000 times (not the smaller example above with 1000 times):

library(stringr)
library(tibble)
library(tictoc)

mytib <- tibble(text = rep('hello this is very interesting', times = 1000000))

tic()
xx <- str_detect(mytib$text, regex("very"))
toc()
#> 0.36 sec elapsed
2 Likes

Ha!! thank you @davis! I always wondered how future_map chunks the data. Is there any way to control the chunking then? What if I nest my compute variable into different groups and then call future_map? something like df %>% group_by(mygroups) %>% nest() %>% mutate(parallel = future_map(data, ~myfunc(.x))

This is potentially the key point. So what you say is that in order to make it work, I need to run the R script from the terminal directly? (that is without opening R studio?)

Thank you again!!!

@davis you were right. running with Rscript did some magic. However, could you please tell me how to customize furrr a bit more? For instance choosing the number of chunks seems important (also is furrr using all the processors by default?)

Thank you again for all the amazing work you do (slider is a gem)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.