Error in parallel execution in R

JorgeBS94 · June 9, 2023, 12:52pm

Hello,

This is the first time I post here. I have found an error on the last 2 versions of Rstudio in my machine. I am running a simple vectorised for loop using the future_map() function from package furrr, which uses the future package. I have used this function many times and I have always had a good performance. However, when I try to execute any task in parallel (I have tried different scripts), the computer is not using all cores. Instead, it runs with 20% of the CPU and just 0.8 GHz (my processor goes up to nearly 5 GHz). Sometimes during the computation the usage goes up to 70% and the speed is just around 1.3 GHz. Then it comes back to 20%. I have tried using clusterApply() instead of future_map() and the execution is exactly the same.

I have all packages up to date and the sessionInfo() is the following:

R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.utf8 LC_CTYPE=Spanish_Spain.utf8 LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.utf8

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] furrr_0.3.1 future_1.32.0 mgsub_1.7.3 graphite_1.46.0 ROntoTools_2.28.0 Rgraphviz_2.44.0 KEGGgraph_1.60.0
[8] KEGGREST_1.40.0 boot_1.3-28.1 graph_1.78.0 BiocGenerics_0.46.0 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[15] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0

loaded via a namespace (and not attached):
[1] gtable_0.3.3 Biobase_2.60.0 tzdb_0.3.0 vctrs_0.6.2 tools_4.3.0 bitops_1.0-7
[7] generics_0.1.3 stats4_4.3.0 parallel_4.3.0 fansi_1.0.4 AnnotationDbi_1.62.1 RSQLite_2.3.1
[13] blob_1.2.4 pkgconfig_2.0.3 S4Vectors_0.38.1 lifecycle_1.0.3 GenomeInfoDbData_1.2.10 compiler_4.3.0
[19] Biostrings_2.68.0 munsell_0.5.0 codetools_0.2-19 GenomeInfoDb_1.36.0 RCurl_1.98-1.12 pillar_1.9.0
[25] crayon_1.5.2 cachem_1.0.8 org.Hs.eg.db_3.17.0 parallelly_1.35.0 digest_0.6.31 tidyselect_1.2.0
[31] stringi_1.7.12 listenv_0.9.0 fastmap_1.1.1 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
[37] XML_3.99-0.14 utf8_1.2.3 withr_2.5.0 rappdirs_0.3.3 scales_1.2.1 bit64_4.0.5
[43] timechange_0.2.0 XVector_0.40.0 httr_1.4.6 globals_0.16.2 bit_4.0.5 png_0.1-8
[49] hms_1.1.3 memoise_2.0.1 IRanges_2.34.0 rlang_1.1.1 glue_1.6.2 DBI_1.1.3
[55] rstudioapi_0.14 R6_2.5.1 zlibbioc_1.46.0

Anyone knows what is this about?

Thank you very much for your help,

Jorge

nirgrahamuk · June 9, 2023, 1:41pm

are you using plan(multisession) and setting the number of workers ?

JorgeBS94 · June 9, 2023, 2:10pm

Yes, sure. I do:

plan(multisession,18)

I have tried changing the number of workers and the result is the same every time.

nirgrahamuk · June 9, 2023, 2:44pm

Yout title implies that Rstudio maybe at fault. I think one way to investigate that may be to not use rstudio but use RGui instead, or else go into R via system console and run your code outside of rstudio. Its possibke but i think unlikely that rstudio is at fault.

JorgeBS94 · June 9, 2023, 3:47pm

I have done as you suggested and I have the same performance issues, so I guess it is not Rstudio's fault. I have updated the title accordingly.

nirgrahamuk · June 9, 2023, 3:50pm

I checked on furrr's homepage what their intro vignette contains; how to proof parallelism is at work. There is a good example there.
Can I ask you to try it as is and report back your experience ?

library(furrr)
library(tictoc)

# This should take 6 seconds in total running sequentially
plan(sequential)

tic()
nothingness <- future_map(c(2, 2, 2), ~Sys.sleep(.x))
toc()
#> 6.08 sec elapsed
# This should take ~2 seconds running in parallel, with a little overhead
# in `future_map()` from sending data to the workers. There is generally also
# a one time cost from `plan(multisession)` setting up the workers.
plan(multisession, workers = 3)

tic()
nothingness <- future_map(c(2, 2, 2), ~Sys.sleep(.x))
toc()

JorgeBS94 · June 9, 2023, 4:15pm

nirgrahamuk:

library(furrr)
library(tictoc)

# This should take 6 seconds in total running sequentially
plan(sequential)

tic()
nothingness <- future_map(c(2, 2, 2), ~Sys.sleep(.x))
toc()
#> 6.08 sec elapsed
# This should take ~2 seconds running in parallel, with a little overhead
# in `future_map()` from sending data to the workers. There is generally also
# a one time cost from `plan(multisession)` setting up the workers.
plan(multisession, workers = 3)

tic()
nothingness <- future_map(c(2, 2, 2), ~Sys.sleep(.x))
toc()

The computer passes the test:

library(furrr)
library(tictoc)

This should take 6 seconds in total running sequentially

plan(sequential)

tic()
nothingness <- future_map(c(2, 2, 2), ~Sys.sleep(.x))
toc()
6.13 sec elapsed
#> 6.08 sec elapsed

This should take ~2 seconds running in parallel, with a little overhead

in future_map() from sending data to the workers. There is generally also

a one time cost from plan(multisession) setting up the workers.

plan(multisession, workers = 3)

tic()
nothingness <- future_map(c(2, 2, 2), ~Sys.sleep(.x))
toc()
4.05 sec elapsed

However, in this test, you are telling the computer to sleep, not to calculate anything. The code works but it is not reproducible of my real examples, where there is an expensive computation in parallel. My computer can therefore parallelise but it is doing it wrong.

nirgrahamuk · June 9, 2023, 4:22pm

it shows that the sleep actions occor on different cores; you should be able to drop in some long running calculation. Ok, heres one I came up with:

library(furrr)
library(tictoc)

my_func <- function(n){combn(x = 1:(n*200),3)}
# This should take 6 seconds in total running sequentially
plan(sequential)

tic()
nothingness <- future_map(c(2, 2, 2), ~my_func(.x))
toc()

plan(multisession, workers = 3)

tic()
nothingness <- future_map(c(2, 2, 2), ~my_func(.x))
toc()

I get about 24 seconds for the sequental and 10 seconds for parellelised

system · July 21, 2023, 4:22pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Error in parallel execution in R

This should take 6 seconds in total running sequentially

This should take ~2 seconds running in parallel, with a little overhead

in future_map() from sending data to the workers. There is generally also

a one time cost from plan(multisession) setting up the workers.

in `future_map()` from sending data to the workers. There is generally also

a one time cost from `plan(multisession)` setting up the workers.