For anyone that is interested in another example of purrr parallelization with the future package besides the one in the tweet, here is a silly random forest example with the weather data set from nycflights13. It's just meant to show the time difference of the two approaches, and that the parallelization actually works. The elapsed time is the important one.
Note that there is some overhead in parallelization, so spreading the 3 models over 3 cores is not exactly 3 times as fast!
library(future)
library(tidyverse)
library(nycflights13)
weather_nest <- weather %>%
group_by(origin) %>%
nest()
# Silly random forest model
weather_model <- function(data) {
randomForest::randomForest(temp ~ dewp + humid + precip, data = data, na.action = na.omit)
}
# Test 1
t1 <- proc.time()
# multiprocess chooses between multicore (Mac) or multisession (Windows)
plan(multiprocess)
# This returns instantly and begins running the models.
# If you ran just this you would still be able to control your R
# session and run other code. It is "non-blocking" because the computation
# is being done somewhere else. On my Mac, I can open Activity Monitor
# and see that rsession is listed 4 times. Once for this session and 3 other
# times for the 3 other cores (one per model)
weather_nest_future <- mutate(weather_nest,
wether_future = map(data, ~ future(weather_model(.x))))
# Once we run this, we "block" the R session that we are in, because we are
# waiting for values() to return the results of the random forest
# Note that values() is a future function, not randomForest
mutate(weather_nest_future, weather_value = values(wether_future))
proc.time() - t1
# user system elapsed
# 10.769 0.987 4.145
# Test 2
t2 <- proc.time()
# This runs them normally, in sequence
mutate(weather_nest,
weather_model_sequential = map(data, ~weather_model(.x)))
proc.time() - t2
# user system elapsed
# 8.261 0.399 8.667