Hi,
Can you advise me about the best way to parallelize this problem?
I have a tibble of 103,884 rows x 1,373 columns from which I want to remove some experimental bias. To this I am exploring modeling the biases with a nonlinear function that I fit to the data in each column, grouped by subset, and keeping only the residuals. That actually means that I want to apply a fairly costly function to 1351 columns in 270 groups, or 364,770 times in total. Now it is written as:
ti.output <- ti.input %>%
group_by(barcode) %>%
mutate_at(
.vars = vars(numNames),
.funs = funs(nonLinearCorrection(., !! xCol))
)
where the column 'barcode' defines the grouping, 'numNames' is character vector with the names of all the columns I want to apply the function to, and 'nonLinearCorrection(y, x)' is the function, which takes two numeric vectors and returns a vector of replacement y values.
I cannot think of a good way to parallelize it. I have considered using furr::future_map_dfc with a wrapper function that takes the list of names as list to run over, but that doesn't work because every worker thread needs a copy of the data frame, and it is too big. I considered using multidplyr, but apparently this is not supported for our installed R 3.5.1.
What am I missing?
Emmanuel