grouped and parallel curve fitting

Hi,

Can you advise me about the best way to parallelize this problem?

I have a tibble of 103,884 rows x 1,373 columns from which I want to remove some experimental bias. To this I am exploring modeling the biases with a nonlinear function that I fit to the data in each column, grouped by subset, and keeping only the residuals. That actually means that I want to apply a fairly costly function to 1351 columns in 270 groups, or 364,770 times in total. Now it is written as:

ti.output <- ti.input %>%
    group_by(barcode) %>%
    mutate_at(
        .vars = vars(numNames),
        .funs = funs(nonLinearCorrection(., !! xCol)) 
    )

where the column 'barcode' defines the grouping, 'numNames' is character vector with the names of all the columns I want to apply the function to, and 'nonLinearCorrection(y, x)' is the function, which takes two numeric vectors and returns a vector of replacement y values.

I cannot think of a good way to parallelize it. I have considered using furr::future_map_dfc with a wrapper function that takes the list of names as list to run over, but that doesn't work because every worker thread needs a copy of the data frame, and it is too big. I considered using multidplyr, but apparently this is not supported for our installed R 3.5.1.

What am I missing?

Emmanuel

Hi,

Just to comment, I finally installed multidplyr from git (it is not yet on CRAN) and then it worked fine if I did the necessary preparatory work:

# set up a cluster
cluster <- new_cluster(16)
# function and objects needed 
cluster_copy(cluster, "nonLinearCorrection")    
cluster_copy(cluster, "numNames")    
cluster_copy(cluster, "xCol")    
# packages needed by the nonLinearCorrection() function
cluster_library(cluster, "optimx")
cluster_library(cluster, "tibble")
cluster_library(cluster, "dplyr")
cluster_library(cluster, "magrittr")
    
ti.output <- ti.input %>%
        group_by(barcode) %>%
        partition(cluster) %>%
        mutate_at(
            .vars = vars(numNames),
            .funs = list(nonLinearCorrection(., !! xCol))
        ) %>%
        collect()

This worked better than some alternatives that I considered, because it slices the data set horizontally rather than vertically, so that the worker threads each get only a copy of part of the data. No doubt there are still more efficient ways to do the same.

Emmanuel