grouped and parallel curve fitting


Can you advise me about the best way to parallelize this problem?

I have a tibble of 103,884 rows x 1,373 columns from which I want to remove some experimental bias. To this I am exploring modeling the biases with a nonlinear function that I fit to the data in each column, grouped by subset, and keeping only the residuals. That actually means that I want to apply a fairly costly function to 1351 columns in 270 groups, or 364,770 times in total. Now it is written as:

ti.output <- ti.input %>%
    group_by(barcode) %>%
        .vars = vars(numNames),
        .funs = funs(nonLinearCorrection(., !! xCol)) 

where the column 'barcode' defines the grouping, 'numNames' is character vector with the names of all the columns I want to apply the function to, and 'nonLinearCorrection(y, x)' is the function, which takes two numeric vectors and returns a vector of replacement y values.

I cannot think of a good way to parallelize it. I have considered using furr::future_map_dfc with a wrapper function that takes the list of names as list to run over, but that doesn't work because every worker thread needs a copy of the data frame, and it is too big. I considered using multidplyr, but apparently this is not supported for our installed R 3.5.1.

What am I missing?



Just to comment, I finally installed multidplyr from git (it is not yet on CRAN) and then it worked fine if I did the necessary preparatory work:

# set up a cluster
cluster <- new_cluster(16)
# function and objects needed 
cluster_copy(cluster, "nonLinearCorrection")    
cluster_copy(cluster, "numNames")    
cluster_copy(cluster, "xCol")    
# packages needed by the nonLinearCorrection() function
cluster_library(cluster, "optimx")
cluster_library(cluster, "tibble")
cluster_library(cluster, "dplyr")
cluster_library(cluster, "magrittr")
ti.output <- ti.input %>%
        group_by(barcode) %>%
        partition(cluster) %>%
            .vars = vars(numNames),
            .funs = list(nonLinearCorrection(., !! xCol))
        ) %>%

This worked better than some alternatives that I considered, because it slices the data set horizontally rather than vertically, so that the worker threads each get only a copy of part of the data. No doubt there are still more efficient ways to do the same.


This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.