Converting nested for loops_the nested parallel foreach doesn't work


I have a table with ~1M entry points (where each line is an insurance contract, i.e. one client can have multiple contracts) and cols client_id , names and adresses . The problem I am trying to solve is that the same client can have different client_id for each new contract.

To resolve this I have done the following:

  1. Creating a New_ID as a 4th col in the table
  2. Iterate twice over names and calculate names similarity for each combination
  3. Iterate twice over adresses and calculate names similarity for each combination
  4. Inside each iteration: if name_similarity > 0.9 & adresses_similarity > 0.8 then New_ID takes the value of j

Used packages + fake data:

library(stringdist) # strings' similarities 
library(parallel)   # parallel programming 
library(foreach)    # parallel programming 
library(doParallel) # parallel programming 
library(doSNOW)     # parallel programming 

# Fake data 
client_id <- 1:6
names <- c("Name", "Naaame", "Name", "Namee", "Nammee", "Nammee")
adresses <- c("Adress", "Adressss", "Adress", "Adresss", "Aadressss", "Aadressss")

A <- data.frame(cbind(client_id, names, adresses)) %>% 
  mutate(New_ID = NA)

Nested for loops

The below nested for loops works well:

for(i in seq_along(A$client_id)){
  for(j in seq_along(A$client_id)){
        # calculate names similarities 
        name_similarity <- stringdist::stringsim(A$names[i],
                                                 method = "osa",
                                                 useBytes = T)
        # calculate adresses similarities
        adresses_similarity <- stringdist::stringsim(A$adresses[i],
                                                 method = "qgram",
                                                 useBytes = T)
        # Decision & New_ID attribution 
        if(name_similarity > 0.9) {
          if(adresses_similarity > 0.85){
            A[i , 4] = j # New ID
        } # decision end 
  } # Close j loop 
} # Close i loop  

Although the script above produces the expected result, it will take days to iterate over the real data size (~ 1M). So I thought of parallel programming.

Parallel programming:

I have tried to nest two foreach using the operator %:% and run it in parallel using the operator %dopar% of the doParallel package.

cl <- makeCluster(detectCores())         # Intiate clusters (I have 8 cores on my local machine) 
registerDoSNOW(cl)                       # relate foreach to a parallel mecanism from {parallel} 
clusterExport(cl, list("A"))             # export data to clusters  
clusterEvalQ(cl, c(library(tidyverse), 
                   library(stringdist))) # export used packages to child clusters 

foreach(i = seq_along(A$client_id) ) %:% 
  foreach(j = seq_along(A$client_id)) %dopar%{
        # calculate names similarities 
        name_similarity <- stringdist::stringsim(A$names[i],
                                                 method = "osa",
                                                 useBytes = T) 
        # calculate adresses similarities
        adresses_similarity <- stringdist::stringsim(A$adresses[i],
                                                 method = "osa",
                                                 useBytes = T)
        # Decision & New_ID attribution 
        if(name_similarity > 0.9) {
          if(adresses_similarity > 0.85){
            A[i , 4] = j # New ID
        } # decision end 

However, after running the parallel nested foreach loops, the New_ID column still empty. I've tried to unlist() the result as the foreach loop returns values in list, it doesn't work.

How can I write the nested parallel foreach to obtain the same result as in the nested for loops? Thanks

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.