Parallel (Multicore, MultiCluster) Processing on Windows via ODBC Connection and SQL/HQL Query

When trying to export a connection I get error: Error in connection_info(dbObj@ptr) : external pointer is not valid

Here is my code:


cl <- makeCluster(detectCores(logical = FALSE))
clusterEvalQ(cl=cl, {require(tidyverse)
                    require(parallel)
                    require(DBI)
                    require(odbc)
                    con <- dbConnect(odbc::odbc(), 
                     Driver = "Cloudera ODBC Driver for Impala", 
                     Host = "server", Port = 1234)
                    })

clusterExport(cl=cl,"QueryList")
clusterExport(cl=cl,"con")
              # ,envir = environment()
# clusterEvalQ(cl=cl,QueryList)
# clusterEvalQ(cl=cl,con)



QryResults <- parLapply(cl=cl,QueryList,dbGetQuery,conn=con)


#stopCluster(cl=cl)
stopCluster(cl=cl); print("Cluster stopped.")

I tried both clusterEvalQ and clusterExport as above and got the same result.

Would anyone happen to know the solution to this problem?

Parallel/ Multicore processing is a powerful way to speed up the processing capabilities of running your analysis . Specially when there are aspects of your analysis that can be executed in tandem but run independently of each other to later then be brought together. A clear example of this is a 36 union all #SQL #HQL query taking 10min instead of 30min ( ~ Time / # of Cores). On a Mac it is simple because of it’s capabilities to fork ( split instances of your loaded packages and connections). On a Windows machine it is far more challenging

@andresrcs any thoughts ? I know you helped me with a similar topic in the past ?

DBIConnection object, such as the con object in your example, must not be exported to external R processes. They only work in the R process where they were created. The error is from the parallel worker and basically tells you that that con object was useless to the worker.

There is a related 'dbi' example on this in the future vignette 'Non-Exportable Objects'. That example and the vignette in general should help bring some lights to what the underlying problem is.

Thanks for the post @HenrikBengtsson. Are you saying those articles help to solve the problem? or simply explain the problem? I ask this because this is totally possible on a Mac. I had my co-worker run almost the same code and it handled the connections sharing/exporting just fine- because of the ability to Fork in those systems (Windows does not allow forking)

Are you saying those articles help to solve the problem? or simply explain the problem?

That 'Non-exportable objects' vignette explains the problem and why some objects cannot be exported to external R processes.

I ask this because this is totally possible on a Mac. I had my co-worker run almost the same code and it handled the connections sharing/exporting just fine- because of the ability to Fork in those systems

Yes, forked processing can probably handle a few cases of non-exportable objects. However, I would stride with great care. Just because you don't get a run-time error, it does not mean it is safe to do so. Are you sure that DBI::DBIConnection objects are fork-safe? I don't know but I would not use it in critical, production pipelines until I know for sure. I can imagine many reasons for why it would not be safe. It's a bit like having multiple people talking on the same phone line. As long as they don't speak at the same time, the receiving end (here the database) will receive everything just fine. When two or more speak at the same time, things might be breaking down.

I would reach out to the {DBI} maintainer(s) and ask them to confirm that a DBIConnection can be used by multiple forked processes at the same time, or not.

PS. I think the safest would be to have each worker to open its own DBIConnection, instead of sharing the same.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

This topic was automatically closed after 14 days. New replies are no longer allowed.