Parallel processing slower than without

I'm trying to perform a large matrix cross product and a bunch of other calculations following that, but it's slower than I'd like. I feel like there must be a way to utilise parallel processing to speed the task up, but I don't know if the overhead associated with parallel processing makes this task doomed from the start.

I know there are alternative BLAS libraries to the default one that comes with R, that are great for this sort of thing (Intel MKL, OpenBLAS) ,but I don't really understand how to install and set those up and I don't even know if they apply to Windows... the tutorials I've seen for setting them up on Windows are old and don't seem to work for me...so I'm trying to use futures and the furrr package in R to obtain speed improvements, but failing.

Below is a reproducible example of the time difference between a standard and a parallel matrix cross product. The parallel version is actually slower.

#Matrices - a little slow to run the first line
a <- Matrix::rsparsematrix(2000000, 30, 0.5)
b <- Matrix::rsparsematrix(200, 30, 0.8)

#Packages needed
library(tictoc)
library(future)

#Standard call - takes a good few seconds--------------------
tic()
test1 <- Matrix::tcrossprod(a, b)
toc()
#> 35.21 sec elapsed

##Parallel approach------------------------------------------
#Split data up in to 4 chunks
nrows <- seq_len(nrow(a))
a_chunked <- split.data.frame(a, cut(nrows, pretty(nrows, 4)))

#Run each chunk in parallel
plan(multisession)
tic()
test2 <- furrr::future_map(a_chunked, ~Matrix::tcrossprod(., b),
                           .options = furrr::furrr_options(seed = NULL))
toc()
#> 49.5 sec elapsed

In reality, I would be doing subsequent calculations on the matrix after the cross product, and in the end the code would return a small data frame of results, which would be returned by each "worker", rather than a large sparse matrix like in the above example. Therefore the amount of overhead for the data being sent off to each worker may be high, but the amount of data being sent back by each worked wouldn't.

Am I approaching this in the completely wrong way, or is there not much I can really do to get speed improvements here without an alternative BLAS library?

I get a 3x speed improvement, by applying tcrossprod on a_ and b_ where

a_ <- as.matrix(a)
b_ <- as.matrix(b)

i.e. not using the dgCMatrix class but using regular R matrix.

Unfortunately, with the data I am using for my projects, dense matrices take up too much space (so I run in to memory issues), and the matrices are very sparse in nature (more so than in the example given here). So I am looking for speed improvements for a tcrossprod between two sparse matrices.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.