Conditional column summing of large sparse matrix

I am trying to work out a conditional sum for each column of a massive sparse matrix, but I am encountering memory issues. Column summing seems to work okay when there are no conditions, but if I try and apply a condition of >= 1 then there are memory issues, which I believe is driven by attempts to replace values in the sparse matrix.

I need this to both be as fast as possible and avoid the memory issue... is this possible? In my actual internal R package, there are other calculations that take place, and speed is extremely important for the users of the package.

Reproducible example below:

#Matrices - a little slow to run
a <- Matrix::rsparsematrix(2000000, 30, 0.5)
b <- Matrix::rsparsematrix(200, 30, 0.8)
c <- Matrix::tcrossprod(b, a)

#Works
colsums <- Matrix::colSums(c)

#Doesn't work - not enough memory
colsums_1plus <- Matrix::colSums(c >= 1)
#> Error: cannot allocate vector of size 3.0 Gb

I've played around with a few different approaches and so far no luck. I've tried using different matrix types and I have also tried splitting the large matrix in to chunks and doing the calculations in parallel using furrr::future_map, but that was either the same outcome or some parts ran whilst others didn't and overall the time taken was significantly higher!

Thanks in advance for any help with this.

Hm... I wonder if there's an issue with the logical result of c>= 1 and the sparse matrix class.

Does this work for you?

By the way, it might be dangerous to overload the c function with a variable named c.

cgt1 <- c
cgt1@x <- as.double(cgt1@x >= 1)
colsums_1plus <- Matrix::colSums(cgt1)

A couple of points.
first its not the colSums function per se that is first generating a huge vector and crashing out, its the inner calculation
i.e.
c >= 1

Secondly, it seems that by taking the transpose of the crossproduct of a and b, your result is not at all sparse, so I would say that at that point it makes sense to revert back to a traditional r matrix, or something else
for example

a_ <- Matrix::rsparsematrix(2000000, 30, 0.5)
b_ <- Matrix::rsparsematrix(200, 30, 0.8)
tradmat <- as.matrix(Matrix::tcrossprod(b_, a_))

tradmatgt1 <- tradmat >= 1
colSums(tradmatgt1)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.