How to efficiently calculate pairwise overlaps of many sets?

jdblischak · March 17, 2020, 1:26am

I have a list of many sets, and I want to calculate the overlap of the elements for each pairwise comparison. In my actual application, the sets contain names of genes grouped by some shared characteristic (e.g. from an annotation database such as gene ontology or reactome). But to distill the problem, the input data is a named list of character vectors.

I have written a brute force approach to calculate each pairwise comparison. It works fine when the number of sets is small (<1000 sets). The problem is that these biological ontologies can have tens of thousands of sets, and the combinatorial explosion kills my brute force approach. For example, 10,000 sets requires {{10,000}\choose{2}}=49,995,000 pairwise comparisons. I've done multiple iterations of profiling to speed up my solution (e.g. by pre-allocating objects), but it is still quite slow.

I feel like this is a general enough problem that someone has probably come up with a more efficient solution. Is there a data structure and/or algorithm that would make it faster to compute a bunch of pairwise comparisons?

Below is my reproducible example. The first part simulates the input list of character vectors. 100 sets runs quickly. Change n_sets to 10^4 to experience the inefficiency.

# Simulate data ----------------------------------------------------------------

n_sets <- 10^2
universe <- c(letters, LETTERS)
sets <- replicate(n_sets,
                  sample(universe, size = rpois(1, lambda = 15)),
                  simplify = FALSE)
names(sets) <- paste0("set", seq(n_sets))
lengths(sets)

# Overlap function -------------------------------------------------------------

# https://en.wikipedia.org/wiki/Overlap_coefficient
calc_overlap <- function(set1, set2) {
  length(intersect(set1, set2)) / min(length(set1), length(set2))
}

# Calculate pairwise overlaps (the slow part) ----------------------------------

n_overlaps <- choose(n = n_sets, k = 2)
vec_name1 <- character(length = n_overlaps)
vec_name2 <- character(length = n_overlaps)
vec_overlap <- numeric(length = n_overlaps)
overlaps_index <- 1
for (i in seq(n_sets - 1)) {
  name1 <- names(sets)[i]
  set1 <- sets[[i]]
  for (j in seq(i + 1, n_sets)) {
    name2 <- names(sets)[j]
    set2 <- sets[[j]]
    overlap <- calc_overlap(set1, set2)
    vec_name1[overlaps_index] <- name1
    vec_name2[overlaps_index] <- name2
    vec_overlap[overlaps_index] <- overlap
    overlaps_index <- overlaps_index + 1
  }
}
result <- data.frame(name1 = vec_name1,
                     name2 = vec_name2,
                     overlap = vec_overlap,
                     stringsAsFactors = FALSE)

nirgrahamuk · March 17, 2020, 11:26pm

I profiled your code, and found that some significant computation time was spent on an internal call to 'unique' within the intersect function. I figured that your reprex assumed uniques to inputs, and therefore could assume uniqueness of intersect results, so the unique call could be eliminated.
I created this script

# Overlap function -------------------------------------------------------------
non_unique_intersect <- function(set1,set2){
set1[match(set2, set1, 0L)]
  }

# https://en.wikipedia.org/wiki/Overlap_coefficient
calc_overlap <- function(set1, set2) {
  length(non_unique_intersect(set1, set2)) / min(length(set1), length(set2))
}

# Calculate pairwise overlaps (the slow part) ----------------------------------

n_overlaps <- choose(n = n_sets, k = 2)
vec_name1 <- character(length = n_overlaps)
vec_name2 <- character(length = n_overlaps)
vec_overlap <- numeric(length = n_overlaps)
overlaps_index <- 1
for (i in seq(n_sets - 1)) {
  name1 <- names(sets)[i]
  set1 <- sets[[i]]
  for (j in seq(i + 1, n_sets)) {
    name2 <- names(sets)[j]
    set2 <- sets[[j]]
    overlap <- calc_overlap(set1, set2)
    vec_name1[overlaps_index] <- name1
    vec_name2[overlaps_index] <- name2
    vec_overlap[overlaps_index] <- overlap
    overlaps_index <- overlaps_index + 1
  }
}
result2 <- data.frame(name1 = vec_name1,
                     name2 = vec_name2,
                     overlap = vec_overlap,
                     stringsAsFactors = FALSE)
}

I wrapped your original script, and my version, into functions orig() and newer() and benchmarked them. with the n_sets <- 10^4

Unit: seconds
   expr     min      lq     mean   median       uq      max neval
 orig() 430.923 433.053 434.9667 433.6698 435.6975 441.4901     5
Unit: seconds
    expr      min       lq     mean   median       uq      max neval
 newer() 164.6912 165.5209 166.1947 166.0552 166.5096 168.1965     5

so I got a 3x speedup from that.

jdblischak · March 20, 2020, 2:04am

Thanks so much @nirgrahamuk! I had noticed in the profiling results that the unique() calls from intersect() were time-consuming, but I didn't think there was anything I could do about it. Your idea to bypass this by assuming unique inputs is very clever.

I confirmed that I also get a ~3x speed up for 10^4 sets (from ~9 min down to ~3 min):

> system.time(orig(sets))
   user  system elapsed 
 526.19    0.20  526.30 

> system.time(new(sets))
   user  system elapsed 
 199.22    0.11  199.33

Click here for the script I ran to obtain the results above:


# Simulate data ----------------------------------------------------------------

n_sets <- 10^4
universe <- c(letters, LETTERS)
sets <- replicate(n_sets,
                  sample(universe, size = rpois(1, lambda = 15)),
                  simplify = FALSE)
names(sets) <- paste0("set", seq(n_sets))
lengths(sets)

orig <- function(sets) {
  
  n_sets <- length(sets)

  # Overlap function -------------------------------------------------------------
  
  # https://en.wikipedia.org/wiki/Overlap_coefficient
  calc_overlap <- function(set1, set2) {
    length(intersect(set1, set2)) / min(length(set1), length(set2))
  }
  
  # Calculate pairwise overlaps (the slow part) ----------------------------------
  
  n_overlaps <- choose(n = n_sets, k = 2)
  vec_name1 <- character(length = n_overlaps)
  vec_name2 <- character(length = n_overlaps)
  vec_overlap <- numeric(length = n_overlaps)
  overlaps_index <- 1
  for (i in seq(n_sets - 1)) {
    name1 <- names(sets)[i]
    set1 <- sets[[i]]
    for (j in seq(i + 1, n_sets)) {
      name2 <- names(sets)[j]
      set2 <- sets[[j]]
      overlap <- calc_overlap(set1, set2)
      vec_name1[overlaps_index] <- name1
      vec_name2[overlaps_index] <- name2
      vec_overlap[overlaps_index] <- overlap
      overlaps_index <- overlaps_index + 1
    }
  }
  result <- data.frame(name1 = vec_name1,
                       name2 = vec_name2,
                       overlap = vec_overlap,
                       stringsAsFactors = FALSE)
  return(result)
}

new <- function(sets) {

  n_sets <- length(sets)
  
  # Overlap function -------------------------------------------------------------
  
  non_unique_intersect <- function(set1,set2){
    set1[match(set2, set1, 0L)]
  }
  
  # https://en.wikipedia.org/wiki/Overlap_coefficient
  calc_overlap <- function(set1, set2) {
    length(non_unique_intersect(set1, set2)) / min(length(set1), length(set2))
  }
  
  # Calculate pairwise overlaps (the slow part) ----------------------------------
  
  n_overlaps <- choose(n = n_sets, k = 2)
  vec_name1 <- character(length = n_overlaps)
  vec_name2 <- character(length = n_overlaps)
  vec_overlap <- numeric(length = n_overlaps)
  overlaps_index <- 1
  for (i in seq(n_sets - 1)) {
    name1 <- names(sets)[i]
    set1 <- sets[[i]]
    for (j in seq(i + 1, n_sets)) {
      name2 <- names(sets)[j]
      set2 <- sets[[j]]
      overlap <- calc_overlap(set1, set2)
      vec_name1[overlaps_index] <- name1
      vec_name2[overlaps_index] <- name2
      vec_overlap[overlaps_index] <- overlap
      overlaps_index <- overlaps_index + 1
    }
  }
  result <- data.frame(name1 = vec_name1,
                       name2 = vec_name2,
                       overlap = vec_overlap,
                       stringsAsFactors = FALSE)
  return(result)
  
}

system.time(orig(sets))
system.time(new(sets))

jdblischak · March 27, 2020, 2:04am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

jdblischak · March 27, 2020, 4:53pm

I wrote a blog post to share what I learned from optimizing this functionality:

How to efficiently calculate pairwise overlaps of many sets

nirgrahamuk · March 28, 2020, 12:54pm

you went deep, and considered union as well I see
nice write up.

If you want to use these in your work, I suppose the next step would be to look at parallelisation, because I think each pass through the inner j loop, is independent, and so might be calculated on a different cpu-core

jdblischak · March 30, 2020, 9:35pm

Thanks!

And good point about parallelization. I had considered that before, though I was thinking this entire function would run in a parallel process. In other words, if I have 3 distinct lists of sets, I would spawn 3 separate processes to run the pairwise overlaps for each. I thought fewer spawning events would be better, because at least for the future package, on Windows the parallel processes are actually separate R sessions that would need to be started and stopped.

Do you have any advice for writing cross-platform parallelized code? My understanding is that getting parallelized code to run on Windows is non-trivial.