I have a list of many sets, and I want to calculate the overlap of the elements for each pairwise comparison. In my actual application, the sets contain names of genes grouped by some shared characteristic (e.g. from an annotation database such as gene ontology or reactome). But to distill the problem, the input data is a named list of character vectors.
I have written a brute force approach to calculate each pairwise comparison. It works fine when the number of sets is small (<1000 sets). The problem is that these biological ontologies can have tens of thousands of sets, and the combinatorial explosion kills my brute force approach. For example, 10,000 sets requires {{10,000}\choose{2}}=49,995,000 pairwise comparisons. I've done multiple iterations of profiling to speed up my solution (e.g. by pre-allocating objects), but it is still quite slow.
I feel like this is a general enough problem that someone has probably come up with a more efficient solution. Is there a data structure and/or algorithm that would make it faster to compute a bunch of pairwise comparisons?
Below is my reproducible example. The first part simulates the input list of character vectors. 100 sets runs quickly. Change n_sets
to 10^4
to experience the inefficiency.
# Simulate data ----------------------------------------------------------------
n_sets <- 10^2
universe <- c(letters, LETTERS)
sets <- replicate(n_sets,
sample(universe, size = rpois(1, lambda = 15)),
simplify = FALSE)
names(sets) <- paste0("set", seq(n_sets))
lengths(sets)
# Overlap function -------------------------------------------------------------
# https://en.wikipedia.org/wiki/Overlap_coefficient
calc_overlap <- function(set1, set2) {
length(intersect(set1, set2)) / min(length(set1), length(set2))
}
# Calculate pairwise overlaps (the slow part) ----------------------------------
n_overlaps <- choose(n = n_sets, k = 2)
vec_name1 <- character(length = n_overlaps)
vec_name2 <- character(length = n_overlaps)
vec_overlap <- numeric(length = n_overlaps)
overlaps_index <- 1
for (i in seq(n_sets - 1)) {
name1 <- names(sets)[i]
set1 <- sets[[i]]
for (j in seq(i + 1, n_sets)) {
name2 <- names(sets)[j]
set2 <- sets[[j]]
overlap <- calc_overlap(set1, set2)
vec_name1[overlaps_index] <- name1
vec_name2[overlaps_index] <- name2
vec_overlap[overlaps_index] <- overlap
overlaps_index <- overlaps_index + 1
}
}
result <- data.frame(name1 = vec_name1,
name2 = vec_name2,
overlap = vec_overlap,
stringsAsFactors = FALSE)