My dataset looks like this but I have around 510^5 sequences/strings to compare pairwise and calculate their levenshtein distance.
I understand that these would lead to a matrix of 510^5 * 5*10^5 elements.
I have tried so far the following packages in our HPC but none of them can handle the size of the matrix.
library(tidyverse) x <- c("T", "A", "C", "G") data <- expand.grid(rep(list(x), 5)) %>% unite("sequences", 1:5, sep="") head(data) #> sequences #> 1 TTTTT #> 2 ATTTT #> 3 CTTTT #> 4 GTTTT #> 5 TATTT #> 6 AATTT
Created on 2022-02-22 by the reprex package (v2.0.1)
Is there any trick that I can follow to achieve my goal for counting lv distance?
Can I parallelise the process and if yes how? Would it make sense?
I appreciate your time. Any guidance and help are highly appreciated