Finding distance among thousands of strings

My dataset looks like this but I have around 510^5 sequences/strings to compare pairwise and calculate their levenshtein distance.
I understand that these would lead to a matrix of 5
10^5 * 5*10^5 elements.

I have tried so far the following packages in our HPC but none of them can handle the size of the matrix.


x <- c("T", "A", "C", "G")
data <- expand.grid(rep(list(x), 5)) %>% 
  unite("sequences", 1:5, sep="")

#>   sequences
#> 1     TTTTT
#> 2     ATTTT
#> 3     CTTTT
#> 4     GTTTT
#> 5     TATTT
#> 6     AATTT

Created on 2022-02-22 by the reprex package (v2.0.1)

Is there any trick that I can follow to achieve my goal for counting lv distance?
Can I parallelise the process and if yes how? Would it make sense?

I appreciate your time. Any guidance and help are highly appreciated

If i understood you correctly you are hoping to compute as many as (510^5)^2 comparisons.
This is an octillion and seems quite inconceivable

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.