Find the distance between two groups of strings in R

I have a very large dataset, which looks like this.

I have two types of data frames

  1. my reference data.frame
ref=c("cake","brownies")

and my experimental data.frame

expr=c("cak","cakee","cake", "rownies","browwnies")

I want to match the ref and expr data.frames and find the levenstein distance between them. The output could look like this...

ref   expr      distance 
cake  cak         1
cake  cakee       1
cake  cake        0
cake  rownies    ...

after I have measured their levenstein distance I want to cluster any string that has distance less than 3 to one cluster and my data to maybe look like

ref        expr      distance  cluster
cake       cak         1         1
cake       cakee       1         1
cake       cake        0         1
brownies   rownies     1         2 
brownies   browwnies   1         2

any help or advice on how to move on is appreciate it. At the moment I am trying a lot
of R packages to find the distance between data.frame such as

library("DescTools")

but they do not seem to work well.

library(stringdist)
ref=c("cake","brownies")
expr=c("cak","cakee","cake", "rownies","browwnies")

d <- expand.grid(ref,expr)
colnames(d) <- c("source","target")
d$dist <- stringdist(d$source,d$target)
d$clust <- ifelse(d$dist < 3,1,2)
d
#>      source    target dist clust
#> 1      cake       cak    1     1
#> 2  brownies       cak    8     2
#> 3      cake     cakee    1     1
#> 4  brownies     cakee    7     2
#> 5      cake      cake    0     1
#> 6  brownies      cake    7     2
#> 7      cake   rownies    6     2
#> 8  brownies   rownies    1     1
#> 9      cake browwnies    8     2
#> 10 brownies browwnies    1     1
1 Like

Thank you :slight_smile: . I learnt a lot and I ll come back with more posts and questions. Thank you for the help

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.