Fuzzy String matching with stringdist package

snandy2011 · March 5, 2018, 5:25am

HI,
I just want to know the interpretation of the stringdist function of stringdist package.

I am doing fuzzy string matching with stringdist package by taking 6 fruits name. please Find below file.

Now i have executed string dist function. The code is below,

library(stringdist)

x <- read.csv("fruits.csv")
df1 <- data.frame(seqid = seq(1:6), name = x)

df1

dfr <- data.frame(n1=df1$name,n2=df1$name)

dfr

ndf <- expand.grid(lapply(dfr, levels)) 

ndf

View(ndf)

ndf <- ndf[order(ndf$n1),]

ndf
View(ndf)

method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")

for( i in method_list)
{
  ndf[,i] <- stringdist(ndf$n1,ndf$n2,method=i)
}

suspicious_match <- ndf[ndf$cosine < 0.20 & ndf$cosine != 0 & ndf$qgram < 10, ]

suspicious_match <- suspicious_match[order(suspicious_match$n1,suspicious_match$cosine),]


View(suspicious_match)

The code has been executed successfully,there is no error.But, i am getting a little difficulties to understand the interpretation of the output.

for example,
Grapes green seedless and grapes seedless red are the different fruit name but the soundex method is showing they are same. But, what other methods (osa,lv,dl,hamming etc) are saying ?

I have googled those method, but, did not understand the real interpretation of those method.

can you just tell me what is the interpretation of these methods, so that i can identify above two fruits are different... not same.

any suggestions in this case are really appreciable.

Thanks,
snandy

mara · March 5, 2018, 2:56pm

Hi @snandy2011,

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? That way we can see the output you're having trouble understanding without everyone having to run the code themselves. In essence, tt will help us help you if we can be sure we're all working with/looking at the same stuff.

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

snandy2011 · March 6, 2018, 4:34am

I have tried multiple times to do the reprex. But, it shows me below error.

No input provided and clipboard is not available. Unable to put result on the clipboard.
Can you please tell me why it is showing or else you can upload a video how to do the reprex.

mara · March 6, 2018, 10:43am

So, the unable to put result on clipboard isn't a big deal, since you can send the reprex to an "outfile," the "no input provided" part, however, means that you're not passing it the code you're trying to reprex.

There's a great video from an rOpenSci community call wherein Jenny Bryan explains reprex (the most relevant material starts ~10:40).

You can also see the slides from that video here:

Since i don't have the csv you're working with (you attached a pdf) I can't actually run the reprex for you. What you'll need to do is save the script (minus the View() commands, since those are for viewing in an interactive session) as a file (e.g. fruits_stringdist.R) and, just to make things easier, save your fruits.csv in the same directory (be sure that this is your working directory when you run the reprex command below). You'll then run something to the effect of:

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

and then you copy and paste the content of the markdown file here.

mara · March 6, 2018, 11:02am

OK, so I did my best guess of what your csv would look like here, but I'm unclear on what you're trying to accomplish. Like, I understand the stringdist package, and the concepts therein, but I'm confused by what it is you're trying to interpret here.

library(stringdist)
fruits <- tibble::tibble(name = c("apple", "mango", "Grapes Red Seedless", "orange", "Grapes green seedless", "Grapes seedless Red"))
x <- fruits

df1 <- data.frame(seqid = seq(1:6), name = x)
df1
#>   seqid                  name
#> 1     1                 apple
#> 2     2                 mango
#> 3     3   Grapes Red Seedless
#> 4     4                orange
#> 5     5 Grapes green seedless
#> 6     6   Grapes seedless Red

dfr <- data.frame(n1 = df1$name, n2 = df1$name)
dfr
#>                      n1                    n2
#> 1                 apple                 apple
#> 2                 mango                 mango
#> 3   Grapes Red Seedless   Grapes Red Seedless
#> 4                orange                orange
#> 5 Grapes green seedless Grapes green seedless
#> 6   Grapes seedless Red   Grapes seedless Red

ndf <- expand.grid(lapply(dfr, levels))
ndf
#>                       n1                    n2
#> 1                  apple                 apple
#> 2  Grapes green seedless                 apple
#> 3    Grapes Red Seedless                 apple
#> 4    Grapes seedless Red                 apple
#> 5                  mango                 apple
#> 6                 orange                 apple
#> 7                  apple Grapes green seedless
#> 8  Grapes green seedless Grapes green seedless
#> 9    Grapes Red Seedless Grapes green seedless
#> 10   Grapes seedless Red Grapes green seedless
#> 11                 mango Grapes green seedless
#> 12                orange Grapes green seedless
#> 13                 apple   Grapes Red Seedless
#> 14 Grapes green seedless   Grapes Red Seedless
#> 15   Grapes Red Seedless   Grapes Red Seedless
#> 16   Grapes seedless Red   Grapes Red Seedless
#> 17                 mango   Grapes Red Seedless
#> 18                orange   Grapes Red Seedless
#> 19                 apple   Grapes seedless Red
#> 20 Grapes green seedless   Grapes seedless Red
#> 21   Grapes Red Seedless   Grapes seedless Red
#> 22   Grapes seedless Red   Grapes seedless Red
#> 23                 mango   Grapes seedless Red
#> 24                orange   Grapes seedless Red
#> 25                 apple                 mango
#> 26 Grapes green seedless                 mango
#> 27   Grapes Red Seedless                 mango
#> 28   Grapes seedless Red                 mango
#> 29                 mango                 mango
#> 30                orange                 mango
#> 31                 apple                orange
#> 32 Grapes green seedless                orange
#> 33   Grapes Red Seedless                orange
#> 34   Grapes seedless Red                orange
#> 35                 mango                orange
#> 36                orange                orange

ndf <- ndf[order(ndf$n1), ]
ndf
#>                       n1                    n2
#> 1                  apple                 apple
#> 7                  apple Grapes green seedless
#> 13                 apple   Grapes Red Seedless
#> 19                 apple   Grapes seedless Red
#> 25                 apple                 mango
#> 31                 apple                orange
#> 2  Grapes green seedless                 apple
#> 8  Grapes green seedless Grapes green seedless
#> 14 Grapes green seedless   Grapes Red Seedless
#> 20 Grapes green seedless   Grapes seedless Red
#> 26 Grapes green seedless                 mango
#> 32 Grapes green seedless                orange
#> 3    Grapes Red Seedless                 apple
#> 9    Grapes Red Seedless Grapes green seedless
#> 15   Grapes Red Seedless   Grapes Red Seedless
#> 21   Grapes Red Seedless   Grapes seedless Red
#> 27   Grapes Red Seedless                 mango
#> 33   Grapes Red Seedless                orange
#> 4    Grapes seedless Red                 apple
#> 10   Grapes seedless Red Grapes green seedless
#> 16   Grapes seedless Red   Grapes Red Seedless
#> 22   Grapes seedless Red   Grapes seedless Red
#> 28   Grapes seedless Red                 mango
#> 34   Grapes seedless Red                orange
#> 5                  mango                 apple
#> 11                 mango Grapes green seedless
#> 17                 mango   Grapes Red Seedless
#> 23                 mango   Grapes seedless Red
#> 29                 mango                 mango
#> 35                 mango                orange
#> 6                 orange                 apple
#> 12                orange Grapes green seedless
#> 18                orange   Grapes Red Seedless
#> 24                orange   Grapes seedless Red
#> 30                orange                 mango
#> 36                orange                orange

method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")

for (i in method_list)
{
  ndf[, i] <- stringdist(ndf$n1, ndf$n2, method = i)
}

suspicious_match <- ndf[ndf$cosine < 0.20 & ndf$cosine != 0 & ndf$qgram < 10, ]

suspicious_match
#>                       n1                    n2 osa lv dl hamming lcs qgram
#> 14 Grapes green seedless   Grapes Red Seedless   5  5  5     Inf   8     8
#> 20 Grapes green seedless   Grapes seedless Red  10 10 10     Inf  10     6
#> 9    Grapes Red Seedless Grapes green seedless   5  5  5     Inf   8     8
#> 21   Grapes Red Seedless   Grapes seedless Red   9  9  9      10  10     2
#> 10   Grapes seedless Red Grapes green seedless  10 10 10     Inf  10     6
#> 16   Grapes seedless Red   Grapes Red Seedless   9  9  9      10  10     2
#>        cosine    jaccard        jw soundex
#> 14 0.05755000 0.30769231 0.1840800       0
#> 20 0.04454718 0.25000000 0.1770112       0
#> 9  0.05755000 0.30769231 0.1840800       0
#> 21 0.01759449 0.09090909 0.1486068       0
#> 10 0.04454718 0.25000000 0.1770112       0
#> 16 0.01759449 0.09090909 0.1486068       0

suspicious_match <- suspicious_match[order(suspicious_match$n1, suspicious_match$cosine), ]

suspicious_match
#>                       n1                    n2 osa lv dl hamming lcs qgram
#> 20 Grapes green seedless   Grapes seedless Red  10 10 10     Inf  10     6
#> 14 Grapes green seedless   Grapes Red Seedless   5  5  5     Inf   8     8
#> 21   Grapes Red Seedless   Grapes seedless Red   9  9  9      10  10     2
#> 9    Grapes Red Seedless Grapes green seedless   5  5  5     Inf   8     8
#> 16   Grapes seedless Red   Grapes Red Seedless   9  9  9      10  10     2
#> 10   Grapes seedless Red Grapes green seedless  10 10 10     Inf  10     6
#>        cosine    jaccard        jw soundex
#> 20 0.04454718 0.25000000 0.1770112       0
#> 14 0.05755000 0.30769231 0.1840800       0
#> 21 0.01759449 0.09090909 0.1486068       0
#> 9  0.05755000 0.30769231 0.1840800       0
#> 16 0.01759449 0.09090909 0.1486068       0
#> 10 0.04454718 0.25000000 0.1770112       0

snandy2011 · March 6, 2018, 11:28am

Hi Mara,

Thank you very much for your effort. The way you are trying to solve my problem is really awesome.

I am going through your reprex video example. Hope i will make it.

Now, to answer your question, what i am trying to interpret here----,

If the soundex is 0, i.e exact match and if it is 1, then it is different.This thing i have understood.
Now, what other methods are saying? I mean,
suppose the osa value is 5, another osa is 10, that means, if the Osa value is low, that is better for exact matching or higher value is better for exact matching? Please correct me if i am wrong.
can the same concept be applicable on the other methods like,(lv,dl,hammimg,lcs etc)?

I know this can be a little more weird questions, but need to clear the concept as i am newbie.
If you have any difficulties to understand my questions, please let me know,
I will further explain it.

Thanks once again for your effort

mara · March 6, 2018, 12:20pm

Oh, you're asking about interpreting different string distance metrics! I am not an expert (or even a novice, really) in this area (it's likely a little out of scope for this forum), but I'd recommend checking out the Mark van der Loo's paper that accompanied the package (here). The bibliography of the paper will be a good guide for better understanding the various methods.

There's also a slide deck from a presentation he gave at useR2014 about stringdist:

danr · March 6, 2018, 12:34pm

There are some issues with the version of reprex that is in CRAN.

Until CRAN catches up with the latest version install reprex with

devtools::install_github("tidyverse/reprex")