Joining detecting partial strings and score the kind of indentificaction

Wkp · October 2, 2020, 1:10pm

I need get the join as below, detecting even the partial matching, maybe with some "dist technique", and I need aswell a score variable to indicate the score identification. If the match is perfect, indicate 100, if is a not perfect match, indicate it as less that 100..


df1<- data.frame(var1= 1:1, Type= c("megane business hiter"))

df2<- data.frame(var1= 1:3, Type= c("megane business hiter", 
                                    "megan businss limited", 
                                       "meganee busi lim."))

I need a result as below:


df_Result<- data.frame(Type.x= c("megane business hiternaer","megane business hiternaer",
                                 "megane business hiternaer"), Type.y = c("megane business hiter", 
                                                                          "megan businss limited", 
                                                                          "meganee busi lim."), score_result = c("100 (full indentification)", "less that 100, eg:75", "less than 100, eg: 45"))

mara · October 2, 2020, 1:34pm

I'd recommend taking a look at the fuzzyjoin and stringdist packages:

Wkp · October 2, 2020, 3:00pm

It´s possible someone help me to create the script?

nirgrahamuk · October 5, 2020, 8:52am

presumably you studied the recommended resources, but then got stuck.
Where did you get stuck ?

Wkp · October 6, 2020, 11:19am

Hi, I get some script using Jaro distance method,
library(tidyverse)
library(fuzzyjoin)

dfresult = stringdist_join(df1, df2, by= "Type", mode = "left", 
                            ignore_case = FALSE, method = "jw",p=.15, max_dist = 8 ,
                            distance_col= "dist") %>% group_by(Type.x) %>% top_n(1, -dist)

dfresult$dist = 1-dfresult$dist

dfresult

result
# A tibble: 1 x 5
# Groups:   Type.x [1]
  var1.x Type.x                var1.y Type.y                 dist
   <int> <chr>                  <int> <chr>                 <dbl>
1      1 megane business hiter      1 megane business hiter     1

I am happy. Anyway, I would like to investigate a better way, because when this kind is making the checking letter by letter, so check this example:

df1<- data.frame(Reference = c("11a11b11cd"), ID = 1:1)


df2<-data.frame (Reference = c ( "11abcd", "111111abcd", "001abdc1"), ID= 1:3)

dfresult = stringdist_join(df1, df2, by= "Reference", mode = "left", 
                            ignore_case = FALSE, method = "jw",p=.15, max_dist = 8 ,
                            distance_col= "dist") %>% group_by(Reference.x) %>% top_n(1, -dist)
dfresult$dist = 1-dfresult$dist

dfresult
result:

# A tibble: 1 x 5
# Groups:   Reference.x [1]
  Reference.x  ID.x Reference.y  ID.y  dist
  <chr>       <int> <chr>       <int> <dbl>
1 11a11b11cd      1 111111abcd      2 0.953

So, How is possible that the reults in this last case is 95% of mathing? I would like aplly another method, please you help?

nirgrahamuk · October 6, 2020, 11:41am

The Jaro distance (method='jw', p=0), is a number between 0 (exact match) and 1 (completely dissimilar) measuring dissimilarity between strings. It is defined to be 0 when both strings have length 0, and 1 when there are no character matches between a and b. Otherwise, the Jaro distance is defined as 1-(1/3)(w_1m/|a| + w_2m/|b| + w_3(m-t)/m). Here,|a| indicates the number of characters in a, m is the number of character matches and t the number of transpositions of matching characters. The w_i are weights associated with the characters in a, characters in b and with transpositions. A character c of a matches a character from b when c occurs in b, and the index of c in a differs less than \max(|a|,|b|)/2 -1 (where we use integer division) from the index of c in b. Two matching characters are transposed when they are matched but they occur in different order in string a and b.

you reversed the score...

dfresult$dist = 1-dfresult$dist

Wkp · October 6, 2020, 12:30pm

I know I reversed the score a purpose.
Thanks for the text from R helper, I saw that.
Can you please create a script similiar like I have created, using the "lv" method?
Thank

nirgrahamuk · October 6, 2020, 12:32pm

seems like you are asking me to copy your code and type lv instead of jw for you ?
what am i missing ?

Wkp · October 6, 2020, 12:44pm

yes, please do that, and check if it runs?

nirgrahamuk · October 6, 2020, 12:45pm

This is so strange to me.
Why do you want me to do it, and not do it yourself ?

Wkp · October 6, 2020, 1:07pm

I would like to aplly stringdist_join with lv method. I have several problems to get it. If you can not help me please, go ahead with other subject.

nirgrahamuk · October 6, 2020, 1:09pm

What have you tried so far? what is your specific problem?, we are more inclined towards helping you with specific coding problems rather than doing your work for you.

df1<- data.frame(Reference = c("11a11b11cd"), ID = 1:1)


df2<-data.frame (Reference = c ( "11abcd", "111111abcd", "001abdc1"), ID= 1:3)

(dfresult <- stringdist_join(df1, df2, by= "Reference", mode = "left", 
                           ignore_case = FALSE, method = "lv",max_dist=Inf,
                           distance_col= "dist"))

This uses lv

Wkp · October 7, 2020, 12:26pm

Hi EconomiCurtis,
Thanks for your observation. If you read all the conversation you will be able to understand why I was asking for that.
Cheers,
WKP

Wkp · October 7, 2020, 12:29pm

Hi Nirgrahamuk,
Really thank you for that.
There was something in my R Code that stuck when I tried to use the lv method.
Thanks again for your big support.
Cheers,

system · October 14, 2020, 12:29pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.