Hi, I get some script using Jaro distance method,
library(tidyverse)
library(fuzzyjoin)
dfresult = stringdist_join(df1, df2, by= "Type", mode = "left",
ignore_case = FALSE, method = "jw",p=.15, max_dist = 8 ,
distance_col= "dist") %>% group_by(Type.x) %>% top_n(1, -dist)
dfresult$dist = 1-dfresult$dist
dfresult
result
# A tibble: 1 x 5
# Groups: Type.x [1]
var1.x Type.x var1.y Type.y dist
<int> <chr> <int> <chr> <dbl>
1 1 megane business hiter 1 megane business hiter 1
I am happy. Anyway, I would like to investigate a better way, because when this kind is making the checking letter by letter, so check this example:
df1<- data.frame(Reference = c("11a11b11cd"), ID = 1:1)
df2<-data.frame (Reference = c ( "11abcd", "111111abcd", "001abdc1"), ID= 1:3)
dfresult = stringdist_join(df1, df2, by= "Reference", mode = "left",
ignore_case = FALSE, method = "jw",p=.15, max_dist = 8 ,
distance_col= "dist") %>% group_by(Reference.x) %>% top_n(1, -dist)
dfresult$dist = 1-dfresult$dist
dfresult
result:
# A tibble: 1 x 5
# Groups: Reference.x [1]
Reference.x ID.x Reference.y ID.y dist
<chr> <int> <chr> <int> <dbl>
1 11a11b11cd 1 111111abcd 2 0.953
So, How is possible that the reults in this last case is 95% of mathing? I would like aplly another method, please you help?