Grouping Fuzzy Duplicates and cleansing with tidystringdist

Hi,

This is a continuation of the question posed in identifying Identifying Fuzzy Duplicates from a column
I have been playing around with the example and i appear to have hit a wall

If i take the previous example with the aim of cleansing the duplicates to a single Item, I am coming a bit unstuck.

To give you an example, if i look at the shipper "CANON PVT. LTD" i get four matches back. The match with itself and the match with "CANON PVT. LTD." (the difference is the full stop)
Equally the latter appears again matching to the former, so i now have 4 records. I would require only two records. For example I know that "CANON PVT. LTD" could act as a key for all four matches. Can anyone see a way on how to remove the additional double match

Equally in the example, i would like to cleanse Antila,Thomas and ANTILA,THOMAS to just Antila,Thomas in maybe a third grouping column

Apologies if this seems a bit confusing

library(tidystringdist)
library(tidyverse)

# Sample data
df <- data.frame(stringsAsFactors = FALSE,
                 Name = as.factor(c(" CANON PVT. LTD ", " Antila,Thomas ", " Greg ",
                                    " St.Luke's Hospital ", " Z_SANDSTONE COOLING LTD ",
                                    " St.Luke's Hospital ", " CANON PVT. LTD. ",
                                    " SANDSTONE COOLING LTD ", " Greg ", " ANTILA,THOMAS "))
)

fuzzy_match_string <- function(x, y, jw_dist){
  
  grid_df <- expand.grid(unique(x),
                         unique(y)) %>% 
    rename(V1 = Var1, V2 = Var2) %>%
    mutate_all(as.character) %>% 
    tidy_stringdist(.) %>% 
    filter(jw < jw_dist) %>% 
    select(V1, V2, jw) %>% 
    unique()
  
}

map_df(df$Name, fuzzy_match_string, df$Name, 0.07) %>% 
  unique() %>% 
  arrange(V1)

#> # A tibble: 12 x 3
#>    V1                          V2                              jw
#>    <chr>                       <chr>                        <dbl>
#>  1 " Antila,Thomas "           " Antila,Thomas "           0     
#>  2 " ANTILA,THOMAS "           " ANTILA,THOMAS "           0     
#>  3 " CANON PVT. LTD "          " CANON PVT. LTD "          0     
#>  4 " CANON PVT. LTD "          " CANON PVT. LTD. "         0.0196
#>  5 " CANON PVT. LTD. "         " CANON PVT. LTD "          0.0196
#>  6 " CANON PVT. LTD. "         " CANON PVT. LTD. "         0     
#>  7 " Greg "                    " Greg "                    0     
#>  8 " SANDSTONE COOLING LTD "   " Z_SANDSTONE COOLING LTD " 0.0267
#>  9 " SANDSTONE COOLING LTD "   " SANDSTONE COOLING LTD "   0     
#> 10 " St.Luke's Hospital "      " St.Luke's Hospital "      0     
#> 11 " Z_SANDSTONE COOLING LTD " " Z_SANDSTONE COOLING LTD " 0     
#> 12 " Z_SANDSTONE COOLING LTD " " SANDSTONE COOLING LTD "   0.0267

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.