Identifying Fuzzy Duplicates from a column

April · July 12, 2019, 8:37pm

I have a table which contains name of vendors along with their other details such as address, telephone no etc. I need to identify the name of vendors who are similar to each other. I was successful in finding exact duplicate vendors, but it becomes difficult with fuzzy duplicates. Here is just a sample data set:

| Name                    |   City  |
|-------------------------|:-------:|
| CANON PVT. LTD          | Georgia |
| Antila,Thomas           | Georgia |
| Greg                    | Georgia |
| St.Luke's Hospital      | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| St.Luke's Hospital      | Georgia |
| CANON PVT. LTD.         | Georgia |
| SANDSTONE COOLING LTD   | Georgia |
| Greg                    | Georgia |
| ANTILA,THOMAS           | Georgia |

I want to identify only the fuzzy duplicates so that the output looks like this:

| Name                    |   City  |
|-------------------------|:-------:|
| CANON PVT. LTD          | Georgia |
| CANON PVT. LTD.         | Georgia |
| Antila,Thomas           | Georgia |
| ANTILA, THOMAS          | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| SANDSTONE COOLING LTD   | Georgia |

I went through a similar post like this identifying exact or near duplicate names in a dataset and tried using tidy_comb_all and tidy_stringdist, however, it gives me the result of the column 'Name', whereas, I want the column 'city' as well (i.e, the entire dataframe. My original data has many other columns and I want information of all of them). How is it possible to achieve this? @john.smith Would you be able to look into this problem? Thanks in advance!

andresrcs · July 13, 2019, 3:23pm

This would be one way to do it

library(tidystringdist)
library(tidyverse)

# Sample data
df <- data.frame(stringsAsFactors = FALSE,
  Name = as.factor(c(" CANON PVT. LTD ", " Antila,Thomas ", " Greg ",
                     " St.Luke's Hospital ", " Z_SANDSTONE COOLING LTD ",
                     " St.Luke's Hospital ", " CANON PVT. LTD. ",
                     " SANDSTONE COOLING LTD ", " Greg ", " ANTILA,THOMAS ")),
  City = as.factor(c(" Georgia ", " Georgia ", " Georgia ", " Georgia ",
                     " Georgia ", " Georgia ", " Georgia ", " Georgia ",
                     " Georgia ", " Georgia "))
)


match <- df %>% 
  tidy_comb_all(Name) %>% 
  tidy_stringdist() %>% 
  filter(soundex == 0) %>% # Set a threshold
  gather(x, match, starts_with("V")) %>% 
  .$match
  

df %>% 
  filter(Name %in% match) %>% 
  arrange(Name)
#>                        Name      City
#> 1            Antila,Thomas   Georgia 
#> 2            ANTILA,THOMAS   Georgia 
#> 3           CANON PVT. LTD   Georgia 
#> 4          CANON PVT. LTD.   Georgia 
#> 5    SANDSTONE COOLING LTD   Georgia 
#> 6  Z_SANDSTONE COOLING LTD   Georgia

^{Created on 2019-07-13 by the reprex package (v0.2.1)}

April · July 14, 2019, 12:50am

@andresrcs Thanks a lot Andresrcs for taking out time to solve this problem! This code seems to work perfect with the sample code. But I fail to understand the logic of parameters inside filter and gather functions. What does 'soundex==0' and 'starts_with("V")' imply?
When I try to work with this code on my original dataset, it gives out names in the output which are little similar but with large differences, e.g., 'Aadiyta' and 'Aaram Techserve'. Also, which method is the code using for calculating string distance?
I had written the following code for my original dataset:

comb <- tidy_comb_all(df$`Name 1`)
out=tidy_stringdist(comb,method="lcs")
out=subset(out,(out$lcs>=1 & out$lcs<=10))

This worked pretty well in identifying the differences, but the problem with this code is that it would compare 'Antila, Thomas' with 'ANTILA, THOMAS' once and would again compare 'ANTILA, THOMAS' with 'Antila, Thomas', so, it is generating duplicates in the output. Also, it only lists out the 'Name' column whereas, I want to have all the other columns of my original dataset in the output. How is that achievable?

andresrcs · July 14, 2019, 1:04am

I simply choosed a random metric and threshold as example, selecting the most appropriate method, metric and threshold is up to you and, in my opinion, requires more domain specific knowledge for fine tuning.

starts_with("V") is for selecting the columns containing the matching combinations, both start with "V" (i.e. V1, V2).

April · July 14, 2019, 2:07am

@andresrcs Thank you for your reply! When I broke down the code step wise, I understood it. Thanks once again!

andresrcs · July 14, 2019, 2:09am

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

April · July 14, 2019, 2:12am

Since your piece of code solved the problem, I have marked your post as the solution. Thanks a ton again!

system · July 21, 2019, 2:26am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.