identifying exact or near duplicate names in a dataset


#1

Below are the supplier names as example, which are exact duplicates as well as near duplicates, how can we identify this is with R,
3M
3M Company
3M Co
A & R LOGISTICS INC
AR LOGISTICS INC
A & R LOGISTICS LTD
ABB GROUP
ABB LTD
ABB INC
how do I tag these into one group by fuzzy logic to normalize the names. Output can be like below, also open for better suggestions

3M 1
3M Company 1
3M Co 1
A & R LOGISTICS INC 2
AR LOGISTICS INC 2
A & R LOGISTICS LTD 2
ABB GROUP 3
ABB LTD 3
ABB INC 3


#2

Hi,

I have this problem as well in my day job :slight_smile:
The bad news is there is no easy way to do it automatically but I'm far from an expert but if anyone has any ideas i would love to hear your thoughts also.

These look like shipper names which are probably not standardized. I can only suggest a number of ways that i looked at it but your mileage may vary.

  • You could try tidystringdist which has an assortment of fuzzy string matchers and is very intuitive. Each metric compares a string and gives you a similarity score sometimes scaled between 0-100. When we were stuck with this, we compared every shipper name to every other shipper name and matched it to the highest matched value. We decided early on that anything below a 95% match needed to be investigated manually. We would then stick this into a reference table for future matching. The next time we had a bunch of shipper names we would first check the reference table then check for a fuzzy match.

  • The second method could be to use a clustering technique and use the distance measure such as a levenschtein distance which measures the number of edits needed to make one shipper name look like another. It would in theory cluster the names that were similar and then we would go in to manually check them to create

I hope this is useful

Thanks


#3

John,
Thank for the reply. It is very helpful. But I am a beginner to R & not good at writing the scripts with syntax to execute & get the required output.
Can you guide me on this? My sheet does have one column with supplier names.

Regards,
Ravana


#4

A really good way to ask this is with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

Here's a video guiding you through how to make one: [Video] Reproducible Examples and the `reprex` package


In past projects I've use stringdist with good success. Colin Fay created a few vinettes on this method here:
https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html

For a longer exploration, Colin has a blog post below, which works up to string distance on the Game of Thornes dataset (about halfway down for where the string dist discussion starts).

String distance might be problematic if you have too many companies with similar names.


Stack overflow has a nice discussion:


#5

Hi @Ravana

I would recommend that you look at the links supplied by @EconomiCurtis as they are excellent.

Below is a rough draft i think of what you are looking for. The end result gives you a similarity score to each word and as you can see once the value drops below 0.5 the matches are basically useless. You can use whatever similarity score you think provides the most merit

The high matches do catch similar words with sight variations. The code below assumes that V1 is a reference table

I hope it helps you in your project

library(tidystringdist)
#> Warning: package 'tidystringdist' was built under R version 3.5.1
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.1
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Takes the data and converts it to a dataframe format
companies <- tibble(comp = c("3M",
  "3M Company",
  "3M Co",
  "A & R LOGISTICS INC",
  "AR LOGISTICS INC",
  "A & R LOGISTICS LTD",
  "ABB GROUP",
  "ABB LTD",
  "ABB INC"))

# Setup so that each company is compared to every other company
compare <- tidy_comb_all(companies, comp)

# Have a quick look at the data to show the company 
# 3M being compared to all other companies
head(compare, 9)
#> # A tibble: 9 x 2
#>   V1         V2                 
#>   <chr>      <chr>              
#> 1 3M         3M Company         
#> 2 3M         3M Co              
#> 3 3M         A & R LOGISTICS INC
#> 4 3M         AR LOGISTICS INC   
#> 5 3M         A & R LOGISTICS LTD
#> 6 3M         ABB GROUP          
#> 7 3M         ABB LTD            
#> 8 3M         ABB INC            
#> 9 3M Company 3M Co

# Compare all the company names to each other
# Added a sim score based on jaccard because its a little bit easier to read
comparisons <- tidy_stringdist(compare) %>% 
  mutate(sim = 1-jaccard) %>% 
  select(V1, V2, sim)

# Now we need to see the top score per group
recommendation <- comparisons %>% 
  group_by(V1) %>% 
  summarise(max_sim = max(sim)) %>% 
  ungroup()

# Match it back to the original data
comparisons %>% 
  inner_join(recommendation, by = c("V1" = "V1", "sim" = "max_sim"))
#> # A tibble: 9 x 3
#>   V1                  V2                    sim
#>   <chr>               <chr>               <dbl>
#> 1 3M                  3M Co               0.4  
#> 2 3M Company          3M Co               0.5  
#> 3 3M Co               ABB INC             0.222
#> 4 A & R LOGISTICS INC AR LOGISTICS INC    0.917
#> 5 AR LOGISTICS INC    A & R LOGISTICS LTD 0.769
#> 6 A & R LOGISTICS LTD ABB LTD             0.385
#> 7 ABB GROUP           ABB LTD             0.273
#> 8 ABB GROUP           ABB INC             0.273
#> 9 ABB LTD             ABB INC             0.333

Created on 2018-10-24 by the reprex package (v0.2.1)