identifying exact or near duplicate names in a dataset

john.smith · October 24, 2018, 9:53am

I would recommend that you look at the links supplied by @EconomiCurtis as they are excellent.

Below is a rough draft i think of what you are looking for. The end result gives you a similarity score to each word and as you can see once the value drops below 0.5 the matches are basically useless. You can use whatever similarity score you think provides the most merit

The high matches do catch similar words with sight variations. The code below assumes that V1 is a reference table

I hope it helps you in your project

library(tidystringdist)
#> Warning: package 'tidystringdist' was built under R version 3.5.1
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.1
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Takes the data and converts it to a dataframe format
companies <- tibble(comp = c("3M",
  "3M Company",
  "3M Co",
  "A & R LOGISTICS INC",
  "AR LOGISTICS INC",
  "A & R LOGISTICS LTD",
  "ABB GROUP",
  "ABB LTD",
  "ABB INC"))

# Setup so that each company is compared to every other company
compare <- tidy_comb_all(companies, comp)

# Have a quick look at the data to show the company 
# 3M being compared to all other companies
head(compare, 9)
#> # A tibble: 9 x 2
#>   V1         V2                 
#>   <chr>      <chr>              
#> 1 3M         3M Company         
#> 2 3M         3M Co              
#> 3 3M         A & R LOGISTICS INC
#> 4 3M         AR LOGISTICS INC   
#> 5 3M         A & R LOGISTICS LTD
#> 6 3M         ABB GROUP          
#> 7 3M         ABB LTD            
#> 8 3M         ABB INC            
#> 9 3M Company 3M Co

# Compare all the company names to each other
# Added a sim score based on jaccard because its a little bit easier to read
comparisons <- tidy_stringdist(compare) %>% 
  mutate(sim = 1-jaccard) %>% 
  select(V1, V2, sim)

# Now we need to see the top score per group
recommendation <- comparisons %>% 
  group_by(V1) %>% 
  summarise(max_sim = max(sim)) %>% 
  ungroup()

# Match it back to the original data
comparisons %>% 
  inner_join(recommendation, by = c("V1" = "V1", "sim" = "max_sim"))
#> # A tibble: 9 x 3
#>   V1                  V2                    sim
#>   <chr>               <chr>               <dbl>
#> 1 3M                  3M Co               0.4  
#> 2 3M Company          3M Co               0.5  
#> 3 3M Co               ABB INC             0.222
#> 4 A & R LOGISTICS INC AR LOGISTICS INC    0.917
#> 5 AR LOGISTICS INC    A & R LOGISTICS LTD 0.769
#> 6 A & R LOGISTICS LTD ABB LTD             0.385
#> 7 ABB GROUP           ABB LTD             0.273
#> 8 ABB GROUP           ABB INC             0.273
#> 9 ABB LTD             ABB INC             0.333

Created on 2018-10-24 by the reprex package (v0.2.1)