identifying exact or near duplicate names in a dataset


#1

Below are the supplier names as example, which are exact duplicates as well as near duplicates, how can we identify this is with R,
3M
3M Company
3M Co
A & R LOGISTICS INC
AR LOGISTICS INC
A & R LOGISTICS LTD
ABB GROUP
ABB LTD
ABB INC
how do I tag these into one group by fuzzy logic to normalize the names. Output can be like below, also open for better suggestions

3M 1
3M Company 1
3M Co 1
A & R LOGISTICS INC 2
AR LOGISTICS INC 2
A & R LOGISTICS LTD 2
ABB GROUP 3
ABB LTD 3
ABB INC 3


#2

Hi,

I have this problem as well in my day job :slight_smile:
The bad news is there is no easy way to do it automatically but I'm far from an expert but if anyone has any ideas i would love to hear your thoughts also.

These look like shipper names which are probably not standardized. I can only suggest a number of ways that i looked at it but your mileage may vary.

  • You could try tidystringdist which has an assortment of fuzzy string matchers and is very intuitive. Each metric compares a string and gives you a similarity score sometimes scaled between 0-100. When we were stuck with this, we compared every shipper name to every other shipper name and matched it to the highest matched value. We decided early on that anything below a 95% match needed to be investigated manually. We would then stick this into a reference table for future matching. The next time we had a bunch of shipper names we would first check the reference table then check for a fuzzy match.

  • The second method could be to use a clustering technique and use the distance measure such as a levenschtein distance which measures the number of edits needed to make one shipper name look like another. It would in theory cluster the names that were similar and then we would go in to manually check them to create

I hope this is useful

Thanks


#3

John,
Thank for the reply. It is very helpful. But I am a beginner to R & not good at writing the scripts with syntax to execute & get the required output.
Can you guide me on this? My sheet does have one column with supplier names.

Regards,
Ravana


#4

A really good way to ask this is with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

Here's a video guiding you through how to make one: [Video] Reproducible Examples and the `reprex` package


In past projects I've use stringdist with good success. Colin Fay created a few vinettes on this method here:
https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html

For a longer exploration, Colin has a blog post below, which works up to string distance on the Game of Thornes dataset (about halfway down for where the string dist discussion starts).

String distance might be problematic if you have too many companies with similar names.


Stack overflow has a nice discussion:


#5

Hi @Ravana

I would recommend that you look at the links supplied by @EconomiCurtis as they are excellent.

Below is a rough draft i think of what you are looking for. The end result gives you a similarity score to each word and as you can see once the value drops below 0.5 the matches are basically useless. You can use whatever similarity score you think provides the most merit

The high matches do catch similar words with sight variations. The code below assumes that V1 is a reference table

I hope it helps you in your project

library(tidystringdist)
#> Warning: package 'tidystringdist' was built under R version 3.5.1
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.1
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Takes the data and converts it to a dataframe format
companies <- tibble(comp = c("3M",
  "3M Company",
  "3M Co",
  "A & R LOGISTICS INC",
  "AR LOGISTICS INC",
  "A & R LOGISTICS LTD",
  "ABB GROUP",
  "ABB LTD",
  "ABB INC"))

# Setup so that each company is compared to every other company
compare <- tidy_comb_all(companies, comp)

# Have a quick look at the data to show the company 
# 3M being compared to all other companies
head(compare, 9)
#> # A tibble: 9 x 2
#>   V1         V2                 
#>   <chr>      <chr>              
#> 1 3M         3M Company         
#> 2 3M         3M Co              
#> 3 3M         A & R LOGISTICS INC
#> 4 3M         AR LOGISTICS INC   
#> 5 3M         A & R LOGISTICS LTD
#> 6 3M         ABB GROUP          
#> 7 3M         ABB LTD            
#> 8 3M         ABB INC            
#> 9 3M Company 3M Co

# Compare all the company names to each other
# Added a sim score based on jaccard because its a little bit easier to read
comparisons <- tidy_stringdist(compare) %>% 
  mutate(sim = 1-jaccard) %>% 
  select(V1, V2, sim)

# Now we need to see the top score per group
recommendation <- comparisons %>% 
  group_by(V1) %>% 
  summarise(max_sim = max(sim)) %>% 
  ungroup()

# Match it back to the original data
comparisons %>% 
  inner_join(recommendation, by = c("V1" = "V1", "sim" = "max_sim"))
#> # A tibble: 9 x 3
#>   V1                  V2                    sim
#>   <chr>               <chr>               <dbl>
#> 1 3M                  3M Co               0.4  
#> 2 3M Company          3M Co               0.5  
#> 3 3M Co               ABB INC             0.222
#> 4 A & R LOGISTICS INC AR LOGISTICS INC    0.917
#> 5 AR LOGISTICS INC    A & R LOGISTICS LTD 0.769
#> 6 A & R LOGISTICS LTD ABB LTD             0.385
#> 7 ABB GROUP           ABB LTD             0.273
#> 8 ABB GROUP           ABB INC             0.273
#> 9 ABB LTD             ABB INC             0.333

Created on 2018-10-24 by the reprex package (v0.2.1)


#6

Hi @john.smith,

Thank you for providing this great example. It is super helpful.

The code seems to be working well to identify all of the fuzzy matches in my dataset. However, it seems to be excluding all of the exact matches. The issue arises in the first step when I use 'tidy_comb_all': company names that are exact matches (i.e. "McDonalds" and "McDonalds") are not compared to one another, while company names that are fuzzy matches (i.e. "McDonalds" and "McDonald's") are compared to one another. I can tell by inspecting the resulting "compare" dataset as well as the fact that the number of observations in the "compare" dataset is less than the number of possible combinations.

Is this an issue that can be overcome, or does 'tidy_comb_all' only find fuzzy matches? Thank you so much.


#7

Hi @IlanaR

Actually yes, there is a way to implement what you are looking for
The comparison is set up by the below line which excludes the the comparison being made with itself

It can be replaced with the following which should give you what you need

library(tidyverse)
library(tidystringdist)

companies <- tibble(comp = c("3M",
                             "3M Company",
                             "3M Co",
                             "A & R LOGISTICS INC",
                             "AR LOGISTICS INC",
                             "A & R LOGISTICS LTD",
                             "ABB GROUP",
                             "ABB LTD",
                             "ABB INC"))



companies_translation <- expand.grid(companies$comp, companies$comp) %>% 
  rename(V1 = Var1, V2 = Var2) %>%
  tidy_stringdist(.)

Hope it helps


#8

Hi @john.smith,

Thank you so much. This is great.

Ilana


#9

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:


closed #10

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.