Remove Duplicates / Near Duplicates / Repeat Entries

library(tidyverse)
library(reprex)
df <- tibble::tribble(
  ~V1,  ~V2,
  "5th Generation Builder", "5th Generation Builder, LLC",
  "5th Generation Builders Inc.",   "5th Generation Builders",
  "89 Contractors LLC", "89 Contractors LLC",
  "906 Studio Architects LLC",  "906 Studio Architects LLC",
  "A & A Glass Co.",    "Paragon Const.",
  "A & E Farm", "A & E Farm",
  "A & H GLASS",    "C & C Contractors",
  "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
  "A & K Construction Co.", "J. Jones Restoration",
  "A & L Construction", "A & L Const.")


output <- df %>%
  distinct(V1, V2, .keep_all = TRUE)

output <- df %>%
  group_by(V1, V2) %>%
  dplyr::mutate(row_number = dplyr::row_number()) %>%
  filter(row_number == max(row_number))

Created on 2020-04-03 by the reprex package (v0.3.0)

I am very new to the community, so I'll apologize in advance.

I am trying to produce a list, resulting from V1 as the CRM ( reference ) comparing to V2 ( Leads ) Filter the duplicates out or even index them as such, leaving me with a list such as a V3 column, indicating anything in V2 that was not identified in V1.

I feel I've tried most everything related to duplicate removal, but I can't seem to fabricate the output I'm looking for.

I truly hope this helps and that my reprex is acceptable.

It could be if you actually post the reprex instead of a screenshot of it, the output is ready to be posted here directly.

Very sorry, the copy plot didn't work as expected.

Think it's good now.

Very sorry about that my friend.

I think you forgot to include a sample of how V3 column should look like, I'm trying to understand what you are trying to do but it is not clear enough for me.
For what I can manage to understand maybe this package could help you identifying fuzzy duplicates.

I have reviewed Colin’s paper, and i don’t think it will output in my favor.
The idea I have for the V3 column is simply anything in V2 that does not appear in V1 ( the reference column )

Is the result you expect line by line ? like which of V1 = V2 ?
Or like a setdiff between the two vectors ?

If you can give an example (even build manually) of the expected output, it would help

library(tidyverse)
df <- tibble::tribble(
  ~V1,  ~V2,
  "5th Generation Builder", "5th Generation Builder, LLC",
  "5th Generation Builders Inc.",   "5th Generation Builders",
  "89 Contractors LLC", "89 Contractors LLC",
  "906 Studio Architects LLC",  "906 Studio Architects LLC",
  "A & A Glass Co.",    "Paragon Const.",
  "A & E Farm", "A & E Farm",
  "A & H GLASS",    "C & C Contractors",
  "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
  "A & K Construction Co.", "J. Jones Restoration",
  "A & L Construction", "A & L Const.")


df %>%
  filter(V2 != V1)
#> # A tibble: 7 x 2
#>   V1                                        V2                         
#>   <chr>                                     <chr>                      
#> 1 5th Generation Builder                    5th Generation Builder, LLC
#> 2 5th Generation Builders Inc.              5th Generation Builders    
#> 3 A & A Glass Co.                           Paragon Const.             
#> 4 A & H GLASS                               C & C Contractors          
#> 5 A & J Homeworks,Painting, and Restoration A.W. Builders              
#> 6 A & K Construction Co.                    J. Jones Restoration       
#> 7 A & L Construction                        A & L Const.

setdiff(df$V2, df$V1)
#> [1] "5th Generation Builder, LLC" "5th Generation Builders"    
#> [3] "Paragon Const."              "C & C Contractors"          
#> [5] "A.W. Builders"               "J. Jones Restoration"       
#> [7] "A & L Const."

Created on 2020-04-04 by the reprex package (v0.3.0.9001)

R_Test_Screenshot

Thank you so much for the help, this screenshot is generally the result target.

The dataset has a little over 1k in the V1 (reference or CRM ) a little over 3k in the V2 ( leads ).

my goal is to take the leads, compare to what is in our current CRM and send the leads out.

In the past I've had numerous people work on this manually which has taken quite some time and given the current state of things that luxury is gone.

I do hope this helps.

df <- tibble::tribble(
   ~V1, ~V2,
   "5th Generation Builder",
"5th Generation Builders Inc.",
"89 Contractors LLC",
"906 Studio Architects LLC",
"A & A Glass Co.",
"A & E Farm",
"A & H GLASS",
"A & J Homeworks,Painting, and Restoration",
"A & K Construction Co.",
"A & L Construction",
"5th Generation Builder, LLC",
"5th Generation Builders",
"89 Contractors LLC",
"906 Studio Architects LLC",
"Paragon Const.",
"A & E Farm",
"C & C Contractors",
"A.W. Builders",
"J. Jones Restoration",
"A & L Const.")

tidy_comb_df <- tidy_comb_all(df$V1, df$V2)
#> Error in tidy_comb_all(df$V1, df$V2): could not find function "tidy_comb_all"

tidy_stringdist(tidy_comb_df)
#> Error in tidy_stringdist(tidy_comb_df): could not find function "tidy_stringdist"
tidy_stringdist(tidy_comb_df, method = c("osa","jw"))
#> Error in tidy_stringdist(tidy_comb_df, method = c("osa", "jw")): could not find function "tidy_stringdist"

df2 <- tidy_stringdist(tidy_comb_df, method= "osa") %>%
  filter(osa > 18) %>%
  arrange(desc(osa))
#> Error in tidy_stringdist(tidy_comb_df, method = "osa") %>% filter(osa > : could not find function "%>%"

Created on 2020-04-04 by the reprex package (v0.3.0)

I have to set the osa to roughly 18 to catch the resulting 4. But that also allows many duplicates to occur.

Ultimately the V3 or result column would show only
** Paragon Const.*
** C & C Contractors*
** A.W. Builders*
** J. Jones Restoration*

I know this is not as precise as you might want (since the term fuzzy) and might require fine-tuning and manually check, but you would end up with much less than 3k rows to check.

library(tidystringdist)
library(dplyr)

df <- tibble::tribble(
    ~V1,  ~V2,
    "5th Generation Builder", "5th Generation Builder, LLC",
    "5th Generation Builders Inc.",   "5th Generation Builders",
    "89 Contractors LLC", "89 Contractors LLC",
    "906 Studio Architects LLC",  "906 Studio Architects LLC",
    "A & A Glass Co.",    "Paragon Const.",
    "A & E Farm", "A & E Farm",
    "A & H GLASS",    "C & C Contractors",
    "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
    "A & K Construction Co.", "J. Jones Restoration",
    "A & L Construction", "A & L Const.")

tidy_stringdist(df) %>% 
    filter(osa>10) %>% 
    select(V1, V2)
#> # A tibble: 4 x 2
#>   V1                                        V2                  
#>   <chr>                                     <chr>               
#> 1 A & A Glass Co.                           Paragon Const.      
#> 2 A & H GLASS                               C & C Contractors   
#> 3 A & J Homeworks,Painting, and Restoration A.W. Builders       
#> 4 A & K Construction Co.                    J. Jones Restoration

Maybe you should take a look a NLP techniques.

tex2vec has tools for document similarity. Maybe it could help...
http://text2vec.org/similarity.html

Looks interesting, I appreciate the info.
I think I’d be ok with the result being completely dissimilar or even just return a logical reference?

library(tidyverse)
library(reprex)
df <- tibble::tribble(
  ~V1,  ~V2,
  "5th Generation Builder", "5th Generation Builder, LLC",
  "5th Generation Builders Inc.",   "5th Generation Builders",
  "89 Contractors LLC", "89 Contractors LLC",
  "906 Studio Architects LLC",  "906 Studio Architects LLC",
  "A & A Glass Co.",    "Paragon Const.",
  "A & E Farm", "A & E Farm",
  "A & H GLASS",    "C & C Contractors",
  "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
  "A & K Construction Co.", "J. Jones Restoration",
  "A & L Construction", "A & L Const.")
str(df)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
#>  $ V1: chr  "5th Generation Builder" "5th Generation Builders Inc." "89 Contractors LLC" "906 Studio Architects LLC" ...
#>  $ V2: chr  "5th Generation Builder, LLC" "5th Generation Builders" "89 Contractors LLC" "906 Studio Architects LLC" ...
df
#> # A tibble: 10 x 2
#>    V1                                        V2                         
#>    <chr>                                     <chr>                      
#>  1 5th Generation Builder                    5th Generation Builder, LLC
#>  2 5th Generation Builders Inc.              5th Generation Builders    
#>  3 89 Contractors LLC                        89 Contractors LLC         
#>  4 906 Studio Architects LLC                 906 Studio Architects LLC  
#>  5 A & A Glass Co.                           Paragon Const.             
#>  6 A & E Farm                                A & E Farm                 
#>  7 A & H GLASS                               C & C Contractors          
#>  8 A & J Homeworks,Painting, and Restoration A.W. Builders              
#>  9 A & K Construction Co.                    J. Jones Restoration       
#> 10 A & L Construction                        A & L Const.


tidy_stringdist(df) %>% 
  filter(osa>10) %>% 
  select(V1, V2)
#> Error in tidy_stringdist(df): could not find function "tidy_stringdist"


tidy_df <- tidy_stringdist(df) %>% 
  filter(soundex>=1) %>% 
  select(-V1, V2)
#> Error in tidy_stringdist(df): could not find function "tidy_stringdist"

Created on 2020-04-05 by the reprex package (v0.3.0)

What are your thoughts using the soundex criteria?
Thought I was on to something subtracting "osa" from "soundex" since soundex is a single digit format.
Thoughts?

Hi @sbaumbaugh: It looks like folks don't have access to osa or soundex -- are they in the output of tidy_stringdist()? And if so, could you make it accessible? Is it your own or part of a package?

Yes that last big should have it in the output.
library(stringiest) which i will add to the reprex.
I apologize.

Still struggling with the output, V1 is the propriety information, If V2 is not in V1 can I use mutate to new column?

I've used soundex and osa metrics to help filter this further but I'm not sure this is valid output.

Can anyone shed some light?

The goal remains that V1 is the reference, so if an element in V2 exists in V1 it should be dropped as a repeat or duplicate.

The results return 4 obs. but should only return 3 unique elements of V2 that do not appear in V1.

Thoughts?

df <- tibble::tribble(
  ~V1,  ~V2,
  "5th Generation Builder", "5th Generation Builder, LLC",
  "5th Generation Builders Inc.",   "5th Generation Builders",
  "89 Contractors LLC", "89 Contractors LLC",
  "906 Studio Architects LLC",  "906 Studio Architects",
  "A & A Glass Co.",    "Paragon Const.",
  "A & E Farm", "A & E Farm",
  "A & H GLASS",    "C & C Contractors",
  "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
  "Paragon Const.", "J. Jones Restoration",
  "A & L Construction", "A & L Const.")

tidy_e <- tidy_stringdist(df) %>% 
  filter(soundex>=1) %>% 
  select(-V1, V2) %>% 
  arrange(V2,osa) %>% 
  mutate(V2, sim = soundex/ osa) %>% 
  distinct(V2, osa, soundex, sim) %>% 
  rename('Leads'= 'V2')
#> Error in tidy_stringdist(df) %>% filter(soundex >= 1) %>% select(-V1, : could not find function "%>%"

Created on 2020-04-14 by the reprex package (v0.3.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.