Find similars rows in R

juandmaz · January 5, 2022, 3:08am

I have a very large database (48 thousand rows) and I want to review rows that are similar or the same but I can't find a way to do it.

I have this example base:

tribble(
  ~id, ~name, ~lastname, ~pasport, ~State,
  1, "Peter", "Gomez", "1234", "Texas",
  2, "Maria", "Perez", "4567", "Texas",
  3, "Peterr", "Gomes", "1234", "Texas",
  4, "Maria", "Perez", "4567", "Texas",
  5, "Lucy", "Batista", "5784", "California",
  6, "Peter", "Gomez", "1234", "Texaas",
  7, "Maria", "Perezz", "4567", "Texas",
  8, "John", "Mark", "9423", "California",
  9, "Ben", "Aro", "3201", "Washington",
  10, "Jennifer", "Cruz", "3456", "Ohio")

This is the result:

id    name  lastname pasport   State
<dbl> <chr>   <chr>    <chr>   <chr>    
1     Peter  Gomez    1234   Texas  
2     Maria  Perez    4567   Texas  
3     Peterr Gomes    1234   Texas  
4     Maria  Perez    4567   Texas  
5     Lucy  Batista  5784   California   
6     Peter  Gomez    1234   Texaas 
7     Maria  Perezz   4567   Texas  
8     John   Mark     9423   California   
9     Ben    Aro      3201   Washington    
10    Jennifer Cruz   3456   Ohio

I want a way that identifies similar or identicals rows and shows them to me. The output I want would be this (or something similar):

id  name  lastname pasport   State

1  Peter  Gomez    1234   Texas
3  Peterr Gomes    1234   Texas
6  Peter  Gomez    1234   Texaas

2  Maria  Perez    4567   Texas 
4  Maria  Perez    4567   Texas
7  Maria  Perezz   4567   Texas

So I can know which observations have problems or were entered incorrectly.

I did not find a function or package that can do this. Anyone know a way? Thanks a lot!

andresrcs · January 5, 2022, 3:53am

In your specific example, the passport is a unique identifier so you can simply count and filter when more than one instance is present

library(tidyverse)

sample_df <- tribble(
    ~id, ~name, ~lastname, ~pasport, ~State,
    1, "Peter", "Gomez", "1234", "Texas",
    2, "Maria", "Perez", "4567", "Texas",
    3, "Peterr", "Gomes", "1234", "Texas",
    4, "Maria", "Perez", "4567", "Texas",
    5, "Lucy", "Batista", "5784", "California",
    6, "Peter", "Gomez", "1234", "Texaas",
    7, "Maria", "Perezz", "4567", "Texas",
    8, "John", "Mark", "9423", "California",
    9, "Ben", "Aro", "3201", "Washington",
    10, "Jennifer", "Cruz", "3456", "Ohio") 

sample_df %>%
    add_count(pasport) %>% 
    filter(n > 1) %>% 
    arrange(pasport) %>% 
    select(-n)
#> # A tibble: 6 × 5
#>      id name   lastname pasport State 
#>   <dbl> <chr>  <chr>    <chr>   <chr> 
#> 1     1 Peter  Gomez    1234    Texas 
#> 2     3 Peterr Gomes    1234    Texas 
#> 3     6 Peter  Gomez    1234    Texaas
#> 4     2 Maria  Perez    4567    Texas 
#> 5     4 Maria  Perez    4567    Texas 
#> 6     7 Maria  Perezz   4567    Texas

^{Created on 2022-01-04 by the reprex package (v2.0.1)}

juandmaz · January 5, 2022, 6:08am

thanks for the answer but it does not work for me because I am looking for a systematic way to apply it on a larger base, and there will not be a unique identifier for each person as in this case. Maybe I should have taken the variable pasport out of the example or put a different value for the same person in another observation. But with your way I am not looking for discrepancies, but simply grouping by a variable.

mikecro · January 5, 2022, 9:20am

Seems like you have to:
identify correct spellings (by frequency)
identify the misspellings - but note that some rare spellings are just rare names
match those misspellings to the correct spellings
group by corrected spelling

xvalda · January 5, 2022, 12:06pm

Then pasting all columns (except id) and counting those that have more than one occurence could work?
Is it what you're looking for @juandmaz ?

sample_df %>% unite("pasted", -id, remove = FALSE) %>% 
  add_count(pasted) %>% 
  filter(n > 1) %>% 
  select(-pasted)

#> # A tibble: 2 x 6
#>      id name  lastname pasport State     n
#>   <dbl> <chr> <chr>    <chr>   <chr> <int>
#> 1     2 Maria Perez    4567    Texas     2
#> 2     4 Maria Perez    4567    Texas     2

andresrcs · January 5, 2022, 12:15pm

In that case, I think you can look into "string distances" or Natural Language Processing (NLP) in general, here is an example of the use of string distances to identify near duplicates

juandmaz · January 5, 2022, 7:25pm

Hi Andres, thanks for the reply! It is a great solution that you gave me, and it is very close to what I need, but it still does not work since that way only applies to a single variable (the one that is chosen) and I need to apply it to all the variables of my base . In my example, it could only be applied to name or lastname, when I need to apply it to name and lastname.

andresrcs · January 5, 2022, 7:40pm

Why not to simply apply it to a new variable that pastes name and last name together?

juandmaz · January 5, 2022, 7:44pm

Do you propose that for this example create a new variable that copies name, lastname and state? and then apply to it

andresrcs · January 5, 2022, 8:01pm

I think it's worth giving it a try, after all you want to account for similarity in those three fields together and not individually so I believe it makes sense to merge them.

brunocristiano · January 6, 2022, 2:05am

Hi Juan, try jaccard similarity, you may have to adapt tthe following code. Hope it works!

a <- c('potato', 'tomotto', 'chips', 'baloon')
b <- c('car', 'chips', 'bird', 'salt')

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

jaccard(a, b)

system · January 27, 2022, 2:06am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.