removing different types of duplicates in dataframe using hablar in r

jasongeslois · December 2, 2020, 4:59am

Hi, would someone be able to help me figure out different ways to remove duplicate rows under certain conditions? Ideally what I'm doing is comparing two different dataframes that are setup with the same information, and once combined, to be able to remove duplicates, and identify the ones that are unique. I've discovered the hablar package which helps with this greatly, but there's a few ways I still would like to be able to figure out, and am open to other packages or ways around it. Thanks for any help.


library(tidyverse)
library(hablar)

#creating reproducible data 
data1 <- data.frame(FirstName = as.character(c("JOHN", "JAMES", "Jeff", "John B.", "Smith")), 
                    LastName = as.character(c("SMITH", "JONES", "Marks", "Smith", "John")), 
                    DateOfBirth = as.Date(c("1955-01-31", "1987-02-03", "1974-03-05", "1955-01-31", "1955-01-31")), stringsAsFactors = FALSE)
data1

data2 <- data.frame(FirstName = as.character(c("John", "James", "Tina")), 
                    LastName = as.character(c("Smith", "Jones", "Marks")), 
                    DateOfBirth = as.Date(c("1955-01-31", "1987-02-03", "1975-06-05")), stringsAsFactors = FALSE)
data2

#combining the two data frames
joind <- full_join(data1, data2, by = c("FirstName", "LastName", "DateOfBirth"))
joind

#cleaning step to trim any white spaces, and convert all to upper case
joind$FirstName <- toupper(joind$FirstName)
joind$FirstName <- trimws(joind$FirstName, which = c("right"))
joind$FirstName <- trimws(joind$FirstName, which = c("left"))
joind$LastName <- toupper(joind$LastName)
joind$LastName <- trimws(joind$LastName, which = c("right"))
joind$LastName <- trimws(joind$LastName, which = c("left"))

#running hablar package
dup <- joind %>%
  find_duplicates(FirstName, LastName, DateOfBirth) 
dup

first problem - this works great to identify which ones are duplicates, but is there a way so that it identifies the ones that are not, for ex. to show Jeff Marks, and Tina Marks? or is there an alternative tidy or base r method?

#second problem - how to identify when a first name is in the last name spot and vice versa, ex with Smith John 1955-01-31, which is a duplicate, but was just entered incorrectly. So my thoughts might be something like an IF statement, I'm just not sure how to write it or something maybe that would say:

# if row A and row B have the same DateOfBirth, AND 
# FirstName of row A = LastName of row B, AND 
# LastName of row B = FirstName of row A, THEN
# it is a duplicate

third problem - how to identify the duplicate John B. Smith

Running the below lines will somewhat help solve for problems 2 and 3 to just look for the same DateOfBirth, but it may be cumbersome when there is really long list of names. Would there be an alternative method that might work better?

dup <- joind %>%
  find_duplicates(DateOfBirth) 
dup

Thanks again for any help.

jasongeslois · December 17, 2020, 5:15am

Would anyone have any thoughts to help with this problem? Thanks again

system · January 7, 2021, 5:15am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.