Hi, would someone be able to help me figure out different ways to remove duplicate rows under certain conditions? Ideally what I'm doing is comparing two different dataframes that are setup with the same information, and once combined, to be able to remove duplicates, and identify the ones that are unique. I've discovered the hablar package which helps with this greatly, but there's a few ways I still would like to be able to figure out, and am open to other packages or ways around it. Thanks for any help.
library(tidyverse) library(hablar) #creating reproducible data data1 <- data.frame(FirstName = as.character(c("JOHN", "JAMES", "Jeff", "John B.", "Smith")), LastName = as.character(c("SMITH", "JONES", "Marks", "Smith", "John")), DateOfBirth = as.Date(c("1955-01-31", "1987-02-03", "1974-03-05", "1955-01-31", "1955-01-31")), stringsAsFactors = FALSE) data1 data2 <- data.frame(FirstName = as.character(c("John", "James", "Tina")), LastName = as.character(c("Smith", "Jones", "Marks")), DateOfBirth = as.Date(c("1955-01-31", "1987-02-03", "1975-06-05")), stringsAsFactors = FALSE) data2 #combining the two data frames joind <- full_join(data1, data2, by = c("FirstName", "LastName", "DateOfBirth")) joind #cleaning step to trim any white spaces, and convert all to upper case joind$FirstName <- toupper(joind$FirstName) joind$FirstName <- trimws(joind$FirstName, which = c("right")) joind$FirstName <- trimws(joind$FirstName, which = c("left")) joind$LastName <- toupper(joind$LastName) joind$LastName <- trimws(joind$LastName, which = c("right")) joind$LastName <- trimws(joind$LastName, which = c("left")) #running hablar package dup <- joind %>% find_duplicates(FirstName, LastName, DateOfBirth) dup
first problem - this works great to identify which ones are duplicates, but is there a way so that it identifies the ones that are not, for ex. to show Jeff Marks, and Tina Marks? or is there an alternative tidy or base r method?
#second problem - how to identify when a first name is in the last name spot and vice versa, ex with Smith John 1955-01-31, which is a duplicate, but was just entered incorrectly. So my thoughts might be something like an IF statement, I'm just not sure how to write it or something maybe that would say:
# if row A and row B have the same DateOfBirth, AND # FirstName of row A = LastName of row B, AND # LastName of row B = FirstName of row A, THEN # it is a duplicate
third problem - how to identify the duplicate John B. Smith
Running the below lines will somewhat help solve for problems 2 and 3 to just look for the same DateOfBirth, but it may be cumbersome when there is really long list of names. Would there be an alternative method that might work better?
dup <- joind %>% find_duplicates(DateOfBirth) dup
Thanks again for any help.