Detect and delate rows where at least two columns have the same content

Hi All,

I'm creating a data frame and I need to delete all the rows where at least two columns have the same content (text). Empty cells ( NA ) shouldn't be considered duplicates. For example, in the following data frame, I would need to cancel only the first and the second rows.

df = cbind(A = c('a', 'b', 'c', 'd','NA'),
                 B = c('a', 'c', 'd', 'e','g'), C = c('e', 'b', 'a', 'f','NA'))

df <- data.frame (df)

But I have more than 10'000 rows, therefore I would need to find a code that allows me to detect the rows where some cells have the same contents and delete them. How could I do?

Another solution could be to concatenate all the 25 columns contents in one cell (per row) and ask R to delate the rows where the string in that cell has a name repeated twice.

Hope to have been the clearer as possible, in case ask me for clarification.

Many thanks

Perhaps something like this, though might not be that practical if you have lots of columns.

library(dplyr)
df %>% 
  as_tibble() %>% 
  mutate(duplicates = if_else(A == B | A == C | B == C,
                              TRUE,
                              FALSE)) %>% 
  filter(duplicates == FALSE)

# A tibble: 2 x 4
  A     B     C     duplicates
  <chr> <chr> <chr> <lgl>     
1 c     d     a     FALSE     
2 d     e     f     FALSE    
2 Likes

Be sure to create a data frame or tibble, not just use cbind.

2 Likes

It works, thank you. The problem is some cells are empty (NA), therefore it tooks those as duplicates. Moreover, even if I had several columns I created all the combinations by hand because I was trying with nested for cycles but it didn't work. Would you have an idea of how to resolve those problems too? Sorry if I'm insistent, I'm an RStudio naif :grin:

It would be best if your example reflected your challenge in this way. Could you update it?

1 Like

df <- data.frame (A = c('a', 'b', 'c', 'd',NA),
                  B = c('a', 'c', 'd', 'e','g'), C = c('e', 'b', 'a', 'f',NA))

#get vector of names of variables in df

(nmv <- names(df))

#get pairs of names
(name_pairs <- combn(nmv,2,simplify = FALSE))

#build list of filter comparisons
library(purrr)
library(glue)
library(rlang)
(filters_1 <- map_chr(name_pairs,~glue("({.x[1]}!={.x[2]} | is.na({.x[1]}))")))
#combin into a single filter rule
filters_2 <- parse_exprs(paste0(filters_1 , collapse = " & "))[[1]]

#apply the rule
df %>% filter(eval(filters_2))
2 Likes

But in this way, I'm going to lose the rows with NAs, while I want to keep them if there are no other duplicates

Sorry, my bad. I didn't notice I wrote the opposite. By the way, empty cells (NA) shouldn't be considered as duplicates for the aim of my research.

ok, i think i understood that I was just sloppy.
extend the inner rules like so:

(filters_1 <- map_chr(name_pairs,~glue("({.x[1]}!={.x[2]} | is.na({.x[1]}) | is.na({.x[2]}))")))
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.