If there are duplicates in a df, keep one according to a rule

Let's say I have the following dataframe:

users <-  data.frame(name = c('John', 'John', 'Bob'),
                                 age = c(18, 18, 28),
                                 country = c('Brazil', 'Brazil', 'US'),
                                 Grade = c('A', 'B', 'C'))

If I run the code below, only the first and third row will be kept.

users %>%
  distinct(name, age, country, .keep_all = TRUE)

However, I would like to keep the second John. Whenever there is a duplicate, the one with the lower grade should be chosen. Or maybe the one in which the grade column has a string containing a substr or something like this. How can I do this in a Tidyverse-way?

In essence you just need to group_by on the variables that you want to remove duplicates on (so in your example 'name') and then filter on the variable that you want to make the decision on. So for your example:

users <-  data.frame(name = c('John', 'John', 'Bob'),
                                 age = c(18, 18, 28),
                                 country = c('Brazil', 'Brazil', 'US'),
                                 Grade = c('A', 'B', 'C'))

users %>% 
   group_by(name) %>% 
   filter(as.character(Grade) == max(as.character(Grade)))

For your second example, looking for a substring, you can use str_detect in the filter argument

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.