How to filter a dataframe based on multiple conditions

Ibrahim · April 5, 2021, 5:55am

Hi everyone. I'm cleaning a dataset that looks this:

df <- data.frame(First_Name = c("Adam", "John", "Daniel", "Jack", "David","Emily"),
                        Second_Name = c("White", NA, "Brown", "White", "Simpsons","Simpson"),
                        Carer_Number = c("1010101", "9494949", NA, "9494949", "464646", "9494949"),
                        Company = c(NA, "CompanyB","CompanyC", "CompanyD","CompanyE",NA))

My aim is to remove duplicates in the Carer_Number column. My questions are:

When I use the code below to filter duplicated values in the column, I only get two duplicates ( expecting 3). I'm doing something wrong here?

repeats_df <- df %>%
  filter(duplicated(Carer_Number))

How do I keep a duplicated number in the column based on other conditions? For example, only keep a duplicated Carer_Number if the Company and Second Name column is not empty.

Thanks.

Shadrack_Kibet · April 5, 2021, 6:09am

You could use the following for the first problem. You need to install questionr package.

repeats_df  <- df %>%
  filter(questionr::duplicated2(Carer_Number))

Shadrack_Kibet · April 5, 2021, 6:15am

For the second question try the following.

df %>%
  filter(duplicated(Carer_Number)&!is.na(Second_Name)&!is.na(Company))

pradeeps · April 5, 2021, 6:17pm

without using another package you can try

df <- data.frame(First_Name = c("Adam", "John", "Daniel", "Jack", "David","Emily"),
                 Second_Name = c("White", NA, "Brown", "White", "Simpsons","Simpson"),
                 Carer_Number = c("1010101", "9494949", NA, "9494949", "464646", "9494949"),
                 Company = c(NA, "CompanyB","CompanyC", "CompanyD","CompanyE",NA)) 

multiocurrence<-df %>% 
  group_by(Carer_Number) %>% 
  summarise(cnt=n()) %>% 
  filter(cnt>1) %>% 
  select(Carer_Number) %>% 
  pull()

repeats_df <- df %>% filter(Carer_Number %in% multiocurrence)
repeats_df

For the second question:

pradeeps · April 5, 2021, 6:20pm

For the second question

df <- data.frame(First_Name = c("Adam", "John", "Daniel", "Jack", "David","Emily"),
                 Second_Name = c("White", NA, "Brown", "White", "Simpsons","Simpson"),
                 Carer_Number = c("1010101", "9494949", NA, "9494949", "464646", "9494949"),
                 Company = c(NA, "CompanyB","CompanyC", "CompanyD","CompanyE",NA)) 

multiocurrence<-df %>% 
  filter(!is.na(Second_Name)&!is.na(Company)) %>% 
  group_by(Carer_Number) %>% 
  summarise(cnt=n()) %>% 
  filter(cnt>1) %>% 
  select(Carer_Number) %>% 
  pull()

repeats_df <- df %>% filter(Carer_Number %in% multiocurrence)
repeats_df

awprc · April 9, 2021, 5:02pm

I think it would be helpful in your questions to be more clear about what is being filtered out. Are you trying to filter out duplicate rows (i.e. remove them from dataset)? Or are you trying to filter your dataset so that it only includes rows that have the same value in one column?

It's just not clear to me what you mean since none of your rows are duplicates of another.

Because if you are trying to filter out rows that do not have duplicates in your Carer column, then the following works as has been pointed out.

df %>%
    group_by(Carer_Number) %>% 
    filter(n() > 1) %>%
    ungroup()

Another function to consider would be distinct().

system · April 16, 2021, 5:03pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.