Filter out ALL rows with duplicate values

tmauch · September 28, 2019, 6:37pm

I am still very new to R and looking for help with a problem I'm trying to solve. I have a large data set I'm working with and I then have a subset of that data, Subset A. I want to be able to subtract Subset A from the main data to create a second subset of what's left - so essentially I would have Subset A and Subset B, and when you put them together, you would have the whole data set.

To do this, I've merged the main dataset and Subset A to identify and filter out the duplicates (based on duplicate values in a specific column; the whole rows are not completely identical). However, after playing around with duplicate(), distinct(), and unique(), the problem I keep running into is that I can't filter out or account for ALL duplicates. The functions will always leave me with one unique row to represent each duplicate value.

So for example, in the below data frame:

name = c("Mary", "Frank", "Tom", "Frank", "Mary")
color = c("red", "red", "red", "red", "red")
data = data.frame(name, color)
data

   name color
1  Mary   red
2 Frank   red
3   Tom   red
4 Frank   red
5  Mary   red

I would want to subset so that only the row with "Tom" remains.

name color
1  Tom   red

I do NOT want a subset that returns:


 name color
1  Mary   red
2 Frank   red
3   Tom   red

Does that make sense? I should note that in this dataset, I have a lot of values to account for. Am I going about this the completely wrong way? Or am I just missing something? I know how I can do this in Excel, but I would like to learn how to do it in R.

andresrcs · September 28, 2019, 7:05pm

A simple solution

library(dplyr)

data = data.frame(stringsAsFactors = FALSE,
                  name = c("Mary", "Frank", "Tom", "Frank", "Mary"),
                  color = c("red", "red", "red", "red", "red"))
data %>% 
    count(name, color) %>% 
    filter(n == 1) %>% 
    select(-n)
#> # A tibble: 1 x 2
#>   name  color
#>   <chr> <chr>
#> 1 Tom   red

^{Created on 2019-09-28 by the reprex package (v0.3.0.9000)}

tmauch · September 28, 2019, 7:27pm

That did it, thank you!

joels · September 28, 2019, 7:28pm

In general, you would group by the columns for which you want to return combinations that have only a single row. In this case, you have only one column to check, so you could do:

# Return names which have only a single row of data
data %>% 
  group_by(name) %>% 
  filter(n()==1)

# Return names which have more than one row of data
data %>% 
  group_by(name) %>% 
  filter(n() != 1)

You could also keep the entire data frame, but add a column that marks names with only a single row and names with more than one row:

data = data %>% 
  group_by(name) %>% 
  mutate(duplicate.flag = n() > 1)

Then, you could use filter to subset each group, as needed:

data %>% filter(duplicate.flag)
data %>% filter(!duplicate.flag)

system · October 5, 2019, 7:29pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.