Keep only duplicates based on group of duplicates

Hi, I have a dataset of more than 1.5 million instances. there are a few problem instances where they share the same value in one variable, and sometimes do not match in another. I only want to keep the instance (row) if they match in both variables.

Here is a sample dataset:

a<-c("Ada~R0254110050~Ag","R0254110050",10000)
b<-c("Ada~R0254110050~OOC Residential","R0254110050",10000)
c<-c("Ada~R0254110050~Non-OOC Residential","R0254110050",9000)
d<-c("Ada~R0254110051~OOC Residential","R0254110051",11000)
e<-c("Ada~R0254110051~OOC Residential","R0254110051",12000)
f<-c("Ada~R0254110052~Ag","R0254110052",11000)
g<-c("Ada~R0254110052~Ag","R0254110052",12000)
h<-c("Ada~R0254110053~Non-OOC Residential","R0254110053",10000)
i<-c("Ada~R0254110053~Non-OOC Residential","R0254110053",9000)
j<-c("Ada~R0254110054~Commercial","R0254110054",9000)
k<-c("Ada~R0254110055~Timber","R0254110055",9000)
l<-c("Ada~R0254110056~Non-OOC Residential","R0254110056",9000)

data <- as.data.frame(rbind(a,b,c,d,e,f,g,h,i,j,k,l))
names(data) <- c("parcel.id", "parcel", "value")

I want to group by parcel, and only keep rows where the parcel.id also match each other.

The output would remove only rows a,b, and c (because they they share the same parcel number, but not the same parcel.id).

Seems like a simple fix, but I cannot get it for the life of me. I have tried this:

data %>%
  group_by(parcel) %>%
  filter((duplicated(parcel.id) | duplicated(parcel.id, fromLast = T)))

but it also removes j,k, and l, which I need to keep. I tried sub-setting, but couldn't figure out how to group, and I also considered a if statement that would only filter if the parcel number was duplicated, but couldn't get that to work either.

I would prefer doing this as a subset, but dplyr (or any other elegant solution) also works.

Please help, and thanks in advance!

I bet there is a more elegant way to do this but it works.

library(dplyr)
a<-c("Ada~R0254110050~Ag","R0254110050",10000)
b<-c("Ada~R0254110050~OOC Residential","R0254110050",10000)
c<-c("Ada~R0254110050~Non-OOC Residential","R0254110050",9000)
d<-c("Ada~R0254110051~OOC Residential","R0254110051",11000)
e<-c("Ada~R0254110051~OOC Residential","R0254110051",12000)
f<-c("Ada~R0254110052~Ag","R0254110052",11000)
g<-c("Ada~R0254110052~Ag","R0254110052",12000)
h<-c("Ada~R0254110053~Non-OOC Residential","R0254110053",10000)
i<-c("Ada~R0254110053~Non-OOC Residential","R0254110053",9000)
j<-c("Ada~R0254110054~Commercial","R0254110054",9000)
k<-c("Ada~R0254110055~Timber","R0254110055",9000)
l<-c("Ada~R0254110056~Non-OOC Residential","R0254110056",9000)

data <- as.data.frame(rbind(a,b,c,d,e,f,g,h,i,j,k,l))
names(data) <- c("parcel.id", "parcel", "value")

MyFunc <- function(Col) {
  if (length(unique(Col)) > 1) 0 else 1
}
FilteredData <- data %>% group_by(parcel) %>% mutate(Flag = MyFunc(parcel.id)) %>% 
  filter(Flag == 1) %>% select(-Flag)
FilteredData
#> # A tibble: 9 x 3
#> # Groups:   parcel [6]
#>   parcel.id                           parcel      value
#>   <chr>                               <chr>       <chr>
#> 1 Ada~R0254110051~OOC Residential     R0254110051 11000
#> 2 Ada~R0254110051~OOC Residential     R0254110051 12000
#> 3 Ada~R0254110052~Ag                  R0254110052 11000
#> 4 Ada~R0254110052~Ag                  R0254110052 12000
#> 5 Ada~R0254110053~Non-OOC Residential R0254110053 10000
#> 6 Ada~R0254110053~Non-OOC Residential R0254110053 9000 
#> 7 Ada~R0254110054~Commercial          R0254110054 9000 
#> 8 Ada~R0254110055~Timber              R0254110055 9000 
#> 9 Ada~R0254110056~Non-OOC Residential R0254110056 9000

Created on 2021-08-24 by the reprex package (v0.3.0)

Perfect. This worked amazingly! Thanks so much! :slight_smile:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.