Hi, I have a dataset of more than 1.5 million instances. there are a few problem instances where they share the same value in one variable, and sometimes do not match in another. I only want to keep the instance (row) if they match in both variables.
Here is a sample dataset:
a<-c("Ada~R0254110050~Ag","R0254110050",10000)
b<-c("Ada~R0254110050~OOC Residential","R0254110050",10000)
c<-c("Ada~R0254110050~Non-OOC Residential","R0254110050",9000)
d<-c("Ada~R0254110051~OOC Residential","R0254110051",11000)
e<-c("Ada~R0254110051~OOC Residential","R0254110051",12000)
f<-c("Ada~R0254110052~Ag","R0254110052",11000)
g<-c("Ada~R0254110052~Ag","R0254110052",12000)
h<-c("Ada~R0254110053~Non-OOC Residential","R0254110053",10000)
i<-c("Ada~R0254110053~Non-OOC Residential","R0254110053",9000)
j<-c("Ada~R0254110054~Commercial","R0254110054",9000)
k<-c("Ada~R0254110055~Timber","R0254110055",9000)
l<-c("Ada~R0254110056~Non-OOC Residential","R0254110056",9000)
data <- as.data.frame(rbind(a,b,c,d,e,f,g,h,i,j,k,l))
names(data) <- c("parcel.id", "parcel", "value")
I want to group by parcel, and only keep rows where the parcel.id also match each other.
The output would remove only rows a,b, and c (because they they share the same parcel number, but not the same parcel.id).
Seems like a simple fix, but I cannot get it for the life of me. I have tried this:
data %>%
group_by(parcel) %>%
filter((duplicated(parcel.id) | duplicated(parcel.id, fromLast = T)))
but it also removes j,k, and l, which I need to keep. I tried sub-setting, but couldn't figure out how to group, and I also considered a if statement that would only filter if the parcel number was duplicated, but couldn't get that to work either.
I would prefer doing this as a subset, but dplyr (or any other elegant solution) also works.
Please help, and thanks in advance!