Hi, I have a dataset of more than 1.5 million instances. there are a few problem instances where they share the same value in one variable, and sometimes do not match in another. I only want to keep the instance (row) if they match in both variables.
Here is a sample dataset:
a<-c("Ada~R0254110050~Ag","R0254110050",10000) b<-c("Ada~R0254110050~OOC Residential","R0254110050",10000) c<-c("Ada~R0254110050~Non-OOC Residential","R0254110050",9000) d<-c("Ada~R0254110051~OOC Residential","R0254110051",11000) e<-c("Ada~R0254110051~OOC Residential","R0254110051",12000) f<-c("Ada~R0254110052~Ag","R0254110052",11000) g<-c("Ada~R0254110052~Ag","R0254110052",12000) h<-c("Ada~R0254110053~Non-OOC Residential","R0254110053",10000) i<-c("Ada~R0254110053~Non-OOC Residential","R0254110053",9000) j<-c("Ada~R0254110054~Commercial","R0254110054",9000) k<-c("Ada~R0254110055~Timber","R0254110055",9000) l<-c("Ada~R0254110056~Non-OOC Residential","R0254110056",9000) data <- as.data.frame(rbind(a,b,c,d,e,f,g,h,i,j,k,l)) names(data) <- c("parcel.id", "parcel", "value")
I want to group by parcel, and only keep rows where the parcel.id also match each other.
The output would remove only rows a,b, and c (because they they share the same parcel number, but not the same parcel.id).
Seems like a simple fix, but I cannot get it for the life of me. I have tried this:
data %>% group_by(parcel) %>% filter((duplicated(parcel.id) | duplicated(parcel.id, fromLast = T)))
but it also removes j,k, and l, which I need to keep. I tried sub-setting, but couldn't figure out how to group, and I also considered a if statement that would only filter if the parcel number was duplicated, but couldn't get that to work either.
I would prefer doing this as a subset, but dplyr (or any other elegant solution) also works.
Please help, and thanks in advance!