Serious problems with filter and dplyr

Hello,
I'm working with some data and I've faced "errors" that doesn't perform.
For example, if I filter something
data %>% filter(region=="Glasgow" & age==30)
I can see the data I desired. But when I try to do
data=data %>% filter(!(region=="Glasgow" & age==30)
In order to exclude a specific row, I noticed that R deletes no only the one I specified, even more, It deletes many more rows.
In order to avoid that, I think that
data=data %>% filter(!(region=="Glasgow" & age==30 & !is.na(age) & !is.na(region) )
The code above in Stata is just
data=data if region!="Glasgow" & age!=30
I've lost confidence doing that kind of task.
I've read about anti-join, but I'm not sure now.
What is the best way to deal with data and exclude rows with confidence?
Thanks for your time and interest.

Can you provide a reproducible example?

This is true, interesting! It also removes the lines where Glasgow is NA, it must have to do something how the (negated) NA is handled.
If you don't need the NAs in the age you could also remove it beforehand.

data = data.frame(
  stringsAsFactors = FALSE,
  region = c("Glasgow","Edinburgh","Inverness","Aberdeen",
             "Glasgow","Edinburgh", "Inverness","Aberdeen",
             "Glasgow","Edinburgh", "Inverness","Aberdeen",
             "Glasgow", "Edinburgh","Inverness","Aberdeen",
             "Glasgow", "Edinburgh","Inverness","Aberdeen",
             "Glasgow", "Edinburgh","Inverness","Aberdeen",
             "Glasgow", "Edinburgh","Inverness","Aberdeen",
             "Glasgow", "Edinburgh","Inverness","Aberdeen"),
  age = c(30L,42L,44L,NA,30L,48L,32L,42L,28L,NA,
          48L,30L,42L,38L,30L,26L,46L,28L,43L,39L,20L,
          35L,33L,NA,37L,43L,30L,49L,NA,27L,41L,33L),
  line = 1:32
)

data %>% filter(region=="Glasgow" & age==30)
# this gives (correctly) 2 hits: line 1 and line 5

data %>% filter(!(region=="Glasgow" & age==30))
# this gives 29 - line 29 (Glasgow - NA) is missing

# using anti join
kick_out  = filter(data, region=="Glasgow" & age==30)

data %>% 
  anti_join(kick_out)
# this gives 30, correctly without line 1 and line 5
1 Like

Reading your code assures me that R is very tricking handling big databases.
Besides, I think It is tedious to check every time I declare a filter or !filter.
I will take the anti_join approach in the future. It seems the safest.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.