I lost observations that exist, when I changed dplyr function from 'filter(x, is.na(...))' to 'filter(x, !is.na(...))'

Hello, everyone!

I run the code below and it worked perfectly until I changed from 'filter(x, is.na(dep_time))' to 'filter(x, !is.na(dep_time))'. I lost all my observations in environment. However, when cleaned the environment and run 'filter(x, !is.na(dep_time))', it worked again.

library(dplyr)
library(nycflights13)
fl = flights
fl = fl %>% filter(is.na(dep_time))
dim(fl)

Why does it happen?

It might be because in the first filter fl = fl %>% filter(is.na(dep_time)), you go from 336,776 x 19 to 8,255 x 19 (only where dep_time is NA).

Then when you change it to fl = fl %>% filter(!is.na(dep_time)), you are using the already filtered fl, so going from 8,255 x 19 to 0 x 19 as there are no rows of dep_time that have values left. They have already been filtered out the first time.

suppressPackageStartupMessages({
  library(dplyr)
  library(nycflights13)
  })  

fl <- flights
fl_na <- fl %>% filter(is.na(dep_time))
fl_not_na <- fl %>% filter(!is.na(dep_time))
dim(fl) # original
#> [1] 336776     19
dim(fl_na) # filtered to select only dep_time NA
#> [1] 8255   19
dim(fl_not_na) # filtered to exclude dep_time NA
#> [1] 328521     19

Created on 2020-10-13 by the reprex package (v0.3.0.9001)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.