Using filter() with across() to keep all rows of a data frame that include a missing value for any variable

brad.cannell · June 2, 2020, 9:27pm

Sometimes I want to view all rows in a data frame that will be dropped if I drop all rows that have a missing value for any variable. In this case, I'm specifically interested in how to do this with dplyr 1.0's across() function used inside of the filter() verb.

Here is an example data frame:

df <- tribble(
  ~id, ~x, ~y,
  1, 1, 0,
  2, 1, 1,
  3, NA, 1,
  4, 0, 0,
  5, 1, NA
)

Code for keeping rows that DO NOT include any missing values is provided on the tidyverse website. Specifically, I can use:

df %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~ !is.na(.x)
    )
  )

Which returns:

# A tibble: 3 x 3
     id     x     y
  <dbl> <dbl> <dbl>
1     1     1     0
2     2     1     1
3     4     0     0

However, I can't figure out how to return the opposite -- rows with a missing value in any variable. The result I'm looking for is:

# A tibble: 2 x 3
     id     x     y
  <dbl> <dbl> <dbl>
1     3    NA     1
2     5     1    NA

My first thought was just to remove the !:

df %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~ is.na(.x)
    )
  )

But, that returns zero rows.

Of course, I can get the answer I want with this code if I know all variables that have a missing value ahead of time:

df %>% 
  filter(is.na(x) | is.na(y))

But, I'm looking for a solution that doesn't require me to know which variables have a missing value ahead of time. Additionally, I'm aware of how to do this with the filter_all() function:

df %>% 
  filter_all(any_vars(is.na(.)))

But, the filter_all() function has been superseded by the use of across() in an existing verb. See https://dplyr.tidyverse.org/articles/colwise.html

Other unsuccessful attempts I've made are:

df %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~any_vars(is.na(.x))
    )
  )

df %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~!!any_vars(is.na(.x))
    )
  )

df %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~!!any_vars(is.na(.))
    )
  )

df %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~any(is.na(.x))
    )
  )

df %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~any(is.na(.))
    )
  )

This question is also posted on Stack Overflow.

HanOostdijk · June 2, 2020, 10:44pm

In the same article you mention ( tidyverse website ) there is 'a trick' with the rowSums function. You can use that as :

rowAny <- function(x) rowSums(x) > 0 
df %>% 
     filter(rowAny(
         across(
             .cols = everything(),
             .fns = ~ is.na(.x)
         )
     )
)

brad.cannell · June 3, 2020, 12:19am

Thank you, @HanOostdijk! This definitely works, and meets the criteria of using dplyr and across().

Also, I can't help myself from adding a little bit of commentary here. I'll say again, the solution above works. So, thank you @HanOostdijk! However, it feels unsatisfying to me. For me, dplyr has always been about simplifying code and making it easier to read. The code above does not feel like it meets either of those criteria for me. Don't get me wrong, I'm a huge fan of dplyr and the tidyverse (Thank you for giving away these amazing tools!), but as of this moment, across() doesn't feel like an improvement over _if, _at, and _all. I hope it grows on me!

mara · June 3, 2020, 1:47pm

Don't worry, the scoped variants aren't going away (superseded just means we're not adding new features to them), and across() is still in its infancy of examples, etc., so hopefully it will continue to improve!

brad.cannell · June 3, 2020, 2:04pm

Thanks for the clarification, @mara! I know across() will continue to improve (and hopefully I improve at using it! ) Thanks for all you all do!

brad.cannell · June 3, 2020, 2:30pm

Forgive me if this isn't the appropriate place to post this comment, @mara, but I think I figured out why across() feels a little uncomfortable for me. I think it's because in my mind across() should only select the columns to be operated on (in the spirit of each function does one thing). In reality, across() is used to select the columns to be operated on and to receive the operation to execute.

For me, I think across() would feel more natural if it could be used like, for example:

df %>% 
  group_by(g1, g2) %>% 
  summarise(across(a:d), mean)

Instead of:

df %>% 
  group_by(g1, g2) %>% 
  summarise(across(a:d, mean))

Or, using the motivating example for this post:

df %>% 
  filter(
    across(everything()), 
    any_vars(is.na(.))
  )

Instead of:

df %>% 
  filter(
    rowAny(
      across(
        .cols = everything(),
        .fns = ~ is.na(.x)
      )
    )
  )

I'm sure that ship has sailed and I'm sure there was a great reason for implementing across() in the way that it was implemented. But, I thought I would share my two cents just in case it ends up being useful.

mara · June 3, 2020, 2:33pm

Hey Brad,

I hear what you're saying, and it makes sense to me. I'm pretty sure this had to do with challenges of implementation, but I'll pass the word along.

system · June 10, 2020, 2:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.